memory-optimization-results.md raw

Memory Optimization Results

This document summarizes the memory optimization work performed on the p256k1 pure Go secp256k1 implementation.

Executive Summary

MetricBeforeAfterImprovement
ECDSA Verify allocations8 allocs, 5,632 B0 allocs, 0 B100% reduction
ECDSA Sign allocations39 allocs, 2,386 B11 allocs, 546 B72% fewer allocs, 77% less memory
Schnorr Verify allocations11 allocs, 5,730 B3 allocs, 98 B73% fewer allocs, 98% less memory
Schnorr Sign allocations~40 allocs6 allocs, 320 B85% fewer allocs
BatchNormalize memory80 MB cumulative0 B100% reduction

Detailed Benchmark Comparison

ECDSA Operations

OperationBeforeAfterChange
ECDSA Verify
- Time~180 μs193 μs-
- Allocations80-100%
- Bytes/op5,632 B0 B-100%
ECDSA Sign
- Time~77 μs77 μs-
- Allocations3911-72%
- Bytes/op2,386 B546 B-77%
ECDSA PubkeyDerivation
- Time~58 μs58 μs-
- Allocations00-
- Bytes/op0 B0 B-

Schnorr Operations

OperationBeforeAfterChange
Schnorr Verify
- Time~185 μs181 μs-2%
- Allocations113-73%
- Bytes/op5,730 B98 B-98%
Schnorr Sign
- Time~138 μs110 μs-20%
- Allocations~406-85%
- Bytes/op~2,500 B320 B-87%
Schnorr PubkeyDerivation
- Time~61 μs58 μs-5%
- Allocations22-
- Bytes/op128 B128 B-

Comparison vs Competitors

vs btcec (Pure Go)

Operationp256k1btcecp256k1 Advantage
Schnorr Sign (time)110 μs235 μs2.1x faster
Schnorr Sign (allocs)63784% fewer
Schnorr Verify (allocs)31681% fewer
ECDSA Verify (allocs)023100% fewer
ECDSA Sign (allocs)112861% fewer

vs libsecp256k1 (C library via purego)

Operationp256k1 (Pure Go)libsecp256k1 (C)Notes
ECDSA Verify193 μs, 0 allocs47 μs, 12 allocsPure Go is 4x slower but zero-alloc
ECDSA Sign77 μs, 11 allocs35 μs, 13 allocsPure Go is 2.2x slower, similar allocs
Schnorr Verify181 μs, 3 allocs51 μs, 8 allocsPure Go is 3.5x slower, fewer allocs
Schnorr Sign110 μs, 6 allocs39 μs, 8 allocsPure Go is 2.8x slower, fewer allocs

Optimizations Applied

1. Fixed-Size Array for Batch Operations

Problem: BatchNormalize and batchInverse used slices, causing 80MB+ cumulative allocations.

Solution: Created batchNormalize16 and batchInverse16 with fixed [16]FieldElement arrays since glvTableSize=16 is constant.

// Before: slice allocation escapes to heap
func BatchNormalize(out []GroupElementAffine, points []GroupElementJacobian)

// After: fixed-size array stays on stack
func batchNormalize16(out *[16]GroupElementAffine, points *[16]GroupElementJacobian)

Files: field.go, field_32bit.go, group.go, ecdh.go

2. SHA256/HMAC Context Pooling

Problem: Each HMAC operation created 2 new SHA256 contexts. RFC6979 nonce generation created 5+ HMAC contexts.

Solution: Added sync.Pool for SHA256, HMAC, and RFC6979 contexts.

var sha256Pool = sync.Pool{
    New: func() interface{} { return sha256.New() },
}

var hmacPool = sync.Pool{
    New: func() interface{} { return &HMACSHA256{} },
}

var rfc6979Pool = sync.Pool{
    New: func() interface{} { return &RFC6979HMACSHA256{} },
}

File: hash.go

3. Pre-allocated Single-Byte Slices

Problem: []byte{0x00} and []byte{0x01} in hot paths allocated new slices each call.

Solution: Pre-allocated package-level variables.

var (
    byte0x00 = []byte{0x00}
    byte0x01 = []byte{0x01}
)

File: hash.go

4. Fixed-Size Arrays in ECDSA/Schnorr

Problem: Dynamic slice allocations in signing functions.

Solution: Changed to fixed-size arrays.

// Before: escapes to heap
nonceKey := make([]byte, 64)

// After: stays on stack
var nonceKey [64]byte

Files: ecdsa.go, schnorr.go, schnorr_wasm.go

5. Precomputed Tag Hashes

Problem: BIP-340 tag hashes (SHA256("BIP0340/challenge"), etc.) computed every call.

Solution: Compute once at init, cache for reuse.

var (
    bip340AuxTagHash       [32]byte
    bip340NonceTagHash     [32]byte
    bip340ChallengeTagHash [32]byte
)

func getTaggedHashPrefix(tag []byte) [32]byte {
    // Returns precomputed hash for known tags
}

File: hash.go

6. Zero-Allocation Hash Finalization

Problem: hash.Sum(nil) allocates a new slice for the result.

Solution: Use hash.Sum(buf[:0]) with pre-sized buffer.

// Before: allocates new []byte
copy(temp[:], h.inner.Sum(nil))

// After: appends to existing buffer
h.inner.Sum(temp[:0])

File: hash.go

Memory Profile Comparison

Before Optimization (Top Allocators)

BatchNormalize:     80 MB (53%)
batchInverse:       32 MB (21%)
SHA256/HMAC:        21 MB (14%)
Other:              18 MB (12%)

After Optimization (Top Allocators)

sync.Pool overhead: ~500 B (amortized)
Remaining allocs:   Minimal, mostly pooled

Impact on GC

MetricBeforeAfter
Allocations per verify8-110-3
GC pressureHighLow
Memory churn~6 KB/op<100 B/op

The reduced allocations significantly decrease garbage collection pressure, improving latency consistency in high-throughput scenarios.

WASM Compatibility

All optimizations are compatible with WASM/js builds:

Benchmark Commands

# Run Pure Go benchmarks
go test -bench="BenchmarkPureGo" -benchmem ./bench/

# Compare with btcec
go test -bench="BenchmarkBtcec" -benchmem ./bench/

# Compare with libsecp256k1 (requires libsecp256k1.so)
go test -bench="BenchmarkLibSecp" -benchmem ./bench/

# Run memory profile
go test -bench="BenchmarkPureGo_ECDSA_Sign" -memprofile=mem.prof ./bench/
go tool pprof -alloc_objects mem.prof

Conclusion

The memory optimization work achieved: