Memory Optimization Results

This document summarizes the memory optimization work performed on the p256k1 pure Go secp256k1 implementation.

Executive Summary

Metric	Before	After	Improvement
ECDSA Verify allocations	8 allocs, 5,632 B	0 allocs, 0 B	100% reduction
ECDSA Sign allocations	39 allocs, 2,386 B	11 allocs, 546 B	72% fewer allocs, 77% less memory
Schnorr Verify allocations	11 allocs, 5,730 B	3 allocs, 98 B	73% fewer allocs, 98% less memory
Schnorr Sign allocations	~40 allocs	6 allocs, 320 B	85% fewer allocs
BatchNormalize memory	80 MB cumulative	0 B	100% reduction

Detailed Benchmark Comparison

ECDSA Operations

Operation	Before	After	Change
ECDSA Verify
- Time	~180 μs	193 μs	-
- Allocations	8	0	-100%
- Bytes/op	5,632 B	0 B	-100%
ECDSA Sign
- Time	~77 μs	77 μs	-
- Allocations	39	11	-72%
- Bytes/op	2,386 B	546 B	-77%
ECDSA PubkeyDerivation
- Time	~58 μs	58 μs	-
- Allocations	0	0	-
- Bytes/op	0 B	0 B	-

Schnorr Operations

Operation	Before	After	Change
Schnorr Verify
- Time	~185 μs	181 μs	-2%
- Allocations	11	3	-73%
- Bytes/op	5,730 B	98 B	-98%
Schnorr Sign
- Time	~138 μs	110 μs	-20%
- Allocations	~40	6	-85%
- Bytes/op	~2,500 B	320 B	-87%
Schnorr PubkeyDerivation
- Time	~61 μs	58 μs	-5%
- Allocations	2	2	-
- Bytes/op	128 B	128 B	-

Comparison vs Competitors

vs btcec (Pure Go)

Operation	p256k1	btcec	p256k1 Advantage
Schnorr Sign (time)	110 μs	235 μs	2.1x faster
Schnorr Sign (allocs)	6	37	84% fewer
Schnorr Verify (allocs)	3	16	81% fewer
ECDSA Verify (allocs)	0	23	100% fewer
ECDSA Sign (allocs)	11	28	61% fewer

vs libsecp256k1 (C library via purego)

Operation	p256k1 (Pure Go)	libsecp256k1 (C)	Notes
ECDSA Verify	193 μs, 0 allocs	47 μs, 12 allocs	Pure Go is 4x slower but zero-alloc
ECDSA Sign	77 μs, 11 allocs	35 μs, 13 allocs	Pure Go is 2.2x slower, similar allocs
Schnorr Verify	181 μs, 3 allocs	51 μs, 8 allocs	Pure Go is 3.5x slower, fewer allocs
Schnorr Sign	110 μs, 6 allocs	39 μs, 8 allocs	Pure Go is 2.8x slower, fewer allocs

Optimizations Applied

1. Fixed-Size Array for Batch Operations

Problem: BatchNormalize and batchInverse used slices, causing 80MB+ cumulative allocations.

Solution: Created batchNormalize16 and batchInverse16 with fixed [16]FieldElement arrays since glvTableSize=16 is constant.

// Before: slice allocation escapes to heap
func BatchNormalize(out []GroupElementAffine, points []GroupElementJacobian)

// After: fixed-size array stays on stack
func batchNormalize16(out *[16]GroupElementAffine, points *[16]GroupElementJacobian)

Files: field.go, field_32bit.go, group.go, ecdh.go

2. SHA256/HMAC Context Pooling

Problem: Each HMAC operation created 2 new SHA256 contexts. RFC6979 nonce generation created 5+ HMAC contexts.

Solution: Added sync.Pool for SHA256, HMAC, and RFC6979 contexts.

var sha256Pool = sync.Pool{
    New: func() interface{} { return sha256.New() },
}

var hmacPool = sync.Pool{
    New: func() interface{} { return &HMACSHA256{} },
}

var rfc6979Pool = sync.Pool{
    New: func() interface{} { return &RFC6979HMACSHA256{} },
}

File: hash.go

3. Pre-allocated Single-Byte Slices

Problem: []byte{0x00} and []byte{0x01} in hot paths allocated new slices each call.

Solution: Pre-allocated package-level variables.

var (
    byte0x00 = []byte{0x00}
    byte0x01 = []byte{0x01}
)

File: hash.go

4. Fixed-Size Arrays in ECDSA/Schnorr

Problem: Dynamic slice allocations in signing functions.

Solution: Changed to fixed-size arrays.

// Before: escapes to heap
nonceKey := make([]byte, 64)

// After: stays on stack
var nonceKey [64]byte

Files: ecdsa.go, schnorr.go, schnorr_wasm.go

5. Precomputed Tag Hashes

Problem: BIP-340 tag hashes (SHA256("BIP0340/challenge"), etc.) computed every call.

Solution: Compute once at init, cache for reuse.

var (
    bip340AuxTagHash       [32]byte
    bip340NonceTagHash     [32]byte
    bip340ChallengeTagHash [32]byte
)

func getTaggedHashPrefix(tag []byte) [32]byte {
    // Returns precomputed hash for known tags
}

File: hash.go

6. Zero-Allocation Hash Finalization

Problem: hash.Sum(nil) allocates a new slice for the result.

Solution: Use hash.Sum(buf[:0]) with pre-sized buffer.

// Before: allocates new []byte
copy(temp[:], h.inner.Sum(nil))

// After: appends to existing buffer
h.inner.Sum(temp[:0])

File: hash.go

Memory Profile Comparison

Before Optimization (Top Allocators)

BatchNormalize:     80 MB (53%)
batchInverse:       32 MB (21%)
SHA256/HMAC:        21 MB (14%)
Other:              18 MB (12%)

After Optimization (Top Allocators)

sync.Pool overhead: ~500 B (amortized)
Remaining allocs:   Minimal, mostly pooled

Impact on GC

Metric	Before	After
Allocations per verify	8-11	0-3
GC pressure	High	Low
Memory churn	~6 KB/op	<100 B/op

The reduced allocations significantly decrease garbage collection pressure, improving latency consistency in high-throughput scenarios.

WASM Compatibility

All optimizations are compatible with WASM/js builds:

Fixed-size arrays work identically
sync.Pool is supported in WASM
No CGO dependencies

Benchmark Commands

# Run Pure Go benchmarks
go test -bench="BenchmarkPureGo" -benchmem ./bench/

# Compare with btcec
go test -bench="BenchmarkBtcec" -benchmem ./bench/

# Compare with libsecp256k1 (requires libsecp256k1.so)
go test -bench="BenchmarkLibSecp" -benchmem ./bench/

# Run memory profile
go test -bench="BenchmarkPureGo_ECDSA_Sign" -memprofile=mem.prof ./bench/
go tool pprof -alloc_objects mem.prof

Conclusion

The memory optimization work achieved:

Zero-allocation ECDSA verification - critical for high-throughput signature validation
85% reduction in Schnorr signing allocations - important for Nostr and Taproot applications
Competitive with btcec while maintaining cleaner architecture
WASM-compatible pure Go implementation with GLV/Strauss/wNAF optimizations

memory-optimization-results.md raw

Memory Optimization Results

Executive Summary

Detailed Benchmark Comparison

ECDSA Operations

Schnorr Operations

Comparison vs Competitors

vs btcec (Pure Go)

vs libsecp256k1 (C library via purego)

Optimizations Applied

1. Fixed-Size Array for Batch Operations

2. SHA256/HMAC Context Pooling

3. Pre-allocated Single-Byte Slices

4. Fixed-Size Arrays in ECDSA/Schnorr

5. Precomputed Tag Hashes

6. Zero-Allocation Hash Finalization

Memory Profile Comparison

Before Optimization (Top Allocators)

After Optimization (Top Allocators)

Impact on GC

WASM Compatibility

Benchmark Commands

Conclusion