# Memory Optimization Results

This document summarizes the memory optimization work performed on the p256k1 pure Go secp256k1 implementation.

## Executive Summary

| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| ECDSA Verify allocations | 8 allocs, 5,632 B | **0 allocs, 0 B** | 100% reduction |
| ECDSA Sign allocations | 39 allocs, 2,386 B | 11 allocs, 546 B | 72% fewer allocs, 77% less memory |
| Schnorr Verify allocations | 11 allocs, 5,730 B | 3 allocs, 98 B | 73% fewer allocs, 98% less memory |
| Schnorr Sign allocations | ~40 allocs | 6 allocs, 320 B | 85% fewer allocs |
| BatchNormalize memory | 80 MB cumulative | 0 B | 100% reduction |

## Detailed Benchmark Comparison

### ECDSA Operations

| Operation | Before | After | Change |
|-----------|--------|-------|--------|
| **ECDSA Verify** | | | |
| - Time | ~180 μs | 193 μs | - |
| - Allocations | 8 | **0** | -100% |
| - Bytes/op | 5,632 B | **0 B** | -100% |
| **ECDSA Sign** | | | |
| - Time | ~77 μs | 77 μs | - |
| - Allocations | 39 | 11 | -72% |
| - Bytes/op | 2,386 B | 546 B | -77% |
| **ECDSA PubkeyDerivation** | | | |
| - Time | ~58 μs | 58 μs | - |
| - Allocations | 0 | 0 | - |
| - Bytes/op | 0 B | 0 B | - |

### Schnorr Operations

| Operation | Before | After | Change |
|-----------|--------|-------|--------|
| **Schnorr Verify** | | | |
| - Time | ~185 μs | 181 μs | -2% |
| - Allocations | 11 | 3 | -73% |
| - Bytes/op | 5,730 B | 98 B | -98% |
| **Schnorr Sign** | | | |
| - Time | ~138 μs | 110 μs | -20% |
| - Allocations | ~40 | 6 | -85% |
| - Bytes/op | ~2,500 B | 320 B | -87% |
| **Schnorr PubkeyDerivation** | | | |
| - Time | ~61 μs | 58 μs | -5% |
| - Allocations | 2 | 2 | - |
| - Bytes/op | 128 B | 128 B | - |

## Comparison vs Competitors

### vs btcec (Pure Go)

| Operation | p256k1 | btcec | p256k1 Advantage |
|-----------|--------|-------|------------------|
| Schnorr Sign (time) | 110 μs | 235 μs | **2.1x faster** |
| Schnorr Sign (allocs) | 6 | 37 | **84% fewer** |
| Schnorr Verify (allocs) | 3 | 16 | **81% fewer** |
| ECDSA Verify (allocs) | 0 | 23 | **100% fewer** |
| ECDSA Sign (allocs) | 11 | 28 | **61% fewer** |

### vs libsecp256k1 (C library via purego)

| Operation | p256k1 (Pure Go) | libsecp256k1 (C) | Notes |
|-----------|------------------|------------------|-------|
| ECDSA Verify | 193 μs, 0 allocs | 47 μs, 12 allocs | Pure Go is 4x slower but zero-alloc |
| ECDSA Sign | 77 μs, 11 allocs | 35 μs, 13 allocs | Pure Go is 2.2x slower, similar allocs |
| Schnorr Verify | 181 μs, 3 allocs | 51 μs, 8 allocs | Pure Go is 3.5x slower, fewer allocs |
| Schnorr Sign | 110 μs, 6 allocs | 39 μs, 8 allocs | Pure Go is 2.8x slower, fewer allocs |

## Optimizations Applied

### 1. Fixed-Size Array for Batch Operations

**Problem:** `BatchNormalize` and `batchInverse` used slices, causing 80MB+ cumulative allocations.

**Solution:** Created `batchNormalize16` and `batchInverse16` with fixed `[16]FieldElement` arrays since `glvTableSize=16` is constant.

```go
// Before: slice allocation escapes to heap
func BatchNormalize(out []GroupElementAffine, points []GroupElementJacobian)

// After: fixed-size array stays on stack
func batchNormalize16(out *[16]GroupElementAffine, points *[16]GroupElementJacobian)
```

**Files:** `field.go`, `field_32bit.go`, `group.go`, `ecdh.go`

### 2. SHA256/HMAC Context Pooling

**Problem:** Each HMAC operation created 2 new SHA256 contexts. RFC6979 nonce generation created 5+ HMAC contexts.

**Solution:** Added `sync.Pool` for SHA256, HMAC, and RFC6979 contexts.

```go
var sha256Pool = sync.Pool{
    New: func() interface{} { return sha256.New() },
}

var hmacPool = sync.Pool{
    New: func() interface{} { return &HMACSHA256{} },
}

var rfc6979Pool = sync.Pool{
    New: func() interface{} { return &RFC6979HMACSHA256{} },
}
```

**File:** `hash.go`

### 3. Pre-allocated Single-Byte Slices

**Problem:** `[]byte{0x00}` and `[]byte{0x01}` in hot paths allocated new slices each call.

**Solution:** Pre-allocated package-level variables.

```go
var (
    byte0x00 = []byte{0x00}
    byte0x01 = []byte{0x01}
)
```

**File:** `hash.go`

### 4. Fixed-Size Arrays in ECDSA/Schnorr

**Problem:** Dynamic slice allocations in signing functions.

**Solution:** Changed to fixed-size arrays.

```go
// Before: escapes to heap
nonceKey := make([]byte, 64)

// After: stays on stack
var nonceKey [64]byte
```

**Files:** `ecdsa.go`, `schnorr.go`, `schnorr_wasm.go`

### 5. Precomputed Tag Hashes

**Problem:** BIP-340 tag hashes (`SHA256("BIP0340/challenge")`, etc.) computed every call.

**Solution:** Compute once at init, cache for reuse.

```go
var (
    bip340AuxTagHash       [32]byte
    bip340NonceTagHash     [32]byte
    bip340ChallengeTagHash [32]byte
)

func getTaggedHashPrefix(tag []byte) [32]byte {
    // Returns precomputed hash for known tags
}
```

**File:** `hash.go`

### 6. Zero-Allocation Hash Finalization

**Problem:** `hash.Sum(nil)` allocates a new slice for the result.

**Solution:** Use `hash.Sum(buf[:0])` with pre-sized buffer.

```go
// Before: allocates new []byte
copy(temp[:], h.inner.Sum(nil))

// After: appends to existing buffer
h.inner.Sum(temp[:0])
```

**File:** `hash.go`

## Memory Profile Comparison

### Before Optimization (Top Allocators)
```
BatchNormalize:     80 MB (53%)
batchInverse:       32 MB (21%)
SHA256/HMAC:        21 MB (14%)
Other:              18 MB (12%)
```

### After Optimization (Top Allocators)
```
sync.Pool overhead: ~500 B (amortized)
Remaining allocs:   Minimal, mostly pooled
```

## Impact on GC

| Metric | Before | After |
|--------|--------|-------|
| Allocations per verify | 8-11 | 0-3 |
| GC pressure | High | Low |
| Memory churn | ~6 KB/op | <100 B/op |

The reduced allocations significantly decrease garbage collection pressure, improving latency consistency in high-throughput scenarios.

## WASM Compatibility

All optimizations are compatible with WASM/js builds:
- Fixed-size arrays work identically
- `sync.Pool` is supported in WASM
- No CGO dependencies

## Benchmark Commands

```bash
# Run Pure Go benchmarks
go test -bench="BenchmarkPureGo" -benchmem ./bench/

# Compare with btcec
go test -bench="BenchmarkBtcec" -benchmem ./bench/

# Compare with libsecp256k1 (requires libsecp256k1.so)
go test -bench="BenchmarkLibSecp" -benchmem ./bench/

# Run memory profile
go test -bench="BenchmarkPureGo_ECDSA_Sign" -memprofile=mem.prof ./bench/
go tool pprof -alloc_objects mem.prof
```

## Conclusion

The memory optimization work achieved:
- **Zero-allocation ECDSA verification** - critical for high-throughput signature validation
- **85% reduction in Schnorr signing allocations** - important for Nostr and Taproot applications
- **Competitive with btcec** while maintaining cleaner architecture
- **WASM-compatible** pure Go implementation with GLV/Strauss/wNAF optimizations