Benchmark Comparison Report
Signer Implementation Comparison
This report compares three signer implementations for secp256k1 operations:
- P256K1Signer - This repository's new port from Bitcoin Core secp256k1 (pure Go)
- ~~BtcecSigner - Pure Go wrapper around btcec/v2~~ (removed)
- NextP256K Signer - CGO version using next.orly.dev/pkg/crypto/p256k (CGO bindings to libsecp256k1)
Generated: 2025-11-02 (Updated after comprehensive CPU optimizations)
Platform: linux/amd64
CPU: AMD Ryzen 5 PRO 4650G with Radeon Graphics
Go Version: go1.25.3
Key Optimizations:
- Implemented 8-bit byte-based precomputed tables matching btcec's approach, resulting in 4x improvement in pubkey derivation and 4.3x improvement in signing.
- Optimized windowed multiplication for verification (6-bit windows, increased from 5-bit): 8% improvement (149,511 → 138,127 ns/op).
- Optimized ECDH with windowed multiplication (6-bit windows): 5% improvement (109,068 → 103,345 ns/op).
- Major CPU optimizations (Nov 2025):
- Precomputed TaggedHash prefixes for common BIP-340 tags: 28% faster (310 → 230 ns/op)
- Eliminated unnecessary copies in field element operations (mul/sqr): faster when magnitude ≤ 8
- Optimized group element operations (toBytes/toStorage): in-place normalization to avoid copies
- Optimized EcmultGen: pre-allocated group elements to reduce allocations
- Sign optimizations: 54% faster (63,421 → 29,237 ns/op), 47% fewer allocations (17 → 9 allocs/op)
- Verify optimizations: 8% faster (149,511 → 138,127 ns/op), 78% fewer allocations (9 → 2 allocs/op)
- Pubkey derivation: 6% faster (58,383 → 55,091 ns/op), eliminated intermediate copies
Summary Results
| Operation | P256K1Signer | BtcecSigner | NextP256K | Winner |
| Pubkey Derivation | 55,091 ns/op | 64,177 ns/op | 271,394 ns/op | P256K1 (14% faster than Btcec) |
| Sign | 29,237 ns/op | 225,514 ns/op | 53,015 ns/op | P256K1 (1.8x faster than NextP256K) |
| Verify | 138,127 ns/op | 177,622 ns/op | 44,776 ns/op | NextP256K (3.1x faster) |
| ECDH | 103,345 ns/op | 129,392 ns/op | 125,835 ns/op | P256K1 (1.2x faster than NextP256K) |
Detailed Results
Public Key Derivation
Deriving public key from private key (32 bytes → 32 bytes x-only pubkey).
| Implementation | Time per op | Memory | Allocations | Speedup vs P256K1 |
| P256K1Signer | 55,091 ns/op | 256 B/op | 4 allocs/op | 1.0x (baseline) |
| BtcecSigner | 64,177 ns/op | 368 B/op | 7 allocs/op | 0.9x slower |
| NextP256K | 271,394 ns/op | 983,394 B/op | 9 allocs/op | 0.2x slower |
Analysis:
- P256K1 is fastest (14% faster than Btcec) after implementing 8-bit byte-based precomputed tables
- 6% improvement from CPU optimizations (58,383 → 55,091 ns/op)
- Massive improvement: 4x faster than original implementation (232,922 → 55,091 ns/op)
- NextP256K is slowest, likely due to CGO overhead for small operations
- P256K1 has lowest memory allocation overhead (256 B vs 368 B)
Signing (Schnorr)
Creating BIP-340 Schnorr signatures (32-byte message → 64-byte signature).
| Implementation | Time per op | Memory | Allocations | Speedup vs P256K1 |
| P256K1Signer | 29,237 ns/op | 576 B/op | 9 allocs/op | 1.0x (baseline) |
| BtcecSigner | 225,514 ns/op | 2,193 B/op | 38 allocs/op | 0.1x slower |
| NextP256K | 53,015 ns/op | 128 B/op | 3 allocs/op | 0.6x slower |
Analysis:
- P256K1 is fastest (1.8x faster than NextP256K) after comprehensive CPU optimizations
- 54% improvement from optimizations (63,421 → 29,237 ns/op)
- 47% reduction in allocations (17 → 9 allocs/op)
- P256K1 is 7.7x faster than Btcec
- Optimizations: precomputed TaggedHash prefixes, eliminated intermediate copies, optimized hash operations
- NextP256K has lowest memory usage (128 B vs 576 B) but P256K1 is significantly faster
Verification (Schnorr)
Verifying BIP-340 Schnorr signatures (32-byte message + 64-byte signature).
| Implementation | Time per op | Memory | Allocations | Speedup vs P256K1 |
| P256K1Signer | 138,127 ns/op | 64 B/op | 2 allocs/op | 1.0x (baseline) |
| BtcecSigner | 177,622 ns/op | 1,120 B/op | 18 allocs/op | 0.8x slower |
| NextP256K | 44,776 ns/op | 96 B/op | 2 allocs/op | 3.1x faster |
Analysis:
- NextP256K is dramatically fastest (3.1x faster), showcasing CGO advantage for verification
- P256K1 is fastest pure Go implementation (22% faster than Btcec) after comprehensive optimizations
- 8% improvement from CPU optimizations (149,511 → 138,127 ns/op)
- 78% reduction in allocations (9 → 2 allocs/op), 89% reduction in memory (576 → 64 B/op)
- Total improvement: 26% faster than original (186,054 → 138,127 ns/op)
- Optimizations: 6-bit windowed multiplication (increased from 5-bit), precomputed TaggedHash, eliminated intermediate copies
- P256K1 now has minimal memory footprint (64 B vs 96 B for NextP256K)
ECDH (Shared Secret Generation)
Generating shared secret using Elliptic Curve Diffie-Hellman.
| Implementation | Time per op | Memory | Allocations | Speedup vs P256K1 |
| P256K1Signer | 103,345 ns/op | 241 B/op | 6 allocs/op | 1.0x (baseline) |
| BtcecSigner | 129,392 ns/op | 832 B/op | 13 allocs/op | 0.8x slower |
| NextP256K | 125,835 ns/op | 160 B/op | 3 allocs/op | 0.8x slower |
Analysis:
- P256K1 is fastest (1.2x faster than NextP256K) after optimizing with windowed multiplication
- 5% improvement from CPU optimizations (109,068 → 103,345 ns/op)
- Total improvement: 37% faster than original (163,356 → 103,345 ns/op)
- Optimizations: 6-bit windowed multiplication (increased from 5-bit), optimized field operations
- P256K1 has lowest memory usage (241 B vs 832 B for Btcec)
Performance Analysis
Overall Winner: Mixed (P256K1 wins 3/4 operations, NextP256K wins 1/4 operations)
After comprehensive CPU optimizations:
- P256K1Signer wins in 3 out of 4 operations:
- Pubkey Derivation: Fastest (14% faster than Btcec) - 6% improvement
- Signing: Fastest (1.8x faster than NextP256K) - 54% improvement!
- ECDH: Fastest (1.2x faster than NextP256K) - 5% improvement
- NextP256K wins in 1 operation:
- Verification: Fastest (3.1x faster than P256K1, CGO advantage) - but P256K1 is 8% faster than before
Best Pure Go: P256K1Signer
For pure Go implementations:
- P256K1 wins for key derivation (14% faster than Btcec) - 6% improvement
- P256K1 wins for signing (7.7x faster than Btcec) - 54% improvement!
- P256K1 wins for verification (22% faster than Btcec) - fastest pure Go! (8% improvement)
- P256K1 wins for ECDH (1.25x faster than Btcec) - fastest pure Go! (5% improvement)
Memory Efficiency
| Implementation | Avg Memory per Operation | Notes |
| P256K1Signer | ~270 B avg | Low memory footprint, significantly reduced after optimizations |
| NextP256K | ~300 KB avg | Very efficient, minimal allocations (except pubkey derivation overhead) |
| BtcecSigner | ~1.1 KB avg | Higher allocations, but acceptable |
Note: NextP256K shows high memory in pubkey derivation (983 KB) due to one-time CGO initialization overhead, but this is amortized across operations.
Memory Improvements:
- Sign: 1,152 → 576 B/op (50% reduction)
- Verify: 576 → 64 B/op (89% reduction!)
- Pubkey Derivation: Already optimized (256 B/op)
Recommendations
Use NextP256K (CGO) when:
- Maximum verification performance is critical (3.1x faster than P256K1)
- CGO is acceptable in your build environment
- Low memory footprint is important
- Verification speed is critical (3.1x faster)
Use P256K1Signer when:
- Pure Go is required (no CGO)
- Signing performance is critical (1.8x faster than NextP256K, 7.7x faster than Btcec)
- Pubkey derivation, verification, or ECDH performance is critical (fastest pure Go for all operations!)
- Lower memory allocations are preferred (64 B for verify, 576 B for sign)
- You want to avoid external C dependencies
- You need the best overall pure Go performance
- Now competitive with CGO for signing (faster than NextP256K)
Use BtcecSigner when:
- Pure Go is required
- You're already using btcec in your project
- Note: P256K1Signer is faster across all operations
Conclusion
The benchmarks demonstrate that:
- After comprehensive CPU optimizations, P256K1Signer achieves:
- Fastest pubkey derivation among all implementations (55,091 ns/op) - 6% improvement
- Fastest signing among all implementations (29,237 ns/op) - 54% improvement! (63,421 → 29,237 ns/op)
- Fastest ECDH among all implementations (103,345 ns/op) - 5% improvement (109,068 → 103,345 ns/op)
- Fastest pure Go verification (138,127 ns/op) - 8% improvement (149,511 → 138,127 ns/op)
- Now faster than NextP256K for signing (1.8x faster!)
- CPU optimization results (Nov 2025):
- Precomputed TaggedHash prefixes: 28% faster (310 → 230 ns/op)
- Increased window size from 5-bit to 6-bit: fewer iterations (~43 vs ~52 windows)
- Eliminated unnecessary copies in field/group operations
- Optimized memory allocations: 78% reduction in verify (9 → 2 allocs/op), 47% reduction in sign (17 → 9 allocs/op)
- Sign: 54% faster (63,421 → 29,237 ns/op)
- Verify: 8% faster (149,511 → 138,127 ns/op), 89% less memory (576 → 64 B/op)
- Pubkey Derivation: 6% faster (58,383 → 55,091 ns/op)
- ECDH: 5% faster (109,068 → 103,345 ns/op)
- CGO implementations (NextP256K) still provide advantages for verification (3.1x faster) but P256K1 is now faster for signing
- Pure Go implementations are highly competitive, with P256K1Signer leading in 3 out of 4 operations (pubkey derivation, signing, ECDH)
- Memory efficiency significantly improved, with P256K1Signer maintaining very low memory usage:
- Verify: 64 B/op (89% reduction!)
- Sign: 576 B/op (50% reduction)
- Pubkey Derivation: 256 B/op
- ECDH: 241 B/op
The choice between implementations depends on your specific requirements:
- Maximum verification performance: Use NextP256K (CGO) - 3.1x faster for verification
- Maximum signing performance: Use P256K1Signer (Pure Go) - 1.8x faster than NextP256K, 7.7x faster than Btcec!
- Best pure Go performance: Use P256K1Signer - fastest pure Go for all operations, now competitive with CGO for signing
- Best overall performance: Use P256K1Signer - wins 3 out of 4 operations, fastest overall for signing
- Pure Go alternative: Use BtcecSigner (but P256K1Signer is significantly faster across all operations)
Running the Benchmarks
To reproduce these benchmarks:
# Run all benchmarks
CGO_ENABLED=1 go test -tags=cgo ./bench -bench=. -benchmem
# Run specific operation
CGO_ENABLED=1 go test -tags=cgo ./bench -bench=BenchmarkSign
# Run specific implementation
CGO_ENABLED=1 go test -tags=cgo ./bench -bench=Benchmark.*_P256K1
Note: All benchmarks require CGO to be enabled (CGO_ENABLED=1) and the cgo build tag.