# SIMD/ASM Optimization Benchmark Comparison This document compares four secp256k1 implementations: 1. **btcec/v2** - Pure Go (github.com/btcsuite/btcd/btcec/v2) 2. **P256K1 Pure Go** - This repository with AVX2/BMI2 disabled 3. **P256K1 ASM** - This repository with AVX2/BMI2 assembly optimizations enabled 4. **libsecp256k1** - Native C library via purego (dlopen, no CGO) **Generated:** 2025-11-29 **Platform:** linux/amd64 **CPU:** AMD Ryzen 5 PRO 4650G with Radeon Graphics (AVX2/BMI2 supported) **Go Version:** go1.25.3 --- ## Summary Comparison | Operation | btcec/v2 | P256K1 Pure Go | P256K1 ASM | libsecp256k1 (C) | |-----------|----------|----------------|------------|------------------| | **Pubkey Derivation** | ~50 µs | 56 µs | 56 µs* | 22 µs | | **Sign** | ~60 µs | 58 µs | 58 µs* | 41 µs | | **Verify** | ~100 µs | 182 µs | 182 µs* | 47 µs | | **ECDH** | ~120 µs | 119 µs | 119 µs* | N/A | *Note: AVX2/BMI2 assembly optimizations are currently implemented for field operations but require additional integration work to show speedups at the high-level API. The assembly code is available in `field_amd64_bmi2.s`. --- ## Detailed Results ### btcec/v2 The btcec library is the widely-used pure Go implementation from the btcd project: | Operation | Time per op | |-----------|-------------| | Pubkey Derivation | ~50 µs | | Schnorr Sign | ~60 µs | | Schnorr Verify | ~100 µs | | ECDH | ~120 µs | ### P256K1 Pure Go (AVX2 disabled) This implementation with `SetAVX2Enabled(false)`: | Operation | Time per op | |-----------|-------------| | Pubkey Derivation | 56 µs | | Schnorr Sign | 58 µs | | Schnorr Verify | 182 µs | | ECDH | 119 µs | ### P256K1 with ASM/BMI2 (AVX2 enabled) This implementation with `SetAVX2Enabled(true)`: | Operation | Time per op | Notes | |-----------|-------------|-------| | Pubkey Derivation | 56 µs | Uses GLV optimization | | Schnorr Sign | 58 µs | Uses GLV for k*G | | Schnorr Verify | 182 µs | Signature verification | | ECDH | 119 µs | Uses GLV for scalar mult | **Field Operation Speedups (Low-level):** The BMI2-based field multiplication is available in `field_amd64_bmi2.s` and provides faster 256-bit modular arithmetic using the MULX instruction. ### libsecp256k1 (Native C via purego) The fastest option, using the Bitcoin Core C library: | Operation | Time per op | |-----------|-------------| | Pubkey Derivation | 22 µs | | Schnorr Sign | 41 µs | | Schnorr Verify | 47 µs | | ECDH | N/A | --- ## Key Optimizations in P256K1 ### GLV Endomorphism (Primary Speedup) The GLV (Gallant-Lambert-Vanstone) endomorphism exploits secp256k1's special curve structure: - λ·(x, y) = (β·x, y) for endomorphism constant λ - β³ ≡ 1 (mod p) and λ³ ≡ 1 (mod n) This reduces 256-bit scalar multiplication to two 128-bit multiplications: | Operation | Without GLV | With GLV | Speedup | |-----------|-------------|----------|---------| | Generator mult (k*G) | 122 µs | 45 µs | **2.7x** | | Arbitrary point mult | 122 µs | 101 µs | **17%** | ### BMI2 Assembly (Field Operations) The `field_amd64_bmi2.s` file contains optimized assembly using: - **MULX** instruction for carry-free multiplication - **ADCX/ADOX** for parallel add-with-carry chains - Register allocation optimized for secp256k1's field prime ### Precomputed Tables - **Generator table**: 32 precomputed odd multiples of G - **λ*G table**: 32 precomputed odd multiples for GLV - **8-bit byte table**: For constant-time lookup --- ## Performance Ranking From fastest to slowest for typical cryptographic operations: 1. **libsecp256k1 (C)** - Best choice when native library available - 2-4x faster than pure Go implementations - Uses purego (no CGO required) 2. **btcec/v2** - Good pure Go option - Mature, well-tested codebase - Slightly faster verification than P256K1 3. **P256K1 (This Repo)** - GLV-optimized pure Go - Competitive signing performance - 2.7x faster generator multiplication with GLV - Ongoing BMI2 assembly integration --- ## Recommendations **Use libsecp256k1 when:** - Maximum performance is critical - Running on platforms where purego works (Linux, macOS, Windows) - Verification-heavy workloads (3.9x faster than pure Go) **Use btcec/v2 when:** - Need a battle-tested, widely-used library - Verification performance matters more than signing **Use P256K1 when:** - Pure Go is required (WebAssembly, embedded, cross-compilation) - Signing-heavy workloads (GLV optimization helps most here) - Portability is important - Prefer Go code auditing over C --- ## Running Benchmarks ```bash # Run all SIMD comparison benchmarks go test ./bench -bench='BenchmarkBtcec|BenchmarkP256K1PureGo|BenchmarkP256K1ASM|BenchmarkLibSecp256k1' -benchtime=1s -run=^$ # Run specific benchmark category go test ./bench -bench=BenchmarkBtcec -benchtime=1s -run=^$ go test ./bench -bench=BenchmarkP256K1PureGo -benchtime=1s -run=^$ go test ./bench -bench=BenchmarkP256K1ASM -benchtime=1s -run=^$ go test ./bench -bench=BenchmarkLibSecp256k1 -benchtime=1s -run=^$ # Run internal scalar multiplication benchmarks go test -bench='BenchmarkEcmultGen|BenchmarkEcmultStraussWNAFGLV' -benchtime=1s ``` --- ## CPU Feature Detection The P256K1 implementation automatically detects CPU features: ```go import "p256k1.mleku.dev" // Check if AVX2/BMI2 is available if p256k1.HasAVX2CPU() { // Use optimized path } // Manually control AVX2 usage p256k1.SetAVX2Enabled(false) // Force pure Go p256k1.SetAVX2Enabled(true) // Enable AVX2/BMI2 (if available) ``` --- ## Future Work 1. **Integrate BMI2 field multiplication** into high-level operations 2. **Batch verification** using Strauss or Pippenger algorithms 3. **ARM64 optimizations** using NEON instructions 4. **WebAssembly SIMD** for browser performance