SIMD/ASM Optimization Benchmark Comparison

This document compares four secp256k1 implementations:

btcec/v2 - Pure Go (github.com/btcsuite/btcd/btcec/v2)
P256K1 Pure Go - This repository with AVX2/BMI2 disabled
P256K1 ASM - This repository with AVX2/BMI2 assembly optimizations enabled
libsecp256k1 - Native C library via purego (dlopen, no CGO)

Generated: 2025-11-29 Platform: linux/amd64 CPU: AMD Ryzen 5 PRO 4650G with Radeon Graphics (AVX2/BMI2 supported) Go Version: go1.25.3

Summary Comparison

Operation	btcec/v2	P256K1 Pure Go	P256K1 ASM	libsecp256k1 (C)
Pubkey Derivation	~50 µs	56 µs	56 µs*	22 µs
Sign	~60 µs	58 µs	58 µs*	41 µs
Verify	~100 µs	182 µs	182 µs*	47 µs
ECDH	~120 µs	119 µs	119 µs*	N/A

*Note: AVX2/BMI2 assembly optimizations are currently implemented for field operations but require additional integration work to show speedups at the high-level API. The assembly code is available in field_amd64_bmi2.s.

Detailed Results

btcec/v2

The btcec library is the widely-used pure Go implementation from the btcd project:

Operation	Time per op
Pubkey Derivation	~50 µs
Schnorr Sign	~60 µs
Schnorr Verify	~100 µs
ECDH	~120 µs

P256K1 Pure Go (AVX2 disabled)

This implementation with SetAVX2Enabled(false):

Operation	Time per op
Pubkey Derivation	56 µs
Schnorr Sign	58 µs
Schnorr Verify	182 µs
ECDH	119 µs

P256K1 with ASM/BMI2 (AVX2 enabled)

This implementation with SetAVX2Enabled(true):

Operation	Time per op	Notes
Pubkey Derivation	56 µs	Uses GLV optimization
Schnorr Sign	58 µs	Uses GLV for k*G
Schnorr Verify	182 µs	Signature verification
ECDH	119 µs	Uses GLV for scalar mult

Field Operation Speedups (Low-level): The BMI2-based field multiplication is available in field_amd64_bmi2.s and provides faster 256-bit modular arithmetic using the MULX instruction.

libsecp256k1 (Native C via purego)

The fastest option, using the Bitcoin Core C library:

Operation	Time per op
Pubkey Derivation	22 µs
Schnorr Sign	41 µs
Schnorr Verify	47 µs
ECDH	N/A

Key Optimizations in P256K1

GLV Endomorphism (Primary Speedup)

The GLV (Gallant-Lambert-Vanstone) endomorphism exploits secp256k1's special curve structure:

λ·(x, y) = (β·x, y) for endomorphism constant λ
β³ ≡ 1 (mod p) and λ³ ≡ 1 (mod n)

This reduces 256-bit scalar multiplication to two 128-bit multiplications:

Operation	Without GLV	With GLV	Speedup
Generator mult (k*G)	122 µs	45 µs	2.7x
Arbitrary point mult	122 µs	101 µs	17%

BMI2 Assembly (Field Operations)

The field_amd64_bmi2.s file contains optimized assembly using:

MULX instruction for carry-free multiplication
ADCX/ADOX for parallel add-with-carry chains
Register allocation optimized for secp256k1's field prime

Precomputed Tables

Generator table: 32 precomputed odd multiples of G
λ*G table: 32 precomputed odd multiples for GLV
8-bit byte table: For constant-time lookup

Performance Ranking

From fastest to slowest for typical cryptographic operations:

libsecp256k1 (C) - Best choice when native library available

- 2-4x faster than pure Go implementations - Uses purego (no CGO required)

btcec/v2 - Good pure Go option

- Mature, well-tested codebase - Slightly faster verification than P256K1

P256K1 (This Repo) - GLV-optimized pure Go

- Competitive signing performance - 2.7x faster generator multiplication with GLV - Ongoing BMI2 assembly integration

Recommendations

Use libsecp256k1 when:

Maximum performance is critical
Running on platforms where purego works (Linux, macOS, Windows)
Verification-heavy workloads (3.9x faster than pure Go)

Use btcec/v2 when:

Need a battle-tested, widely-used library
Verification performance matters more than signing

Use P256K1 when:

Pure Go is required (WebAssembly, embedded, cross-compilation)
Signing-heavy workloads (GLV optimization helps most here)
Portability is important
Prefer Go code auditing over C

Running Benchmarks

# Run all SIMD comparison benchmarks
go test ./bench -bench='BenchmarkBtcec|BenchmarkP256K1PureGo|BenchmarkP256K1ASM|BenchmarkLibSecp256k1' -benchtime=1s -run=^$

# Run specific benchmark category
go test ./bench -bench=BenchmarkBtcec -benchtime=1s -run=^$
go test ./bench -bench=BenchmarkP256K1PureGo -benchtime=1s -run=^$
go test ./bench -bench=BenchmarkP256K1ASM -benchtime=1s -run=^$
go test ./bench -bench=BenchmarkLibSecp256k1 -benchtime=1s -run=^$

# Run internal scalar multiplication benchmarks
go test -bench='BenchmarkEcmultGen|BenchmarkEcmultStraussWNAFGLV' -benchtime=1s

CPU Feature Detection

The P256K1 implementation automatically detects CPU features:

import "p256k1.mleku.dev"

// Check if AVX2/BMI2 is available
if p256k1.HasAVX2CPU() {
    // Use optimized path
}

// Manually control AVX2 usage
p256k1.SetAVX2Enabled(false)  // Force pure Go
p256k1.SetAVX2Enabled(true)   // Enable AVX2/BMI2 (if available)

Future Work

Integrate BMI2 field multiplication into high-level operations
Batch verification using Strauss or Pippenger algorithms
ARM64 optimizations using NEON instructions
WebAssembly SIMD for browser performance

BENCHMARK_SIMD.md raw