BENCHMARK_SIMD.md raw

SIMD/ASM Optimization Benchmark Comparison

This document compares four secp256k1 implementations:

  1. btcec/v2 - Pure Go (github.com/btcsuite/btcd/btcec/v2)
  2. P256K1 Pure Go - This repository with AVX2/BMI2 disabled
  3. P256K1 ASM - This repository with AVX2/BMI2 assembly optimizations enabled
  4. libsecp256k1 - Native C library via purego (dlopen, no CGO)

Generated: 2025-11-29 Platform: linux/amd64 CPU: AMD Ryzen 5 PRO 4650G with Radeon Graphics (AVX2/BMI2 supported) Go Version: go1.25.3

Summary Comparison

Operationbtcec/v2P256K1 Pure GoP256K1 ASMlibsecp256k1 (C)
Pubkey Derivation~50 µs56 µs56 µs*22 µs
Sign~60 µs58 µs58 µs*41 µs
Verify~100 µs182 µs182 µs*47 µs
ECDH~120 µs119 µs119 µs*N/A

*Note: AVX2/BMI2 assembly optimizations are currently implemented for field operations but require additional integration work to show speedups at the high-level API. The assembly code is available in field_amd64_bmi2.s.

Detailed Results

btcec/v2

The btcec library is the widely-used pure Go implementation from the btcd project:

OperationTime per op
Pubkey Derivation~50 µs
Schnorr Sign~60 µs
Schnorr Verify~100 µs
ECDH~120 µs

P256K1 Pure Go (AVX2 disabled)

This implementation with SetAVX2Enabled(false):

OperationTime per op
Pubkey Derivation56 µs
Schnorr Sign58 µs
Schnorr Verify182 µs
ECDH119 µs

P256K1 with ASM/BMI2 (AVX2 enabled)

This implementation with SetAVX2Enabled(true):

OperationTime per opNotes
Pubkey Derivation56 µsUses GLV optimization
Schnorr Sign58 µsUses GLV for k*G
Schnorr Verify182 µsSignature verification
ECDH119 µsUses GLV for scalar mult

Field Operation Speedups (Low-level): The BMI2-based field multiplication is available in field_amd64_bmi2.s and provides faster 256-bit modular arithmetic using the MULX instruction.

libsecp256k1 (Native C via purego)

The fastest option, using the Bitcoin Core C library:

OperationTime per op
Pubkey Derivation22 µs
Schnorr Sign41 µs
Schnorr Verify47 µs
ECDHN/A

Key Optimizations in P256K1

GLV Endomorphism (Primary Speedup)

The GLV (Gallant-Lambert-Vanstone) endomorphism exploits secp256k1's special curve structure:

This reduces 256-bit scalar multiplication to two 128-bit multiplications:

OperationWithout GLVWith GLVSpeedup
Generator mult (k*G)122 µs45 µs2.7x
Arbitrary point mult122 µs101 µs17%

BMI2 Assembly (Field Operations)

The field_amd64_bmi2.s file contains optimized assembly using:

Precomputed Tables

Performance Ranking

From fastest to slowest for typical cryptographic operations:

  1. libsecp256k1 (C) - Best choice when native library available

- 2-4x faster than pure Go implementations - Uses purego (no CGO required)

  1. btcec/v2 - Good pure Go option

- Mature, well-tested codebase - Slightly faster verification than P256K1

  1. P256K1 (This Repo) - GLV-optimized pure Go

- Competitive signing performance - 2.7x faster generator multiplication with GLV - Ongoing BMI2 assembly integration

Recommendations

Use libsecp256k1 when:

Use btcec/v2 when:

Use P256K1 when:

Running Benchmarks

# Run all SIMD comparison benchmarks
go test ./bench -bench='BenchmarkBtcec|BenchmarkP256K1PureGo|BenchmarkP256K1ASM|BenchmarkLibSecp256k1' -benchtime=1s -run=^$

# Run specific benchmark category
go test ./bench -bench=BenchmarkBtcec -benchtime=1s -run=^$
go test ./bench -bench=BenchmarkP256K1PureGo -benchtime=1s -run=^$
go test ./bench -bench=BenchmarkP256K1ASM -benchtime=1s -run=^$
go test ./bench -bench=BenchmarkLibSecp256k1 -benchtime=1s -run=^$

# Run internal scalar multiplication benchmarks
go test -bench='BenchmarkEcmultGen|BenchmarkEcmultStraussWNAFGLV' -benchtime=1s

CPU Feature Detection

The P256K1 implementation automatically detects CPU features:

import "p256k1.mleku.dev"

// Check if AVX2/BMI2 is available
if p256k1.HasAVX2CPU() {
    // Use optimized path
}

// Manually control AVX2 usage
p256k1.SetAVX2Enabled(false)  // Force pure Go
p256k1.SetAVX2Enabled(true)   // Enable AVX2/BMI2 (if available)

Future Work

  1. Integrate BMI2 field multiplication into high-level operations
  2. Batch verification using Strauss or Pippenger algorithms
  3. ARM64 optimizations using NEON instructions
  4. WebAssembly SIMD for browser performance