AVX2 secp256k1 Implementation Plan

Overview

This implementation uses 128-bit limbs with AVX2 256-bit registers for secp256k1 cryptographic operations. The key insight is that AVX2's YMM registers can hold two 128-bit values, enabling efficient parallel processing.

Data Layout

Register Mapping

Type	Size	AVX2 Representation	Registers
Uint128	128-bit	1×128-bit in XMM or half YMM	0.5 YMM
Scalar	256-bit	2×128-bit limbs	1 YMM
FieldElement	256-bit	2×128-bit limbs	1 YMM
AffinePoint	512-bit	2×FieldElement (x, y)	2 YMM
JacobianPoint	768-bit	3×FieldElement (x, y, z)	3 YMM

Memory Layout

Uint128:
  [Lo:64][Hi:64] = 128 bits

Scalar/FieldElement (in YMM register):
  YMM = [D[0].Lo:64][D[0].Hi:64][D[1].Lo:64][D[1].Hi:64]
        ├─── 128-bit limb 0 ────┤├─── 128-bit limb 1 ────┤

AffinePoint (2 YMM registers):
  YMM0 = X coordinate (256 bits)
  YMM1 = Y coordinate (256 bits)

JacobianPoint (3 YMM registers):
  YMM0 = X coordinate (256 bits)
  YMM1 = Y coordinate (256 bits)
  YMM2 = Z coordinate (256 bits)

Implementation Phases

Phase 1: Core 128-bit Operations

File: uint128_amd64.s

uint128Add - Add two 128-bit values with carry out

- Instructions: ADDQ, ADCQ - Input: XMM0 (a), XMM1 (b) - Output: XMM0 (result), carry flag

uint128Sub - Subtract with borrow

- Instructions: SUBQ, SBBQ

uint128Mul - Multiply two 64-bit values to get 128-bit result

- Instructions: MULQ (scalar) or VPMULUDQ (SIMD)

uint128Mul128 - Full 128×128→256 multiplication

- This is the critical operation for field/scalar multiplication - Uses Karatsuba or schoolbook with VPMULUDQ

Phase 2: Scalar Operations (mod n)

File: scalar_amd64.go (stubs), scalar_amd64.s (assembly)

ScalarAdd - Add two scalars mod n

` Load a into YMM0 Load b into YMM1 VPADDQ YMM0, YMM0, YMM1 ; parallel add of 64-bit lanes Handle carries between 64-bit lanes Conditional subtract n if >= n `

ScalarSub - Subtract scalars mod n

- Similar to add but with VPSUBQ and conditional add of n

ScalarMul - Multiply scalars mod n

- Compute 512-bit product using 128×128 multiplications - Reduce mod n using Barrett or Montgomery reduction - 512-bit intermediate fits in 2 YMM registers

ScalarNegate - Compute -a mod n

- n - a using subtraction

ScalarInverse - Compute a^(-1) mod n

- Use Fermat's little theorem: a^(n-2) mod n - Requires efficient square-and-multiply

ScalarIsZero, ScalarIsHigh, ScalarEqual - Comparisons

Phase 3: Field Operations (mod p)

File: field_amd64.go (stubs), field_amd64.s (assembly)

FieldAdd - Add two field elements mod p

` Load a into YMM0 Load b into YMM1 VPADDQ YMM0, YMM0, YMM1 Handle carries Conditional subtract p if >= p `

FieldSub - Subtract field elements mod p

FieldMul - Multiply field elements mod p

- Most critical operation for performance - 256×256→512 bit product, then reduce mod p - secp256k1 has special structure: p = 2^256 - 2^32 - 977 - Reduction: if result >= 2^256, add (2^32 + 977) to lower bits

FieldSqr - Square a field element (optimized mul(a,a))

- Can save ~25% multiplications vs general multiply

FieldInv - Compute a^(-1) mod p

- Fermat: a^(p-2) mod p - Use addition chain for efficiency

FieldSqrt - Compute square root mod p

- p ≡ 3 (mod 4), so sqrt(a) = a^((p+1)/4) mod p

FieldNegate, FieldIsZero, FieldEqual - Basic operations

Phase 4: Point Operations

File: point_amd64.go (stubs), point_amd64.s (assembly)

AffineToJacobian - Convert (x, y) to (x, y, 1)

JacobianToAffine - Convert (X, Y, Z) to (X/Z², Y/Z³)

- Requires field inversion

JacobianDouble - Point doubling

- ~4 field multiplications, ~4 field squarings, ~6 field additions - All field ops can use AVX2 versions

JacobianAdd - Add two Jacobian points

- ~12 field multiplications, ~4 field squarings

JacobianAddAffine - Add Jacobian + Affine (optimized)

- ~8 field multiplications, ~3 field squarings - Common case in scalar multiplication

ScalarMult - Compute k*P for scalar k and point P

- Use windowed NAF or GLV decomposition - Core loop: double + conditional add

ScalarBaseMult - Compute k*G using precomputed table

- Precompute multiples of generator G - Faster than general scalar mult

Phase 5: High-Level Operations

File: ecdsa.go, schnorr.go

ECDSA Sign/Verify
Schnorr Sign/Verify (BIP-340)
ECDH - Shared secret computation

Assembly Conventions

Register Usage

YMM0-YMM7:  Scratch registers (caller-saved)
YMM8-YMM15: Can be used but should be preserved

For our operations:
YMM0: Primary operand/result
YMM1: Secondary operand
YMM2-YMM5: Intermediate calculations
YMM6-YMM7: Constants (field prime, masks, etc.)

Key AVX2 Instructions

; Data movement
VMOVDQU     YMM0, [mem]       ; Load 256 bits unaligned
VMOVDQA     YMM0, [mem]       ; Load 256 bits aligned
VBROADCASTI128 YMM0, [mem]    ; Broadcast 128-bit to both lanes

; Arithmetic
VPADDQ      YMM0, YMM1, YMM2  ; Add packed 64-bit integers
VPSUBQ      YMM0, YMM1, YMM2  ; Subtract packed 64-bit integers
VPMULUDQ    YMM0, YMM1, YMM2  ; Multiply low 32-bits of each 64-bit lane

; Logical
VPAND       YMM0, YMM1, YMM2  ; Bitwise AND
VPOR        YMM0, YMM1, YMM2  ; Bitwise OR
VPXOR       YMM0, YMM1, YMM2  ; Bitwise XOR

; Shifts
VPSLLQ      YMM0, YMM1, imm   ; Shift left logical 64-bit
VPSRLQ      YMM0, YMM1, imm   ; Shift right logical 64-bit

; Shuffles and permutes
VPERMQ      YMM0, YMM1, imm   ; Permute 64-bit elements
VPERM2I128  YMM0, YMM1, YMM2, imm ; Permute 128-bit lanes
VPALIGNR    YMM0, YMM1, YMM2, imm ; Byte align

; Comparisons
VPCMPEQQ    YMM0, YMM1, YMM2  ; Compare equal 64-bit
VPCMPGTQ    YMM0, YMM1, YMM2  ; Compare greater than 64-bit

; Blending
VPBLENDVB   YMM0, YMM1, YMM2, YMM3 ; Conditional blend

Carry Propagation Strategy

The tricky part of 128-bit limb arithmetic is carry propagation between the 64-bit halves and between the two 128-bit limbs.

Addition Carry Chain

Given: A = [A0.Lo, A0.Hi, A1.Lo, A1.Hi]  (256 bits as 4×64)
       B = [B0.Lo, B0.Hi, B1.Lo, B1.Hi]

Step 1: Add with VPADDQ (no carries)
  R = A + B (per-lane, ignoring overflow)

Step 2: Detect carries
  carry_0_to_1 = (R0.Lo < A0.Lo) ? 1 : 0  ; carry from Lo to Hi in limb 0
  carry_1_to_2 = (R0.Hi < A0.Hi) ? 1 : 0  ; carry from limb 0 to limb 1
  carry_2_to_3 = (R1.Lo < A1.Lo) ? 1 : 0  ; carry within limb 1
  carry_out    = (R1.Hi < A1.Hi) ? 1 : 0  ; overflow

Step 3: Propagate carries
  R0.Hi += carry_0_to_1
  R1.Lo += carry_1_to_2 + (R0.Hi < carry_0_to_1 ? 1 : 0)
  R1.Hi += carry_2_to_3 + ...

This is complex in SIMD. Alternative: use ADCX/ADOX instructions (ADX extension) for scalar carry chains, which may be faster for sequential operations.

Multiplication Strategy

For 128×128→256 multiplication:

A = A.Hi * 2^64 + A.Lo
B = B.Hi * 2^64 + B.Lo

A * B = A.Hi*B.Hi * 2^128
      + (A.Hi*B.Lo + A.Lo*B.Hi) * 2^64
      + A.Lo*B.Lo

Using MULX (BMI2) for efficient 64×64→128:
  MULX r1, r0, A.Lo    ; r1:r0 = A.Lo * B.Lo
  MULX r3, r2, A.Hi    ; r3:r2 = A.Hi * B.Lo
  ... (4 multiplications total, then accumulate)

Testing Strategy

Unit tests for each operation comparing against reference (main package)
Edge cases: zero, one, max values, values near modulus
Random tests: generate random inputs, compare results
Benchmark comparisons: AVX2 vs pure Go implementation

File Structure

avx/
├── IMPLEMENTATION_PLAN.md  (this file)
├── types.go                (type definitions)
├── uint128.go              (pure Go fallback)
├── uint128_amd64.go        (Go stubs for assembly)
├── uint128_amd64.s         (AVX2 assembly)
├── scalar.go               (pure Go fallback)
├── scalar_amd64.go         (Go stubs)
├── scalar_amd64.s          (AVX2 assembly)
├── field.go                (pure Go fallback)
├── field_amd64.go          (Go stubs)
├── field_amd64.s           (AVX2 assembly)
├── point.go                (pure Go fallback)
├── point_amd64.go          (Go stubs)
├── point_amd64.s           (AVX2 assembly)
├── avx_test.go             (tests)
└── bench_test.go           (benchmarks)

Performance Targets

Compared to the current pure Go implementation:

Scalar multiplication: 2-3x faster
Field multiplication: 2-4x faster
Point operations: 2-3x faster (dominated by field ops)
ECDSA sign/verify: 2-3x faster overall

Dependencies

Go 1.21+ (for assembly support)
CPU with AVX2 support (Intel Haswell+, AMD Excavator+)
Optional: BMI2 for MULX instruction (faster 64×64→128 multiply)

IMPLEMENTATION_PLAN.md raw