This implementation uses 128-bit limbs with AVX2 256-bit registers for secp256k1 cryptographic operations. The key insight is that AVX2's YMM registers can hold two 128-bit values, enabling efficient parallel processing.
| Type | Size | AVX2 Representation | Registers |
|---|---|---|---|
| Uint128 | 128-bit | 1×128-bit in XMM or half YMM | 0.5 YMM |
| Scalar | 256-bit | 2×128-bit limbs | 1 YMM |
| FieldElement | 256-bit | 2×128-bit limbs | 1 YMM |
| AffinePoint | 512-bit | 2×FieldElement (x, y) | 2 YMM |
| JacobianPoint | 768-bit | 3×FieldElement (x, y, z) | 3 YMM |
Uint128:
[Lo:64][Hi:64] = 128 bits
Scalar/FieldElement (in YMM register):
YMM = [D[0].Lo:64][D[0].Hi:64][D[1].Lo:64][D[1].Hi:64]
├─── 128-bit limb 0 ────┤├─── 128-bit limb 1 ────┤
AffinePoint (2 YMM registers):
YMM0 = X coordinate (256 bits)
YMM1 = Y coordinate (256 bits)
JacobianPoint (3 YMM registers):
YMM0 = X coordinate (256 bits)
YMM1 = Y coordinate (256 bits)
YMM2 = Z coordinate (256 bits)
File: uint128_amd64.s
- Instructions: ADDQ, ADCQ
- Input: XMM0 (a), XMM1 (b)
- Output: XMM0 (result), carry flag
- Instructions: SUBQ, SBBQ
- Instructions: MULQ (scalar) or VPMULUDQ (SIMD)
- This is the critical operation for field/scalar multiplication
- Uses Karatsuba or schoolbook with VPMULUDQ
File: scalar_amd64.go (stubs), scalar_amd64.s (assembly)
`
Load a into YMM0
Load b into YMM1
VPADDQ YMM0, YMM0, YMM1 ; parallel add of 64-bit lanes
Handle carries between 64-bit lanes
Conditional subtract n if >= n
`
- Similar to add but with VPSUBQ and conditional add of n
- Compute 512-bit product using 128×128 multiplications - Reduce mod n using Barrett or Montgomery reduction - 512-bit intermediate fits in 2 YMM registers
- n - a using subtraction
- Use Fermat's little theorem: a^(n-2) mod n - Requires efficient square-and-multiply
File: field_amd64.go (stubs), field_amd64.s (assembly)
`
Load a into YMM0
Load b into YMM1
VPADDQ YMM0, YMM0, YMM1
Handle carries
Conditional subtract p if >= p
`
- Most critical operation for performance - 256×256→512 bit product, then reduce mod p - secp256k1 has special structure: p = 2^256 - 2^32 - 977 - Reduction: if result >= 2^256, add (2^32 + 977) to lower bits
- Can save ~25% multiplications vs general multiply
- Fermat: a^(p-2) mod p - Use addition chain for efficiency
- p ≡ 3 (mod 4), so sqrt(a) = a^((p+1)/4) mod p
File: point_amd64.go (stubs), point_amd64.s (assembly)
- Requires field inversion
- ~4 field multiplications, ~4 field squarings, ~6 field additions - All field ops can use AVX2 versions
- ~12 field multiplications, ~4 field squarings
- ~8 field multiplications, ~3 field squarings - Common case in scalar multiplication
- Use windowed NAF or GLV decomposition - Core loop: double + conditional add
- Precompute multiples of generator G - Faster than general scalar mult
File: ecdsa.go, schnorr.go
YMM0-YMM7: Scratch registers (caller-saved)
YMM8-YMM15: Can be used but should be preserved
For our operations:
YMM0: Primary operand/result
YMM1: Secondary operand
YMM2-YMM5: Intermediate calculations
YMM6-YMM7: Constants (field prime, masks, etc.)
; Data movement
VMOVDQU YMM0, [mem] ; Load 256 bits unaligned
VMOVDQA YMM0, [mem] ; Load 256 bits aligned
VBROADCASTI128 YMM0, [mem] ; Broadcast 128-bit to both lanes
; Arithmetic
VPADDQ YMM0, YMM1, YMM2 ; Add packed 64-bit integers
VPSUBQ YMM0, YMM1, YMM2 ; Subtract packed 64-bit integers
VPMULUDQ YMM0, YMM1, YMM2 ; Multiply low 32-bits of each 64-bit lane
; Logical
VPAND YMM0, YMM1, YMM2 ; Bitwise AND
VPOR YMM0, YMM1, YMM2 ; Bitwise OR
VPXOR YMM0, YMM1, YMM2 ; Bitwise XOR
; Shifts
VPSLLQ YMM0, YMM1, imm ; Shift left logical 64-bit
VPSRLQ YMM0, YMM1, imm ; Shift right logical 64-bit
; Shuffles and permutes
VPERMQ YMM0, YMM1, imm ; Permute 64-bit elements
VPERM2I128 YMM0, YMM1, YMM2, imm ; Permute 128-bit lanes
VPALIGNR YMM0, YMM1, YMM2, imm ; Byte align
; Comparisons
VPCMPEQQ YMM0, YMM1, YMM2 ; Compare equal 64-bit
VPCMPGTQ YMM0, YMM1, YMM2 ; Compare greater than 64-bit
; Blending
VPBLENDVB YMM0, YMM1, YMM2, YMM3 ; Conditional blend
The tricky part of 128-bit limb arithmetic is carry propagation between the 64-bit halves and between the two 128-bit limbs.
Given: A = [A0.Lo, A0.Hi, A1.Lo, A1.Hi] (256 bits as 4×64)
B = [B0.Lo, B0.Hi, B1.Lo, B1.Hi]
Step 1: Add with VPADDQ (no carries)
R = A + B (per-lane, ignoring overflow)
Step 2: Detect carries
carry_0_to_1 = (R0.Lo < A0.Lo) ? 1 : 0 ; carry from Lo to Hi in limb 0
carry_1_to_2 = (R0.Hi < A0.Hi) ? 1 : 0 ; carry from limb 0 to limb 1
carry_2_to_3 = (R1.Lo < A1.Lo) ? 1 : 0 ; carry within limb 1
carry_out = (R1.Hi < A1.Hi) ? 1 : 0 ; overflow
Step 3: Propagate carries
R0.Hi += carry_0_to_1
R1.Lo += carry_1_to_2 + (R0.Hi < carry_0_to_1 ? 1 : 0)
R1.Hi += carry_2_to_3 + ...
This is complex in SIMD. Alternative: use ADCX/ADOX instructions (ADX extension) for scalar carry chains, which may be faster for sequential operations.
For 128×128→256 multiplication:
A = A.Hi * 2^64 + A.Lo
B = B.Hi * 2^64 + B.Lo
A * B = A.Hi*B.Hi * 2^128
+ (A.Hi*B.Lo + A.Lo*B.Hi) * 2^64
+ A.Lo*B.Lo
Using MULX (BMI2) for efficient 64×64→128:
MULX r1, r0, A.Lo ; r1:r0 = A.Lo * B.Lo
MULX r3, r2, A.Hi ; r3:r2 = A.Hi * B.Lo
... (4 multiplications total, then accumulate)
avx/
├── IMPLEMENTATION_PLAN.md (this file)
├── types.go (type definitions)
├── uint128.go (pure Go fallback)
├── uint128_amd64.go (Go stubs for assembly)
├── uint128_amd64.s (AVX2 assembly)
├── scalar.go (pure Go fallback)
├── scalar_amd64.go (Go stubs)
├── scalar_amd64.s (AVX2 assembly)
├── field.go (pure Go fallback)
├── field_amd64.go (Go stubs)
├── field_amd64.s (AVX2 assembly)
├── point.go (pure Go fallback)
├── point_amd64.go (Go stubs)
├── point_amd64.s (AVX2 assembly)
├── avx_test.go (tests)
└── bench_test.go (benchmarks)
Compared to the current pure Go implementation: