# AVX2 secp256k1 Implementation Plan

## Overview

This implementation uses 128-bit limbs with AVX2 256-bit registers for secp256k1 cryptographic operations. The key insight is that AVX2's YMM registers can hold two 128-bit values, enabling efficient parallel processing.

## Data Layout

### Register Mapping

| Type | Size | AVX2 Representation | Registers |
|------|------|---------------------|-----------|
| Uint128 | 128-bit | 1×128-bit in XMM or half YMM | 0.5 YMM |
| Scalar | 256-bit | 2×128-bit limbs | 1 YMM |
| FieldElement | 256-bit | 2×128-bit limbs | 1 YMM |
| AffinePoint | 512-bit | 2×FieldElement (x, y) | 2 YMM |
| JacobianPoint | 768-bit | 3×FieldElement (x, y, z) | 3 YMM |

### Memory Layout

```
Uint128:
  [Lo:64][Hi:64] = 128 bits

Scalar/FieldElement (in YMM register):
  YMM = [D[0].Lo:64][D[0].Hi:64][D[1].Lo:64][D[1].Hi:64]
        ├─── 128-bit limb 0 ────┤├─── 128-bit limb 1 ────┤

AffinePoint (2 YMM registers):
  YMM0 = X coordinate (256 bits)
  YMM1 = Y coordinate (256 bits)

JacobianPoint (3 YMM registers):
  YMM0 = X coordinate (256 bits)
  YMM1 = Y coordinate (256 bits)
  YMM2 = Z coordinate (256 bits)
```

## Implementation Phases

### Phase 1: Core 128-bit Operations

File: `uint128_amd64.s`

1. **uint128Add** - Add two 128-bit values with carry out
   - Instructions: `ADDQ`, `ADCQ`
   - Input: XMM0 (a), XMM1 (b)
   - Output: XMM0 (result), carry flag

2. **uint128Sub** - Subtract with borrow
   - Instructions: `SUBQ`, `SBBQ`

3. **uint128Mul** - Multiply two 64-bit values to get 128-bit result
   - Instructions: `MULQ` (scalar) or `VPMULUDQ` (SIMD)

4. **uint128Mul128** - Full 128×128→256 multiplication
   - This is the critical operation for field/scalar multiplication
   - Uses Karatsuba or schoolbook with `VPMULUDQ`

### Phase 2: Scalar Operations (mod n)

File: `scalar_amd64.go` (stubs), `scalar_amd64.s` (assembly)

1. **ScalarAdd** - Add two scalars mod n
   ```
   Load a into YMM0
   Load b into YMM1
   VPADDQ YMM0, YMM0, YMM1    ; parallel add of 64-bit lanes
   Handle carries between 64-bit lanes
   Conditional subtract n if >= n
   ```

2. **ScalarSub** - Subtract scalars mod n
   - Similar to add but with `VPSUBQ` and conditional add of n

3. **ScalarMul** - Multiply scalars mod n
   - Compute 512-bit product using 128×128 multiplications
   - Reduce mod n using Barrett or Montgomery reduction
   - 512-bit intermediate fits in 2 YMM registers

4. **ScalarNegate** - Compute -a mod n
   - `n - a` using subtraction

5. **ScalarInverse** - Compute a^(-1) mod n
   - Use Fermat's little theorem: a^(n-2) mod n
   - Requires efficient square-and-multiply

6. **ScalarIsZero**, **ScalarIsHigh**, **ScalarEqual** - Comparisons

### Phase 3: Field Operations (mod p)

File: `field_amd64.go` (stubs), `field_amd64.s` (assembly)

1. **FieldAdd** - Add two field elements mod p
   ```
   Load a into YMM0
   Load b into YMM1
   VPADDQ YMM0, YMM0, YMM1
   Handle carries
   Conditional subtract p if >= p
   ```

2. **FieldSub** - Subtract field elements mod p

3. **FieldMul** - Multiply field elements mod p
   - Most critical operation for performance
   - 256×256→512 bit product, then reduce mod p
   - secp256k1 has special structure: p = 2^256 - 2^32 - 977
   - Reduction: if result >= 2^256, add (2^32 + 977) to lower bits

4. **FieldSqr** - Square a field element (optimized mul(a,a))
   - Can save ~25% multiplications vs general multiply

5. **FieldInv** - Compute a^(-1) mod p
   - Fermat: a^(p-2) mod p
   - Use addition chain for efficiency

6. **FieldSqrt** - Compute square root mod p
   - p ≡ 3 (mod 4), so sqrt(a) = a^((p+1)/4) mod p

7. **FieldNegate**, **FieldIsZero**, **FieldEqual** - Basic operations

### Phase 4: Point Operations

File: `point_amd64.go` (stubs), `point_amd64.s` (assembly)

1. **AffineToJacobian** - Convert (x, y) to (x, y, 1)

2. **JacobianToAffine** - Convert (X, Y, Z) to (X/Z², Y/Z³)
   - Requires field inversion

3. **JacobianDouble** - Point doubling
   - ~4 field multiplications, ~4 field squarings, ~6 field additions
   - All field ops can use AVX2 versions

4. **JacobianAdd** - Add two Jacobian points
   - ~12 field multiplications, ~4 field squarings

5. **JacobianAddAffine** - Add Jacobian + Affine (optimized)
   - ~8 field multiplications, ~3 field squarings
   - Common case in scalar multiplication

6. **ScalarMult** - Compute k*P for scalar k and point P
   - Use windowed NAF or GLV decomposition
   - Core loop: double + conditional add

7. **ScalarBaseMult** - Compute k*G using precomputed table
   - Precompute multiples of generator G
   - Faster than general scalar mult

### Phase 5: High-Level Operations

File: `ecdsa.go`, `schnorr.go`

1. **ECDSA Sign/Verify**
2. **Schnorr Sign/Verify** (BIP-340)
3. **ECDH** - Shared secret computation

## Assembly Conventions

### Register Usage

```
YMM0-YMM7:  Scratch registers (caller-saved)
YMM8-YMM15: Can be used but should be preserved

For our operations:
YMM0: Primary operand/result
YMM1: Secondary operand
YMM2-YMM5: Intermediate calculations
YMM6-YMM7: Constants (field prime, masks, etc.)
```

### Key AVX2 Instructions

```asm
; Data movement
VMOVDQU     YMM0, [mem]       ; Load 256 bits unaligned
VMOVDQA     YMM0, [mem]       ; Load 256 bits aligned
VBROADCASTI128 YMM0, [mem]    ; Broadcast 128-bit to both lanes

; Arithmetic
VPADDQ      YMM0, YMM1, YMM2  ; Add packed 64-bit integers
VPSUBQ      YMM0, YMM1, YMM2  ; Subtract packed 64-bit integers
VPMULUDQ    YMM0, YMM1, YMM2  ; Multiply low 32-bits of each 64-bit lane

; Logical
VPAND       YMM0, YMM1, YMM2  ; Bitwise AND
VPOR        YMM0, YMM1, YMM2  ; Bitwise OR
VPXOR       YMM0, YMM1, YMM2  ; Bitwise XOR

; Shifts
VPSLLQ      YMM0, YMM1, imm   ; Shift left logical 64-bit
VPSRLQ      YMM0, YMM1, imm   ; Shift right logical 64-bit

; Shuffles and permutes
VPERMQ      YMM0, YMM1, imm   ; Permute 64-bit elements
VPERM2I128  YMM0, YMM1, YMM2, imm ; Permute 128-bit lanes
VPALIGNR    YMM0, YMM1, YMM2, imm ; Byte align

; Comparisons
VPCMPEQQ    YMM0, YMM1, YMM2  ; Compare equal 64-bit
VPCMPGTQ    YMM0, YMM1, YMM2  ; Compare greater than 64-bit

; Blending
VPBLENDVB   YMM0, YMM1, YMM2, YMM3 ; Conditional blend
```

## Carry Propagation Strategy

The tricky part of 128-bit limb arithmetic is carry propagation between the 64-bit halves and between the two 128-bit limbs.

### Addition Carry Chain

```
Given: A = [A0.Lo, A0.Hi, A1.Lo, A1.Hi]  (256 bits as 4×64)
       B = [B0.Lo, B0.Hi, B1.Lo, B1.Hi]

Step 1: Add with VPADDQ (no carries)
  R = A + B (per-lane, ignoring overflow)

Step 2: Detect carries
  carry_0_to_1 = (R0.Lo < A0.Lo) ? 1 : 0  ; carry from Lo to Hi in limb 0
  carry_1_to_2 = (R0.Hi < A0.Hi) ? 1 : 0  ; carry from limb 0 to limb 1
  carry_2_to_3 = (R1.Lo < A1.Lo) ? 1 : 0  ; carry within limb 1
  carry_out    = (R1.Hi < A1.Hi) ? 1 : 0  ; overflow

Step 3: Propagate carries
  R0.Hi += carry_0_to_1
  R1.Lo += carry_1_to_2 + (R0.Hi < carry_0_to_1 ? 1 : 0)
  R1.Hi += carry_2_to_3 + ...
```

This is complex in SIMD. Alternative: use `ADCX`/`ADOX` instructions (ADX extension) for scalar carry chains, which may be faster for sequential operations.

### Multiplication Strategy

For 128×128→256 multiplication:

```
A = A.Hi * 2^64 + A.Lo
B = B.Hi * 2^64 + B.Lo

A * B = A.Hi*B.Hi * 2^128
      + (A.Hi*B.Lo + A.Lo*B.Hi) * 2^64
      + A.Lo*B.Lo

Using MULX (BMI2) for efficient 64×64→128:
  MULX r1, r0, A.Lo    ; r1:r0 = A.Lo * B.Lo
  MULX r3, r2, A.Hi    ; r3:r2 = A.Hi * B.Lo
  ... (4 multiplications total, then accumulate)
```

## Testing Strategy

1. **Unit tests for each operation** comparing against reference (main package)
2. **Edge cases**: zero, one, max values, values near modulus
3. **Random tests**: generate random inputs, compare results
4. **Benchmark comparisons**: AVX2 vs pure Go implementation

## File Structure

```
avx/
├── IMPLEMENTATION_PLAN.md  (this file)
├── types.go                (type definitions)
├── uint128.go              (pure Go fallback)
├── uint128_amd64.go        (Go stubs for assembly)
├── uint128_amd64.s         (AVX2 assembly)
├── scalar.go               (pure Go fallback)
├── scalar_amd64.go         (Go stubs)
├── scalar_amd64.s          (AVX2 assembly)
├── field.go                (pure Go fallback)
├── field_amd64.go          (Go stubs)
├── field_amd64.s           (AVX2 assembly)
├── point.go                (pure Go fallback)
├── point_amd64.go          (Go stubs)
├── point_amd64.s           (AVX2 assembly)
├── avx_test.go             (tests)
└── bench_test.go           (benchmarks)
```

## Performance Targets

Compared to the current pure Go implementation:
- Scalar multiplication: 2-3x faster
- Field multiplication: 2-4x faster
- Point operations: 2-3x faster (dominated by field ops)
- ECDSA sign/verify: 2-3x faster overall

## Dependencies

- Go 1.21+ (for assembly support)
- CPU with AVX2 support (Intel Haswell+, AMD Excavator+)
- Optional: BMI2 for MULX instruction (faster 64×64→128 multiply)