# Implementation Plan: wNAF + GLV Endomorphism Optimization ## Overview This plan details implementing the GLV (Gallant-Lambert-Vanstone) endomorphism optimization combined with wNAF (windowed Non-Adjacent Form) for secp256k1 scalar multiplication, based on: - The IACR paper "SIMD acceleration of EC operations" (eprint.iacr.org/2021/1151) - The libsecp256k1 C implementation in `src/ecmult_impl.h` and `src/scalar_impl.h` ### Expected Performance Gain - **50% reduction** in scalar multiplication time by processing two 128-bit scalars instead of one 256-bit scalar - The GLV endomorphism exploits secp256k1's special structure: λ·(x,y) = (β·x, y) --- ## Phase 1: Constants and Basic Infrastructure ### Step 1.1: Add GLV Constants to scalar.go Add the following constants that are already defined in the C implementation: ```go // Lambda: cube root of unity mod n (group order) // λ^3 ≡ 1 (mod n), and λ^2 + λ + 1 ≡ 0 (mod n) var scalarLambda = Scalar{ d: [4]uint64{ 0xDF02967C1B23BD72, // limb 0 0x122E22EA20816678, // limb 1 0xA5261C028812645A, // limb 2 0x5363AD4CC05C30E0, // limb 3 }, } // Constants for scalar splitting (from libsecp256k1 scalar_impl.h lines 142-157) var scalarMinusB1 = Scalar{ d: [4]uint64{0x6F547FA90ABFE4C3, 0xE4437ED6010E8828, 0, 0}, } var scalarMinusB2 = Scalar{ d: [4]uint64{0xD765CDA83DB1562C, 0x8A280AC50774346D, 0xFFFFFFFFFFFFFFFE, 0xFFFFFFFFFFFFFFFF}, } var scalarG1 = Scalar{ d: [4]uint64{0xE893209A45DBB031, 0x3DAA8A1471E8CA7F, 0xE86C90E49284EB15, 0x3086D221A7D46BCD}, } var scalarG2 = Scalar{ d: [4]uint64{0x1571B4AE8AC47F71, 0x221208AC9DF506C6, 0x6F547FA90ABFE4C4, 0xE4437ED6010E8828}, } ``` **Files to modify:** `scalar.go` **Tests:** Add unit tests comparing with known C test vectors --- ### Step 1.2: Add Beta Constant to field.go Add the field element β (cube root of unity mod p): ```go // Beta: cube root of unity mod p (field order) // β^3 ≡ 1 (mod p), and β^2 + β + 1 ≡ 0 (mod p) // This enables: λ·(x,y) = (β·x, y) on secp256k1 var fieldBeta = FieldElement{ // In 5×52-bit representation n: [5]uint64{...}, // Derived from: 0x7ae96a2b657c07106e64479eac3434e99cf0497512f58995c1396c28719501ee } ``` **Files to modify:** `field.go` **Tests:** Verify β^3 ≡ 1 (mod p) --- ## Phase 2: Scalar Splitting ### Step 2.1: Implement mul_shift_var This function computes `(a * b) >> shift` for scalar splitting: ```go // mulShiftVar computes (a * b) >> shift, returning the result // This is used in GLV scalar splitting where shift is always 384 func (r *Scalar) mulShiftVar(a, b *Scalar, shift uint) { // Compute full 512-bit product // Extract bits [shift, shift+256) as the result } ``` **Reference:** libsecp256k1 `scalar_4x64_impl.h:secp256k1_scalar_mul_shift_var` **Files to modify:** `scalar.go` **Tests:** Test with known inputs and compare with C implementation --- ### Step 2.2: Implement splitLambda The core GLV scalar splitting function: ```go // splitLambda decomposes scalar k into r1, r2 such that: // r1 + λ·r2 ≡ k (mod n) // where r1 and r2 are approximately 128 bits each func splitLambda(r1, r2, k *Scalar) { // c1 = round(k * g1 / 2^384) // c2 = round(k * g2 / 2^384) var c1, c2 Scalar c1.mulShiftVar(k, &scalarG1, 384) c2.mulShiftVar(k, &scalarG2, 384) // r2 = c1*(-b1) + c2*(-b2) c1.mul(&c1, &scalarMinusB1) c2.mul(&c2, &scalarMinusB2) r2.add(&c1, &c2) // r1 = k - r2*λ r1.mul(r2, &scalarLambda) r1.negate(r1) r1.add(r1, k) } ``` **Reference:** libsecp256k1 `scalar_impl.h:secp256k1_scalar_split_lambda` (lines 140-178) **Files to modify:** `scalar.go` **Tests:** - Verify r1 + λ·r2 ≡ k (mod n) - Verify |r1| < 2^128 and |r2| < 2^128 --- ## Phase 3: Point Operations with Endomorphism ### Step 3.1: Implement mulLambda for Points Apply the endomorphism to a point: ```go // mulLambda applies the GLV endomorphism: λ·(x,y) = (β·x, y) func (r *GroupElementAffine) mulLambda(a *GroupElementAffine) { r.x.mul(&a.x, &fieldBeta) r.y = a.y r.infinity = a.infinity } ``` **Reference:** libsecp256k1 `group_impl.h:secp256k1_ge_mul_lambda` (lines 915-922) **Files to modify:** `group.go` **Tests:** Verify λ·G equals expected point --- ### Step 3.2: Implement isHigh for Scalars Check if a scalar is in the upper half of the group order: ```go // isHigh returns true if s > n/2 func (s *Scalar) isHigh() bool { // Compare with n/2 // n = FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFEBAAEDCE6AF48A03BBFD25E8CD0364141 // n/2 = 7FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF5D576E7357A4501DDFE92F46681B20A0 } ``` **Files to modify:** `scalar.go` **Tests:** Test boundary cases around n/2 --- ## Phase 4: Strauss Algorithm with GLV ### Step 4.1: Implement Odd Multiples Table with Z-Ratios The C implementation uses an efficient method to build odd multiples while tracking Z-coordinate ratios: ```go // buildOddMultiplesTable builds a table of odd multiples [1*a, 3*a, 5*a, ...] // and tracks Z-coordinate ratios for efficient normalization func buildOddMultiplesTable( n int, preA []GroupElementAffine, zRatios []FieldElement, z *FieldElement, a *GroupElementJacobian, ) { // Uses isomorphic curve trick for efficient Jacobian+Affine addition // See ecmult_impl.h lines 73-115 } ``` **Reference:** libsecp256k1 `ecmult_impl.h:secp256k1_ecmult_odd_multiples_table` **Files to modify:** `ecdh.go` or new file `ecmult.go` **Tests:** Verify table correctness --- ### Step 4.2: Implement Table Lookup Functions ```go // tableGetGE retrieves point from table, handling sign func tableGetGE(r *GroupElementAffine, pre []GroupElementAffine, n, w int) { // n is the wNAF digit (can be negative) // Returns pre[(|n|-1)/2], negated if n < 0 } // tableGetGELambda retrieves λ-transformed point from table func tableGetGELambda(r *GroupElementAffine, pre []GroupElementAffine, betaX []FieldElement, n, w int) { // Same as tableGetGE but uses precomputed β*x values } ``` **Reference:** libsecp256k1 `ecmult_impl.h` lines 125-143 **Files to modify:** `ecmult.go` --- ### Step 4.3: Implement Full Strauss-GLV Algorithm This is the main multiplication function: ```go // ecmultStraussWNAF computes r = na*a + ng*G using Strauss algorithm with GLV func ecmultStraussWNAF(r *GroupElementJacobian, a *GroupElementJacobian, na *Scalar, ng *Scalar) { // 1. Split scalars using GLV endomorphism // na = na1 + λ*na2 (where na1, na2 are ~128 bits) // 2. Build odd multiples table for a // Also precompute β*x for λ-transformed lookups // 3. Convert both half-scalars to wNAF representation // wNAF size is 129 bits (128 + 1 for potential overflow) // 4. For generator G: split scalar and use precomputed tables // ng = ng1 + 2^128*ng2 (simple bit split, not GLV) // 5. Main loop (from MSB to LSB): // - Double result // - Add contributions from wNAF digits for na1, na2, ng1, ng2 } ``` **Reference:** libsecp256k1 `ecmult_impl.h:secp256k1_ecmult_strauss_wnaf` (lines 237-347) **Files to modify:** `ecmult.go` **Tests:** Compare results with existing implementation --- ## Phase 5: Generator Precomputation ### Step 5.1: Precompute Generator Tables For maximum performance, precompute tables for G and 2^128*G: ```go // preG contains precomputed odd multiples of G for window size WINDOW_G // preG[i] = (2*i+1)*G for i = 0 to (1 << (WINDOW_G-2)) - 1 var preG [1 << (WINDOW_G - 2)]GroupElementStorage // preG128 contains precomputed odd multiples of 2^128*G var preG128 [1 << (WINDOW_G - 2)]GroupElementStorage ``` **Options:** 1. Generate at init() time (slower startup, no code bloat) 2. Generate with go:generate and embed (faster startup, larger binary) **Files to modify:** New file `ecmult_gen_table.go` or `precomputed.go` --- ### Step 5.2: Optimize Generator Multiplication ```go // ecmultGen computes r = ng*G using precomputed tables func ecmultGen(r *GroupElementJacobian, ng *Scalar) { // Split ng = ng1 + 2^128*ng2 // Use preG for ng1 lookups // Use preG128 for ng2 lookups // Combine using Strauss algorithm } ``` --- ## Phase 6: Integration and Testing ### Step 6.1: Update Public APIs Update the main multiplication functions to use the new implementation: ```go // Ecmult computes r = na*a + ng*G func Ecmult(r *GroupElementJacobian, a *GroupElementJacobian, na, ng *Scalar) { ecmultStraussWNAF(r, a, na, ng) } // EcmultGen computes r = ng*G (generator multiplication only) func EcmultGen(r *GroupElementJacobian, ng *Scalar) { ecmultGen(r, ng) } ``` --- ### Step 6.2: Comprehensive Testing 1. **Correctness tests:** - Compare with existing slow implementation - Test edge cases (zero scalar, infinity point, scalar = n-1) - Test with random scalars 2. **Property tests:** - Verify r1 + λ·r2 ≡ k (mod n) for splitLambda - Verify λ·(x,y) = (β·x, y) for mulLambda - Verify β^3 ≡ 1 (mod p) - Verify λ^3 ≡ 1 (mod n) 3. **Cross-validation:** - Compare with btcec or other Go implementations - Test vectors from libsecp256k1 --- ### Step 6.3: Benchmarking Add comprehensive benchmarks: ```go func BenchmarkEcmultStraussGLV(b *testing.B) { // Benchmark new GLV implementation } func BenchmarkEcmultOld(b *testing.B) { // Benchmark old implementation for comparison } func BenchmarkScalarSplitLambda(b *testing.B) { // Benchmark scalar splitting } ``` --- ## Implementation Order The recommended order minimizes dependencies: | Step | Description | Dependencies | Estimated Complexity | |------|-------------|--------------|---------------------| | 1.1 | Add GLV scalar constants | None | Low | | 1.2 | Add Beta field constant | None | Low | | 2.1 | Implement mulShiftVar | None | Medium | | 2.2 | Implement splitLambda | 1.1, 2.1 | Medium | | 3.1 | Implement mulLambda for points | 1.2 | Low | | 3.2 | Implement isHigh | None | Low | | 4.1 | Build odd multiples table | None | Medium | | 4.2 | Table lookup functions | 4.1 | Low | | 4.3 | Full Strauss-GLV algorithm | 2.2, 3.1, 3.2, 4.1, 4.2 | High | | 5.1 | Generator precomputation | 4.1 | Medium | | 5.2 | Optimized generator mult | 5.1 | Medium | | 6.x | Testing and integration | All above | Medium | --- ## Key Differences from Current Implementation The current Go implementation in `ecdh.go` has: - Basic wNAF conversion (`scalar.go:wNAF`) - Simple Strauss without GLV (`ecdh.go:ecmultStraussGLV` - misnamed, doesn't use GLV) - Windowed multiplication without endomorphism The new implementation adds: - GLV scalar splitting (reduces 256-bit to two 128-bit multiplications) - β-multiplication for point transformation - Combined processing of original and λ-transformed points - Precomputed generator tables for faster G multiplication --- ## References 1. **libsecp256k1 source:** - `src/scalar_impl.h` - GLV constants and splitLambda - `src/ecmult_impl.h` - Strauss algorithm with wNAF - `src/field.h` - Beta constant - `src/group_impl.h` - Point lambda multiplication 2. **Papers:** - "Faster Point Multiplication on Elliptic Curves with Efficient Endomorphisms" (GLV, 2001) - "Guide to Elliptic Curve Cryptography" (Hankerson, Menezes, Vanstone) - Algorithm 3.74 3. **IACR ePrint 2021/1151:** - SIMD acceleration techniques - Window size optimization analysis