GLV_STRAUSS_WNAF.md raw

GLV Endomorphism + Strauss/wNAF Optimization for secp256k1

What Is an Endomorphism?

An endomorphism is a map from a mathematical structure back to itself that preserves the structure's operations. The word comes from Greek: endo- ("within") and morphism ("shape/form"). In the context of elliptic curves:

An endomorphism is a function phi: E -> E that maps points on the curve to other points on the same curve, and that respects the group law:

phi(P + Q) = phi(P) + phi(Q)

for all points P, Q on E. This means the map is a group homomorphism from the curve's point group to itself.

Every elliptic curve has two trivial endomorphisms: the identity map (P -> P) and the negation map (P -> -P). Most curves over prime fields have only these. But certain curves have additional, non-trivial endomorphisms -- and secp256k1 is one of them.

Why secp256k1 Has a Non-Trivial Endomorphism

The curve equation y^2 = x^3 + 7 has no x term (the a coefficient is zero). This means the right-hand side x^3 + 7 has a cube symmetry: if you replace x with B*x where B is a cube root of unity (B^3 = 1), then:

(B*x)^3 + 7 = B^3 * x^3 + 7 = 1 * x^3 + 7 = x^3 + 7

So the curve equation is unchanged. The map phi(x, y) = (B*x, y) sends a point on the curve to another point on the curve. You can verify it also respects point addition (this follows from the addition formulas depending only on the curve equation, which is invariant under this substitution). Therefore phi is an endomorphism.

This endomorphism has a corresponding eigenvalue in the scalar field: there exists a scalar L such that phi(P) = L * P for all points P of prime order n. This L is also a cube root of unity, but in F_n (the integers mod the curve order) rather than F_p (the integers mod the field prime).

Not All Curves Have This

A curve like secp256r1 (y^2 = x^3 - 3x + b) has a non-zero a coefficient, so the cube substitution trick breaks:

(B*x)^3 - 3*(B*x) + b = x^3 + 7 - 3*B*x + b  ≠  x^3 - 3x + b

The extra -3*B*x term ruins the symmetry. This is why the GLV endomorphism optimization only works on curves with a = 0 (or certain other special structures). secp256k1 was specifically chosen with this property.

The Core Insight: A "Free" Scalar Multiplication

The secp256k1 curve is y^2 = x^3 + 7 over a prime field F_p. Because the a coefficient is zero, this curve has a special property: there exists a **cube root of unity** in both the field and the scalar group. Concretely:

L != 1

The relationship is: for any point P = (x, y) on the curve, the point (B*x, y) is also on the curve, and it equals L*P. That is:

L * (x, y) = (B*x, y)

This is the endomorphism. Computing L*P normally requires ~256 doublings and additions. But with this trick, it costs one field multiplication (multiply x by B). That's the foundation of everything.

The Constants

Beta  = 0x7ae96a2b657c07106e64479eac3434e99cf0497512f58995c1396c28719501ee
Lambda = 0x5363ad4cc05c30e0a5261c028812645a122e22ea20816678df02967c1b23bd72

Stage 1: Scalar Decomposition (The GLV Trick)

Given a 256-bit scalar k and a point P, we want k*P. The GLV method rewrites:

k*P = k1*P + k2*(L*P) = k1*P + k2*(B*P.x, P.y)

where k1 and k2 are each ~128 bits, satisfying k = k1 + k2*L mod n.

How to find k1, k2

This is a closest vector problem on a 2D lattice. The lattice is the kernel of the map f(i,j) = i + L*j mod n -- i.e., all pairs (i,j) where i + L*j = 0 mod n. Two short basis vectors beta and gamma of this lattice are precomputed (using extended Euclidean / Gauss lattice reduction). Then:

  1. Compute t1 = k * gamma_2 / det and t2 = -k * beta_2 / det (where det

is the lattice determinant)

  1. Round to nearest integers: t1_r = round(t1), t2_r = round(t2)
  2. Compute (k1, k2) = (k, 0) - t1_r * beta - t2_r * gamma

The result: |k1|, |k2| < ~sqrt(n), so each is roughly 128 bits.

How libsecp256k1 Does It Without Division

libsecp256k1 precomputes the constants g1, g2, minus_b1, minus_b2 to perform this decomposition using only fixed-point integer arithmetic (no division), via the identity:

round(k * g1 / 2^384)  ≈  k * b2 / n
round(k * g2 / 2^384)  ≈  k * (-b1) / n

The precomputed decomposition constants are:

g1       = 0x3086d221a7d46bcde86c90e49284eb153daa8a1471e8ca7fe893209a45dbb031
g2       = 0xe4437ed6010e88286f547fa90abfe4c4221208ac9df506c61571b4ae8ac47f71
minus_b1 = 0xe4437ed6010e88286f547fa90abfe4c3
minus_b2 = 0xfffffffffffffffe8a280ac50774346dd765cda83db1562c

The decomposition algorithm in secp256k1_scalar_split_lambda():

c1 = round(k * g1 / 2^384)
c2 = round(k * g2 / 2^384)
c1 = c1 * (-b1) mod n
c2 = c2 * (-b2) mod n
k2 = c1 + c2 mod n
k1 = k - k2 * lambda mod n

Both k1 and k2 are guaranteed to be less than 2^128 in absolute value.

Stage 2: wNAF Encoding (Sparse Signed Representation)

Standard binary representation of a 128-bit number has ~64 ones on average. The windowed Non-Adjacent Form (wNAF) with window w represents the scalar as a sequence of signed odd digits in `{-(2^(w-1)-1), ..., -1, 0, 1, ..., 2^(w-1)-1}` with the property that between any two non-zero digits there are at least w-1 zeros.

For w = 5 (libsecp256k1's default for arbitrary points), the digits are in {-15, -13, ..., -1, 0, 1, ..., 15} and on average only 128 / (5+1) ≈ 21 of the 128 positions are non-zero. This means ~21 point additions instead of ~64 -- a huge reduction.

The Tradeoff: Precomputed Tables

You need a precomputed table of odd multiples:

{1*P, 3*P, 5*P, ..., 15*P}    (8 entries for w=5)

For the endomorphism variant, you also need {1*L*P, 3*L*P, ..., 15*L*P}, but these are essentially free -- just multiply each x-coordinate by Beta.

wNAF Conversion Algorithm

Input: scalar a, window width w
Output: wNAF array, bit count

1. If scalar is negative: negate it, set sign = -1
2. For each bit position from 0 to 127:
   a. If current bit matches carry: skip (digit = 0)
   b. Extract w bits starting at position: word = bits[pos:pos+w] + carry
   c. If MSB of word set: carry = 1, else carry = 0
   d. Apply carry: word = word - (carry << w)
   e. wNAF[pos] = sign * word
   f. Jump by w positions (next w-1 positions are guaranteed zero)

Properties guaranteed:

Stage 3: Strauss's Algorithm (Shared Doublings)

Computing k1*P + k2*Q naively means two independent double-and-add loops. Strauss's algorithm (also called Shamir's trick) processes both scalars simultaneously in a single loop:

result = point_at_infinity
for bit = highest_bit downto 0:
    result = 2 * result                          // one doubling (shared!)
    if wNAF_k1[bit] != 0:
        result += table_P[wNAF_k1[bit]]          // add from P's table
    if wNAF_k2[bit] != 0:
        result += table_Q[wNAF_k2[bit]]          // add from Q's table (lambda)

The key: the doublings are shared. Instead of ~128 doublings for k1 and ~128 doublings for k2 (total 256), you do only ~128 doublings total, with additions interleaved from both scalars. Combined with wNAF sparsity, the total operation count is roughly:

Compare to naive binary method on the original 256-bit scalar: **256 doublings + ~128 additions. That's roughly a 2x speedup** from the GLV split plus additional gains from wNAF.

Stage 4: Generator Optimization

For signature verification, libsecp256k1 computes na*A + ng*G where G is the fixed generator. Since G is known at compile time, its multiples are precomputed into massive tables (secp256k1_pre_g and secp256k1_pre_g_128). The generator scalar ng is split into two 128-bit halves via simple bit shifting (not GLV), and each half indexes into the precomputed tables with window size 15. This eliminates all doublings for the generator component.

Summary of Speedup Sources

TechniqueWhat it savesApproximate gain
GLV endomorphismHalves scalar length (256->128 bits)~2x fewer doublings
wNAF encodingSparser digit representation~3x fewer additions
Strauss interleavingShared doublings for multi-scalarEliminates redundant doublings
Precomputed generator tablesNo computation for G multiplesEliminates ~128 doublings for G

Complete Algorithm Flow

Input: scalar k, point P

[Stage 1: GLV Decomposition]
k -> split_lambda() -> (k1, k2) where k = k1 + L*k2 (mod n)
    |k1|, |k2| < 2^128

[Stage 2: Precomputation]
P -> odd_multiples_table() -> [P, 3P, 5P, ..., 15P]
L(P) = (B*P.x, P.y)        -> [L*P, 3*L*P, ..., 15*L*P]  (just multiply x by B)

[Stage 3: wNAF Encoding]
k1 -> ecmult_wnaf(w=5) -> wnaf_k1[]   (~21 non-zero entries out of 128)
k2 -> ecmult_wnaf(w=5) -> wnaf_k2[]   (~21 non-zero entries out of 128)

[Stage 4: Strauss Accumulation]
result = infinity
for bit = 127 downto 0:
    result = 2 * result
    if wnaf_k1[bit] != 0: result += table_P[wnaf_k1[bit]]
    if wnaf_k2[bit] != 0: result += table_LP[wnaf_k2[bit]]

Output: k*P

libsecp256k1 Source Code Map

PhaseFileFunctionKey Variables
Decomposescalar_impl.hsecp256k1_scalar_split_lambda()g1, g2, minus_b1, minus_b2, lambda
Endomorphismgroup_impl.hsecp256k1_ge_mul_lambda()beta
Precomputeecmult_impl.hsecp256k1_ecmult_odd_multiples_table()pre_a[], zr[]
wNAF encodeecmult_impl.hsecp256k1_ecmult_wnaf()wnaf[], carry, sign
Accumulateecmult_impl.hsecp256k1_ecmult_strauss_wnaf()r, bits
Table lookupecmult_impl.hsecp256k1_ecmult_table_get_ge()n (wNAF coeff)
Lambda lookupecmult_impl.hsecp256k1_ecmult_table_get_ge_lambda()n, beta*x cache

References

Original Papers

with Efficient Endomorphisms" (CRYPTO 2001)

on a Large Class of Curves" (EUROCRYPT 2009)

Practical Explanations