# Iskra Consolidation Plan

## Goal

iskra becomes the algorithm layer between raw storage (iskradb) and domain-specific implementations (transdb for NL, iskra/code for programming languages). Each language - natural or programming - is a sub-lattice that shares the same iskradb store and composition machinery.

```
iskradb          raw B-tree storage, no domain knowledge
    ↑
    iskra         coord system, sub-lattice model, abstract pipeline
         ↑
         transdb   JA/EN: JMdict ingest, JA morphology, particle tables
         iskra/code  Moxie/SRC: mxcorpus ingest, AST analysis (currently here)
         (future)  HR/RU: Slavic morphology tables
```

The insight: translation between any two languages is a sub-lattice path problem, whether those languages are EN→JA, SRC→IR, or Moxie→LLVM. The same coord, relax, and cluster machinery applies.

---

## What moves where

### Phase 1 — Coord system (transdb/key.mx → iskra/coord.mx)

Move from transdb, generalize:

```
PackCoord(semantic, grammatical, cooccur, morph, pragmatic, valency, register uint64) uint64
RelaxCoord(coord uint64) []uint64
CoordSemanticShift, CoordMorphShift, CoordGrammaticalShift, CoordCooccurShift, ...
SemanticHumanSubj, SemanticHumanObj, SemanticAnimSubj, ... (all 16 flags)
CoordSemantic(coord) uint64
CoordMorph(coord) uint8
CoordCooccur(prev, next uint8) uint64
```

`MakeKey(domain uint8, coord uint64, word string) lattice.Key` - "lang" becomes "domain". Domain 0 = reserved, domains 1-255 assigned per language. `LangEN=0x01`, `LangJA=0x02` stay in transdb; future `DomainMoxieSRC=0x10` etc. go in iskra/code.

`RelaxCoord` is already language-agnostic. The relaxation order (semantic → pragmatic → register → valency → grammatical → cooccur → morph) applies to any domain that uses the coord.

`POSForWord`, `ActiveBranches`, `branchOrderJA` stay in transdb — they are JA-specific branch heuristics.

### Phase 2 — MorphState (transdb/register.mx → iskra/morph.mx)

Move the encoding protocol, not the language-specific states:

```
SetMorphState(rec *lattice.Record, state uint8)
GetMorphState(rec *lattice.Record) uint8
SetSemanticInDataFile(rec *lattice.Record, flags uint64)
GetSemanticFromDataFile(rec *lattice.Record) uint64
PackBranch(pos, reg, dom, spec uint8) uint8
POSFromBranch(b uint8) uint8
RegFromBranch, DomFromBranch, SpecFromBranch
BranchWeirdness, MatchesFilter
```

The MorphState constants (MorphPresAffPlain, MorphPastProgNeg, ...) are the 5-bit wu xing encoding — they describe any language's morphological state space, not JA specifically. Move constants to iskra/morph.mx.

JA-specific verb forms (kuruStateForms, suruFormSuffixes, VerbPatterns, BuildVerbForms, addVerbForms) stay in transdb — they are JA morphology tables.

EN-specific forms (enIrregs, regularPast, regularProg) stay in transdb.

Register/domain/honorific constants (RegNeutral, RegFormal, DomGeneral, ...) move to iskra — they are universal lexical metadata.

### Phase 3 — Language descriptor (transdb/langdesc.mx → iskra/langdesc.mx)

The descriptor framework is language-agnostic:

```
LangDesc{Order, HeadFinal, Particle, PreNomRC, ZeroCopula, Markers}
OrderSVO, OrderSOV, OrderVSO, ...
MarkerPrepositional, MarkerPostpositional, MarkerCase
RoleNone, RoleNPSubjTopic, RoleNPSubjGram, RoleNPObjDirect, RolePPLocative, ...
RegisterLangDesc(tree, pool, domain, desc)
GetLangDesc(tree, domain) (LangDesc, bool)
RegisterParticleRole(tree, pool, domain, semCoord, particle, role)
LookupParticleRole(tree, domain, particle, npFlags) uint8
LookupTargetMarker(dstDomain, role) string  -- table stays here, entries generic
```

JA-specific particle strings ("は", "が", ...) and `jaDefaultRole` map stay in transdb.
EN-specific preposition strings stay in transdb (or move to LookupTargetMarker table in iskra).

The `CoordVerbClass` sentinel constant moves to iskra/langdesc.mx (it is part of the registration protocol, not JA-specific).

### Phase 4 — Inflect framework (transdb/inflect.mx → iskra/inflect.mx)

Registration protocol and class-code scheme move to iskra:

```
VerbClassUnknown, VerbClassV1, VerbClassV5K, ... (0-15)  -- stay as JA codes
RegisterVerbClass(tree, pool, domain, dictForm, code)
GetVerbClass(tree, domain, dictForm) (string, bool)
```

`VerbClassCode(s string) uint8` and `VerbClassStr(code uint8) string` stay in transdb because the class name strings ("v1", "v5k", ...) are JMdict-specific.

`InflectJA(dictForm, class, state) string` and `InflectJAFromTree` stay in transdb.

For future Slavic: `InflectHR(stem, class, case, number) string` goes in a new hr package that imports iskra for the registration protocol.

The general interface to define:

```moxie
// iskra/inflect.mx
type InflectFunc func(dictForm string, classCode uint8, state uint8) string

// Each language registers its inflect function at init time.
func RegisterInflectFunc(domain uint8, fn InflectFunc)
func InflectFromTree(tree *lattice.Tree, domain uint8, dictForm string, state uint8) string
```

### Phase 3b — Key normalization hook (iskra/langdesc.mx)

Add to `LangDesc` and the domain registration protocol:

```moxie
// iskra/langdesc.mx
type KeyNormalizer func(word string) string

func RegisterKeyNormalizer(domain uint8, fn KeyNormalizer)
func NormalizeKey(domain uint8, word string) string
```

`NormalizeKey` is called by `MakeKey` callers before hashing. Default (nil) = identity. Each domain registers its own:

- `LangJA`: identity — JA surface forms are already canonical
- `LangEN`: lowercase — "Cat" and "cat" are the same lookup
- `DomainMoxieSRC`: qualified name resolver — `Method` → `package.Type.Method` when a package context is known, or strip qualifier to bare name for fuzzy lookup (two-pass: try qualified first, relax to bare)
- `DomainMoxieIR`: LLVM mangled name → canonical (strip `@`, strip calling convention prefix)
- Future HR/RU: Unicode normalization + lowercase

The `Translate` signature in `sublattice.mx` becomes:

```moxie
func Translate(src SubLattice, dst SubLattice, word string, coord uint64) string
```

where `word` is already normalized for the source domain by the time it reaches `Translate`. Callers use `NormalizeKey(src.Domain, rawWord)` before calling `Translate`. This keeps the composition layer clean — `Translate` never sees raw unnormalized input.

The two-pass pattern for code lookup (qualified → bare) is implemented in the domain's normalizer as a closure over the package context, not in `Translate` itself. `Translate` calls `MakeKey` once with whatever the normalizer produces.

This also fixes an existing ambiguity in transdb: `LookupWordCtx` currently lowercases nothing and treats "Cat" and "cat" as different keys. Moving normalization to a registered hook makes the behavior explicit per domain rather than implicit per callsite.

### Phase 5 — Cluster pipeline (transdb/cluster.mx → iskra/cluster.mx)

The pipeline is abstract; particle detection is language-specific input:

```moxie
// iskra/cluster.mx
type ClusterType uint8  // NPSubj, NPObj, VP, PP, Mod

type Cluster struct { Kind, Tokens, Flags, Role, Nested, Trans, Copular }

// Abstract pipeline stages — caller provides language hooks
type ParticleDetector func(tok string) bool
type RoleLookup func(tok string, npFlags uint64) uint8

ParseClusters(tokens []string, isParticle ParticleDetector, lookupRole RoleLookup,
    hasVerb func([]string) bool) []*Cluster

TranslateCluster(c *Cluster, lookup HeadLookup, srcDomain, dstDomain uint8)
ReorderClusters(clusters []*Cluster, srcOrder, dstOrder uint8) []*Cluster
InsertMarkers(clusters []*Cluster, dstDesc LangDesc, dstDomain uint8) string
```

`jaParticleSet`, `jaDefaultRole`, `jaFunctionWord`, `isPureHiragana`, `accumFlags`, `hasVerb`, `filterContent` stay in transdb as JA-specific implementations of the hooks.

`clusterHeadLookup` (coord-relaxation lookup) moves to iskra — it uses MakeKey and RelaxCoord, not JA-specific logic. The verbStems fallback stays in transdb.

`joinWords`, `clusterKindFromRole`, `clusterFlags` move to iskra.

### Phase 6 — Code lattice modernization (iskra internal)

Map the 5 unused iskradb branches to code-specific axes using the coord system now in iskra:

| Branch | Axis | Code mapping |
|--------|------|--------------|
| Bsemantic (0) | ontological | purity: pure=0, IO=1, mutating=2, unsafe=3 |
| Bcooccur (2) | co-occurrence | call graph: callee prev/next type |
| Bvalency (6) | argument count | arity (0=niladic, 1=unary, 2=binary, 3=variadic) |
| Bregister (7) | register | scope level / allocation class |
| Bphonology (5) | phonological | reserved / package-level grouping |

Replace `FNV-56` key derivation with `SipHash128(coord + name)` using the coord system. The stage tag migrates to the domain byte: `DomainMoxieSRC=0x10`, `DomainMoxieAST=0x11`, `DomainMoxieIR=0x12`, `DomainMoxieASM=0x13`, `DomainMoxieBIN=0x14`.

Cross-stage adjacency (currently key-implicit via same FNV-56 + different stage prefix) becomes: same SipHash of name, different domain byte, RelaxCoord for "find this construct at a different stage".

Replace linear `BindingSignature` scan with coord-based lookup: encode arity in the valency axis, purity in the semantic axis. `find-sig` becomes `LookupWordCtx(tree, pool, name, DomainMoxieSRC, coord)` with RelaxCoord doing the structural distance search.

---

## Sub-lattice composition model (iskra/sublattice.mx)

New file. A sub-lattice is a slice of the iskradb tree where all keys share the same domain byte. Sub-lattices can be composed (translated between) by finding coord-equivalent records across domains.

```moxie
// iskra/sublattice.mx

// SubLattice is a domain-scoped view of an iskradb tree.
type SubLattice struct {
    Tree   *lattice.Tree
    Pool   []byte
    Domain uint8
}

// Translate finds the best match in dstDomain for a record in srcDomain.
// Uses the record's coord (from DataFile morph+semantic bits) as the
// lookup coord in the destination sub-lattice.
func Translate(src SubLattice, dst SubLattice, word string, coord uint64) string

// Compose merges two sub-lattices into the same tree.
// Keys don't collide because domain bytes differ.
func Compose(tree *lattice.Tree, a, b SubLattice)
```

Translation between any two languages becomes: `Translate(jaSubLattice, enSubLattice, word, coord)` — the same call regardless of whether word is JA/EN, Moxie/LLVM, or HR/RU.

---

## Migration sequence

All steps are non-breaking (transdb imports iskra, no API surface changes visible to callers).

1. **coord.mx** — move from transdb. Trivial: no dependencies on JA data. transdb adds `import "git.mleku.dev/iskra"`. ~1 hour.

2. **morph.mx** — move encoding protocol. JA MorphState constants move with it (they are general, not JA-specific). JA verb tables stay. ~1 hour.

3. **langdesc.mx** — move descriptor structs and role constants. JA particle strings stay in transdb. ~2 hours.

4. **inflect.mx** — move registration protocol. Add `InflectFunc` hook. transdb registers its JA inflect function. ~2 hours.

5. **cluster.mx** — parameterize with hooks. Most logic moves; JA particle detection stays in transdb as a hook implementation. ~4 hours.

6. **sublattice.mx** — new. Implement composition model. ~3 hours.

7. **iskra code lattice** — replace FNV key with SipHash coord key, activate 5 dormant branches, replace BindingSignature scan with coord-based lookup. ~1 day.

Total: ~2-3 days of work, none of which blocks current transdb/iskradb development.

---

## What does NOT move

Stays in transdb permanently:
- JMdict / kanjidic ingest
- JA VerbPatterns, BuildVerbForms, kuruStateForms, suruFormSuffixes
- JA tokenizer (TokenizeJA, maxMatchJA, verbStems, inferMorphState, isPureHiragana)
- EN tokenizer (TokenizeEN)
- jaFunctionWord, jaParticleSet, jaDefaultRole
- EN irregular forms (enIrregs, regularPast, regularProg, GenerateENForms)
- Fuzzy matching (fuzzy/ package, BK-trees)
- Cooccurrence extraction (ExtendFromSentences, PMI)
- Propagation (semantic flag diffusion over corpus)

Stays in iskradb permanently:
- B-tree implementation (nodes, records, splits, compaction)
- All 8 branch constants
- CrossWalk, InsertTriple, Transfer
- Key type and SipHash wrapper

---

## Dependency graph after consolidation

```
iskradb  (storage, no imports from iskra or transdb)
    ↑
    iskra  (algorithms; imports iskradb)
         ↑
         transdb  (JA/EN data; imports iskra + iskradb)
         iskra    (code lattice; imports iskra + iskradb, no NL dependency)
```

The code lattice and NL lattice share iskra's algorithm layer and coexist in the same iskradb store with non-overlapping domain bytes.