# Iskra Consolidation Plan ## Goal iskra becomes the algorithm layer between raw storage (iskradb) and domain-specific implementations (transdb for NL, iskra/code for programming languages). Each language - natural or programming - is a sub-lattice that shares the same iskradb store and composition machinery. ``` iskradb raw B-tree storage, no domain knowledge ↑ iskra coord system, sub-lattice model, abstract pipeline ↑ transdb JA/EN: JMdict ingest, JA morphology, particle tables iskra/code Moxie/SRC: mxcorpus ingest, AST analysis (currently here) (future) HR/RU: Slavic morphology tables ``` The insight: translation between any two languages is a sub-lattice path problem, whether those languages are EN→JA, SRC→IR, or Moxie→LLVM. The same coord, relax, and cluster machinery applies. --- ## What moves where ### Phase 1 — Coord system (transdb/key.mx → iskra/coord.mx) Move from transdb, generalize: ``` PackCoord(semantic, grammatical, cooccur, morph, pragmatic, valency, register uint64) uint64 RelaxCoord(coord uint64) []uint64 CoordSemanticShift, CoordMorphShift, CoordGrammaticalShift, CoordCooccurShift, ... SemanticHumanSubj, SemanticHumanObj, SemanticAnimSubj, ... (all 16 flags) CoordSemantic(coord) uint64 CoordMorph(coord) uint8 CoordCooccur(prev, next uint8) uint64 ``` `MakeKey(domain uint8, coord uint64, word string) lattice.Key` - "lang" becomes "domain". Domain 0 = reserved, domains 1-255 assigned per language. `LangEN=0x01`, `LangJA=0x02` stay in transdb; future `DomainMoxieSRC=0x10` etc. go in iskra/code. `RelaxCoord` is already language-agnostic. The relaxation order (semantic → pragmatic → register → valency → grammatical → cooccur → morph) applies to any domain that uses the coord. `POSForWord`, `ActiveBranches`, `branchOrderJA` stay in transdb — they are JA-specific branch heuristics. ### Phase 2 — MorphState (transdb/register.mx → iskra/morph.mx) Move the encoding protocol, not the language-specific states: ``` SetMorphState(rec *lattice.Record, state uint8) GetMorphState(rec *lattice.Record) uint8 SetSemanticInDataFile(rec *lattice.Record, flags uint64) GetSemanticFromDataFile(rec *lattice.Record) uint64 PackBranch(pos, reg, dom, spec uint8) uint8 POSFromBranch(b uint8) uint8 RegFromBranch, DomFromBranch, SpecFromBranch BranchWeirdness, MatchesFilter ``` The MorphState constants (MorphPresAffPlain, MorphPastProgNeg, ...) are the 5-bit wu xing encoding — they describe any language's morphological state space, not JA specifically. Move constants to iskra/morph.mx. JA-specific verb forms (kuruStateForms, suruFormSuffixes, VerbPatterns, BuildVerbForms, addVerbForms) stay in transdb — they are JA morphology tables. EN-specific forms (enIrregs, regularPast, regularProg) stay in transdb. Register/domain/honorific constants (RegNeutral, RegFormal, DomGeneral, ...) move to iskra — they are universal lexical metadata. ### Phase 3 — Language descriptor (transdb/langdesc.mx → iskra/langdesc.mx) The descriptor framework is language-agnostic: ``` LangDesc{Order, HeadFinal, Particle, PreNomRC, ZeroCopula, Markers} OrderSVO, OrderSOV, OrderVSO, ... MarkerPrepositional, MarkerPostpositional, MarkerCase RoleNone, RoleNPSubjTopic, RoleNPSubjGram, RoleNPObjDirect, RolePPLocative, ... RegisterLangDesc(tree, pool, domain, desc) GetLangDesc(tree, domain) (LangDesc, bool) RegisterParticleRole(tree, pool, domain, semCoord, particle, role) LookupParticleRole(tree, domain, particle, npFlags) uint8 LookupTargetMarker(dstDomain, role) string -- table stays here, entries generic ``` JA-specific particle strings ("は", "が", ...) and `jaDefaultRole` map stay in transdb. EN-specific preposition strings stay in transdb (or move to LookupTargetMarker table in iskra). The `CoordVerbClass` sentinel constant moves to iskra/langdesc.mx (it is part of the registration protocol, not JA-specific). ### Phase 4 — Inflect framework (transdb/inflect.mx → iskra/inflect.mx) Registration protocol and class-code scheme move to iskra: ``` VerbClassUnknown, VerbClassV1, VerbClassV5K, ... (0-15) -- stay as JA codes RegisterVerbClass(tree, pool, domain, dictForm, code) GetVerbClass(tree, domain, dictForm) (string, bool) ``` `VerbClassCode(s string) uint8` and `VerbClassStr(code uint8) string` stay in transdb because the class name strings ("v1", "v5k", ...) are JMdict-specific. `InflectJA(dictForm, class, state) string` and `InflectJAFromTree` stay in transdb. For future Slavic: `InflectHR(stem, class, case, number) string` goes in a new hr package that imports iskra for the registration protocol. The general interface to define: ```moxie // iskra/inflect.mx type InflectFunc func(dictForm string, classCode uint8, state uint8) string // Each language registers its inflect function at init time. func RegisterInflectFunc(domain uint8, fn InflectFunc) func InflectFromTree(tree *lattice.Tree, domain uint8, dictForm string, state uint8) string ``` ### Phase 3b — Key normalization hook (iskra/langdesc.mx) Add to `LangDesc` and the domain registration protocol: ```moxie // iskra/langdesc.mx type KeyNormalizer func(word string) string func RegisterKeyNormalizer(domain uint8, fn KeyNormalizer) func NormalizeKey(domain uint8, word string) string ``` `NormalizeKey` is called by `MakeKey` callers before hashing. Default (nil) = identity. Each domain registers its own: - `LangJA`: identity — JA surface forms are already canonical - `LangEN`: lowercase — "Cat" and "cat" are the same lookup - `DomainMoxieSRC`: qualified name resolver — `Method` → `package.Type.Method` when a package context is known, or strip qualifier to bare name for fuzzy lookup (two-pass: try qualified first, relax to bare) - `DomainMoxieIR`: LLVM mangled name → canonical (strip `@`, strip calling convention prefix) - Future HR/RU: Unicode normalization + lowercase The `Translate` signature in `sublattice.mx` becomes: ```moxie func Translate(src SubLattice, dst SubLattice, word string, coord uint64) string ``` where `word` is already normalized for the source domain by the time it reaches `Translate`. Callers use `NormalizeKey(src.Domain, rawWord)` before calling `Translate`. This keeps the composition layer clean — `Translate` never sees raw unnormalized input. The two-pass pattern for code lookup (qualified → bare) is implemented in the domain's normalizer as a closure over the package context, not in `Translate` itself. `Translate` calls `MakeKey` once with whatever the normalizer produces. This also fixes an existing ambiguity in transdb: `LookupWordCtx` currently lowercases nothing and treats "Cat" and "cat" as different keys. Moving normalization to a registered hook makes the behavior explicit per domain rather than implicit per callsite. ### Phase 5 — Cluster pipeline (transdb/cluster.mx → iskra/cluster.mx) The pipeline is abstract; particle detection is language-specific input: ```moxie // iskra/cluster.mx type ClusterType uint8 // NPSubj, NPObj, VP, PP, Mod type Cluster struct { Kind, Tokens, Flags, Role, Nested, Trans, Copular } // Abstract pipeline stages — caller provides language hooks type ParticleDetector func(tok string) bool type RoleLookup func(tok string, npFlags uint64) uint8 ParseClusters(tokens []string, isParticle ParticleDetector, lookupRole RoleLookup, hasVerb func([]string) bool) []*Cluster TranslateCluster(c *Cluster, lookup HeadLookup, srcDomain, dstDomain uint8) ReorderClusters(clusters []*Cluster, srcOrder, dstOrder uint8) []*Cluster InsertMarkers(clusters []*Cluster, dstDesc LangDesc, dstDomain uint8) string ``` `jaParticleSet`, `jaDefaultRole`, `jaFunctionWord`, `isPureHiragana`, `accumFlags`, `hasVerb`, `filterContent` stay in transdb as JA-specific implementations of the hooks. `clusterHeadLookup` (coord-relaxation lookup) moves to iskra — it uses MakeKey and RelaxCoord, not JA-specific logic. The verbStems fallback stays in transdb. `joinWords`, `clusterKindFromRole`, `clusterFlags` move to iskra. ### Phase 6 — Code lattice modernization (iskra internal) Map the 5 unused iskradb branches to code-specific axes using the coord system now in iskra: | Branch | Axis | Code mapping | |--------|------|--------------| | Bsemantic (0) | ontological | purity: pure=0, IO=1, mutating=2, unsafe=3 | | Bcooccur (2) | co-occurrence | call graph: callee prev/next type | | Bvalency (6) | argument count | arity (0=niladic, 1=unary, 2=binary, 3=variadic) | | Bregister (7) | register | scope level / allocation class | | Bphonology (5) | phonological | reserved / package-level grouping | Replace `FNV-56` key derivation with `SipHash128(coord + name)` using the coord system. The stage tag migrates to the domain byte: `DomainMoxieSRC=0x10`, `DomainMoxieAST=0x11`, `DomainMoxieIR=0x12`, `DomainMoxieASM=0x13`, `DomainMoxieBIN=0x14`. Cross-stage adjacency (currently key-implicit via same FNV-56 + different stage prefix) becomes: same SipHash of name, different domain byte, RelaxCoord for "find this construct at a different stage". Replace linear `BindingSignature` scan with coord-based lookup: encode arity in the valency axis, purity in the semantic axis. `find-sig` becomes `LookupWordCtx(tree, pool, name, DomainMoxieSRC, coord)` with RelaxCoord doing the structural distance search. --- ## Sub-lattice composition model (iskra/sublattice.mx) New file. A sub-lattice is a slice of the iskradb tree where all keys share the same domain byte. Sub-lattices can be composed (translated between) by finding coord-equivalent records across domains. ```moxie // iskra/sublattice.mx // SubLattice is a domain-scoped view of an iskradb tree. type SubLattice struct { Tree *lattice.Tree Pool []byte Domain uint8 } // Translate finds the best match in dstDomain for a record in srcDomain. // Uses the record's coord (from DataFile morph+semantic bits) as the // lookup coord in the destination sub-lattice. func Translate(src SubLattice, dst SubLattice, word string, coord uint64) string // Compose merges two sub-lattices into the same tree. // Keys don't collide because domain bytes differ. func Compose(tree *lattice.Tree, a, b SubLattice) ``` Translation between any two languages becomes: `Translate(jaSubLattice, enSubLattice, word, coord)` — the same call regardless of whether word is JA/EN, Moxie/LLVM, or HR/RU. --- ## Migration sequence All steps are non-breaking (transdb imports iskra, no API surface changes visible to callers). 1. **coord.mx** — move from transdb. Trivial: no dependencies on JA data. transdb adds `import "git.mleku.dev/iskra"`. ~1 hour. 2. **morph.mx** — move encoding protocol. JA MorphState constants move with it (they are general, not JA-specific). JA verb tables stay. ~1 hour. 3. **langdesc.mx** — move descriptor structs and role constants. JA particle strings stay in transdb. ~2 hours. 4. **inflect.mx** — move registration protocol. Add `InflectFunc` hook. transdb registers its JA inflect function. ~2 hours. 5. **cluster.mx** — parameterize with hooks. Most logic moves; JA particle detection stays in transdb as a hook implementation. ~4 hours. 6. **sublattice.mx** — new. Implement composition model. ~3 hours. 7. **iskra code lattice** — replace FNV key with SipHash coord key, activate 5 dormant branches, replace BindingSignature scan with coord-based lookup. ~1 day. Total: ~2-3 days of work, none of which blocks current transdb/iskradb development. --- ## What does NOT move Stays in transdb permanently: - JMdict / kanjidic ingest - JA VerbPatterns, BuildVerbForms, kuruStateForms, suruFormSuffixes - JA tokenizer (TokenizeJA, maxMatchJA, verbStems, inferMorphState, isPureHiragana) - EN tokenizer (TokenizeEN) - jaFunctionWord, jaParticleSet, jaDefaultRole - EN irregular forms (enIrregs, regularPast, regularProg, GenerateENForms) - Fuzzy matching (fuzzy/ package, BK-trees) - Cooccurrence extraction (ExtendFromSentences, PMI) - Propagation (semantic flag diffusion over corpus) Stays in iskradb permanently: - B-tree implementation (nodes, records, splits, compaction) - All 8 branch constants - CrossWalk, InsertTriple, Transfer - Key type and SipHash wrapper --- ## Dependency graph after consolidation ``` iskradb (storage, no imports from iskra or transdb) ↑ iskra (algorithms; imports iskradb) ↑ transdb (JA/EN data; imports iskra + iskradb) iskra (code lattice; imports iskra + iskradb, no NL dependency) ``` The code lattice and NL lattice share iskra's algorithm layer and coexist in the same iskradb store with non-overlapping domain bytes.