CONSOLIDATION_PLAN.md raw

Iskra Consolidation Plan

Goal

iskra becomes the algorithm layer between raw storage (iskradb) and domain-specific implementations (transdb for NL, iskra/code for programming languages). Each language - natural or programming - is a sub-lattice that shares the same iskradb store and composition machinery.

iskradb          raw B-tree storage, no domain knowledge
    ↑
    iskra         coord system, sub-lattice model, abstract pipeline
         ↑
         transdb   JA/EN: JMdict ingest, JA morphology, particle tables
         iskra/code  Moxie/SRC: mxcorpus ingest, AST analysis (currently here)
         (future)  HR/RU: Slavic morphology tables

The insight: translation between any two languages is a sub-lattice path problem, whether those languages are EN→JA, SRC→IR, or Moxie→LLVM. The same coord, relax, and cluster machinery applies.


What moves where

Phase 1 — Coord system (transdb/key.mx → iskra/coord.mx)

Move from transdb, generalize:

PackCoord(semantic, grammatical, cooccur, morph, pragmatic, valency, register uint64) uint64
RelaxCoord(coord uint64) []uint64
CoordSemanticShift, CoordMorphShift, CoordGrammaticalShift, CoordCooccurShift, ...
SemanticHumanSubj, SemanticHumanObj, SemanticAnimSubj, ... (all 16 flags)
CoordSemantic(coord) uint64
CoordMorph(coord) uint8
CoordCooccur(prev, next uint8) uint64

MakeKey(domain uint8, coord uint64, word string) lattice.Key - "lang" becomes "domain". Domain 0 = reserved, domains 1-255 assigned per language. LangEN=0x01, LangJA=0x02 stay in transdb; future DomainMoxieSRC=0x10 etc. go in iskra/code.

RelaxCoord is already language-agnostic. The relaxation order (semantic → pragmatic → register → valency → grammatical → cooccur → morph) applies to any domain that uses the coord.

POSForWord, ActiveBranches, branchOrderJA stay in transdb — they are JA-specific branch heuristics.

Phase 2 — MorphState (transdb/register.mx → iskra/morph.mx)

Move the encoding protocol, not the language-specific states:

SetMorphState(rec *lattice.Record, state uint8)
GetMorphState(rec *lattice.Record) uint8
SetSemanticInDataFile(rec *lattice.Record, flags uint64)
GetSemanticFromDataFile(rec *lattice.Record) uint64
PackBranch(pos, reg, dom, spec uint8) uint8
POSFromBranch(b uint8) uint8
RegFromBranch, DomFromBranch, SpecFromBranch
BranchWeirdness, MatchesFilter

The MorphState constants (MorphPresAffPlain, MorphPastProgNeg, ...) are the 5-bit wu xing encoding — they describe any language's morphological state space, not JA specifically. Move constants to iskra/morph.mx.

JA-specific verb forms (kuruStateForms, suruFormSuffixes, VerbPatterns, BuildVerbForms, addVerbForms) stay in transdb — they are JA morphology tables.

EN-specific forms (enIrregs, regularPast, regularProg) stay in transdb.

Register/domain/honorific constants (RegNeutral, RegFormal, DomGeneral, ...) move to iskra — they are universal lexical metadata.

Phase 3 — Language descriptor (transdb/langdesc.mx → iskra/langdesc.mx)

The descriptor framework is language-agnostic:

LangDesc{Order, HeadFinal, Particle, PreNomRC, ZeroCopula, Markers}
OrderSVO, OrderSOV, OrderVSO, ...
MarkerPrepositional, MarkerPostpositional, MarkerCase
RoleNone, RoleNPSubjTopic, RoleNPSubjGram, RoleNPObjDirect, RolePPLocative, ...
RegisterLangDesc(tree, pool, domain, desc)
GetLangDesc(tree, domain) (LangDesc, bool)
RegisterParticleRole(tree, pool, domain, semCoord, particle, role)
LookupParticleRole(tree, domain, particle, npFlags) uint8
LookupTargetMarker(dstDomain, role) string  -- table stays here, entries generic

JA-specific particle strings ("は", "が", ...) and jaDefaultRole map stay in transdb. EN-specific preposition strings stay in transdb (or move to LookupTargetMarker table in iskra).

The CoordVerbClass sentinel constant moves to iskra/langdesc.mx (it is part of the registration protocol, not JA-specific).

Phase 4 — Inflect framework (transdb/inflect.mx → iskra/inflect.mx)

Registration protocol and class-code scheme move to iskra:

VerbClassUnknown, VerbClassV1, VerbClassV5K, ... (0-15)  -- stay as JA codes
RegisterVerbClass(tree, pool, domain, dictForm, code)
GetVerbClass(tree, domain, dictForm) (string, bool)

VerbClassCode(s string) uint8 and VerbClassStr(code uint8) string stay in transdb because the class name strings ("v1", "v5k", ...) are JMdict-specific.

InflectJA(dictForm, class, state) string and InflectJAFromTree stay in transdb.

For future Slavic: InflectHR(stem, class, case, number) string goes in a new hr package that imports iskra for the registration protocol.

The general interface to define:

// iskra/inflect.mx
type InflectFunc func(dictForm string, classCode uint8, state uint8) string

// Each language registers its inflect function at init time.
func RegisterInflectFunc(domain uint8, fn InflectFunc)
func InflectFromTree(tree *lattice.Tree, domain uint8, dictForm string, state uint8) string

Phase 3b — Key normalization hook (iskra/langdesc.mx)

Add to LangDesc and the domain registration protocol:

// iskra/langdesc.mx
type KeyNormalizer func(word string) string

func RegisterKeyNormalizer(domain uint8, fn KeyNormalizer)
func NormalizeKey(domain uint8, word string) string

NormalizeKey is called by MakeKey callers before hashing. Default (nil) = identity. Each domain registers its own:

The Translate signature in sublattice.mx becomes:

func Translate(src SubLattice, dst SubLattice, word string, coord uint64) string

where word is already normalized for the source domain by the time it reaches Translate. Callers use NormalizeKey(src.Domain, rawWord) before calling Translate. This keeps the composition layer clean — Translate never sees raw unnormalized input.

The two-pass pattern for code lookup (qualified → bare) is implemented in the domain's normalizer as a closure over the package context, not in Translate itself. Translate calls MakeKey once with whatever the normalizer produces.

This also fixes an existing ambiguity in transdb: LookupWordCtx currently lowercases nothing and treats "Cat" and "cat" as different keys. Moving normalization to a registered hook makes the behavior explicit per domain rather than implicit per callsite.

Phase 5 — Cluster pipeline (transdb/cluster.mx → iskra/cluster.mx)

The pipeline is abstract; particle detection is language-specific input:

// iskra/cluster.mx
type ClusterType uint8  // NPSubj, NPObj, VP, PP, Mod

type Cluster struct { Kind, Tokens, Flags, Role, Nested, Trans, Copular }

// Abstract pipeline stages — caller provides language hooks
type ParticleDetector func(tok string) bool
type RoleLookup func(tok string, npFlags uint64) uint8

ParseClusters(tokens []string, isParticle ParticleDetector, lookupRole RoleLookup,
    hasVerb func([]string) bool) []*Cluster

TranslateCluster(c *Cluster, lookup HeadLookup, srcDomain, dstDomain uint8)
ReorderClusters(clusters []*Cluster, srcOrder, dstOrder uint8) []*Cluster
InsertMarkers(clusters []*Cluster, dstDesc LangDesc, dstDomain uint8) string

jaParticleSet, jaDefaultRole, jaFunctionWord, isPureHiragana, accumFlags, hasVerb, filterContent stay in transdb as JA-specific implementations of the hooks.

clusterHeadLookup (coord-relaxation lookup) moves to iskra — it uses MakeKey and RelaxCoord, not JA-specific logic. The verbStems fallback stays in transdb.

joinWords, clusterKindFromRole, clusterFlags move to iskra.

Phase 6 — Code lattice modernization (iskra internal)

Map the 5 unused iskradb branches to code-specific axes using the coord system now in iskra:

BranchAxisCode mapping
Bsemantic (0)ontologicalpurity: pure=0, IO=1, mutating=2, unsafe=3
Bcooccur (2)co-occurrencecall graph: callee prev/next type
Bvalency (6)argument countarity (0=niladic, 1=unary, 2=binary, 3=variadic)
Bregister (7)registerscope level / allocation class
Bphonology (5)phonologicalreserved / package-level grouping

Replace FNV-56 key derivation with SipHash128(coord + name) using the coord system. The stage tag migrates to the domain byte: DomainMoxieSRC=0x10, DomainMoxieAST=0x11, DomainMoxieIR=0x12, DomainMoxieASM=0x13, DomainMoxieBIN=0x14.

Cross-stage adjacency (currently key-implicit via same FNV-56 + different stage prefix) becomes: same SipHash of name, different domain byte, RelaxCoord for "find this construct at a different stage".

Replace linear BindingSignature scan with coord-based lookup: encode arity in the valency axis, purity in the semantic axis. find-sig becomes LookupWordCtx(tree, pool, name, DomainMoxieSRC, coord) with RelaxCoord doing the structural distance search.

Sub-lattice composition model (iskra/sublattice.mx)

New file. A sub-lattice is a slice of the iskradb tree where all keys share the same domain byte. Sub-lattices can be composed (translated between) by finding coord-equivalent records across domains.

// iskra/sublattice.mx

// SubLattice is a domain-scoped view of an iskradb tree.
type SubLattice struct {
    Tree   *lattice.Tree
    Pool   []byte
    Domain uint8
}

// Translate finds the best match in dstDomain for a record in srcDomain.
// Uses the record's coord (from DataFile morph+semantic bits) as the
// lookup coord in the destination sub-lattice.
func Translate(src SubLattice, dst SubLattice, word string, coord uint64) string

// Compose merges two sub-lattices into the same tree.
// Keys don't collide because domain bytes differ.
func Compose(tree *lattice.Tree, a, b SubLattice)

Translation between any two languages becomes: Translate(jaSubLattice, enSubLattice, word, coord) — the same call regardless of whether word is JA/EN, Moxie/LLVM, or HR/RU.


Migration sequence

All steps are non-breaking (transdb imports iskra, no API surface changes visible to callers).

  1. coord.mx — move from transdb. Trivial: no dependencies on JA data. transdb adds import "git.mleku.dev/iskra". ~1 hour.
  1. morph.mx — move encoding protocol. JA MorphState constants move with it (they are general, not JA-specific). JA verb tables stay. ~1 hour.
  1. langdesc.mx — move descriptor structs and role constants. JA particle strings stay in transdb. ~2 hours.
  1. inflect.mx — move registration protocol. Add InflectFunc hook. transdb registers its JA inflect function. ~2 hours.
  1. cluster.mx — parameterize with hooks. Most logic moves; JA particle detection stays in transdb as a hook implementation. ~4 hours.
  1. sublattice.mx — new. Implement composition model. ~3 hours.
  1. iskra code lattice — replace FNV key with SipHash coord key, activate 5 dormant branches, replace BindingSignature scan with coord-based lookup. ~1 day.

Total: ~2-3 days of work, none of which blocks current transdb/iskradb development.

What does NOT move

Stays in transdb permanently:

Stays in iskradb permanently:

Dependency graph after consolidation

iskradb  (storage, no imports from iskra or transdb)
    ↑
    iskra  (algorithms; imports iskradb)
         ↑
         transdb  (JA/EN data; imports iskra + iskradb)
         iskra    (code lattice; imports iskra + iskradb, no NL dependency)

The code lattice and NL lattice share iskra's algorithm layer and coexist in the same iskradb store with non-overlapping domain bytes.