lattice language model - semantic walks on bethe lattice (cayley tree). bounded only by memory access speed, low compute requirements.
git clone https://git.smesh.lol/iskra.git

Lattice language model. Semantic walks on a Bethe lattice (Cayley tree), bounded only by memory access speed, low compute requirements. Maps constructs - natural language morphemes or code symbols - onto a multi-branch B-tree, enabling structural isomorphism detection, cross-domain translation, and lattice-based compilation.
Not a neural model. No embeddings, no floating point, no training loop. Structure is stored as structure - exact integer distances on the lattice, deterministic walks, algebraic operations on keys.
Built in Moxie. Storage engine: iskradb. Natural language translation: transdb (work in progress - English-Japanese bilingual lexicon with morphological conjugation, semantic coord disambiguation, and cluster-based phrase translation).
Iskra uses iskradb's eight-branch lattice. Three primary branches carry the semantic axes; additional branches serve as metadata, pattern, and co-occurrence stores:
| Branch | iskradb name | Language domain | Code domain |
|---|---|---|---|
Bgrammatical | Bnoun | Nouns, adjectives, nominal forms | Types, structs, variables, constants |
Bmorphology | Bverb | Verbs, conjugated forms | Functions, methods |
Bpragmatic | Bmodifier | Particles, function words, modifiers | Expressions, control flow, fields |
Bsemantic | - | Atoms (lemmas/stems) for cross-language lookup | - |
Bcooccur | - | Bigrams, language descriptors, verb classes, particle roles | - |
The same lattice structure handles both natural language and code. Code translation (SRC/AST/IR/ASM/BIN stages) is paused - natural language is a superset of what is needed.
Two key schemes depending on domain:
Natural language (coord-based):
Key = SipHash128(seed, [lang(1) || coord(8 LE) || word(N)])
Code lattice (stage-based):
Key[0] = (stage << 56) | (FNV-1a-56(name) & 0x00FFFFFFFFFFFFFF)
Key[1] = 0
Cross-stage adjacency is key-implicit: the same name at a different stage produces the same 56-bit hash, different stage prefix.
bits 63-48 semantic (16 bits): 8 subject|object category pairs, 2 bits each
bits 47-32 reserved
bits 31-29 grammatical (3 bits): syntactic role
bits 28-25 cooccur (4 bits): prev_type(2) + next_type(2)
bits 24-20 morphstate (5 bits): tense/aspect/polarity/formality/evidential
bits 19-18 pragmatic (2 bits): domain context
bits 17-16 valency (2 bits): argument count
bits 15-2 reserved (14 bits): available for Slavic case/number
bits 1-0 register (2 bits): social register
RelaxCoord(coord) []uint64 returns a cascade from most-specific to coord=0. Stripping order: pragmatic, register, valency, semantic bits MSB-LSB, grammatical, cooccur, morphstate.
Encodes a position in the three-branch semantic space. Up to 16 trits packed into uint32 (2 bits each).
Level 1 - declaration kind:
BranchType = 1 BranchFunc = 2 BranchData = 3
Level 2 - subdivision:
Type: SubField=1 SubMethod=2 SubEmbed=3
Func: SubStatement=1 SubExpression=2 SubControl=3
Data: SubConst=1 SubVar=2 SubImport=3
type Tree struct {
db *lattice.Tree
RecMeta []MetaEntry
StringPool []byte
TokenPool []uint32
TokFile *os.File // disk-backed token pool
TokRaw []byte // preloaded token pool
Dict *Dict
BigramIdx map[string][]uint32 // word -> bigram continuations
SigIdx map[uint32][]uint32 // (stage<<24)|sighash -> []recIdx
BulkMode bool
BulkMetaStore *BulkMeta
BulkPoolFile *os.File
CostMap map[uint32]CostEntry
}
Record index is the single unified index for both the B-tree record and metadata.
Count uint32 occurrence count (used in upserts)
Kind NodeKind construct classification
StageTag uint8 low 4 bits = srcLang, high 4 bits = generation
Extra [16]byte packed auxiliary data
Extra layout:
[0:4] TritPath (uint32 LE)
[4] low nibble: Rotation (branch transition encoding)
[5:8] 24-bit signature hash (fast rejection in isomorphism lookup)
[8:12] ContentOffset into TokenPool (uint32 LE)
[12:16] ContentLen in tokens (uint32 LE)
In-memory: standard lattice operations, no persistence.
Disk-backed: .iskr (iskradb file) + .meta (metadata) + .tokpool (token pool). StorageCreate/StorageOpen/StorageFlush/StorageClose.
Bulk ingest: .bulkmeta (LRU cache) + .bulkpool (overflow) for bounded-memory ingestion of 100M+ corpora. EnableBulkStorage/FinalizeBulkStorage.
Sentences are decomposed into byte-encoded skeletons:
0x80 | role_id): content positions (128 roles)0x00-0x7F): structural words (JA particles, EN prepositions, code keywords)ExtractJA, ExtractEN, ExtractKO, ExtractZH produce ExtractResult containing pattern, slots, roles, deep pattern, set representation, and discourse structure.
Each sentence decomposes into a set of SetEntry:
Role, Atom, Morph, Class, Mark, OblRole, Head, ModKind
Three-layer nesting: macrorole (subject/object/oblique), thematic role (agent/patient/goal/source/instrument), modification relation (possessive/attributive/copular/adverbial/relative).
Sidecar index for cross-language translation pairs. AtomIdxEntry stores source/destination atoms with context, role, generation, and weight. Automatically mirrored for bidirectional lookup. Two generations: GenLegacy (bilateral, no context) and GenContexted (context-aware).
All distances are exact integers. No floating point.
TritPath distance - hops via LCA on the ternary tree.
Tree distance - hops via LCA in the B-tree. Measures storage locality.
Geometric distance - L1 distance over binding signatures (alpha-equivalence).
WalkStep uses BigramIdx for O(1) continuation lookup. WalkStepCrossDomain translates steps across language domains. BigramWeightRelaxed uses cascading coord relaxation for fallback.
12 register-based text metrics for EN, JA, KO, ZH:
tok_mean, slot_mean, slot_dens, morph_dens, clause_mean, mark_div,
cjk_ratio, archaic_per_s, atom_TTR, pat_TTR, zipf_slope
BindingSignature: geometric structure of declarations and references with names erased. Two functions with identical binding signatures are alpha-equivalent.
SignatureHash24 - 24-bit fast-rejection hash in MetaEntry.Extra[5:8].
find-sig looks up functions by signature match. audit verifies self-isomorphism (100% pass = correct signature generation).
Five compiler stages (SRC/AST/IR/ASM/BIN) map code constructs across compilation. iskra compile uses the lattice as a compilation substrate. This work is paused - natural language is the active focus since it subsumes the structural requirements.
MOXIEROOT=../moxie moxie build ./cmd/iskra
# corpus & storage
iskra load <corpus-dir> load mxcorpus output, print stats
iskra save <corpus-dir> <out.mesh> load corpus, save binary mesh
iskra load-multi <dir1> [dir2...] -o <f> merge multiple corpora, save mesh
iskra load-mesh <file.mesh> load mesh, print stats
# lattice queries
iskra lookup <mesh-or-corpus> <name> lookup segment, show cross-stage links
iskra forward <mesh-or-corpus> <name> find SRC entry, follow to IR
iskra reverse <mesh-or-corpus> <name> walk BIN->ASM->IR->AST->SRC
iskra content <mesh-or-corpus> <name> show stored content for all stages
# natural language
iskra translate <mesh> <word> cross-domain word translation
iskra lemma-test <mesh> <word> test lemmatization (EN/JA/KO/ZH)
iskra fingerprint <mesh> <text> compute linguistic fingerprint
iskra rt-ja <mesh> <word> JA roundtrip test
iskra rt-cross <mesh> <word> cross-language roundtrip
iskra rt-corpus <mesh> <corpus> corpus roundtrip accuracy
iskra rt-atoms <mesh> atom-level roundtrip
iskra build-atomidx <mesh> create sidecar atom-link index
# pattern analysis
iskra pat-query <mesh> <pattern> pattern query
iskra pat-stats <mesh> pattern statistics
iskra pat-roundtrip <mesh> <text> pattern roundtrip test
# code analysis
iskra astgen <source.mx> generate AST dumps from source
iskra compile -mesh <f> -o <out.ll> <src.mx> lattice-based compile
iskra pipeline <corpus-dirs...> evaluate SRC->AST->IR transforms
iskra find-sig <mesh-or-corpus> <name> find isomorphic functions
iskra subst <mesh-or-corpus> <src> <tgt> substitute names in IR
iskra classify-pair <mesh> <name1> <name2> classify structural relationship
# maintenance
iskra quality <corpus-dir> reverse reconstruction coverage
iskra stats <file.mesh> lattice statistics
iskra verify <corpus-dir> verify lattice integrity
iskra audit <corpus-dirs...> isomorphism audit
iskra vocab <mesh-or-corpus> dictionary statistics
iskra bench-gen -d <dir> [-c <corpus>] generate benchmark files
iskra bench-run -d <dir> -m <moxieroot> run benchmarks
iskra bench-ingest <mesh-or-corpus> <tsv> ingest benchmark costs
iskra test built-in smoke test
Licensed under AGPL-3.0-or-later.