iskra

lattice language model - semantic walks on bethe lattice (cayley tree). bounded only by memory access speed, low compute requirements.

git clone https://git.smesh.lol/iskra.git

iskra

iskra

Lattice language model. Semantic walks on a Bethe lattice (Cayley tree), bounded only by memory access speed, low compute requirements. Maps constructs - natural language morphemes or code symbols - onto a multi-branch B-tree, enabling structural isomorphism detection, cross-domain translation, and lattice-based compilation.

Not a neural model. No embeddings, no floating point, no training loop. Structure is stored as structure - exact integer distances on the lattice, deterministic walks, algebraic operations on keys.

Built in Moxie. Storage engine: iskradb. Natural language translation: transdb (work in progress - English-Japanese bilingual lexicon with morphological conjugation, semantic coord disambiguation, and cluster-based phrase translation).

Architecture

Semantic branches

Iskra uses iskradb's eight-branch lattice. Three primary branches carry the semantic axes; additional branches serve as metadata, pattern, and co-occurrence stores:

Branchiskradb nameLanguage domainCode domain
BgrammaticalBnounNouns, adjectives, nominal formsTypes, structs, variables, constants
BmorphologyBverbVerbs, conjugated formsFunctions, methods
BpragmaticBmodifierParticles, function words, modifiersExpressions, control flow, fields
Bsemantic-Atoms (lemmas/stems) for cross-language lookup-
Bcooccur-Bigrams, language descriptors, verb classes, particle roles-

The same lattice structure handles both natural language and code. Code translation (SRC/AST/IR/ASM/BIN stages) is paused - natural language is a superset of what is needed.

Key formats

Two key schemes depending on domain:

Natural language (coord-based):

Key = SipHash128(seed, [lang(1) || coord(8 LE) || word(N)])

Code lattice (stage-based):

Key[0] = (stage << 56) | (FNV-1a-56(name) & 0x00FFFFFFFFFFFFFF)
Key[1] = 0

Cross-stage adjacency is key-implicit: the same name at a different stage produces the same 56-bit hash, different stage prefix.

Coordinate system (64-bit packed)

bits 63-48  semantic    (16 bits): 8 subject|object category pairs, 2 bits each
bits 47-32  reserved
bits 31-29  grammatical (3 bits): syntactic role
bits 28-25  cooccur     (4 bits): prev_type(2) + next_type(2)
bits 24-20  morphstate  (5 bits): tense/aspect/polarity/formality/evidential
bits 19-18  pragmatic   (2 bits): domain context
bits 17-16  valency     (2 bits): argument count
bits 15-2   reserved    (14 bits): available for Slavic case/number
bits  1-0   register    (2 bits): social register

RelaxCoord(coord) []uint64 returns a cascade from most-specific to coord=0. Stripping order: pragmatic, register, valency, semantic bits MSB-LSB, grammatical, cooccur, morphstate.

TritPath

Encodes a position in the three-branch semantic space. Up to 16 trits packed into uint32 (2 bits each).

Level 1 - declaration kind:

BranchType = 1    BranchFunc = 2    BranchData = 3

Level 2 - subdivision:

Type:  SubField=1  SubMethod=2  SubEmbed=3
Func:  SubStatement=1  SubExpression=2  SubControl=3
Data:  SubConst=1  SubVar=2  SubImport=3

Data structures

Tree

type Tree struct {
    db         *lattice.Tree
    RecMeta    []MetaEntry
    StringPool []byte
    TokenPool  []uint32
    TokFile    *os.File         // disk-backed token pool
    TokRaw     []byte           // preloaded token pool
    Dict       *Dict
    BigramIdx  map[string][]uint32   // word -> bigram continuations
    SigIdx     map[uint32][]uint32   // (stage<<24)|sighash -> []recIdx
    BulkMode   bool
    BulkMetaStore *BulkMeta
    BulkPoolFile  *os.File
    CostMap    map[uint32]CostEntry
}

Record index is the single unified index for both the B-tree record and metadata.

MetaEntry (32 bytes)

Count    uint32      occurrence count (used in upserts)
Kind     NodeKind    construct classification
StageTag uint8       low 4 bits = srcLang, high 4 bits = generation
Extra    [16]byte    packed auxiliary data

Extra layout:

[0:4]   TritPath (uint32 LE)
[4]     low nibble: Rotation (branch transition encoding)
[5:8]   24-bit signature hash (fast rejection in isomorphism lookup)
[8:12]  ContentOffset into TokenPool (uint32 LE)
[12:16] ContentLen in tokens (uint32 LE)

Storage modes

In-memory: standard lattice operations, no persistence.

Disk-backed: .iskr (iskradb file) + .meta (metadata) + .tokpool (token pool). StorageCreate/StorageOpen/StorageFlush/StorageClose.

Bulk ingest: .bulkmeta (LRU cache) + .bulkpool (overflow) for bounded-memory ingestion of 100M+ corpora. EnableBulkStorage/FinalizeBulkStorage.

Pattern extraction

Sentences are decomposed into byte-encoded skeletons:

  • Slots (0x80 | role_id): content positions (128 roles)
  • Markers (0x00-0x7F): structural words (JA particles, EN prepositions, code keywords)
  • Morph markers: definiteness, plural, copula, 3sg - enabling lossless roundtrips

ExtractJA, ExtractEN, ExtractKO, ExtractZH produce ExtractResult containing pattern, slots, roles, deep pattern, set representation, and discourse structure.

Set representation

Each sentence decomposes into a set of SetEntry:

Role, Atom, Morph, Class, Mark, OblRole, Head, ModKind

Three-layer nesting: macrorole (subject/object/oblique), thematic role (agent/patient/goal/source/instrument), modification relation (possessive/attributive/copular/adverbial/relative).

Atom-link index

Sidecar index for cross-language translation pairs. AtomIdxEntry stores source/destination atoms with context, role, generation, and weight. Automatically mirrored for bidirectional lookup. Two generations: GenLegacy (bilateral, no context) and GenContexted (context-aware).

Distances

All distances are exact integers. No floating point.

TritPath distance - hops via LCA on the ternary tree.

Tree distance - hops via LCA in the B-tree. Measures storage locality.

Geometric distance - L1 distance over binding signatures (alpha-equivalence).

Walks

WalkStep uses BigramIdx for O(1) continuation lookup. WalkStepCrossDomain translates steps across language domains. BigramWeightRelaxed uses cascading coord relaxation for fallback.

Fingerprinting

12 register-based text metrics for EN, JA, KO, ZH:

tok_mean, slot_mean, slot_dens, morph_dens, clause_mean, mark_div,
cjk_ratio, archaic_per_s, atom_TTR, pat_TTR, zipf_slope

Symbol isomorphism

BindingSignature: geometric structure of declarations and references with names erased. Two functions with identical binding signatures are alpha-equivalent.

SignatureHash24 - 24-bit fast-rejection hash in MetaEntry.Extra[5:8].

find-sig looks up functions by signature match. audit verifies self-isomorphism (100% pass = correct signature generation).

Code translation (paused)

Five compiler stages (SRC/AST/IR/ASM/BIN) map code constructs across compilation. iskra compile uses the lattice as a compilation substrate. This work is paused - natural language is the active focus since it subsumes the structural requirements.

Building

MOXIEROOT=../moxie moxie build ./cmd/iskra

CLI

# corpus & storage
iskra load <corpus-dir>                       load mxcorpus output, print stats
iskra save <corpus-dir> <out.mesh>            load corpus, save binary mesh
iskra load-multi <dir1> [dir2...] -o <f>      merge multiple corpora, save mesh
iskra load-mesh <file.mesh>                   load mesh, print stats

# lattice queries
iskra lookup <mesh-or-corpus> <name>          lookup segment, show cross-stage links
iskra forward <mesh-or-corpus> <name>         find SRC entry, follow to IR
iskra reverse <mesh-or-corpus> <name>         walk BIN->ASM->IR->AST->SRC
iskra content <mesh-or-corpus> <name>         show stored content for all stages

# natural language
iskra translate <mesh> <word>                 cross-domain word translation
iskra lemma-test <mesh> <word>                test lemmatization (EN/JA/KO/ZH)
iskra fingerprint <mesh> <text>               compute linguistic fingerprint
iskra rt-ja <mesh> <word>                     JA roundtrip test
iskra rt-cross <mesh> <word>                  cross-language roundtrip
iskra rt-corpus <mesh> <corpus>               corpus roundtrip accuracy
iskra rt-atoms <mesh>                         atom-level roundtrip
iskra build-atomidx <mesh>                    create sidecar atom-link index

# pattern analysis
iskra pat-query <mesh> <pattern>              pattern query
iskra pat-stats <mesh>                        pattern statistics
iskra pat-roundtrip <mesh> <text>             pattern roundtrip test

# code analysis
iskra astgen <source.mx>                      generate AST dumps from source
iskra compile -mesh <f> -o <out.ll> <src.mx>  lattice-based compile
iskra pipeline <corpus-dirs...>               evaluate SRC->AST->IR transforms
iskra find-sig <mesh-or-corpus> <name>        find isomorphic functions
iskra subst <mesh-or-corpus> <src> <tgt>      substitute names in IR
iskra classify-pair <mesh> <name1> <name2>    classify structural relationship

# maintenance
iskra quality <corpus-dir>                    reverse reconstruction coverage
iskra stats <file.mesh>                       lattice statistics
iskra verify <corpus-dir>                     verify lattice integrity
iskra audit <corpus-dirs...>                  isomorphism audit
iskra vocab <mesh-or-corpus>                  dictionary statistics
iskra bench-gen -d <dir> [-c <corpus>]        generate benchmark files
iskra bench-run -d <dir> -m <moxieroot>       run benchmarks
iskra bench-ingest <mesh-or-corpus> <tsv>     ingest benchmark costs
iskra test                                    built-in smoke test

License

Licensed under AGPL-3.0-or-later.

files