English-Japanese translation lattice. Bilingual lexicon stored on iskradb, with morphological conjugation tables, semantic coord disambiguation, cluster-based phrase translation, on-demand inflection, and progressive semantic propagation.
Module: git.smesh.lol/transdb. Requires: git.smesh.lol/iskradb, git.mleku.dev/iskra.
Built in Moxie.
Four active branches of the iskradb lattice:
| Branch | Alias | Contents |
|---|---|---|
Bgrammatical (1) | Bnoun | Nouns, adjectives, nominal forms |
Bmorphology (3) | Bverb | Verbs: dict forms at coord=0, conjugated forms at morph coord |
Bpragmatic (4) | Bmodifier | Particles, function words, modifiers |
Bcooccur (2) | - | Metadata: lang descriptors, particle roles, verb class registrations |
Cross-links (Record.Link): JA record Link[0] points to the primary EN translation record; Link[1] to a secondary. EN verb records point back to the JA anchor they were generated from.
Key = SipHash128(defaultSeed, [lang(1 byte) || coord(8 bytes LE) || word(N bytes)])
lang: LangEN=0x01, LangJA=0x02. coord: 64-bit packed bitfield. word: UTF-8 surface form.
bits 63-48 semantic (16 bits): 8 subject|object category pairs, 2 bits each
bits 47-32 reserved
bits 31-29 grammatical (3 bits): syntactic role
bits 28-25 cooccur (4 bits): prev_type(2) + next_type(2)
bits 24-20 morphstate (5 bits): MorphState
bits 19-18 pragmatic (2 bits): domain context
bits 17-16 valency (2 bits): argument count
bits 15-2 reserved (14 bits): available for Slavic case/number
bits 1-0 register (2 bits): social register
coord=0 is the base key (dictionary form, context-free lookups).
bit 4 (earth): tense 0=present 1=past
bit 3 (wood): aspect 0=simple 1=progressive
bit 2 (metal): polarity 0=affirm 1=negative
bit 1 (water): formality 0=plain 1=polite
bit 0 (fire): evidential 0=direct 1=reported
States 0-28 cover all JA verb forms. EN maps tense/aspect/polarity only (formality has no EN surface form).
Sixteen flags, 2 bits per ontological category (subject bit + object bit):
Human, Animate, Abstract, Place, Artifact, Natural, Event, Collective
Stored in Record.DataFile bits 6-21 for O(1) retrieval at coord=0.
bits 0-2 POS branch (3 bits)
bits 3-4 register (RegNeutral/Formal/Informal/Vulgar)
bits 5-6 domain (DomGeneral/Technical/Medical/Legal)
bit 7 honorific
For Bcooccur metadata records, the Branch byte is repurposed: lang descriptor bits, particle role codes (0-11), or verb class codes (0-15).
RelaxCoord(coord) []uint64 returns a cascade from most-specific to coord=0. Stripping order: pragmatic, register, valency, semantic bits MSB-LSB, grammatical, cooccur, morphstate. All lookup functions use this to find the best available translation at the most specific matching context.
Translate(tree, pool, idx, text, srcLang, dstLang, verbose) string
Five-stage pipeline for phrase-level translation:
TokenizeJA/EN
-> ParseClusters phrase segmentation (particle-bounded JA, position-bounded EN)
-> TranslateCluster head/modifier lookup with morph-coord and semantic-flag propagation
-> ReorderClusters SOV<->SVO rearrangement
-> InsertMarkers prepositions (EN), postpositions (JA), copula insertion
ClusterNPSubj (0) topic/grammatical subject
ClusterNPObj (1) direct object
ClusterVP (2) predicate zone
ClusterPP (3) adpositional phrase (locative, dative, source, etc.)
ClusterMod (4) bare modifier
LookupWord(tree, pool, word, srcLang) []string
LookupWordCtx(tree, pool, word, srcLang, coord) []string
FuzzyLookupWord(...) // Damerau-Levenshtein fallback on exact miss
LookupWordCtx tries each coord in RelaxCoord sequence, returning on first hit. Branch search order is context-aware (derived from cooccurrence axis).
Verbs are stored once at dictionary form. Verb class code (v1, v5k, v5g, v5s, v5m, v5n, v5b, v5r, v5t, v5u, v5aru, vs, vk) is stored in Bcooccur. Surface forms are computed on demand:
RegisterVerbClass(tree, lang, dictForm, classCode)
GetVerbClass(tree, lang, dictForm) (string, bool)
InflectJA(dictForm, verbClass, state) string
InflectJAFromTree(tree, lang, dictForm, state) string
13 verb classes x 16 morph states = 208 forms generated from tables. No lattice I/O required for inflection.
The inflect table generalizes to Slavic declension: same pattern extends to noun case tables (7 cases x 2 numbers per declension class).
Registered via transdb lang-init. Stored in Bcooccur.
type LangDesc struct {
Order uint8 // OrderSVO, OrderSOV, OrderVSO, ...
HeadFinal bool // false=EN (head-initial), true=JA (head-final)
Particle bool // false=position-bounded parser, true=particle-bounded
PreNomRC bool // false=post-nominal RC (EN), true=pre-nominal RC (JA)
ZeroCopula bool // false=EN (overt), true=JA (zero copula)
Markers uint8 // MarkerPrepositional, Postpositional, Case
}
Particle-role lookup supports semantic disambiguation:
"ni" + no context -> RoleNPDative (default)
"ni" + SemanticPlaceObj -> RolePPLocative
"ni" + SemanticEventObj -> RolePPTemporal
"ni" + SemanticHumanObj -> RoleNPDative
Uses RelaxCoord on the NP's semantic flags first, falls back to coord=0.
Progressive semantic flag diffusion via subject-verb co-occurrence:
transdb propagate -labels <tsv> -ja <corpus>... [-passes N]
Extracts subject-verb pairs from JA corpus and propagates semantic flags bidirectionally across the pair graph. Checkpoint every N updates, resume-friendly, stop-file controlled.
TokenizeEN(text string) []string
TokenizeJA(text string, tree, verbose) []string
TokenizeJA: forward maximum-match against the JA lattice. Branch search order adapts to the preceding token's POS. Progressive compounds are collapsed to single tokens via morph-coord lookup and verbStem fallback.
Two detection modes:
Fast (detect.mx): 5% Japanese script threshold (hiragana/katakana/CJK).
Trigram models (langdetect/): Character trigram profiles trained from corpus. 300 trigrams per model, cosine similarity scoring, 0.90 confidence threshold. Language-agnostic - supports EN, JA, KO, ZH, and any language with a training corpus.
Length-bucketed word index with Damerau-Levenshtein distance. DualIndex holds both language indices for bidirectional fuzzy lookup.
transdb load -jmdict <path> [-kanjidic <path>] -o <dir>
For each JMdict entry: insert base form, insert reading aliases, insert EN glosses with Link[0], generate conjugations (16 morph states), register verb class, generate synthetic EN forms (past/progressive/3sg/negative for 70 irregular verbs + regular patterns).
transdb extend -en <file> -ja <file> [-db <dir>]
Extracts bilingual pairs from aligned corpora via PMI-weighted co-occurrence.
transdb consolidate [-db <dir>] [-o <dir>]
Rebuilds lattice, drops redundant morph-coord records whose forms can be reconstructed on-demand via inflection engine.
# database creation
transdb load -jmdict <path> [-kanjidic <path>] -o <dir>
transdb lang-init -lang <en|ja> [-db <dir>]
transdb consolidate [-db <dir>] [-o <dir>]
# translation
transdb translate -src <en|ja> -dst <en|ja> [-cluster] [-fuzzy] [-v] <text>
transdb detect <text>
# testing
transdb roundtrip <en-word> [-db <dir>]
transdb roundtrip-ja <ja-word> [-db <dir>]
transdb roundtrip-test [-out <tsv>] [-limit N] [-db <dir>]
# semantic labeling
transdb semantic-label -labels <tsv> [-db <dir>]
transdb propagate -labels <tsv> -ja <file>... [-passes N] [-db <dir>]
transdb apply-checkpoint [-checkpoint <file>] [-db <dir>]
# extension & reranking
transdb extend -en <file> -ja <file> [-db <dir>]
transdb rerank -jmdict <path> [-db <dir>]
transdb apply-overrides -overrides <tsv> [-db <dir>]
# analysis
transdb stats [-db <dir>]
transdb morph-stats [-db <dir>]
transdb branch-count [-db <dir>]
transdb debug <word> [en] [-db <dir>]
transdb wordlist -lang <en|ja> -o <file> [-db <dir>]
transdb posseq [-en <file>] [-ja <file>] [-n N] [-db <dir>]
# utilities
transdb tatoeba-join
transdb langdetect-train -lang <code> <corpus>
Licensed under AGPL-3.0-or-later.