# transdb English-Japanese translation lattice. Bilingual lexicon stored on [iskradb](https://git.smesh.lol/iskradb), with morphological conjugation tables, semantic coord disambiguation, cluster-based phrase translation, on-demand inflection, and progressive semantic propagation. Module: `git.smesh.lol/transdb`. Requires: `git.smesh.lol/iskradb`, `git.mleku.dev/iskra`. Built in [Moxie](https://git.smesh.lol/moxie). --- ## Architecture ### Lattice layout Four active branches of the iskradb lattice: | Branch | Alias | Contents | |--------|-------|----------| | `Bgrammatical` (1) | `Bnoun` | Nouns, adjectives, nominal forms | | `Bmorphology` (3) | `Bverb` | Verbs: dict forms at coord=0, conjugated forms at morph coord | | `Bpragmatic` (4) | `Bmodifier` | Particles, function words, modifiers | | `Bcooccur` (2) | - | Metadata: lang descriptors, particle roles, verb class registrations | Cross-links (`Record.Link`): JA record `Link[0]` points to the primary EN translation record; `Link[1]` to a secondary. EN verb records point back to the JA anchor they were generated from. ### Key derivation ``` Key = SipHash128(defaultSeed, [lang(1 byte) || coord(8 bytes LE) || word(N bytes)]) ``` `lang`: `LangEN=0x01`, `LangJA=0x02`. `coord`: 64-bit packed bitfield. `word`: UTF-8 surface form. ### Coord layout (64-bit) ``` bits 63-48 semantic (16 bits): 8 subject|object category pairs, 2 bits each bits 47-32 reserved bits 31-29 grammatical (3 bits): syntactic role bits 28-25 cooccur (4 bits): prev_type(2) + next_type(2) bits 24-20 morphstate (5 bits): MorphState bits 19-18 pragmatic (2 bits): domain context bits 17-16 valency (2 bits): argument count bits 15-2 reserved (14 bits): available for Slavic case/number bits 1-0 register (2 bits): social register ``` `coord=0` is the base key (dictionary form, context-free lookups). ### MorphState (5-bit, Record.DataFile bits 1-5) ``` bit 4 (earth): tense 0=present 1=past bit 3 (wood): aspect 0=simple 1=progressive bit 2 (metal): polarity 0=affirm 1=negative bit 1 (water): formality 0=plain 1=polite bit 0 (fire): evidential 0=direct 1=reported ``` States 0-28 cover all JA verb forms. EN maps tense/aspect/polarity only (formality has no EN surface form). ### Semantic bitfield (bits 63-48) Sixteen flags, 2 bits per ontological category (subject bit + object bit): ``` Human, Animate, Abstract, Place, Artifact, Natural, Event, Collective ``` Stored in `Record.DataFile` bits 6-21 for O(1) retrieval at coord=0. ### Record.Branch byte ``` bits 0-2 POS branch (3 bits) bits 3-4 register (RegNeutral/Formal/Informal/Vulgar) bits 5-6 domain (DomGeneral/Technical/Medical/Legal) bit 7 honorific ``` For Bcooccur metadata records, the Branch byte is repurposed: lang descriptor bits, particle role codes (0-11), or verb class codes (0-15). --- ## Coord relaxation `RelaxCoord(coord) []uint64` returns a cascade from most-specific to coord=0. Stripping order: pragmatic, register, valency, semantic bits MSB-LSB, grammatical, cooccur, morphstate. All lookup functions use this to find the best available translation at the most specific matching context. --- ## Translation pipelines ### Token-by-token (Translate) ```moxie Translate(tree, pool, idx, text, srcLang, dstLang, verbose) string ``` - JA-EN: two-zone SOV-SVO reordering (subject zone / predicate zone split at particle) - EN-JA: operator accumulation (did/not/apparently) applied to next verb via morphstate ### Cluster-based (TranslateWithClusters) Five-stage pipeline for phrase-level translation: ``` TokenizeJA/EN -> ParseClusters phrase segmentation (particle-bounded JA, position-bounded EN) -> TranslateCluster head/modifier lookup with morph-coord and semantic-flag propagation -> ReorderClusters SOV<->SVO rearrangement -> InsertMarkers prepositions (EN), postpositions (JA), copula insertion ``` ### Cluster types ``` ClusterNPSubj (0) topic/grammatical subject ClusterNPObj (1) direct object ClusterVP (2) predicate zone ClusterPP (3) adpositional phrase (locative, dative, source, etc.) ClusterMod (4) bare modifier ``` ### Lookup with fallback ```moxie LookupWord(tree, pool, word, srcLang) []string LookupWordCtx(tree, pool, word, srcLang, coord) []string FuzzyLookupWord(...) // Damerau-Levenshtein fallback on exact miss ``` `LookupWordCtx` tries each coord in `RelaxCoord` sequence, returning on first hit. Branch search order is context-aware (derived from cooccurrence axis). --- ## Inflection engine Verbs are stored once at dictionary form. Verb class code (v1, v5k, v5g, v5s, v5m, v5n, v5b, v5r, v5t, v5u, v5aru, vs, vk) is stored in Bcooccur. Surface forms are computed on demand: ```moxie RegisterVerbClass(tree, lang, dictForm, classCode) GetVerbClass(tree, lang, dictForm) (string, bool) InflectJA(dictForm, verbClass, state) string InflectJAFromTree(tree, lang, dictForm, state) string ``` 13 verb classes x 16 morph states = 208 forms generated from tables. No lattice I/O required for inflection. The inflect table generalizes to Slavic declension: same pattern extends to noun case tables (7 cases x 2 numbers per declension class). --- ## Lang descriptor system Registered via `transdb lang-init`. Stored in Bcooccur. ```moxie type LangDesc struct { Order uint8 // OrderSVO, OrderSOV, OrderVSO, ... HeadFinal bool // false=EN (head-initial), true=JA (head-final) Particle bool // false=position-bounded parser, true=particle-bounded PreNomRC bool // false=post-nominal RC (EN), true=pre-nominal RC (JA) ZeroCopula bool // false=EN (overt), true=JA (zero copula) Markers uint8 // MarkerPrepositional, Postpositional, Case } ``` ### Particle role disambiguation Particle-role lookup supports semantic disambiguation: ``` "ni" + no context -> RoleNPDative (default) "ni" + SemanticPlaceObj -> RolePPLocative "ni" + SemanticEventObj -> RolePPTemporal "ni" + SemanticHumanObj -> RoleNPDative ``` Uses `RelaxCoord` on the NP's semantic flags first, falls back to coord=0. --- ## Semantic propagation Progressive semantic flag diffusion via subject-verb co-occurrence: ``` transdb propagate -labels -ja ... [-passes N] ``` Extracts subject-verb pairs from JA corpus and propagates semantic flags bidirectionally across the pair graph. Checkpoint every N updates, resume-friendly, stop-file controlled. --- ## Tokenizers ```moxie TokenizeEN(text string) []string TokenizeJA(text string, tree, verbose) []string ``` `TokenizeJA`: forward maximum-match against the JA lattice. Branch search order adapts to the preceding token's POS. Progressive compounds are collapsed to single tokens via morph-coord lookup and verbStem fallback. --- ## Language detection Two detection modes: **Fast (detect.mx):** 5% Japanese script threshold (hiragana/katakana/CJK). **Trigram models (langdetect/):** Character trigram profiles trained from corpus. 300 trigrams per model, cosine similarity scoring, 0.90 confidence threshold. Language-agnostic - supports EN, JA, KO, ZH, and any language with a training corpus. --- ## Fuzzy matching Length-bucketed word index with Damerau-Levenshtein distance. `DualIndex` holds both language indices for bidirectional fuzzy lookup. --- ## Ingest pipeline ``` transdb load -jmdict [-kanjidic ] -o ``` For each JMdict entry: insert base form, insert reading aliases, insert EN glosses with Link[0], generate conjugations (16 morph states), register verb class, generate synthetic EN forms (past/progressive/3sg/negative for 70 irregular verbs + regular patterns). ``` transdb extend -en -ja [-db ] ``` Extracts bilingual pairs from aligned corpora via PMI-weighted co-occurrence. ``` transdb consolidate [-db ] [-o ] ``` Rebuilds lattice, drops redundant morph-coord records whose forms can be reconstructed on-demand via inflection engine. --- ## Commands ``` # database creation transdb load -jmdict [-kanjidic ] -o transdb lang-init -lang [-db ] transdb consolidate [-db ] [-o ] # translation transdb translate -src -dst [-cluster] [-fuzzy] [-v] transdb detect # testing transdb roundtrip [-db ] transdb roundtrip-ja [-db ] transdb roundtrip-test [-out ] [-limit N] [-db ] # semantic labeling transdb semantic-label -labels [-db ] transdb propagate -labels -ja ... [-passes N] [-db ] transdb apply-checkpoint [-checkpoint ] [-db ] # extension & reranking transdb extend -en -ja [-db ] transdb rerank -jmdict [-db ] transdb apply-overrides -overrides [-db ] # analysis transdb stats [-db ] transdb morph-stats [-db ] transdb branch-count [-db ] transdb debug [en] [-db ] transdb wordlist -lang -o [-db ] transdb posseq [-en ] [-ja ] [-n N] [-db ] # utilities transdb tatoeba-join transdb langdetect-train -lang ``` --- ## License Licensed under [AGPL-3.0-or-later](LICENSE).