README.md raw

transdb

English-Japanese translation lattice. Bilingual lexicon stored on iskradb, with morphological conjugation tables, semantic coord disambiguation, cluster-based phrase translation, on-demand inflection, and progressive semantic propagation.

Module: git.smesh.lol/transdb. Requires: git.smesh.lol/iskradb, git.mleku.dev/iskra.

Built in Moxie.

Architecture

Lattice layout

Four active branches of the iskradb lattice:

BranchAliasContents
Bgrammatical (1)BnounNouns, adjectives, nominal forms
Bmorphology (3)BverbVerbs: dict forms at coord=0, conjugated forms at morph coord
Bpragmatic (4)BmodifierParticles, function words, modifiers
Bcooccur (2)-Metadata: lang descriptors, particle roles, verb class registrations

Cross-links (Record.Link): JA record Link[0] points to the primary EN translation record; Link[1] to a secondary. EN verb records point back to the JA anchor they were generated from.

Key derivation

Key = SipHash128(defaultSeed, [lang(1 byte) || coord(8 bytes LE) || word(N bytes)])

lang: LangEN=0x01, LangJA=0x02. coord: 64-bit packed bitfield. word: UTF-8 surface form.

Coord layout (64-bit)

bits 63-48  semantic   (16 bits): 8 subject|object category pairs, 2 bits each
bits 47-32  reserved
bits 31-29  grammatical (3 bits): syntactic role
bits 28-25  cooccur     (4 bits): prev_type(2) + next_type(2)
bits 24-20  morphstate  (5 bits): MorphState
bits 19-18  pragmatic   (2 bits): domain context
bits 17-16  valency     (2 bits): argument count
bits 15-2   reserved    (14 bits): available for Slavic case/number
bits  1-0   register    (2 bits): social register

coord=0 is the base key (dictionary form, context-free lookups).

MorphState (5-bit, Record.DataFile bits 1-5)

bit 4 (earth): tense      0=present  1=past
bit 3 (wood):  aspect     0=simple   1=progressive
bit 2 (metal): polarity   0=affirm   1=negative
bit 1 (water): formality  0=plain    1=polite
bit 0 (fire):  evidential 0=direct   1=reported

States 0-28 cover all JA verb forms. EN maps tense/aspect/polarity only (formality has no EN surface form).

Semantic bitfield (bits 63-48)

Sixteen flags, 2 bits per ontological category (subject bit + object bit):

Human, Animate, Abstract, Place, Artifact, Natural, Event, Collective

Stored in Record.DataFile bits 6-21 for O(1) retrieval at coord=0.

Record.Branch byte

bits 0-2  POS branch (3 bits)
bits 3-4  register   (RegNeutral/Formal/Informal/Vulgar)
bits 5-6  domain     (DomGeneral/Technical/Medical/Legal)
bit  7    honorific

For Bcooccur metadata records, the Branch byte is repurposed: lang descriptor bits, particle role codes (0-11), or verb class codes (0-15).


Coord relaxation

RelaxCoord(coord) []uint64 returns a cascade from most-specific to coord=0. Stripping order: pragmatic, register, valency, semantic bits MSB-LSB, grammatical, cooccur, morphstate. All lookup functions use this to find the best available translation at the most specific matching context.

Translation pipelines

Token-by-token (Translate)

Translate(tree, pool, idx, text, srcLang, dstLang, verbose) string

Cluster-based (TranslateWithClusters)

Five-stage pipeline for phrase-level translation:

TokenizeJA/EN
  -> ParseClusters        phrase segmentation (particle-bounded JA, position-bounded EN)
  -> TranslateCluster     head/modifier lookup with morph-coord and semantic-flag propagation
  -> ReorderClusters      SOV<->SVO rearrangement
  -> InsertMarkers        prepositions (EN), postpositions (JA), copula insertion

Cluster types

ClusterNPSubj (0)  topic/grammatical subject
ClusterNPObj  (1)  direct object
ClusterVP     (2)  predicate zone
ClusterPP     (3)  adpositional phrase (locative, dative, source, etc.)
ClusterMod    (4)  bare modifier

Lookup with fallback

LookupWord(tree, pool, word, srcLang) []string
LookupWordCtx(tree, pool, word, srcLang, coord) []string
FuzzyLookupWord(...)   // Damerau-Levenshtein fallback on exact miss

LookupWordCtx tries each coord in RelaxCoord sequence, returning on first hit. Branch search order is context-aware (derived from cooccurrence axis).


Inflection engine

Verbs are stored once at dictionary form. Verb class code (v1, v5k, v5g, v5s, v5m, v5n, v5b, v5r, v5t, v5u, v5aru, vs, vk) is stored in Bcooccur. Surface forms are computed on demand:

RegisterVerbClass(tree, lang, dictForm, classCode)
GetVerbClass(tree, lang, dictForm) (string, bool)
InflectJA(dictForm, verbClass, state) string
InflectJAFromTree(tree, lang, dictForm, state) string

13 verb classes x 16 morph states = 208 forms generated from tables. No lattice I/O required for inflection.

The inflect table generalizes to Slavic declension: same pattern extends to noun case tables (7 cases x 2 numbers per declension class).


Lang descriptor system

Registered via transdb lang-init. Stored in Bcooccur.

type LangDesc struct {
    Order      uint8   // OrderSVO, OrderSOV, OrderVSO, ...
    HeadFinal  bool    // false=EN (head-initial), true=JA (head-final)
    Particle   bool    // false=position-bounded parser, true=particle-bounded
    PreNomRC   bool    // false=post-nominal RC (EN), true=pre-nominal RC (JA)
    ZeroCopula bool    // false=EN (overt), true=JA (zero copula)
    Markers    uint8   // MarkerPrepositional, Postpositional, Case
}

Particle role disambiguation

Particle-role lookup supports semantic disambiguation:

"ni" + no context         -> RoleNPDative (default)
"ni" + SemanticPlaceObj   -> RolePPLocative
"ni" + SemanticEventObj   -> RolePPTemporal
"ni" + SemanticHumanObj   -> RoleNPDative

Uses RelaxCoord on the NP's semantic flags first, falls back to coord=0.


Semantic propagation

Progressive semantic flag diffusion via subject-verb co-occurrence:

transdb propagate -labels <tsv> -ja <corpus>... [-passes N]

Extracts subject-verb pairs from JA corpus and propagates semantic flags bidirectionally across the pair graph. Checkpoint every N updates, resume-friendly, stop-file controlled.


Tokenizers

TokenizeEN(text string) []string
TokenizeJA(text string, tree, verbose) []string

TokenizeJA: forward maximum-match against the JA lattice. Branch search order adapts to the preceding token's POS. Progressive compounds are collapsed to single tokens via morph-coord lookup and verbStem fallback.


Language detection

Two detection modes:

Fast (detect.mx): 5% Japanese script threshold (hiragana/katakana/CJK).

Trigram models (langdetect/): Character trigram profiles trained from corpus. 300 trigrams per model, cosine similarity scoring, 0.90 confidence threshold. Language-agnostic - supports EN, JA, KO, ZH, and any language with a training corpus.

Fuzzy matching

Length-bucketed word index with Damerau-Levenshtein distance. DualIndex holds both language indices for bidirectional fuzzy lookup.

Ingest pipeline

transdb load -jmdict <path> [-kanjidic <path>] -o <dir>

For each JMdict entry: insert base form, insert reading aliases, insert EN glosses with Link[0], generate conjugations (16 morph states), register verb class, generate synthetic EN forms (past/progressive/3sg/negative for 70 irregular verbs + regular patterns).

transdb extend -en <file> -ja <file> [-db <dir>]

Extracts bilingual pairs from aligned corpora via PMI-weighted co-occurrence.

transdb consolidate [-db <dir>] [-o <dir>]

Rebuilds lattice, drops redundant morph-coord records whose forms can be reconstructed on-demand via inflection engine.


Commands

# database creation
transdb load -jmdict <path> [-kanjidic <path>] -o <dir>
transdb lang-init -lang <en|ja> [-db <dir>]
transdb consolidate [-db <dir>] [-o <dir>]

# translation
transdb translate -src <en|ja> -dst <en|ja> [-cluster] [-fuzzy] [-v] <text>
transdb detect <text>

# testing
transdb roundtrip <en-word> [-db <dir>]
transdb roundtrip-ja <ja-word> [-db <dir>]
transdb roundtrip-test [-out <tsv>] [-limit N] [-db <dir>]

# semantic labeling
transdb semantic-label -labels <tsv> [-db <dir>]
transdb propagate -labels <tsv> -ja <file>... [-passes N] [-db <dir>]
transdb apply-checkpoint [-checkpoint <file>] [-db <dir>]

# extension & reranking
transdb extend -en <file> -ja <file> [-db <dir>]
transdb rerank -jmdict <path> [-db <dir>]
transdb apply-overrides -overrides <tsv> [-db <dir>]

# analysis
transdb stats [-db <dir>]
transdb morph-stats [-db <dir>]
transdb branch-count [-db <dir>]
transdb debug <word> [en] [-db <dir>]
transdb wordlist -lang <en|ja> -o <file> [-db <dir>]
transdb posseq [-en <file>] [-ja <file>] [-n N] [-db <dir>]

# utilities
transdb tatoeba-join
transdb langdetect-train -lang <code> <corpus>

License

Licensed under AGPL-3.0-or-later.