# transdb

English-Japanese translation lattice. Bilingual lexicon stored on [iskradb](https://git.smesh.lol/iskradb), with morphological conjugation tables, semantic coord disambiguation, cluster-based phrase translation, on-demand inflection, and progressive semantic propagation.

Module: `git.smesh.lol/transdb`. Requires: `git.smesh.lol/iskradb`, `git.mleku.dev/iskra`.

Built in [Moxie](https://git.smesh.lol/moxie).

---

## Architecture

### Lattice layout

Four active branches of the iskradb lattice:

| Branch | Alias | Contents |
|--------|-------|----------|
| `Bgrammatical` (1) | `Bnoun` | Nouns, adjectives, nominal forms |
| `Bmorphology` (3) | `Bverb` | Verbs: dict forms at coord=0, conjugated forms at morph coord |
| `Bpragmatic` (4) | `Bmodifier` | Particles, function words, modifiers |
| `Bcooccur` (2) | - | Metadata: lang descriptors, particle roles, verb class registrations |

Cross-links (`Record.Link`): JA record `Link[0]` points to the primary EN translation record; `Link[1]` to a secondary. EN verb records point back to the JA anchor they were generated from.

### Key derivation

```
Key = SipHash128(defaultSeed, [lang(1 byte) || coord(8 bytes LE) || word(N bytes)])
```

`lang`: `LangEN=0x01`, `LangJA=0x02`. `coord`: 64-bit packed bitfield. `word`: UTF-8 surface form.

### Coord layout (64-bit)

```
bits 63-48  semantic   (16 bits): 8 subject|object category pairs, 2 bits each
bits 47-32  reserved
bits 31-29  grammatical (3 bits): syntactic role
bits 28-25  cooccur     (4 bits): prev_type(2) + next_type(2)
bits 24-20  morphstate  (5 bits): MorphState
bits 19-18  pragmatic   (2 bits): domain context
bits 17-16  valency     (2 bits): argument count
bits 15-2   reserved    (14 bits): available for Slavic case/number
bits  1-0   register    (2 bits): social register
```

`coord=0` is the base key (dictionary form, context-free lookups).

### MorphState (5-bit, Record.DataFile bits 1-5)

```
bit 4 (earth): tense      0=present  1=past
bit 3 (wood):  aspect     0=simple   1=progressive
bit 2 (metal): polarity   0=affirm   1=negative
bit 1 (water): formality  0=plain    1=polite
bit 0 (fire):  evidential 0=direct   1=reported
```

States 0-28 cover all JA verb forms. EN maps tense/aspect/polarity only (formality has no EN surface form).

### Semantic bitfield (bits 63-48)

Sixteen flags, 2 bits per ontological category (subject bit + object bit):

```
Human, Animate, Abstract, Place, Artifact, Natural, Event, Collective
```

Stored in `Record.DataFile` bits 6-21 for O(1) retrieval at coord=0.

### Record.Branch byte

```
bits 0-2  POS branch (3 bits)
bits 3-4  register   (RegNeutral/Formal/Informal/Vulgar)
bits 5-6  domain     (DomGeneral/Technical/Medical/Legal)
bit  7    honorific
```

For Bcooccur metadata records, the Branch byte is repurposed: lang descriptor bits, particle role codes (0-11), or verb class codes (0-15).

---

## Coord relaxation

`RelaxCoord(coord) []uint64` returns a cascade from most-specific to coord=0. Stripping order: pragmatic, register, valency, semantic bits MSB-LSB, grammatical, cooccur, morphstate. All lookup functions use this to find the best available translation at the most specific matching context.

---

## Translation pipelines

### Token-by-token (Translate)

```moxie
Translate(tree, pool, idx, text, srcLang, dstLang, verbose) string
```

- JA-EN: two-zone SOV-SVO reordering (subject zone / predicate zone split at particle)
- EN-JA: operator accumulation (did/not/apparently) applied to next verb via morphstate

### Cluster-based (TranslateWithClusters)

Five-stage pipeline for phrase-level translation:

```
TokenizeJA/EN
  -> ParseClusters        phrase segmentation (particle-bounded JA, position-bounded EN)
  -> TranslateCluster     head/modifier lookup with morph-coord and semantic-flag propagation
  -> ReorderClusters      SOV<->SVO rearrangement
  -> InsertMarkers        prepositions (EN), postpositions (JA), copula insertion
```

### Cluster types

```
ClusterNPSubj (0)  topic/grammatical subject
ClusterNPObj  (1)  direct object
ClusterVP     (2)  predicate zone
ClusterPP     (3)  adpositional phrase (locative, dative, source, etc.)
ClusterMod    (4)  bare modifier
```

### Lookup with fallback

```moxie
LookupWord(tree, pool, word, srcLang) []string
LookupWordCtx(tree, pool, word, srcLang, coord) []string
FuzzyLookupWord(...)   // Damerau-Levenshtein fallback on exact miss
```

`LookupWordCtx` tries each coord in `RelaxCoord` sequence, returning on first hit. Branch search order is context-aware (derived from cooccurrence axis).

---

## Inflection engine

Verbs are stored once at dictionary form. Verb class code (v1, v5k, v5g, v5s, v5m, v5n, v5b, v5r, v5t, v5u, v5aru, vs, vk) is stored in Bcooccur. Surface forms are computed on demand:

```moxie
RegisterVerbClass(tree, lang, dictForm, classCode)
GetVerbClass(tree, lang, dictForm) (string, bool)
InflectJA(dictForm, verbClass, state) string
InflectJAFromTree(tree, lang, dictForm, state) string
```

13 verb classes x 16 morph states = 208 forms generated from tables. No lattice I/O required for inflection.

The inflect table generalizes to Slavic declension: same pattern extends to noun case tables (7 cases x 2 numbers per declension class).

---

## Lang descriptor system

Registered via `transdb lang-init`. Stored in Bcooccur.

```moxie
type LangDesc struct {
    Order      uint8   // OrderSVO, OrderSOV, OrderVSO, ...
    HeadFinal  bool    // false=EN (head-initial), true=JA (head-final)
    Particle   bool    // false=position-bounded parser, true=particle-bounded
    PreNomRC   bool    // false=post-nominal RC (EN), true=pre-nominal RC (JA)
    ZeroCopula bool    // false=EN (overt), true=JA (zero copula)
    Markers    uint8   // MarkerPrepositional, Postpositional, Case
}
```

### Particle role disambiguation

Particle-role lookup supports semantic disambiguation:

```
"ni" + no context         -> RoleNPDative (default)
"ni" + SemanticPlaceObj   -> RolePPLocative
"ni" + SemanticEventObj   -> RolePPTemporal
"ni" + SemanticHumanObj   -> RoleNPDative
```

Uses `RelaxCoord` on the NP's semantic flags first, falls back to coord=0.

---

## Semantic propagation

Progressive semantic flag diffusion via subject-verb co-occurrence:

```
transdb propagate -labels <tsv> -ja <corpus>... [-passes N]
```

Extracts subject-verb pairs from JA corpus and propagates semantic flags bidirectionally across the pair graph. Checkpoint every N updates, resume-friendly, stop-file controlled.

---

## Tokenizers

```moxie
TokenizeEN(text string) []string
TokenizeJA(text string, tree, verbose) []string
```

`TokenizeJA`: forward maximum-match against the JA lattice. Branch search order adapts to the preceding token's POS. Progressive compounds are collapsed to single tokens via morph-coord lookup and verbStem fallback.

---

## Language detection

Two detection modes:

**Fast (detect.mx):** 5% Japanese script threshold (hiragana/katakana/CJK).

**Trigram models (langdetect/):** Character trigram profiles trained from corpus. 300 trigrams per model, cosine similarity scoring, 0.90 confidence threshold. Language-agnostic - supports EN, JA, KO, ZH, and any language with a training corpus.

---

## Fuzzy matching

Length-bucketed word index with Damerau-Levenshtein distance. `DualIndex` holds both language indices for bidirectional fuzzy lookup.

---

## Ingest pipeline

```
transdb load -jmdict <path> [-kanjidic <path>] -o <dir>
```

For each JMdict entry: insert base form, insert reading aliases, insert EN glosses with Link[0], generate conjugations (16 morph states), register verb class, generate synthetic EN forms (past/progressive/3sg/negative for 70 irregular verbs + regular patterns).

```
transdb extend -en <file> -ja <file> [-db <dir>]
```

Extracts bilingual pairs from aligned corpora via PMI-weighted co-occurrence.

```
transdb consolidate [-db <dir>] [-o <dir>]
```

Rebuilds lattice, drops redundant morph-coord records whose forms can be reconstructed on-demand via inflection engine.

---

## Commands

```
# database creation
transdb load -jmdict <path> [-kanjidic <path>] -o <dir>
transdb lang-init -lang <en|ja> [-db <dir>]
transdb consolidate [-db <dir>] [-o <dir>]

# translation
transdb translate -src <en|ja> -dst <en|ja> [-cluster] [-fuzzy] [-v] <text>
transdb detect <text>

# testing
transdb roundtrip <en-word> [-db <dir>]
transdb roundtrip-ja <ja-word> [-db <dir>]
transdb roundtrip-test [-out <tsv>] [-limit N] [-db <dir>]

# semantic labeling
transdb semantic-label -labels <tsv> [-db <dir>]
transdb propagate -labels <tsv> -ja <file>... [-passes N] [-db <dir>]
transdb apply-checkpoint [-checkpoint <file>] [-db <dir>]

# extension & reranking
transdb extend -en <file> -ja <file> [-db <dir>]
transdb rerank -jmdict <path> [-db <dir>]
transdb apply-overrides -overrides <tsv> [-db <dir>]

# analysis
transdb stats [-db <dir>]
transdb morph-stats [-db <dir>]
transdb branch-count [-db <dir>]
transdb debug <word> [en] [-db <dir>]
transdb wordlist -lang <en|ja> -o <file> [-db <dir>]
transdb posseq [-en <file>] [-ja <file>] [-n N] [-db <dir>]

# utilities
transdb tatoeba-join
transdb langdetect-train -lang <code> <corpus>
```

---

## License

Licensed under [AGPL-3.0-or-later](LICENSE).