Replace token-by-token translation with a five-stage pipeline:
source tokens
→ ParseClusters (phrase segmentation, particle/position mode)
→ TranslateCluster (semantic-coord lookup per cluster)
→ ReorderClusters (canonical order from language descriptor)
→ InsertMarkers (particles, prepositions, copula)
→ string output
All stages driven by language descriptor records in the lattice. Adding a new language = registering a descriptor record + particle role map. No code changes required.
New file: transdb/langdesc.mx
Record at MakeKey(lang, 0, "") — empty word, lang-scoped:
Branch bits 0-2: canonical order (SVO=0, SOV=1, VSO=2, VOS=3, OVS=4, OSV=5)
Branch bit 3: head direction (initial=0, final=1)
Branch bit 4: parsing mode (position-bounded=0, particle-bounded=1)
Branch bit 5: relative clause (post-nominal=0, pre-nominal=1)
Branch bit 6: copula (overt=0, zero-copula=1)
DataFile bits 6-7: marker system (prepositional=0, postpositional=1, case=2)
| Language | Order | Head | Mode | RC | Copula | Markers |
|---|---|---|---|---|---|---|
| EN | SVO | initial | position | post | overt | prep |
| JA | SOV | final | particle | pre | zero | postpos |
| KO | SOV | final | particle | pre | overt | postpos |
New command: transdb lang-init -lang <en|ja|ko|...>
Particle roles stored in branch 2 (Bcooccur, currently empty — repurposed as role registry) at MakeKey(lang, NP_semantic_coord, particle_form).
Non-ambiguous particles at coord=0:
MakeKey(LangJA, 0, "は") → role = NP_subj_topic
MakeKey(LangJA, 0, "が") → role = NP_subj_gram
MakeKey(LangJA, 0, "を") → role = NP_obj_direct
MakeKey(LangJA, 0, "から") → role = PP_source
MakeKey(LangJA, 0, "まで") → role = PP_limit
MakeKey(LangJA, 0, "と") → role = PP_comitative
Ambiguous particles with coord-keyed entries (RelaxCoord resolves via NP semantic flags):
MakeKey(LangJA, SemanticPlace<<CoordSemanticShift, "に") → role = PP_locative
MakeKey(LangJA, SemanticEvent<<CoordSemanticShift, "に") → role = PP_temporal
MakeKey(LangJA, SemanticHumanObj<<CoordSemanticShift,"に") → role = NP_dative
MakeKey(LangJA, 0, "に") → role = NP_dative (default)
MakeKey(LangJA, SemanticPlace<<CoordSemanticShift, "で") → role = PP_locative_static
MakeKey(LangJA, 0, "で") → role = PP_instrumental
New file: transdb/cluster.mx
type ClusterType uint8
const (
ClusterNPSubj ClusterType = 0
ClusterNPObj ClusterType = 1
ClusterVP ClusterType = 2
ClusterPP ClusterType = 3
ClusterMod ClusterType = 4
)
type Cluster struct {
Kind ClusterType
Tokens []string
Flags uint64 // accumulated semantic flags
Role uint8 // particle role code
Nested []*Cluster // relative clause VP if present
Trans string // filled by TranslateCluster
Copular bool // sentence is copular (no Bverb in predicate zone)
}
RC signal check: jaRecordBranch(tree, tok) == uint8(lattice.Bverb) — Bverb (branch 3) only.
大きい猫 stays a modified NP because 大きい is Bmodifier, not Bverb.
ParseClusters assumes TokenizeJA has already run and collapsed progressive compounds. 食べている is produced as a SINGLE Bverb token by the morph-coord tokenizer (inferMorphState("食べている") → MorphPresProgPlain=8, coord-keyed lattice record found). The cluster parser never sees a split て-form. This dependency must be maintained — cluster parsing must run on already-tokenized input, not raw text.
acc := []
for each token:
if isParticle(token, lang):
role := LookupParticleRole(lang, token, SemanticFlagsOf(acc))
// Check: was there a Bverb token inside acc? → relative clause
if hasFiniteVerb(acc):
vpTokens, headTokens = splitAtVerbBoundary(acc)
nested = ParseClusters(vpTokens, tree, lang) // recurse
emit Cluster{tokens: headTokens, role: role, Nested: nested}
else:
emit Cluster{tokens: acc, role: role}
acc = []
elif jaRecordBranch(tree, tok) == Bverb AND len(acc) > 0:
// Finite verb inside accumulation = relative clause signal
// (i-adjectives are Bmodifier, never trigger here)
acc.append(tok) // defer — let particle boundary decide
else:
acc.append(tok)
// Trailing tokens = predicate zone
if len(acc) > 0:
hasCopula := !containsFiniteVerb(acc) // no Bverb = copular sentence
emit Cluster{tokens: acc, Kind: ClusterVP, Copular: hasCopula}
hasCopular = (no Bverb token in predicate zone)
This covers all cases:
だ/です at sentence-final position: already in jaFunctionWord, silently consumed by the particle/function-word detection. Copularity is signaled by absence of Bverb, not presence of だ.
TranslateCluster(c *Cluster, tree, pool, srcLang, dstLang uint8) string
PackCoord(c.Flags, 0, 0, morphstate, 0, 0, 0) — cluster semantic flags in semantic axis, morphstate from verb formVP clusters: morphstate from source verb's conjugation flows into coord → target verb found at matching morph coord automatically.
ReorderClusters(clusters []*Cluster, srcDesc, dstDesc LangDesc) []*Cluster
Pure rearrangement, no translation. Maps source cluster roles to target canonical positions:
srcOrder = {NP_subj:0, VP:1, NP_obj:2} // SVO
dstOrder = {NP_subj:0, NP_obj:1, VP:2} // SOV
Bare modifiers (ClusterMod) attach to nearest semantically compatible cluster by flag overlap.
InsertMarkers(clusters []*Cluster, tree, dstDesc LangDesc, dstLang uint8) string
For each cluster in target order:
- LookupTargetMarker(dstLang, cluster.Role, cluster.Flags) → は/を/に/で
- LookupTargetMarker(dstLang, cluster.Role, cluster.Flags) → in/to/from/with
cluster.Copular == true):- overt-copula target (EN): insert "is/are" between NP_subj and predicate cluster - zero-copula target (JA): omit copula, emit predicate directly
| File | Lines | Purpose |
|---|---|---|
transdb/langdesc.mx | ~120 | LangDesc struct, GetLangDesc, RegisterLangDesc, encoding constants |
transdb/cluster.mx | ~350 | Cluster struct, ParseClusters (both modes + recursion), TranslateCluster, ReorderClusters, InsertMarkers |
transdb/translate.mx | ~30 changed | Replace token loop in Translate() with cluster pipeline; keep old path as fallback |
cmd/transdb/main.mx | ~120 added | lang-init command (descriptor + particle map registration) |
Total new code: ~620 lines.
lang-init, register EN + JA descriptors + particle role mapsParseClusters — particle-bounded only (JA→EN direction first)TranslateCluster — reuses existing coord lookup machineryReorderClusters — pure array rearrangementInsertMarkers — postposition/preposition/copula insertion-cluster flag; run quality sample JA→EN vs current token-by-token| Source | Expected | Tests |
|---|---|---|
| 猫が魚を食べた | cat ate fish | Basic SOV→SVO, object positioning |
| 彼は東京に行った | he went to Tokyo | Locative に via SemanticPlace coord |
| 友達にあげた | gave to friend | Dative に via SemanticHuman coord |
| 食べている鳥が鳴いた | the eating bird chirped | RC (Bverb inside NP) + SemanticAnim→鳴く |
| 東京は大きい都市だ | Tokyo is a big city | Copular: no Bverb, overt copula in EN |
| 東京は大きい | Tokyo is big | Copular: predicate adjective, no だ |
| Source | Expected | Tests |
|---|---|---|
| the bird sang | 鳥は鳴いた | SemanticAnim subject → 鳴く over 歌う |
| she sang beautifully | 彼女は美しく歌った | SemanticHuman subject → 歌う |
| he went to Tokyo | 彼は東京に行った | Locative PP → postposition に |
TokenizeJA output — morph-coord tokenizer has already collapsed progressives (食べている = single token). Never parse raw text.jaRecordBranch == Bverb only. Bmodifier adjectives never trigger it.no Bverb in predicate zone is the single rule. だ/です suppression is a consequence, not the cause.LookupParticleRole(lang, particle, NP_flags) — RelaxCoord on the NP's semantic coord. Same mechanism as verb disambiguation.