# Cluster-based Phrase Transformation Plan ## Architecture Replace token-by-token translation with a five-stage pipeline: ``` source tokens → ParseClusters (phrase segmentation, particle/position mode) → TranslateCluster (semantic-coord lookup per cluster) → ReorderClusters (canonical order from language descriptor) → InsertMarkers (particles, prepositions, copula) → string output ``` All stages driven by language descriptor records in the lattice. Adding a new language = registering a descriptor record + particle role map. No code changes required. --- ## Step 1 — Language descriptor records New file: `transdb/langdesc.mx` Record at `MakeKey(lang, 0, "")` — empty word, lang-scoped: ``` Branch bits 0-2: canonical order (SVO=0, SOV=1, VSO=2, VOS=3, OVS=4, OSV=5) Branch bit 3: head direction (initial=0, final=1) Branch bit 4: parsing mode (position-bounded=0, particle-bounded=1) Branch bit 5: relative clause (post-nominal=0, pre-nominal=1) Branch bit 6: copula (overt=0, zero-copula=1) DataFile bits 6-7: marker system (prepositional=0, postpositional=1, case=2) ``` | Language | Order | Head | Mode | RC | Copula | Markers | |----------|-------|------|------|----|--------|---------| | EN | SVO | initial | position | post | overt | prep | | JA | SOV | final | particle | pre | zero | postpos | | KO | SOV | final | particle | pre | overt | postpos | New command: `transdb lang-init -lang ` --- ## Step 2 — Particle role map records Particle roles stored in branch 2 (Bcooccur, currently empty — repurposed as role registry) at `MakeKey(lang, NP_semantic_coord, particle_form)`. Non-ambiguous particles at coord=0: ``` MakeKey(LangJA, 0, "は") → role = NP_subj_topic MakeKey(LangJA, 0, "が") → role = NP_subj_gram MakeKey(LangJA, 0, "を") → role = NP_obj_direct MakeKey(LangJA, 0, "から") → role = PP_source MakeKey(LangJA, 0, "まで") → role = PP_limit MakeKey(LangJA, 0, "と") → role = PP_comitative ``` Ambiguous particles with coord-keyed entries (RelaxCoord resolves via NP semantic flags): ``` MakeKey(LangJA, SemanticPlace< 0: // Finite verb inside accumulation = relative clause signal // (i-adjectives are Bmodifier, never trigger here) acc.append(tok) // defer — let particle boundary decide else: acc.append(tok) // Trailing tokens = predicate zone if len(acc) > 0: hasCopula := !containsFiniteVerb(acc) // no Bverb = copular sentence emit Cluster{tokens: acc, Kind: ClusterVP, Copular: hasCopula} ``` ### Copula detection — single rule `hasCopular = (no Bverb token in predicate zone)` This covers all cases: - 東京は大きい都市だ: だ is jaFunctionWord, no Bverb → copular ✓ - 東京は大きい: no Bverb, 大きい is Bmodifier → copular ✓ - 東京は都市だ: no Bverb (だ is filtered) → copular ✓ - 猫が走る: 走る is Bverb → NOT copular ✓ だ/です at sentence-final position: already in jaFunctionWord, silently consumed by the particle/function-word detection. Copularity is signaled by absence of Bverb, not presence of だ. ### Position-bounded parser (EN, ZH, FR) — Step 7 1. Find main verbs (IsENVerb equivalent — Bverb in EN lattice) 2. Pre-verbal contiguous nominals → NP_subj candidates 3. Post-verbal contiguous nominals → NP_obj candidates 4. Prepositions mark PP boundaries within groups 5. Relative markers (that/which/who) or bare participles → recurse --- ## Step 4 — Cluster translation `TranslateCluster(c *Cluster, tree, pool, srcLang, dstLang uint8) string` 1. Identify head word (last for head-final source, first for head-initial) 2. Compute lookup coord: `PackCoord(c.Flags, 0, 0, morphstate, 0, 0, 0)` — cluster semantic flags in semantic axis, morphstate from verb form 3. Look up head word in target language via RelaxCoord — finds best semantic-coord match 4. Translate modifiers (inherit parent cluster semantic context) 5. Assemble per target descriptor's head-direction bit VP clusters: morphstate from source verb's conjugation flows into coord → target verb found at matching morph coord automatically. --- ## Step 5 — Cluster reordering `ReorderClusters(clusters []*Cluster, srcDesc, dstDesc LangDesc) []*Cluster` Pure rearrangement, no translation. Maps source cluster roles to target canonical positions: ``` srcOrder = {NP_subj:0, VP:1, NP_obj:2} // SVO dstOrder = {NP_subj:0, NP_obj:1, VP:2} // SOV ``` Bare modifiers (ClusterMod) attach to nearest semantically compatible cluster by flag overlap. --- ## Step 6 — Marker insertion `InsertMarkers(clusters []*Cluster, tree, dstDesc LangDesc, dstLang uint8) string` For each cluster in target order: - Particle-bounded target (JA output): append postposition after cluster - `LookupTargetMarker(dstLang, cluster.Role, cluster.Flags)` → は/を/に/で - Position-bounded target (EN output): prepend preposition before PP clusters - `LookupTargetMarker(dstLang, cluster.Role, cluster.Flags)` → in/to/from/with - Copular sentence (`cluster.Copular == true`): - overt-copula target (EN): insert "is/are" between NP_subj and predicate cluster - zero-copula target (JA): omit copula, emit predicate directly --- ## File structure | File | Lines | Purpose | |------|-------|---------| | `transdb/langdesc.mx` | ~120 | LangDesc struct, GetLangDesc, RegisterLangDesc, encoding constants | | `transdb/cluster.mx` | ~350 | Cluster struct, ParseClusters (both modes + recursion), TranslateCluster, ReorderClusters, InsertMarkers | | `transdb/translate.mx` | ~30 changed | Replace token loop in Translate() with cluster pipeline; keep old path as fallback | | `cmd/transdb/main.mx` | ~120 added | lang-init command (descriptor + particle map registration) | Total new code: ~620 lines. --- ## Migration sequence 1. Implement `lang-init`, register EN + JA descriptors + particle role maps 2. Implement `ParseClusters` — particle-bounded only (JA→EN direction first) 3. Implement `TranslateCluster` — reuses existing coord lookup machinery 4. Implement `ReorderClusters` — pure array rearrangement 5. Implement `InsertMarkers` — postposition/preposition/copula insertion 6. Gate behind `-cluster` flag; run quality sample JA→EN vs current token-by-token 7. Add position-bounded parser for EN→JA direction 8. Remove flag when cluster path matches or exceeds token-by-token quality --- ## Test cases ### JA→EN (steps 1-6) | Source | Expected | Tests | |--------|----------|-------| | 猫が魚を食べた | cat ate fish | Basic SOV→SVO, object positioning | | 彼は東京に行った | he went to Tokyo | Locative に via SemanticPlace coord | | 友達にあげた | gave to friend | Dative に via SemanticHuman coord | | 食べている鳥が鳴いた | the eating bird chirped | RC (Bverb inside NP) + SemanticAnim→鳴く | | 東京は大きい都市だ | Tokyo is a big city | Copular: no Bverb, overt copula in EN | | 東京は大きい | Tokyo is big | Copular: predicate adjective, no だ | ### EN→JA (step 7) | Source | Expected | Tests | |--------|----------|-------| | the bird sang | 鳥は鳴いた | SemanticAnim subject → 鳴く over 歌う | | she sang beautifully | 彼女は美しく歌った | SemanticHuman subject → 歌う | | he went to Tokyo | 彼は東京に行った | Locative PP → postposition に | --- ## Key invariants - Cluster parsing runs on `TokenizeJA` output — morph-coord tokenizer has already collapsed progressives (食べている = single token). Never parse raw text. - RC detection uses `jaRecordBranch == Bverb` only. Bmodifier adjectives never trigger it. - Copula detection: `no Bverb in predicate zone` is the single rule. だ/です suppression is a consequence, not the cause. - Ambiguous particle role resolution: `LookupParticleRole(lang, particle, NP_flags)` — RelaxCoord on the NP's semantic coord. Same mechanism as verb disambiguation. - Three-level coord propagation (embedded VP → head noun → main VP) is just recursion of the same cluster mechanism. No special-case machinery.