Cluster-based Phrase Transformation Plan

Architecture

Replace token-by-token translation with a five-stage pipeline:

source tokens
    → ParseClusters         (phrase segmentation, particle/position mode)
    → TranslateCluster      (semantic-coord lookup per cluster)
    → ReorderClusters       (canonical order from language descriptor)
    → InsertMarkers         (particles, prepositions, copula)
    → string output

All stages driven by language descriptor records in the lattice. Adding a new language = registering a descriptor record + particle role map. No code changes required.

Step 1 — Language descriptor records

New file: transdb/langdesc.mx

Record at MakeKey(lang, 0, "") — empty word, lang-scoped:

Branch bits 0-2: canonical order  (SVO=0, SOV=1, VSO=2, VOS=3, OVS=4, OSV=5)
Branch bit 3:    head direction   (initial=0, final=1)
Branch bit 4:    parsing mode     (position-bounded=0, particle-bounded=1)
Branch bit 5:    relative clause  (post-nominal=0, pre-nominal=1)
Branch bit 6:    copula           (overt=0, zero-copula=1)
DataFile bits 6-7: marker system (prepositional=0, postpositional=1, case=2)

Language	Order	Head	Mode	RC	Copula	Markers
EN	SVO	initial	position	post	overt	prep
JA	SOV	final	particle	pre	zero	postpos
KO	SOV	final	particle	pre	overt	postpos

New command: transdb lang-init -lang <en|ja|ko|...>

Step 2 — Particle role map records

Particle roles stored in branch 2 (Bcooccur, currently empty — repurposed as role registry) at MakeKey(lang, NP_semantic_coord, particle_form).

Non-ambiguous particles at coord=0:

MakeKey(LangJA, 0, "は") → role = NP_subj_topic
MakeKey(LangJA, 0, "が") → role = NP_subj_gram
MakeKey(LangJA, 0, "を") → role = NP_obj_direct
MakeKey(LangJA, 0, "から") → role = PP_source
MakeKey(LangJA, 0, "まで") → role = PP_limit
MakeKey(LangJA, 0, "と") → role = PP_comitative

Ambiguous particles with coord-keyed entries (RelaxCoord resolves via NP semantic flags):

MakeKey(LangJA, SemanticPlace<<CoordSemanticShift,   "に") → role = PP_locative
MakeKey(LangJA, SemanticEvent<<CoordSemanticShift,   "に") → role = PP_temporal
MakeKey(LangJA, SemanticHumanObj<<CoordSemanticShift,"に") → role = NP_dative
MakeKey(LangJA, 0,                                   "に") → role = NP_dative (default)
MakeKey(LangJA, SemanticPlace<<CoordSemanticShift,   "で") → role = PP_locative_static
MakeKey(LangJA, 0,                                   "で") → role = PP_instrumental

Step 3 — Cluster parser

New file: transdb/cluster.mx

type ClusterType uint8
const (
    ClusterNPSubj ClusterType = 0
    ClusterNPObj  ClusterType = 1
    ClusterVP     ClusterType = 2
    ClusterPP     ClusterType = 3
    ClusterMod    ClusterType = 4
)

type Cluster struct {
    Kind   ClusterType
    Tokens []string
    Flags  uint64      // accumulated semantic flags
    Role   uint8       // particle role code
    Nested []*Cluster  // relative clause VP if present
    Trans  string      // filled by TranslateCluster
    Copular bool       // sentence is copular (no Bverb in predicate zone)
}

Verb detection — CRITICAL

RC signal check: jaRecordBranch(tree, tok) == uint8(lattice.Bverb) — Bverb (branch 3) only.

い-adjectives (大きい, 美しい) → Bmodifier → NOT a verb → do NOT trigger RC detection
な-adjective stems (きれい) → Bnoun → NOT a verb → do NOT trigger RC detection
Verbal nouns (勉強) → Bnoun → NOT a verb

大きい猫 stays a modified NP because 大きい is Bmodifier, not Bverb.

て-form dependency

ParseClusters assumes TokenizeJA has already run and collapsed progressive compounds. 食べている is produced as a SINGLE Bverb token by the morph-coord tokenizer (inferMorphState("食べている") → MorphPresProgPlain=8, coord-keyed lattice record found). The cluster parser never sees a split て-form. This dependency must be maintained — cluster parsing must run on already-tokenized input, not raw text.

Particle-bounded parser (JA, KO, TR)

acc := []
for each token:
    if isParticle(token, lang):
        role := LookupParticleRole(lang, token, SemanticFlagsOf(acc))
        // Check: was there a Bverb token inside acc? → relative clause
        if hasFiniteVerb(acc):
            vpTokens, headTokens = splitAtVerbBoundary(acc)
            nested = ParseClusters(vpTokens, tree, lang)  // recurse
            emit Cluster{tokens: headTokens, role: role, Nested: nested}
        else:
            emit Cluster{tokens: acc, role: role}
        acc = []
    elif jaRecordBranch(tree, tok) == Bverb AND len(acc) > 0:
        // Finite verb inside accumulation = relative clause signal
        // (i-adjectives are Bmodifier, never trigger here)
        acc.append(tok)  // defer — let particle boundary decide
    else:
        acc.append(tok)

// Trailing tokens = predicate zone
if len(acc) > 0:
    hasCopula := !containsFiniteVerb(acc)  // no Bverb = copular sentence
    emit Cluster{tokens: acc, Kind: ClusterVP, Copular: hasCopula}

Copula detection — single rule

hasCopular = (no Bverb token in predicate zone)

This covers all cases:

東京は大きい都市だ: だ is jaFunctionWord, no Bverb → copular ✓
東京は大きい: no Bverb, 大きい is Bmodifier → copular ✓
東京は都市だ: no Bverb (だ is filtered) → copular ✓
猫が走る: 走る is Bverb → NOT copular ✓

だ/です at sentence-final position: already in jaFunctionWord, silently consumed by the particle/function-word detection. Copularity is signaled by absence of Bverb, not presence of だ.

Position-bounded parser (EN, ZH, FR) — Step 7

Find main verbs (IsENVerb equivalent — Bverb in EN lattice)
Pre-verbal contiguous nominals → NP_subj candidates
Post-verbal contiguous nominals → NP_obj candidates
Prepositions mark PP boundaries within groups
Relative markers (that/which/who) or bare participles → recurse

Step 4 — Cluster translation

TranslateCluster(c *Cluster, tree, pool, srcLang, dstLang uint8) string

Identify head word (last for head-final source, first for head-initial)
Compute lookup coord: PackCoord(c.Flags, 0, 0, morphstate, 0, 0, 0) — cluster semantic flags in semantic axis, morphstate from verb form
Look up head word in target language via RelaxCoord — finds best semantic-coord match
Translate modifiers (inherit parent cluster semantic context)
Assemble per target descriptor's head-direction bit

VP clusters: morphstate from source verb's conjugation flows into coord → target verb found at matching morph coord automatically.

Step 5 — Cluster reordering

ReorderClusters(clusters []*Cluster, srcDesc, dstDesc LangDesc) []*Cluster

Pure rearrangement, no translation. Maps source cluster roles to target canonical positions:

srcOrder = {NP_subj:0, VP:1, NP_obj:2}   // SVO
dstOrder = {NP_subj:0, NP_obj:1, VP:2}   // SOV

Bare modifiers (ClusterMod) attach to nearest semantically compatible cluster by flag overlap.

Step 6 — Marker insertion

InsertMarkers(clusters []*Cluster, tree, dstDesc LangDesc, dstLang uint8) string

For each cluster in target order:

Particle-bounded target (JA output): append postposition after cluster

- LookupTargetMarker(dstLang, cluster.Role, cluster.Flags) → は/を/に/で

Position-bounded target (EN output): prepend preposition before PP clusters

- LookupTargetMarker(dstLang, cluster.Role, cluster.Flags) → in/to/from/with

Copular sentence (cluster.Copular == true):

- overt-copula target (EN): insert "is/are" between NP_subj and predicate cluster - zero-copula target (JA): omit copula, emit predicate directly

File structure

File	Lines	Purpose
`transdb/langdesc.mx`	~120	LangDesc struct, GetLangDesc, RegisterLangDesc, encoding constants
`transdb/cluster.mx`	~350	Cluster struct, ParseClusters (both modes + recursion), TranslateCluster, ReorderClusters, InsertMarkers
`transdb/translate.mx`	~30 changed	Replace token loop in Translate() with cluster pipeline; keep old path as fallback
`cmd/transdb/main.mx`	~120 added	lang-init command (descriptor + particle map registration)

Total new code: ~620 lines.

Migration sequence

Implement lang-init, register EN + JA descriptors + particle role maps
Implement ParseClusters — particle-bounded only (JA→EN direction first)
Implement TranslateCluster — reuses existing coord lookup machinery
Implement ReorderClusters — pure array rearrangement
Implement InsertMarkers — postposition/preposition/copula insertion
Gate behind -cluster flag; run quality sample JA→EN vs current token-by-token
Add position-bounded parser for EN→JA direction
Remove flag when cluster path matches or exceeds token-by-token quality

Test cases

JA→EN (steps 1-6)

Source	Expected	Tests
猫が魚を食べた	cat ate fish	Basic SOV→SVO, object positioning
彼は東京に行った	he went to Tokyo	Locative に via SemanticPlace coord
友達にあげた	gave to friend	Dative に via SemanticHuman coord
食べている鳥が鳴いた	the eating bird chirped	RC (Bverb inside NP) + SemanticAnim→鳴く
東京は大きい都市だ	Tokyo is a big city	Copular: no Bverb, overt copula in EN
東京は大きい	Tokyo is big	Copular: predicate adjective, no だ

EN→JA (step 7)

Source	Expected	Tests
the bird sang	鳥は鳴いた	SemanticAnim subject → 鳴く over 歌う
she sang beautifully	彼女は美しく歌った	SemanticHuman subject → 歌う
he went to Tokyo	彼は東京に行った	Locative PP → postposition に

Key invariants

Cluster parsing runs on TokenizeJA output — morph-coord tokenizer has already collapsed progressives (食べている = single token). Never parse raw text.
RC detection uses jaRecordBranch == Bverb only. Bmodifier adjectives never trigger it.
Copula detection: no Bverb in predicate zone is the single rule. だ/です suppression is a consequence, not the cause.
Ambiguous particle role resolution: LookupParticleRole(lang, particle, NP_flags) — RelaxCoord on the NP's semantic coord. Same mechanism as verb disambiguation.
Three-level coord propagation (embedded VP → head noun → main VP) is just recursion of the same cluster mechanism. No special-case machinery.

CLUSTER_PLAN.md raw