CLUSTER_PLAN.md raw

Cluster-based Phrase Transformation Plan

Architecture

Replace token-by-token translation with a five-stage pipeline:

source tokens
    → ParseClusters         (phrase segmentation, particle/position mode)
    → TranslateCluster      (semantic-coord lookup per cluster)
    → ReorderClusters       (canonical order from language descriptor)
    → InsertMarkers         (particles, prepositions, copula)
    → string output

All stages driven by language descriptor records in the lattice. Adding a new language = registering a descriptor record + particle role map. No code changes required.


Step 1 — Language descriptor records

New file: transdb/langdesc.mx

Record at MakeKey(lang, 0, "") — empty word, lang-scoped:

Branch bits 0-2: canonical order  (SVO=0, SOV=1, VSO=2, VOS=3, OVS=4, OSV=5)
Branch bit 3:    head direction   (initial=0, final=1)
Branch bit 4:    parsing mode     (position-bounded=0, particle-bounded=1)
Branch bit 5:    relative clause  (post-nominal=0, pre-nominal=1)
Branch bit 6:    copula           (overt=0, zero-copula=1)
DataFile bits 6-7: marker system (prepositional=0, postpositional=1, case=2)
LanguageOrderHeadModeRCCopulaMarkers
ENSVOinitialpositionpostovertprep
JASOVfinalparticleprezeropostpos
KOSOVfinalparticlepreovertpostpos

New command: transdb lang-init -lang <en|ja|ko|...>


Step 2 — Particle role map records

Particle roles stored in branch 2 (Bcooccur, currently empty — repurposed as role registry) at MakeKey(lang, NP_semantic_coord, particle_form).

Non-ambiguous particles at coord=0:

MakeKey(LangJA, 0, "は") → role = NP_subj_topic
MakeKey(LangJA, 0, "が") → role = NP_subj_gram
MakeKey(LangJA, 0, "を") → role = NP_obj_direct
MakeKey(LangJA, 0, "から") → role = PP_source
MakeKey(LangJA, 0, "まで") → role = PP_limit
MakeKey(LangJA, 0, "と") → role = PP_comitative

Ambiguous particles with coord-keyed entries (RelaxCoord resolves via NP semantic flags):

MakeKey(LangJA, SemanticPlace<<CoordSemanticShift,   "に") → role = PP_locative
MakeKey(LangJA, SemanticEvent<<CoordSemanticShift,   "に") → role = PP_temporal
MakeKey(LangJA, SemanticHumanObj<<CoordSemanticShift,"に") → role = NP_dative
MakeKey(LangJA, 0,                                   "に") → role = NP_dative (default)
MakeKey(LangJA, SemanticPlace<<CoordSemanticShift,   "で") → role = PP_locative_static
MakeKey(LangJA, 0,                                   "で") → role = PP_instrumental

Step 3 — Cluster parser

New file: transdb/cluster.mx

type ClusterType uint8
const (
    ClusterNPSubj ClusterType = 0
    ClusterNPObj  ClusterType = 1
    ClusterVP     ClusterType = 2
    ClusterPP     ClusterType = 3
    ClusterMod    ClusterType = 4
)

type Cluster struct {
    Kind   ClusterType
    Tokens []string
    Flags  uint64      // accumulated semantic flags
    Role   uint8       // particle role code
    Nested []*Cluster  // relative clause VP if present
    Trans  string      // filled by TranslateCluster
    Copular bool       // sentence is copular (no Bverb in predicate zone)
}

Verb detection — CRITICAL

RC signal check: jaRecordBranch(tree, tok) == uint8(lattice.Bverb) — Bverb (branch 3) only.

大きい猫 stays a modified NP because 大きい is Bmodifier, not Bverb.

て-form dependency

ParseClusters assumes TokenizeJA has already run and collapsed progressive compounds. 食べている is produced as a SINGLE Bverb token by the morph-coord tokenizer (inferMorphState("食べている") → MorphPresProgPlain=8, coord-keyed lattice record found). The cluster parser never sees a split て-form. This dependency must be maintained — cluster parsing must run on already-tokenized input, not raw text.

Particle-bounded parser (JA, KO, TR)

acc := []
for each token:
    if isParticle(token, lang):
        role := LookupParticleRole(lang, token, SemanticFlagsOf(acc))
        // Check: was there a Bverb token inside acc? → relative clause
        if hasFiniteVerb(acc):
            vpTokens, headTokens = splitAtVerbBoundary(acc)
            nested = ParseClusters(vpTokens, tree, lang)  // recurse
            emit Cluster{tokens: headTokens, role: role, Nested: nested}
        else:
            emit Cluster{tokens: acc, role: role}
        acc = []
    elif jaRecordBranch(tree, tok) == Bverb AND len(acc) > 0:
        // Finite verb inside accumulation = relative clause signal
        // (i-adjectives are Bmodifier, never trigger here)
        acc.append(tok)  // defer — let particle boundary decide
    else:
        acc.append(tok)

// Trailing tokens = predicate zone
if len(acc) > 0:
    hasCopula := !containsFiniteVerb(acc)  // no Bverb = copular sentence
    emit Cluster{tokens: acc, Kind: ClusterVP, Copular: hasCopula}

Copula detection — single rule

hasCopular = (no Bverb token in predicate zone)

This covers all cases:

だ/です at sentence-final position: already in jaFunctionWord, silently consumed by the particle/function-word detection. Copularity is signaled by absence of Bverb, not presence of だ.

Position-bounded parser (EN, ZH, FR) — Step 7

  1. Find main verbs (IsENVerb equivalent — Bverb in EN lattice)
  2. Pre-verbal contiguous nominals → NP_subj candidates
  3. Post-verbal contiguous nominals → NP_obj candidates
  4. Prepositions mark PP boundaries within groups
  5. Relative markers (that/which/who) or bare participles → recurse

Step 4 — Cluster translation

TranslateCluster(c *Cluster, tree, pool, srcLang, dstLang uint8) string

  1. Identify head word (last for head-final source, first for head-initial)
  2. Compute lookup coord: PackCoord(c.Flags, 0, 0, morphstate, 0, 0, 0) — cluster semantic flags in semantic axis, morphstate from verb form
  3. Look up head word in target language via RelaxCoord — finds best semantic-coord match
  4. Translate modifiers (inherit parent cluster semantic context)
  5. Assemble per target descriptor's head-direction bit

VP clusters: morphstate from source verb's conjugation flows into coord → target verb found at matching morph coord automatically.

Step 5 — Cluster reordering

ReorderClusters(clusters []*Cluster, srcDesc, dstDesc LangDesc) []*Cluster

Pure rearrangement, no translation. Maps source cluster roles to target canonical positions:

srcOrder = {NP_subj:0, VP:1, NP_obj:2}   // SVO
dstOrder = {NP_subj:0, NP_obj:1, VP:2}   // SOV

Bare modifiers (ClusterMod) attach to nearest semantically compatible cluster by flag overlap.


Step 6 — Marker insertion

InsertMarkers(clusters []*Cluster, tree, dstDesc LangDesc, dstLang uint8) string

For each cluster in target order:

- LookupTargetMarker(dstLang, cluster.Role, cluster.Flags) → は/を/に/で

- LookupTargetMarker(dstLang, cluster.Role, cluster.Flags) → in/to/from/with

- overt-copula target (EN): insert "is/are" between NP_subj and predicate cluster - zero-copula target (JA): omit copula, emit predicate directly

File structure

FileLinesPurpose
transdb/langdesc.mx~120LangDesc struct, GetLangDesc, RegisterLangDesc, encoding constants
transdb/cluster.mx~350Cluster struct, ParseClusters (both modes + recursion), TranslateCluster, ReorderClusters, InsertMarkers
transdb/translate.mx~30 changedReplace token loop in Translate() with cluster pipeline; keep old path as fallback
cmd/transdb/main.mx~120 addedlang-init command (descriptor + particle map registration)

Total new code: ~620 lines.

Migration sequence

  1. Implement lang-init, register EN + JA descriptors + particle role maps
  2. Implement ParseClusters — particle-bounded only (JA→EN direction first)
  3. Implement TranslateCluster — reuses existing coord lookup machinery
  4. Implement ReorderClusters — pure array rearrangement
  5. Implement InsertMarkers — postposition/preposition/copula insertion
  6. Gate behind -cluster flag; run quality sample JA→EN vs current token-by-token
  7. Add position-bounded parser for EN→JA direction
  8. Remove flag when cluster path matches or exceeds token-by-token quality

Test cases

JA→EN (steps 1-6)

SourceExpectedTests
猫が魚を食べたcat ate fishBasic SOV→SVO, object positioning
彼は東京に行ったhe went to TokyoLocative に via SemanticPlace coord
友達にあげたgave to friendDative に via SemanticHuman coord
食べている鳥が鳴いたthe eating bird chirpedRC (Bverb inside NP) + SemanticAnim→鳴く
東京は大きい都市だTokyo is a big cityCopular: no Bverb, overt copula in EN
東京は大きいTokyo is bigCopular: predicate adjective, no だ

EN→JA (step 7)

SourceExpectedTests
the bird sang鳥は鳴いたSemanticAnim subject → 鳴く over 歌う
she sang beautifully彼女は美しく歌ったSemanticHuman subject → 歌う
he went to Tokyo彼は東京に行ったLocative PP → postposition に

Key invariants