# Cluster-based Phrase Transformation Plan

## Architecture

Replace token-by-token translation with a five-stage pipeline:

```
source tokens
    → ParseClusters         (phrase segmentation, particle/position mode)
    → TranslateCluster      (semantic-coord lookup per cluster)
    → ReorderClusters       (canonical order from language descriptor)
    → InsertMarkers         (particles, prepositions, copula)
    → string output
```

All stages driven by language descriptor records in the lattice. Adding a new language = registering a descriptor record + particle role map. No code changes required.

---

## Step 1 — Language descriptor records

New file: `transdb/langdesc.mx`

Record at `MakeKey(lang, 0, "")` — empty word, lang-scoped:

```
Branch bits 0-2: canonical order  (SVO=0, SOV=1, VSO=2, VOS=3, OVS=4, OSV=5)
Branch bit 3:    head direction   (initial=0, final=1)
Branch bit 4:    parsing mode     (position-bounded=0, particle-bounded=1)
Branch bit 5:    relative clause  (post-nominal=0, pre-nominal=1)
Branch bit 6:    copula           (overt=0, zero-copula=1)
DataFile bits 6-7: marker system (prepositional=0, postpositional=1, case=2)
```

| Language | Order | Head | Mode | RC | Copula | Markers |
|----------|-------|------|------|----|--------|---------|
| EN | SVO | initial | position | post | overt | prep |
| JA | SOV | final | particle | pre | zero | postpos |
| KO | SOV | final | particle | pre | overt | postpos |

New command: `transdb lang-init -lang <en|ja|ko|...>`

---

## Step 2 — Particle role map records

Particle roles stored in branch 2 (Bcooccur, currently empty — repurposed as role registry) at `MakeKey(lang, NP_semantic_coord, particle_form)`.

Non-ambiguous particles at coord=0:
```
MakeKey(LangJA, 0, "は") → role = NP_subj_topic
MakeKey(LangJA, 0, "が") → role = NP_subj_gram
MakeKey(LangJA, 0, "を") → role = NP_obj_direct
MakeKey(LangJA, 0, "から") → role = PP_source
MakeKey(LangJA, 0, "まで") → role = PP_limit
MakeKey(LangJA, 0, "と") → role = PP_comitative
```

Ambiguous particles with coord-keyed entries (RelaxCoord resolves via NP semantic flags):
```
MakeKey(LangJA, SemanticPlace<<CoordSemanticShift,   "に") → role = PP_locative
MakeKey(LangJA, SemanticEvent<<CoordSemanticShift,   "に") → role = PP_temporal
MakeKey(LangJA, SemanticHumanObj<<CoordSemanticShift,"に") → role = NP_dative
MakeKey(LangJA, 0,                                   "に") → role = NP_dative (default)
MakeKey(LangJA, SemanticPlace<<CoordSemanticShift,   "で") → role = PP_locative_static
MakeKey(LangJA, 0,                                   "で") → role = PP_instrumental
```

---

## Step 3 — Cluster parser

New file: `transdb/cluster.mx`

```moxie
type ClusterType uint8
const (
    ClusterNPSubj ClusterType = 0
    ClusterNPObj  ClusterType = 1
    ClusterVP     ClusterType = 2
    ClusterPP     ClusterType = 3
    ClusterMod    ClusterType = 4
)

type Cluster struct {
    Kind   ClusterType
    Tokens []string
    Flags  uint64      // accumulated semantic flags
    Role   uint8       // particle role code
    Nested []*Cluster  // relative clause VP if present
    Trans  string      // filled by TranslateCluster
    Copular bool       // sentence is copular (no Bverb in predicate zone)
}
```

### Verb detection — CRITICAL

RC signal check: `jaRecordBranch(tree, tok) == uint8(lattice.Bverb)` — Bverb (branch 3) only.

- い-adjectives (大きい, 美しい) → Bmodifier → NOT a verb → do NOT trigger RC detection
- な-adjective stems (きれい) → Bnoun → NOT a verb → do NOT trigger RC detection
- Verbal nouns (勉強) → Bnoun → NOT a verb

大きい猫 stays a modified NP because 大きい is Bmodifier, not Bverb.

### て-form dependency

`ParseClusters` assumes `TokenizeJA` has already run and collapsed progressive compounds. 食べている is produced as a SINGLE Bverb token by the morph-coord tokenizer (inferMorphState("食べている") → MorphPresProgPlain=8, coord-keyed lattice record found). The cluster parser never sees a split て-form. This dependency must be maintained — cluster parsing must run on already-tokenized input, not raw text.

### Particle-bounded parser (JA, KO, TR)

```
acc := []
for each token:
    if isParticle(token, lang):
        role := LookupParticleRole(lang, token, SemanticFlagsOf(acc))
        // Check: was there a Bverb token inside acc? → relative clause
        if hasFiniteVerb(acc):
            vpTokens, headTokens = splitAtVerbBoundary(acc)
            nested = ParseClusters(vpTokens, tree, lang)  // recurse
            emit Cluster{tokens: headTokens, role: role, Nested: nested}
        else:
            emit Cluster{tokens: acc, role: role}
        acc = []
    elif jaRecordBranch(tree, tok) == Bverb AND len(acc) > 0:
        // Finite verb inside accumulation = relative clause signal
        // (i-adjectives are Bmodifier, never trigger here)
        acc.append(tok)  // defer — let particle boundary decide
    else:
        acc.append(tok)

// Trailing tokens = predicate zone
if len(acc) > 0:
    hasCopula := !containsFiniteVerb(acc)  // no Bverb = copular sentence
    emit Cluster{tokens: acc, Kind: ClusterVP, Copular: hasCopula}
```

### Copula detection — single rule

`hasCopular = (no Bverb token in predicate zone)`

This covers all cases:
- 東京は大きい都市だ: だ is jaFunctionWord, no Bverb → copular ✓
- 東京は大きい: no Bverb, 大きい is Bmodifier → copular ✓
- 東京は都市だ: no Bverb (だ is filtered) → copular ✓
- 猫が走る: 走る is Bverb → NOT copular ✓

だ/です at sentence-final position: already in jaFunctionWord, silently consumed by the particle/function-word detection. Copularity is signaled by absence of Bverb, not presence of だ.

### Position-bounded parser (EN, ZH, FR) — Step 7

1. Find main verbs (IsENVerb equivalent — Bverb in EN lattice)
2. Pre-verbal contiguous nominals → NP_subj candidates
3. Post-verbal contiguous nominals → NP_obj candidates
4. Prepositions mark PP boundaries within groups
5. Relative markers (that/which/who) or bare participles → recurse

---

## Step 4 — Cluster translation

`TranslateCluster(c *Cluster, tree, pool, srcLang, dstLang uint8) string`

1. Identify head word (last for head-final source, first for head-initial)
2. Compute lookup coord: `PackCoord(c.Flags, 0, 0, morphstate, 0, 0, 0)` — cluster semantic flags in semantic axis, morphstate from verb form
3. Look up head word in target language via RelaxCoord — finds best semantic-coord match
4. Translate modifiers (inherit parent cluster semantic context)
5. Assemble per target descriptor's head-direction bit

VP clusters: morphstate from source verb's conjugation flows into coord → target verb found at matching morph coord automatically.

---

## Step 5 — Cluster reordering

`ReorderClusters(clusters []*Cluster, srcDesc, dstDesc LangDesc) []*Cluster`

Pure rearrangement, no translation. Maps source cluster roles to target canonical positions:

```
srcOrder = {NP_subj:0, VP:1, NP_obj:2}   // SVO
dstOrder = {NP_subj:0, NP_obj:1, VP:2}   // SOV
```

Bare modifiers (ClusterMod) attach to nearest semantically compatible cluster by flag overlap.

---

## Step 6 — Marker insertion

`InsertMarkers(clusters []*Cluster, tree, dstDesc LangDesc, dstLang uint8) string`

For each cluster in target order:

- Particle-bounded target (JA output): append postposition after cluster
  - `LookupTargetMarker(dstLang, cluster.Role, cluster.Flags)` → は/を/に/で
- Position-bounded target (EN output): prepend preposition before PP clusters
  - `LookupTargetMarker(dstLang, cluster.Role, cluster.Flags)` → in/to/from/with
- Copular sentence (`cluster.Copular == true`):
  - overt-copula target (EN): insert "is/are" between NP_subj and predicate cluster
  - zero-copula target (JA): omit copula, emit predicate directly

---

## File structure

| File | Lines | Purpose |
|------|-------|---------|
| `transdb/langdesc.mx` | ~120 | LangDesc struct, GetLangDesc, RegisterLangDesc, encoding constants |
| `transdb/cluster.mx` | ~350 | Cluster struct, ParseClusters (both modes + recursion), TranslateCluster, ReorderClusters, InsertMarkers |
| `transdb/translate.mx` | ~30 changed | Replace token loop in Translate() with cluster pipeline; keep old path as fallback |
| `cmd/transdb/main.mx` | ~120 added | lang-init command (descriptor + particle map registration) |

Total new code: ~620 lines.

---

## Migration sequence

1. Implement `lang-init`, register EN + JA descriptors + particle role maps
2. Implement `ParseClusters` — particle-bounded only (JA→EN direction first)
3. Implement `TranslateCluster` — reuses existing coord lookup machinery
4. Implement `ReorderClusters` — pure array rearrangement
5. Implement `InsertMarkers` — postposition/preposition/copula insertion
6. Gate behind `-cluster` flag; run quality sample JA→EN vs current token-by-token
7. Add position-bounded parser for EN→JA direction
8. Remove flag when cluster path matches or exceeds token-by-token quality

---

## Test cases

### JA→EN (steps 1-6)

| Source | Expected | Tests |
|--------|----------|-------|
| 猫が魚を食べた | cat ate fish | Basic SOV→SVO, object positioning |
| 彼は東京に行った | he went to Tokyo | Locative に via SemanticPlace coord |
| 友達にあげた | gave to friend | Dative に via SemanticHuman coord |
| 食べている鳥が鳴いた | the eating bird chirped | RC (Bverb inside NP) + SemanticAnim→鳴く |
| 東京は大きい都市だ | Tokyo is a big city | Copular: no Bverb, overt copula in EN |
| 東京は大きい | Tokyo is big | Copular: predicate adjective, no だ |

### EN→JA (step 7)

| Source | Expected | Tests |
|--------|----------|-------|
| the bird sang | 鳥は鳴いた | SemanticAnim subject → 鳴く over 歌う |
| she sang beautifully | 彼女は美しく歌った | SemanticHuman subject → 歌う |
| he went to Tokyo | 彼は東京に行った | Locative PP → postposition に |

---

## Key invariants

- Cluster parsing runs on `TokenizeJA` output — morph-coord tokenizer has already collapsed progressives (食べている = single token). Never parse raw text.
- RC detection uses `jaRecordBranch == Bverb` only. Bmodifier adjectives never trigger it.
- Copula detection: `no Bverb in predicate zone` is the single rule. だ/です suppression is a consequence, not the cause.
- Ambiguous particle role resolution: `LookupParticleRole(lang, particle, NP_flags)` — RelaxCoord on the NP's semantic coord. Same mechanism as verb disambiguation.
- Three-level coord propagation (embedded VP → head noun → main VP) is just recursion of the same cluster mechanism. No special-case machinery.