name: distributed-systems description: This skill should be used when designing or implementing distributed systems, understanding consensus protocols (Paxos, Raft, PBFT, Nakamoto, PnyxDB), analyzing CAP theorem trade-offs, implementing logical clocks (Lamport, Vector, ITC), or building fault-tolerant architectures. Provides comprehensive knowledge of consensus algorithms, Byzantine fault tolerance, adversarial oracle protocols, replication strategies, causality tracking, and distributed system design principles.
This skill provides deep knowledge of distributed systems design, consensus protocols, fault tolerance, and the fundamental trade-offs in building reliable distributed architectures.
The CAP theorem, introduced by Eric Brewer in 2000, states that a distributed data store cannot simultaneously provide more than two of:
In any distributed system over a network:
Behavior during partition: Refuses some requests to maintain consistency.
Examples:
Use when:
Behavior during partition: Continues serving requests, may return stale data.
Examples:
Use when:
Theoretical only: Cannot exist in distributed systems because partitions are inevitable.
Single-node databases are technically CA but aren't distributed.
PACELC extends CAP to address normal operation:
If there is a Partition, choose between Availability and Consistency. Else (normal operation), choose between Latency and Consistency.
| System | P: A or C | E: L or C |
|---|---|---|
| DynamoDB | A | L |
| Cassandra | A | L |
| MongoDB | C | C |
| PNUTS | C | L |
Every read returns the most recent write. Achieved through:
Trade-off: Higher latency, lower availability during failures.
If no new updates, all replicas eventually converge to the same state.
Variants:
Operations appear instantaneous at some point between invocation and response.
Provides:
Transactions appear to execute in some serial order.
Note: Linearizability ≠ Serializability
Getting distributed nodes to agree on a single value despite failures.
Requirements:
Developed by Leslie Lamport (1989/1998), foundational consensus algorithm.
Phase 1a: Prepare
Proposer → Acceptors: PREPARE(n)
- n is unique proposal number
Phase 1b: Promise
Acceptor → Proposer: PROMISE(n, accepted_proposal)
- If n > highest_seen: promise to ignore lower proposals
- Return previously accepted proposal if any
Phase 2a: Accept
Proposer → Acceptors: ACCEPT(n, v)
- v = value from highest accepted proposal, or proposer's own value
Phase 2b: Accepted
Acceptor → Learners: ACCEPTED(n, v)
- If n >= highest_promised: accept the proposal
Decision: Value is decided when majority of acceptors accept it.
Optimization for sequences of values:
Strengths:
Weaknesses:
Designed by Diego Ongaro and John Ousterhout (2013) for understandability.
1. Follower times out (no heartbeat from leader)
2. Becomes Candidate, increments term, votes for self
3. Requests votes from other servers
4. Wins with majority votes → becomes Leader
5. Loses (another leader) → becomes Follower
6. Timeout → starts new election
Safety: Only candidates with up-to-date logs can win.
1. Client sends command to Leader
2. Leader appends to local log
3. Leader sends AppendEntries to Followers
4. On majority acknowledgment: entry is committed
5. Leader applies to state machine, responds to client
6. Followers apply committed entries
If two logs contain entry with same index and term:
Logical clock that increases with each election:
| Aspect | Paxos | Raft |
|---|---|---|
| Understandability | Complex | Designed for clarity |
| Leader | Optional (Multi-Paxos) | Required |
| Log gaps | Allowed | Not allowed |
| Membership changes | Complex | Joint consensus |
| Implementations | Many variants | Consistent |
Developed by Castro and Liskov (1999) for Byzantine faults.
Nodes can behave arbitrarily:
Tolerates f Byzantine faults with 3f+1 nodes.
Why 3f+1?
Normal Operation (leader is honest):
1. REQUEST: Client → Primary (leader)
2. PRE-PREPARE: Primary → All replicas
- Primary assigns sequence number
3. PREPARE: Each replica → All replicas
- Validates pre-prepare
4. COMMIT: Each replica → All replicas
- After receiving 2f+1 prepares
5. REPLY: Each replica → Client
- After receiving 2f+1 commits
Client waits for f+1 matching replies.
When primary appears faulty:
Scalability challenge: Quadratic messaging limits cluster size.
The consensus mechanism powering Bitcoin, introduced by Satoshi Nakamoto (2008).
Combines three elements:
1. Transactions broadcast to network
2. Miners collect transactions into blocks
3. Miners race to solve PoW puzzle:
- Find nonce such that Hash(block_header) < target
- Difficulty adjusts to maintain ~10 min block time
4. First miner to solve broadcasts block
5. Other nodes verify and append to longest chain
6. Miner receives block reward + transaction fees
When forks occur:
Chain A: [genesis] → [1] → [2] → [3]
Chain B: [genesis] → [1] → [2'] → [3'] → [4']
Nodes follow Chain B (more accumulated work)
Chain A blocks become "orphaned"
Note: Actually "most accumulated work" not "most blocks"—a chain with fewer but harder blocks wins.
Honest Majority Assumption: Protocol secure if honest mining power > 50%.
Formal analysis (Ren 2019):
Safe if: g²α > β
Where:
α = honest mining rate
β = adversarial mining rate
g = growth rate accounting for network delay
Δ = maximum network delay
Implications:
No instant finality—deeper blocks are exponentially harder to reverse:
| Confirmations | Attack Probability (30% attacker) |
|---|---|
| 1 | ~50% |
| 3 | ~12% |
| 6 | ~0.2% |
| 12 | ~0.003% |
Convention: 6 confirmations (~1 hour) considered "final" for Bitcoin.
51% Attack: Attacker with majority hashrate can:
Selfish Mining: Strategic block withholding to waste honest miners' work.
Long-Range Attacks: Not applicable to PoW (unlike PoS).
| Aspect | Nakamoto | Classical BFT |
|---|---|---|
| Finality | Probabilistic | Immediate |
| Throughput | Low (~7 TPS) | Higher |
| Participants | Permissionless | Permissioned |
| Energy | High (PoW) | Low |
| Fault tolerance | 50% hashrate | 33% nodes |
| Scalability | Global | Limited nodes |
Developed by Bonniot, Neumann, and Taïani (2019) for consortia applications.
Unlike leader-based BFT, PnyxDB uses leaderless quorums with conditional endorsements:
1. Client broadcasts transaction to endorsers
2. Endorsers evaluate against application-defined policies
3. If no conflicts: endorser sends acknowledgment
4. If conflicts detected: conditional endorsement specifying
which transactions must NOT be committed for this to be valid
5. Transaction commits when quorum of valid endorsements collected
6. BVP resolves conflicting transactions
Ensures termination with conflicting transactions:
Unique feature: nodes can endorse or reject transactions based on application-defined policies without compromising consistency.
Use cases:
Compared to BFT-SMaRt and Tendermint:
Clients → Leader → Followers
Pros: Simple, strong consistency possible Cons: Leader bottleneck, failover complexity
| Type | Durability | Latency | Availability |
|---|---|---|---|
| Synchronous | Guaranteed | High | Lower |
| Asynchronous | At-risk | Low | Higher |
| Semi-synchronous | Balanced | Medium | Medium |
Multiple nodes accept writes, replicate to each other.
Use cases:
Challenges:
Any node can accept reads and writes.
Examples: Dynamo, Cassandra, Riak
n = total replicas
w = write quorum (nodes that must acknowledge write)
r = read quorum (nodes that must respond to read)
For strong consistency: w + r > n
Common configurations:
During partitions:
Node stops responding. Simplest failure model.
Detection: Heartbeats, timeouts Tolerance: 2f+1 nodes for f failures (Paxos, Raft)
Arbitrary behavior including malicious.
Detection: Difficult without redundancy Tolerance: 3f+1 nodes for f failures (PBFT)
Nodes can't communicate with some other nodes.
Impact: Forces CP vs AP choice Recovery: Reconciliation after partition heals
Multiple nodes believe they are leader.
Prevention:
Replicate deterministic state machine across nodes:
Requires: Total order broadcast (consensus)
Head → Node2 → Node3 → ... → Tail
Primary handles all operations, synchronously replicates to backups.
Failover: Backup promoted to primary on failure.
Intersecting sets ensure consistency:
- Is data loss acceptable? - Can operations be reordered? - Are conflicts resolvable?
- What's acceptable downtime? - Geographic distribution needs? - Partition recovery strategy?
- Latency targets? - Throughput needs? - Consistency cost tolerance?
Vulnerabilities:
Mitigations:
Vulnerabilities:
Mitigations:
| Scenario | Recommended | Rationale |
|---|---|---|
| Internal infrastructure | Raft | Simple, well-understood |
| High consistency needs | Raft/Paxos | Proven correctness |
| Public/untrusted network | PBFT variant | Byzantine tolerance |
| Blockchain | HotStuff/Tendermint | Linear complexity BFT |
| Eventually consistent | Dynamo-style | High availability |
| Global distribution | Multi-leader + CRDTs | Partition tolerance |
What must be persisted before acknowledgment:
Dynamic cluster membership:
Oracles bridge on-chain smart contracts with off-chain data, but introduce trust assumptions into trustless systems.
Definition: The security, authenticity, and trust conflict between third-party oracles and the trustless execution of smart contracts.
Core Challenge: Blockchains cannot verify correctness of external data. Oracles become:
Flash Loan Attacks:
1. Borrow large amount via flash loan (no collateral)
2. Manipulate price on DEX (large trade)
3. Oracle reads manipulated price
4. Smart contract executes with wrong price
5. Profit from arbitrage/liquidation
6. Repay flash loan in same transaction
Notable Example: Harvest Finance ($30M+ loss, 2020)
Cost of Corruption Analysis:
If oracle controls value V:
- Attack profit: V
- Attack cost: oracle stake + reputation
- Rational to attack if: profit > cost
Implication: Oracles must have stake > value they secure.
Multi-layer Security:
1. Multiple independent data sources
2. Multiple independent node operators
3. Aggregation (median, weighted average)
4. Reputation system
5. Cryptoeconomic incentives (staking)
Data Aggregation:
Nodes: [Oracle₁: $100, Oracle₂: $101, Oracle₃: $150, Oracle₄: $100]
Median: $100.50
Outlier (Oracle₃) has minimal impact
Node reputation based on:
- Historical accuracy
- Response time
- Uptime
- Stake amount
Job assignment weighted by reputation
Slashing for misbehavior
Resist single-block manipulation:
TWAP = Σ(price_i × duration_i) / total_duration
Example over 1 hour:
- 30 min at $100: 30 × 100 = 3000
- 20 min at $101: 20 × 101 = 2020
- 10 min at $150 (manipulation): 10 × 150 = 1500
TWAP = 6520 / 60 = $108.67 (vs $150 spot)
Prevent front-running oracle updates:
Phase 1 (Commit):
- Oracle commits: hash(price || salt)
- Cannot be read by others
Phase 2 (Reveal):
- Oracle reveals: price, salt
- Contract verifies hash matches
- All oracles reveal simultaneously
Game-theoretic oracle coordination:
1. Multiple oracles submit answers
2. Consensus answer determined
3. Oracles matching consensus rewarded
4. Outliers penalized
Assumption: Honest answer is "obvious" Schelling point
Hardware-based oracle security:
TEE (Intel SGX, ARM TrustZone):
- Isolated execution environment
- Code attestation
- Protected memory
- External data fetching inside enclave
Benefits:
Limitations:
| Type | Source | Trust Model | Use Case |
|---|---|---|---|
| Price feeds | Exchanges | Multiple sources | DeFi |
| Randomness | VRF/DRAND | Cryptographic | Gaming, NFTs |
| Event outcomes | Manual report | Reputation | Prediction markets |
| Cross-chain | Other blockchains | Bridge security | Interoperability |
| Computation | Off-chain compute | Verifiable | Complex logic |
Optimistic Oracles:
1. Oracle posts answer + bond
2. Dispute window (e.g., 2 hours)
3. If disputed: escalate to arbitration
4. If not disputed: answer accepted
5. Incorrect oracle loses bond
Examples: UMA Protocol, Optimistic Oracle
Physical clocks cannot reliably order events in distributed systems due to clock drift and synchronization issues. Logical clocks provide ordering based on causality.
Defined by Leslie Lamport (1978):
Event a happened-before event b (a → b) if:
If neither a → b nor b → a, events are concurrent (a || b).
Simple scalar timestamps providing partial ordering.
Rules:
1. Each process maintains counter C
2. Before each event: C = C + 1
3. Send message m with timestamp C
4. On receive: C = max(C, message_timestamp) + 1
Properties:
Use cases:
Array of counters, one per process. Captures full causality.
Structure (for n processes):
VC[1..n] where VC[i] is process i's logical time
Rules (at process i):
1. Before each event: VC[i] = VC[i] + 1
2. Send message with full vector VC
3. On receive from j:
for k in 1..n:
VC[k] = max(VC[k], received_VC[k])
VC[i] = VC[i] + 1
Comparison (for vectors V1 and V2):
V1 = V2 iff ∀i: V1[i] = V2[i]
V1 ≤ V2 iff ∀i: V1[i] ≤ V2[i]
V1 < V2 iff V1 ≤ V2 and V1 ≠ V2
V1 || V2 iff NOT(V1 ≤ V2) and NOT(V2 ≤ V1) # concurrent
Properties:
Trade-off: O(n) space per event, where n = number of processes.
Developed by Almeida, Baquero, and Fonte (2008) for dynamic systems.
Problem with Vector Clocks:
ITC Solution:
Core Operations:
fork(id): Split ID into two children
- Parent retains left half
- New process gets right half
join(id1, id2): Merge two IDs
- Combine ID trees
- Localized operation, no global coordination
event(id, stamp): Increment logical clock
peek(id, stamp): Read without increment
ID Space Representation:
1 # Full ID space
/ \
0 1 # After one fork
/ \
0 1 # After another fork (left child)
Stamp (Clock) Representation:
Example:
Initial: id=(1), stamp=0
Fork: id1=(1,0), stamp1=0
id2=(0,1), stamp2=0
Event at id1: stamp1=(0,(1,0))
Join id1+id2: id=(1), stamp=max of both
Advantages over Vector Clocks:
Use cases:
Specialization of vector clocks for tracking data versions.
Difference from Vector Clocks:
Usage in Dynamo-style systems:
Client reads with version vector V1
Client writes with version vector V2
Server compares:
- If V1 < current: stale read, conflict possible
- If V1 = current: safe update
- If V1 || current: concurrent writes, need resolution
Combines physical and logical time.
Structure:
HLC = (physical_time, logical_counter)
Rules:
1. On local/send event:
pt = physical_clock()
if pt > l:
l = pt
c = 0
else:
c = c + 1
return (l, c)
2. On receive with timestamp (l', c'):
pt = physical_clock()
if pt > l and pt > l':
l = pt
c = 0
elif l' > l:
l = l'
c = c' + 1
elif l > l':
c = c + 1
else: # l = l'
c = max(c, c') + 1
return (l, c)
Properties:
| Clock Type | Space | Causality | Concurrency | Dynamic |
|---|---|---|---|---|
| Lamport | O(1) | Partial | No | Yes |
| Vector | O(n) | Full | Yes | No |
| ITC | O(log n)* | Full | Yes | Yes |
| HLC | O(1) | Partial | No | Yes |
*ITC space varies based on tree structure
Conflict Detection (Vector Clocks):
if V1 < V2:
# v1 is ancestor of v2, no conflict
elif V1 > V2:
# v2 is ancestor of v1, no conflict
else: # V1 || V2
# Concurrent updates, need conflict resolution
Causal Broadcast:
Deliver message m with VC only when:
1. VC[sender] = local_VC[sender] + 1 (next expected from sender)
2. ∀j ≠ sender: VC[j] ≤ local_VC[j] (all causal deps satisfied)
Snapshot Algorithms:
Consistent cut: set of events S where
if e ∈ S and f → e, then f ∈ S
Vector clocks make this efficiently verifiable
For detailed protocol specifications and proofs, see:
references/consensus-protocols.md - Detailed protocol descriptionsreferences/consistency-models.md - Formal consistency definitionsreferences/failure-scenarios.md - Failure mode analysisreferences/logical-clocks.md - Clock algorithms and implementations