name: distributed-systems description: This skill should be used when designing or implementing distributed systems, understanding consensus protocols (Paxos, Raft, PBFT, Nakamoto, PnyxDB), analyzing CAP theorem trade-offs, implementing logical clocks (Lamport, Vector, ITC), or building fault-tolerant architectures. Provides comprehensive knowledge of consensus algorithms, Byzantine fault tolerance, adversarial oracle protocols, replication strategies, causality tracking, and distributed system design principles.

Distributed Systems

This skill provides deep knowledge of distributed systems design, consensus protocols, fault tolerance, and the fundamental trade-offs in building reliable distributed architectures.

When to Use This Skill

Designing distributed databases or storage systems
Implementing consensus protocols (Raft, Paxos, PBFT, Nakamoto, PnyxDB)
Analyzing system trade-offs using CAP theorem
Building fault-tolerant or Byzantine fault-tolerant systems
Understanding replication and consistency models
Implementing causality tracking with logical clocks
Building blockchain consensus mechanisms
Designing decentralized oracle systems
Understanding adversarial attack vectors in distributed systems

CAP Theorem

The Fundamental Trade-off

The CAP theorem, introduced by Eric Brewer in 2000, states that a distributed data store cannot simultaneously provide more than two of:

Consistency (C): Every read receives the most recent write or an error
Availability (A): Every request receives a non-error response (without guarantee of most recent data)
Partition Tolerance (P): System continues operating despite network partitions

Why P is Non-Negotiable

In any distributed system over a network:

Network partitions will occur (cable cuts, router failures, congestion)
A system that isn't partition-tolerant isn't truly distributed
The real choice is between CP and AP during partitions

System Classifications

CP Systems (Consistency + Partition Tolerance)

Behavior during partition: Refuses some requests to maintain consistency.

Examples:

MongoDB (with majority write concern)
HBase
Zookeeper
etcd

Use when:

Correctness is paramount (financial systems)
Stale reads are unacceptable
Brief unavailability is tolerable

AP Systems (Availability + Partition Tolerance)

Behavior during partition: Continues serving requests, may return stale data.

Examples:

Cassandra
DynamoDB
CouchDB
Riak

Use when:

High availability is critical
Eventual consistency is acceptable
Shopping carts, social media feeds

CA Systems

Theoretical only: Cannot exist in distributed systems because partitions are inevitable.

Single-node databases are technically CA but aren't distributed.

PACELC Extension

PACELC extends CAP to address normal operation:

If there is a Partition, choose between Availability and Consistency. Else (normal operation), choose between Latency and Consistency.

System	P: A or C	E: L or C
DynamoDB	A	L
Cassandra	A	L
MongoDB	C	C
PNUTS	C	L

Consistency Models

Strong Consistency

Every read returns the most recent write. Achieved through:

Single leader with synchronous replication
Consensus protocols (Paxos, Raft)

Trade-off: Higher latency, lower availability during failures.

Eventual Consistency

If no new updates, all replicas eventually converge to the same state.

Variants:

Causal consistency: Preserves causally related operations order
Read-your-writes: Clients see their own writes
Monotonic reads: Never see older data after seeing newer
Session consistency: Consistency within a session

Linearizability

Operations appear instantaneous at some point between invocation and response.

Provides:

Single-object operations appear atomic
Real-time ordering guarantees
Foundation for distributed locks, leader election

Serializability

Transactions appear to execute in some serial order.

Note: Linearizability ≠ Serializability

Linearizability: Single-operation recency guarantee
Serializability: Multi-operation isolation guarantee

Consensus Protocols

The Consensus Problem

Getting distributed nodes to agree on a single value despite failures.

Requirements:

Agreement: All correct nodes decide on the same value
Validity: Decided value was proposed by some node
Termination: All correct nodes eventually decide

Paxos

Developed by Leslie Lamport (1989/1998), foundational consensus algorithm.

Roles

Proposers: Propose values
Acceptors: Vote on proposals
Learners: Learn decided values

Basic Protocol (Single-Decree)

Phase 1a: Prepare

Proposer → Acceptors: PREPARE(n)
  - n is unique proposal number

Phase 1b: Promise

Acceptor → Proposer: PROMISE(n, accepted_proposal)
  - If n > highest_seen: promise to ignore lower proposals
  - Return previously accepted proposal if any

Phase 2a: Accept

Proposer → Acceptors: ACCEPT(n, v)
  - v = value from highest accepted proposal, or proposer's own value

Phase 2b: Accepted

Acceptor → Learners: ACCEPTED(n, v)
  - If n >= highest_promised: accept the proposal

Decision: Value is decided when majority of acceptors accept it.

Multi-Paxos

Optimization for sequences of values:

Elect stable leader
Skip Phase 1 for subsequent proposals
Significantly reduces message complexity

Strengths and Weaknesses

Strengths:

Proven correct
Tolerates f failures with 2f+1 nodes
Foundation for many systems

Weaknesses:

Complex to implement correctly
No specified leader election
Performance requires Multi-Paxos optimizations

Raft

Key Design Principles

Decomposition: Separates leader election, log replication, safety
State reduction: Minimizes states to consider
Strong leader: All writes through leader

Server States

Leader: Handles all client requests, replicates log
Follower: Passive, responds to leader and candidates
Candidate: Trying to become leader

Leader Election

1. Follower times out (no heartbeat from leader)
2. Becomes Candidate, increments term, votes for self
3. Requests votes from other servers
4. Wins with majority votes → becomes Leader
5. Loses (another leader) → becomes Follower
6. Timeout → starts new election

Safety: Only candidates with up-to-date logs can win.

Log Replication

1. Client sends command to Leader
2. Leader appends to local log
3. Leader sends AppendEntries to Followers
4. On majority acknowledgment: entry is committed
5. Leader applies to state machine, responds to client
6. Followers apply committed entries

Log Matching Property

If two logs contain entry with same index and term:

Entries are identical
All preceding entries are identical

Term

Logical clock that increases with each election:

Detects stale leaders
Resolves conflicts
Included in all messages

Comparison with Paxos

Aspect	Paxos	Raft
Understandability	Complex	Designed for clarity
Leader	Optional (Multi-Paxos)	Required
Log gaps	Allowed	Not allowed
Membership changes	Complex	Joint consensus
Implementations	Many variants	Consistent

PBFT (Practical Byzantine Fault Tolerance)

Developed by Castro and Liskov (1999) for Byzantine faults.

Byzantine Faults

Nodes can behave arbitrarily:

Crash
Send incorrect messages
Collude maliciously
Act inconsistently to different nodes

Fault Tolerance

Tolerates f Byzantine faults with 3f+1 nodes.

Why 3f+1?

Need 2f+1 honest responses
f Byzantine nodes might lie
Need f more to distinguish honest majority

Protocol Phases

Normal Operation (leader is honest):

1. REQUEST: Client → Primary (leader)
2. PRE-PREPARE: Primary → All replicas
   - Primary assigns sequence number
3. PREPARE: Each replica → All replicas
   - Validates pre-prepare
4. COMMIT: Each replica → All replicas
   - After receiving 2f+1 prepares
5. REPLY: Each replica → Client
   - After receiving 2f+1 commits

Client waits for f+1 matching replies.

View Change

When primary appears faulty:

Replicas timeout waiting for primary
Broadcast VIEW-CHANGE with prepared certificates
New primary collects 2f+1 view-changes
Broadcasts NEW-VIEW with proof
System resumes with new primary

Message Complexity

Normal case: O(n²) messages per request
View change: O(n³) messages

Scalability challenge: Quadratic messaging limits cluster size.

Optimizations

Speculative execution: Execute before commit
Batching: Group multiple requests
Signatures: Use MACs instead of digital signatures
Threshold signatures: Reduce signature overhead

Modern BFT Variants

HotStuff (2019)

Linear message complexity O(n)
Used in LibraBFT (Diem), other blockchains
Three-phase protocol with threshold signatures

Tendermint

Blockchain-focused BFT
Integrated with Cosmos SDK
Immediate finality

QBFT (Quorum BFT)

Enterprise-focused (ConsenSys/JPMorgan)
Enhanced IBFT for Ethereum-based systems

Nakamoto Consensus

The consensus mechanism powering Bitcoin, introduced by Satoshi Nakamoto (2008).

Core Innovation

Combines three elements:

Proof-of-Work (PoW): Cryptographic puzzle for block creation
Longest Chain Rule: Fork resolution by accumulated work
Probabilistic Finality: Security increases with confirmations

How It Works

1. Transactions broadcast to network
2. Miners collect transactions into blocks
3. Miners race to solve PoW puzzle:
   - Find nonce such that Hash(block_header) < target
   - Difficulty adjusts to maintain ~10 min block time
4. First miner to solve broadcasts block
5. Other nodes verify and append to longest chain
6. Miner receives block reward + transaction fees

Longest Chain Rule

When forks occur:

Chain A: [genesis] → [1] → [2] → [3]
Chain B: [genesis] → [1] → [2'] → [3'] → [4']

Nodes follow Chain B (more accumulated work)
Chain A blocks become "orphaned"

Note: Actually "most accumulated work" not "most blocks"—a chain with fewer but harder blocks wins.

Security Model

Honest Majority Assumption: Protocol secure if honest mining power > 50%.

Formal analysis (Ren 2019):

Safe if: g²α > β

Where:
  α = honest mining rate
  β = adversarial mining rate
  g = growth rate accounting for network delay
  Δ = maximum network delay

Implications:

Larger block interval → more security margin
Higher network delay → need more honest majority
10-minute block time provides safety margin for global network

Probabilistic Finality

No instant finality—deeper blocks are exponentially harder to reverse:

Confirmations	Attack Probability (30% attacker)
1	~50%
3	~12%
6	~0.2%
12	~0.003%

Convention: 6 confirmations (~1 hour) considered "final" for Bitcoin.

Attacks

51% Attack: Attacker with majority hashrate can:

Double-spend transactions
Prevent confirmations
NOT: steal funds, change consensus rules, create invalid transactions

Selfish Mining: Strategic block withholding to waste honest miners' work.

Profitable with < 50% hashrate under certain conditions
Mitigated by network propagation improvements

Long-Range Attacks: Not applicable to PoW (unlike PoS).

Trade-offs vs Traditional BFT

Aspect	Nakamoto	Classical BFT
Finality	Probabilistic	Immediate
Throughput	Low (~7 TPS)	Higher
Participants	Permissionless	Permissioned
Energy	High (PoW)	Low
Fault tolerance	50% hashrate	33% nodes
Scalability	Global	Limited nodes

PnyxDB: Leaderless Democratic BFT

Developed by Bonniot, Neumann, and Taïani (2019) for consortia applications.

Key Innovation: Conditional Endorsements

Unlike leader-based BFT, PnyxDB uses leaderless quorums with conditional endorsements:

Endorsements track conflicts between transactions
If transactions commute (no conflicting operations), quorums built independently
Non-commuting transactions handled via Byzantine Veto Procedure (BVP)

Transaction Lifecycle

1. Client broadcasts transaction to endorsers
2. Endorsers evaluate against application-defined policies
3. If no conflicts: endorser sends acknowledgment
4. If conflicts detected: conditional endorsement specifying
   which transactions must NOT be committed for this to be valid
5. Transaction commits when quorum of valid endorsements collected
6. BVP resolves conflicting transactions

Byzantine Veto Procedure (BVP)

Ensures termination with conflicting transactions:

Transactions have deadlines
Conflicting endorsements trigger resolution loop
Protocol guarantees exit when deadline passes
At most f Byzantine nodes tolerated with n endorsers

Application-Level Voting

Unique feature: nodes can endorse or reject transactions based on application-defined policies without compromising consistency.

Use cases:

Consortium governance decisions
Policy-based access control
Democratic decision making

Performance

Compared to BFT-SMaRt and Tendermint:

11x faster commit latencies
< 5 seconds in worldwide geo-distributed deployment
Tested with 180 nodes

Implementation

Written in Go (requires Go 1.11+)
Uses gossip broadcast for message propagation
Web-of-trust node authentication
Scales to hundreds/thousands of nodes

Replication Strategies

Single-Leader Replication

Clients → Leader → Followers

Pros: Simple, strong consistency possible Cons: Leader bottleneck, failover complexity

Synchronous vs Asynchronous

Type	Durability	Latency	Availability
Synchronous	Guaranteed	High	Lower
Asynchronous	At-risk	Low	Higher
Semi-synchronous	Balanced	Medium	Medium

Multi-Leader Replication

Multiple nodes accept writes, replicate to each other.

Use cases:

Multi-datacenter deployment
Clients with offline operation

Challenges:

Write conflicts
Conflict resolution complexity

Conflict Resolution

Last-write-wins (LWW): Timestamp-based, may lose data
Application-specific: Custom merge logic
CRDTs: Mathematically guaranteed convergence

Leaderless Replication

Any node can accept reads and writes.

Examples: Dynamo, Cassandra, Riak

Quorum Reads/Writes

n = total replicas
w = write quorum (nodes that must acknowledge write)
r = read quorum (nodes that must respond to read)

For strong consistency: w + r > n

Common configurations:

n=3, w=2, r=2: Tolerates 1 failure
n=5, w=3, r=3: Tolerates 2 failures

Sloppy Quorums and Hinted Handoff

During partitions:

Write to available nodes (even if not home replicas)
"Hints" stored for unavailable nodes
Hints replayed when nodes recover

Failure Modes

Crash Failures

Node stops responding. Simplest failure model.

Detection: Heartbeats, timeouts Tolerance: 2f+1 nodes for f failures (Paxos, Raft)

Byzantine Failures

Arbitrary behavior including malicious.

Detection: Difficult without redundancy Tolerance: 3f+1 nodes for f failures (PBFT)

Network Partitions

Nodes can't communicate with some other nodes.

Impact: Forces CP vs AP choice Recovery: Reconciliation after partition heals

Split Brain

Multiple nodes believe they are leader.

Prevention:

Fencing (STONITH: Shoot The Other Node In The Head)
Quorum-based leader election
Lease-based leadership

Design Patterns

State Machine Replication

Replicate deterministic state machine across nodes:

All replicas start in same state
Apply same commands in same order
All reach same final state

Requires: Total order broadcast (consensus)

Chain Replication

Head → Node2 → Node3 → ... → Tail

Writes enter at head, propagate down chain
Reads served by tail (strongly consistent)
Simple, high throughput

Primary-Backup

Primary handles all operations, synchronously replicates to backups.

Failover: Backup promoted to primary on failure.

Quorum Systems

Intersecting sets ensure consistency:

Any read quorum intersects any write quorum
Guarantees reads see latest write

Balancing Trade-offs

Identifying Critical Requirements

Correctness requirements

- Is data loss acceptable? - Can operations be reordered? - Are conflicts resolvable?

Availability requirements

- What's acceptable downtime? - Geographic distribution needs? - Partition recovery strategy?

Performance requirements

- Latency targets? - Throughput needs? - Consistency cost tolerance?

Vulnerability Mitigation by Protocol

Paxos/Raft (Crash Fault Tolerant)

Vulnerabilities:

Leader failure causes brief unavailability
Split-brain without proper fencing
Slow follower impacts commit latency (sync replication)

Mitigations:

Fast leader election (pre-voting)
Quorum-based fencing
Flexible quorum configurations
Learner nodes for read scaling

PBFT (Byzantine Fault Tolerant)

Vulnerabilities:

O(n²) messages limit scalability
View change is expensive
Requires 3f+1 nodes (more infrastructure)

Mitigations:

Batching and pipelining
Optimistic execution (HotStuff)
Threshold signatures
Hierarchical consensus for scaling

Choosing the Right Protocol

Scenario	Recommended	Rationale
Internal infrastructure	Raft	Simple, well-understood
High consistency needs	Raft/Paxos	Proven correctness
Public/untrusted network	PBFT variant	Byzantine tolerance
Blockchain	HotStuff/Tendermint	Linear complexity BFT
Eventually consistent	Dynamo-style	High availability
Global distribution	Multi-leader + CRDTs	Partition tolerance

Implementation Considerations

Timeouts

Heartbeat interval: 100-300ms typical
Election timeout: 10x heartbeat (avoid split votes)
Request timeout: Application-dependent

Persistence

What must be persisted before acknowledgment:

Raft: Current term, voted-for, log entries
PBFT: View number, prepared/committed certificates

Membership Changes

Dynamic cluster membership:

Raft: Joint consensus (old + new config)
Paxos: α-reconfiguration
PBFT: View change with new configuration

Testing

Jepsen: Distributed systems testing framework
Chaos engineering: Intentional failure injection
Formal verification: TLA+, Coq proofs

Adversarial Oracle Protocols

Oracles bridge on-chain smart contracts with off-chain data, but introduce trust assumptions into trustless systems.

The Oracle Problem

Definition: The security, authenticity, and trust conflict between third-party oracles and the trustless execution of smart contracts.

Core Challenge: Blockchains cannot verify correctness of external data. Oracles become:

Single points of failure
Targets for manipulation
Trust assumptions in "trustless" systems

Attack Vectors

Price Oracle Manipulation

Flash Loan Attacks:

1. Borrow large amount via flash loan (no collateral)
2. Manipulate price on DEX (large trade)
3. Oracle reads manipulated price
4. Smart contract executes with wrong price
5. Profit from arbitrage/liquidation
6. Repay flash loan in same transaction

Notable Example: Harvest Finance ($30M+ loss, 2020)

Data Source Attacks

Compromised API: Single data source manipulation
Front-running: Oracle updates exploited before on-chain
Liveness attacks: Preventing oracle updates
Bribery: Incentivizing oracle operators to lie

Economic Attacks

Cost of Corruption Analysis:

If oracle controls value V:
  - Attack profit: V
  - Attack cost: oracle stake + reputation
  - Rational to attack if: profit > cost

Implication: Oracles must have stake > value they secure.

Decentralized Oracle Networks (DONs)

Chainlink Model

Multi-layer Security:

1. Multiple independent data sources
2. Multiple independent node operators
3. Aggregation (median, weighted average)
4. Reputation system
5. Cryptoeconomic incentives (staking)

Data Aggregation:

Nodes: [Oracle₁: $100, Oracle₂: $101, Oracle₃: $150, Oracle₄: $100]
Median: $100.50
Outlier (Oracle₃) has minimal impact

Reputation and Staking

Node reputation based on:
  - Historical accuracy
  - Response time
  - Uptime
  - Stake amount

Job assignment weighted by reputation
Slashing for misbehavior

Oracle Design Patterns

Time-Weighted Average Price (TWAP)

Resist single-block manipulation:

TWAP = Σ(price_i × duration_i) / total_duration

Example over 1 hour:
  - 30 min at $100: 30 × 100 = 3000
  - 20 min at $101: 20 × 101 = 2020
  - 10 min at $150 (manipulation): 10 × 150 = 1500
  TWAP = 6520 / 60 = $108.67 (vs $150 spot)

Commit-Reveal Schemes

Prevent front-running oracle updates:

Phase 1 (Commit):
  - Oracle commits: hash(price || salt)
  - Cannot be read by others

Phase 2 (Reveal):
  - Oracle reveals: price, salt
  - Contract verifies hash matches
  - All oracles reveal simultaneously

Schelling Points

Game-theoretic oracle coordination:

1. Multiple oracles submit answers
2. Consensus answer determined
3. Oracles matching consensus rewarded
4. Outliers penalized

Assumption: Honest answer is "obvious" Schelling point

Trusted Execution Environments (TEEs)

Hardware-based oracle security:

TEE (Intel SGX, ARM TrustZone):
  - Isolated execution environment
  - Code attestation
  - Protected memory
  - External data fetching inside enclave

Benefits:

Verifiable computation
Protected from host machine
Cryptographic proofs of execution

Limitations:

Hardware trust assumption
Side-channel attacks possible
Intel SGX vulnerabilities discovered

Oracle Types by Data Source

Type	Source	Trust Model	Use Case
Price feeds	Exchanges	Multiple sources	DeFi
Randomness	VRF/DRAND	Cryptographic	Gaming, NFTs
Event outcomes	Manual report	Reputation	Prediction markets
Cross-chain	Other blockchains	Bridge security	Interoperability
Computation	Off-chain compute	Verifiable	Complex logic

Defense Mechanisms

Diversification: Multiple independent oracles
Economic security: Stake > protected value
Time delays: Allow dispute periods
Circuit breakers: Pause on anomalous data
TWAP: Resist flash manipulation
Commit-reveal: Prevent front-running
Reputation: Long-term incentives

Hybrid Approaches

Optimistic Oracles:

1. Oracle posts answer + bond
2. Dispute window (e.g., 2 hours)
3. If disputed: escalate to arbitration
4. If not disputed: answer accepted
5. Incorrect oracle loses bond

Examples: UMA Protocol, Optimistic Oracle

Causality and Logical Clocks

Physical clocks cannot reliably order events in distributed systems due to clock drift and synchronization issues. Logical clocks provide ordering based on causality.

The Happened-Before Relation

Defined by Leslie Lamport (1978):

Event a happened-before event b (a → b) if:

a and b are in the same process, and a comes before b
a is a send event and b is the corresponding receive
There exists c such that a → c and c → b (transitivity)

If neither a → b nor b → a, events are concurrent (a || b).

Lamport Clocks

Simple scalar timestamps providing partial ordering.

Rules:

1. Each process maintains counter C
2. Before each event: C = C + 1
3. Send message m with timestamp C
4. On receive: C = max(C, message_timestamp) + 1

Properties:

If a → b, then C(a) < C(b)
Limitation: C(a) < C(b) does NOT imply a → b
Cannot detect concurrent events

Use cases:

Total ordering with tie-breaker (process ID)
Distributed snapshots
Simple event ordering

Vector Clocks

Array of counters, one per process. Captures full causality.

Structure (for n processes):

VC[1..n] where VC[i] is process i's logical time

Rules (at process i):

1. Before each event: VC[i] = VC[i] + 1
2. Send message with full vector VC
3. On receive from j:
   for k in 1..n:
       VC[k] = max(VC[k], received_VC[k])
   VC[i] = VC[i] + 1

Comparison (for vectors V1 and V2):

V1 = V2    iff ∀i: V1[i] = V2[i]
V1 ≤ V2    iff ∀i: V1[i] ≤ V2[i]
V1 < V2    iff V1 ≤ V2 and V1 ≠ V2
V1 || V2   iff NOT(V1 ≤ V2) and NOT(V2 ≤ V1)  # concurrent

Properties:

a → b iff VC(a) < VC(b)
a || b iff VC(a) || VC(b)
Full causality detection

Trade-off: O(n) space per event, where n = number of processes.

Interval Tree Clocks (ITC)

Developed by Almeida, Baquero, and Fonte (2008) for dynamic systems.

Problem with Vector Clocks:

Static: size fixed to max number of processes
ID retirement requires global coordination
Unsuitable for high-churn systems (P2P)

ITC Solution:

Binary tree structure for ID space
Dynamic ID allocation and deallocation
Localized fork/join operations

Core Operations:

fork(id): Split ID into two children
  - Parent retains left half
  - New process gets right half

join(id1, id2): Merge two IDs
  - Combine ID trees
  - Localized operation, no global coordination

event(id, stamp): Increment logical clock
peek(id, stamp): Read without increment

ID Space Representation:

     1           # Full ID space
    / \
   0   1         # After one fork
  / \
 0   1           # After another fork (left child)

Stamp (Clock) Representation:

Tree structure mirrors ID space
Each node has base value + optional children
Efficient representation of sparse vectors

Example:

Initial: id=(1), stamp=0
Fork:    id1=(1,0), stamp1=0
         id2=(0,1), stamp2=0
Event at id1: stamp1=(0,(1,0))
Join id1+id2: id=(1), stamp=max of both

Advantages over Vector Clocks:

Constant-size representation possible
Dynamic membership without global state
Efficient ID garbage collection
Causality preserved across reconfigurations

Use cases:

Peer-to-peer systems
Mobile/ad-hoc networks
Systems with frequent node join/leave

Version Vectors

Specialization of vector clocks for tracking data versions.

Difference from Vector Clocks:

Vector clocks: track all events
Version vectors: track data updates only

Usage in Dynamo-style systems:

Client reads with version vector V1
Client writes with version vector V2
Server compares:
  - If V1 < current: stale read, conflict possible
  - If V1 = current: safe update
  - If V1 || current: concurrent writes, need resolution

Hybrid Logical Clocks (HLC)

Combines physical and logical time.

Structure:

HLC = (physical_time, logical_counter)

Rules:

1. On local/send event:
   pt = physical_clock()
   if pt > l:
       l = pt
       c = 0
   else:
       c = c + 1
   return (l, c)

2. On receive with timestamp (l', c'):
   pt = physical_clock()
   if pt > l and pt > l':
       l = pt
       c = 0
   elif l' > l:
       l = l'
       c = c' + 1
   elif l > l':
       c = c + 1
   else:  # l = l'
       c = max(c, c') + 1
   return (l, c)

Properties:

Bounded drift from physical time
Captures causality like Lamport clocks
Timestamps comparable to wall-clock time
Used in CockroachDB, Google Spanner

Comparison of Logical Clocks

Clock Type	Space	Causality	Concurrency	Dynamic
Lamport	O(1)	Partial	No	Yes
Vector	O(n)	Full	Yes	No
ITC	O(log n)*	Full	Yes	Yes
HLC	O(1)	Partial	No	Yes

*ITC space varies based on tree structure

Practical Applications

Conflict Detection (Vector Clocks):

if V1 < V2:
    # v1 is ancestor of v2, no conflict
elif V1 > V2:
    # v2 is ancestor of v1, no conflict
else:  # V1 || V2
    # Concurrent updates, need conflict resolution

Causal Broadcast:

Deliver message m with VC only when:
1. VC[sender] = local_VC[sender] + 1  (next expected from sender)
2. ∀j ≠ sender: VC[j] ≤ local_VC[j]  (all causal deps satisfied)

Snapshot Algorithms:

Consistent cut: set of events S where
  if e ∈ S and f → e, then f ∈ S
Vector clocks make this efficiently verifiable

References

For detailed protocol specifications and proofs, see:

references/consensus-protocols.md - Detailed protocol descriptions
references/consistency-models.md - Formal consistency definitions
references/failure-scenarios.md - Failure mode analysis
references/logical-clocks.md - Clock algorithms and implementations

SKILL.md raw

Distributed Systems

When to Use This Skill

CAP Theorem

The Fundamental Trade-off

Why P is Non-Negotiable

System Classifications

CP Systems (Consistency + Partition Tolerance)

AP Systems (Availability + Partition Tolerance)

CA Systems

PACELC Extension

Consistency Models

Strong Consistency

Eventual Consistency

Linearizability

Serializability

Consensus Protocols

The Consensus Problem

Paxos

Roles

Basic Protocol (Single-Decree)

Multi-Paxos

Strengths and Weaknesses

Raft

Key Design Principles

Server States

Leader Election

Log Replication

Log Matching Property

Term

Comparison with Paxos

PBFT (Practical Byzantine Fault Tolerance)

Byzantine Faults

Fault Tolerance

Protocol Phases

View Change

Message Complexity

Optimizations

Modern BFT Variants

HotStuff (2019)

Tendermint

QBFT (Quorum BFT)

Nakamoto Consensus

Core Innovation

How It Works

Longest Chain Rule

Security Model

Probabilistic Finality

Attacks

Trade-offs vs Traditional BFT

PnyxDB: Leaderless Democratic BFT

Key Innovation: Conditional Endorsements

Transaction Lifecycle

Byzantine Veto Procedure (BVP)

Application-Level Voting

Performance

Implementation

Replication Strategies

Single-Leader Replication

Synchronous vs Asynchronous

Multi-Leader Replication

Conflict Resolution

Leaderless Replication

Quorum Reads/Writes

Sloppy Quorums and Hinted Handoff

Failure Modes

Crash Failures

Byzantine Failures

Network Partitions

Split Brain

Design Patterns

State Machine Replication

Chain Replication

Primary-Backup

Quorum Systems

Balancing Trade-offs

Identifying Critical Requirements

Vulnerability Mitigation by Protocol

Paxos/Raft (Crash Fault Tolerant)

PBFT (Byzantine Fault Tolerant)