~/posts/2025-04-04_understanding-and-scaling-raft.md
$

cat 2025-04-04_understanding-and-scaling-raft.md

📅

If you have worked with distributed systems, you must have come across this famous quote:

In a distributed system, the only thing two nodes can agree on is that they can't agree on anything.

This quote has stuck with me because it perfectly captures the complexity of building distributed systems. The challenge of getting multiple machines to agree on a single value is so fundamental that it has spawned entire fields of research. Today, we'll look at one elegant solution to this problem - the Raft consensus algorithm.

Understanding Raft: A Simplified Approach to Consensus

Raft was designed with understandability as its primary goal. Unlike more complex consensus algorithms, Raft divides the consensus problem into three relatively independent subproblems:

Leader Election

In Raft, a single server acts as the leader, managing client requests and serving as the source of truth:

  • Each server starts as a Follower
  • If no leader is heard from, a Follower becomes a Candidate
  • The Candidate requests votes from others
  • If it gets majority votes, it becomes the Leader
  • If two candidates split the vote, a new election starts

Log Replication

Once a leader is elected, it manages all client requests:

  • The leader receives all client requests
  • It adds them to its log
  • It replicates this log to followers
  • Once a majority confirm, the entry is committed

This approach ensures that all operations happen in the same order across all servers, providing consistency in the distributed system.

Safety Guarantees

Raft ensures several critical safety properties:

  • Election Safety: at most one leader can be elected in a given term
  • Leader Append-Only: a leader never overwrites or deletes entries in its log
  • Log Matching: if two logs contain an entry with the same index and term, they are identical
  • Leader Completeness: if an entry is committed, it will be present in the logs of all future leaders

The Cost of Consensus

There's a reason why we don't use Raft for everything. Each write operation requires:

  • 1 round-trip to leader
  • 1 round-trip to followers
  • 1 round-trip back to client

In a 5-node cluster, a single write operation involves at least 3 network round-trips, making it relatively expensive for high-throughput systems.

Scaling Raft: The Multi-Raft Pattern

To scale Raft to larger systems, the Multi-Raft pattern has emerged as a key architecture in distributed databases like CockroachDB and TiDB:

  • Data is split into ranges/shards
  • Each range has its own independent Raft group
  • Different leaders for different ranges allow parallel operations

This pattern allows systems to scale horizontally while maintaining strong consistency guarantees.

Key Optimizations in Multi-Raft Systems

Resource Sharing

Multiple Raft groups on the same node can share resources:

  • Thread pools for processing
  • Batched disk writes across groups
  • Unified network connections
  • Shared memory pools

Message Batching

To reduce network overhead:

  • Multiple heartbeats combined into single packets
  • Log entries from different groups bundled together
  • Responses aggregated for efficient network usage

Range Leases

To optimize read operations:

  • Long-term read delegation to a single node
  • Reads don't need full Raft consensus
  • Significantly reduces read latency

Dynamic Range Management

  • Hot ranges split automatically
  • Cold ranges merge to reduce overhead
  • Load-based splitting for better distribution

Challenges with Multi-Raft

While Multi-Raft solves scaling problems, it introduces complexity:

Operational Complexity

Running thousands of Raft groups means:

  • More state to track and debug
  • Complex failure scenarios
  • Increased monitoring overhead

Resource Management

Each Raft group consumes resources, and managing thousands requires careful planning:

  • Memory for in-flight operations
  • Disk space for logs
  • Network bandwidth for replication
  • CPU for leader election and log processing

Cross-Range Transactions

When operations span multiple ranges:

  • Atomic commits across Raft groups become necessary
  • Coordination protocols add complexity
  • Higher latency for distributed transactions
  • Increased chance of conflicts and retries

Best Practices for Using Raft

When to Use Raft

Use Raft when:

  • Strong consistency is required
  • The system can tolerate some latency
  • The dataset can be partitioned effectively

When Not to Use Raft

Consider alternatives when:

  • Eventual consistency is sufficient
  • Ultra-low latency is critical
  • The system has extremely high write throughput

State Machine Management

Your Raft implementation needs to consider:

  • Log compaction strategies
  • Snapshot mechanisms for large state machines
  • Efficient recovery processes

Handling Network Partitions

The system should be designed to handle:

  • Temporary partitions without data loss
  • Leader isolation scenarios
  • Recovery from network failures

Conclusion

The beauty of Raft lies in its simplicity. While other consensus algorithms might be more efficient in specific scenarios, Raft's understandability makes it the go-to choice for many distributed systems.

Remember:

Consensus is expensive. Use it only when you absolutely need it.

Take the time to understand when you need strong consistency versus when eventual consistency is good enough. Your system's scalability might depend on it.

Hope you liked reading the article.

Please reach out to me here for more ideas or improvements.