Implementing Raft in RabbitMQ
https://content.pivotal.io/rabbitmq/implementing-raft-in-rabbitmq
RAFT is a distributed consensus protocol.
RabbitMQ High Availability
- Replication of data and operations
- Message replication is done at queue level
- Called "Queue mirroring"
- Internally uses a component called "guaranteed multicast"
- Provides replication and total ordering of operations
- Chain replication ensures strong consistency and good availability guarantees in fail-stop scenarios. http://www.cs.cornell.edu/home/rvr/papers/OSDI04.pdf
- In a cluster of RabbitMQ nodes a queue can have a mirror on one or more nodes
- Provides fail-over and redundancy
Ring
Works well most of the time. Requieres good failure detection. Membership changes are expensive (requires queue sync). Master election algorithm is informally specified.
We can do better with RAFT.
RAFT
Requirements:
- Strong consistency guarantees.
- Total order of operations.
- Predictable behavior in response to failure events (well-defined recovery procedure)
- Safe queue master fail-over
- Parallel replication
Options:
- Paxos
- Viewstampesd replication
- Raft
What is RAFT:
- A group of algorithms for reaching consensus in a distributed system
- Similar problem space to RabbitMQ queue mirroring
- Oriented towards implementers
- Requires no external dependencies
RAFT provides:
- A state machine log abstraction
- Fits many domains
- Leader-follower model
- State machine log replication
- Consistency-oriented, availability characteristics
- Total order of operations
- Well-defined algorithms important for implementers
- Leader election
- Safe cluster membership changes
- Durable storage expectations
- Recovery
- Reply log to restore state
- Snapshooting
RAFT Protocol
RAFT vs RabbitMQ
What to do when detecting a potencial failure?
- Nothing
- most reliable / least useful
- Try to fix stuff
- evict down nodes, reform topology
- communicate changes to other nodes
- The minimum required
- regain / retain availability and consistency