Each machine would require lease from CM, and CM would also require lease from them.
Reconfiguration
The reconfiguration protocol moves a FaRM instance from one configuration to the next. For this part, implementing precise membership is necessary, because the server's CPU cannot check if it holds the lease. After a failure, all machines in a new configuration must agree on its membership before allowing object mutations. This allows FaRM to perform the check at the client rather than at the server.
The reconfiguration contains below steps:
- Suspect: When a lease for a machine expires at the CM, it suspects that machine of failure and initiates recon- figuration. At this point it starts blocking all external client requests.
- Probe: The new CM issues an RDMA read to all the machines in the configuration except the machine that is suspected.
- Update Configuration: After receiving replies to the probes, the new CM attempts to update the configuration data stored in Zookeeper.
- Remap regions: The new CM then reassigns regions previously mapped to failed machines to restore the number of replicas.
- Send new configuration: After remapping regions, the CM sends a NEW- CONFIG message to all the machines in the configuration.
- Apply new configuration: When a machine receives a NEW- CONFIG with a configuration identifier that is greater than its own, it updates its current configuration. It also starts blocking requests from external clients
- Commit new configuration: All members now unblock previously blocked external client requests and initiate transaction recovery.
Transaction state recovery
FaRM recovers transaction state after a configuration change using the logs distributed across the replicas of objects modified by a transaction. This involves recovering the state both at the replicas of objects modified by the transaction and at the coordinator to decide on the outcome of the transaction.
- Block access to recovering regions: When the primary of a region fails, one of the backups is promoted to be the new primary during reconfiguration. Requests would be blocked until all transactions that updated it have been reflected at the new primary.
- Drain logs: to ensure that all relevant records are processed during recovery.
- Find recovering transactions: A recovering transaction is one whose commit phase spans configuration changes, and for which some replica of a written object, some primary of a read object, or the coordinator has changed due to reconfiguration. All machines must agree on whether a given transaction is a recovering transaction or not.
- Lock recovery: In parallel, the threads in the primary fetch any transaction log records from backups that are not already stored locally and then lock any objects modified by recovering transactions.
- Replicate log records: The threads in the primary replicate log records by sending backups the REPLICATE-TX- STATE message for any transactions that they are missing.
- Vote: The coordinator for a recovering transaction decides whether to commit or abort the transaction based on votes from each region updated by the transaction. These votes are sent by the primaries of each region.
- Decide: The coordinator decides to commit a transaction if it receives a commit-primary vote from any region. Otherwise, it waits for all regions to vote and commits.
For correctness, the key idea is that recovery preserves the outcome for transactions that were previously committed or aborted. A transaction is committed when either a primary exposes transaction modifications, or the coordinator notifies the application that the transaction committed. A transaction is aborted when the coordinator sends an abort message or notifies the application that the transaction has aborted.
Recovering data
Data recovery is not necessary to resume nor- mal case operation, so we delay it until all regions become active to minimize impact on latency-critical lock recovery. FaRM begins data recovery for new backups in parallel with foreground operations. To reduce impact on foreground performance, recovery is paced by scheduling the next read to start at a random point within an interval.
Each recovered object must be examined before being copied to the backup. The steps contain:
- check version number (greater than local)
- lock local version with a compare-and-swap
- update object state
- unlock
Summary
FaRM is the first system to simultaneously provide high availability, high throughput, low latency, and strict serializability with a new fast recovery protocol and an optimized transaction and replication protocol. To be more specific, it is a distributed main memory computing platform for modern data centers that provides strictly serializable transactions with high throughput, low latency, and high availability. Key to achieving this are new transaction, replication, and recovery protocols designed from first principles to leverage commodity networks with RDMA and a new, inexpensive approach to providing non-volatile DRAM.
2021-2022, Wenzhe Zhang Revision
e7f77fc Wenzhe Zhang's Notebook
develop