GFS: The Google File System

This part is all based on the paper GFS, which is absolutely worth carefully reading.

What is GFS

GFS is a scalable distributed file system for large distributed data-intensive applications. With the same goals with previous distributed file system, GFS focus on:

Component failures are norm rather than the exception: do not try to avoid failures, but accept them and solve them
Files are huge: I/O must be treated carefully
Most files are mutated by appending new data rather than overwriting: appending is the focus of performance optimization and atomicity guarantees
Co-designing the applications and the file system API: increase flexibility

GFS is not built for experiments, but based on very specific goals. It does not pay equal attentions to different kinds of file operations, but according to Google's scenario, focus on large streaming reads. Some designs may not meet requirements out of Google's scope, but it does not mean GFS has a bad design.

Structures about GFS

Single master: GFS has one single master, which maintains file metadata, and is obviously a centric component. This single master simplifies the whole system, but GFS's developers deal with it carefully. Only control flow is passed to and from this master, while data flow, on the other hand, is not. This design decreases the workload of I/O and network bandwidth, and it has been proved a delightful design.
Multi chunkservers: They are used to store files and their replicas. Files are stored in chunks, which are fixed-size buckets, or blocks. GFS treats it carefully when writing contents into these buckets, because it has to care about the both sides of the file, they may be stored in different buckets, and these details must be recorded and tracked for reading.
Multi clients: Clients would simply send requests to the master, and the master would check its records and find out which chunkservers contain the file each client wants and their locations. These information would be sent back to clients, and clients would communicate with chunkservers directly for the data.

GFS also maintains necessary logs to track each operation, and creates checkpoints based on these logs. They are powerful to be used for recovering the file system by replaying them.

After a sequence of successful mutations, the mutated file region is guaranteed to be defined and contain the data written by the last mutation. GFS achieves this by using:

the same order applied of mutations
chunk version number

GFS also uses checksums to verify the validity of records.

How GFS runs

Lease

Cause each operation would finally be applied to all replicas, GFS uses leases to maintain the consistency across replicas. This mechanism is designed to minimize management overhead at the master. There are some steps:

The client asks the master for the location of the chunkserver holding the lease of the chunk and locations of other replicas
The master replies
The client pushes the data to all replicas
Once receiving all confirmed, the client sends a write request to the primary. The primary applied the mutations
The primary forwards the request to all secondary replicas
The secondaries all reply to the primary
The primary replies to the client

Data Flow

GFS decouples the flow of data from the flow of control to use the network efficiently.

Control Flow: Client -> Primary -> Secondary
Data Flow: between chunkservers (and clients)

The key is that each machine forwards the data to the closest machine. Once a machine receives some data, it starts forwarding immediately.

Atomic Record Appends

GFS guarantees to append data to the file at least once atomically at an offset of GFS's choosing and returns that offset to the client. GFS does not guarantee that all replicas are bytewise identical. It only guarantees that the data is written at least once as an atomic unit. All replicas are at least as long as the end of record and therefore any future record will be assigned a higher offset or a different chunk even if a different replica later becomes the primary.

Snapshot

The snapshot operation makes a copy of a file or a directory tree. There are some steps:

Revoke current leases
Log the snapshot operation
All following write operations would be applied to a copied new chunk, instead of the snapshotted original one

Mater operation

The master executes all namespace operations. In addition, it manages chunk replicas throughout the system:

Namespace Management and Locking: allow multiple operations to be active and use locks over regions of the namespace to ensure proper serialization.
Replica Placement: maximize data reliability and availability, and maximize network bandwidth utilization.
Creation, Re-replication, Rebalancing: re-replication happens when the number of replicas is not enough; rebalancing happens for better disk space usage and load balancing.
Garbage Collection: After a file is deleted, GFS does not immediately reclaim the available physical storage. It does so only lazily during regular garbage collection at both the file and chunk levels.
Stale Replica Detection: the master uses version number to detect stale replicates and remove them.

Fault tolerance

High availability

GFS guarantees high availability mainly by Fast Recovery and chunk replications

Data Integrity

GFS does not guarantee identical replicas. Therefore, each chunkserver must independently verify the integrity of its own copy by maintaining checksums.

For reads, the chunkserver verifies the checksum of data blocks that overlap the read range before returning any data.

For writes, GFS reads and verifies the first and last blocks of the range being overwritten, then performs the write.