Spark could only load data needed from external storage, instead of the whole datasets. For example, we can require Spark to load data which contains error
in the log, and this would be exactly what we get after loading, and the original datasets would not be loaded into RAM.
RDDs can only be created ("written") through coarse-grained transformations, which allows for more efficient fault tolerance. Furthermore, only the lost partitions of an RDD need to be recomputed upon failure, and they can be recomputed in parallel on different nodes, without having to roll back the whole program.
A second benefit of RDDs is that their immutable nature lets a system mitigate slow nodes (stragglers) by running backup copies of slow tasks as in MapReduce.
Spark uses Scala as the programming language, which is base Java VM. To use Spark, developers write a driver program that connects to a cluster of workers. The driver defines one or more RDDs and invokes actions on them. Users provide arguments to RDD operation like map by passing closures. Scala represents each closure as a Java object, and these objects can be serialized and loaded on another node to pass the closure across the network.
RDDs themselves are statically typed objects parametrized by an element type. For example, RDD[Int] is an RDD of integers.
Ideally, a system implementing RDDs should provide as rich a set of transformation operators as possible. In Spark's design, there is a simple graph-based representation for RDDs that facilitates these goals. It could support a wide range of transformations without adding special logic to the scheduler for each one, which greatly simplified the system design.
The designers propose representing each RDD through a common interface that exposes five pieces of information: a set of partitions, which are atomic pieces of the dataset; a set of dependencies on parent RDDs; a function for computing the dataset based on its parents; and metadata about its partitioning scheme and data placement.
An interesting points is how to represent dependencies between RDDs. The designers classify it into two types: narrow dependencies, where each partition of the parent RDD is used by at most one partition of the child RDD, wide dependencies, where multiple child partitions may depend on it.
Distinctions: First, narrow dependencies allow for pipelined execution on one cluster node, which can compute all the parent partitions; wide dependencies require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce-like operation. Second, recovery after a node failure is more efficient with a narrow dependency, as only the lost parent partitions need to be recomputed; but for wide dependencies, a complete re-execution is required.
Spark could share resources with Hadoop, MPI and other applications. Each Spark program runs as a separate Mesos application, with its own driver (master) and workers, and resource sharing between these applications is handled by Mesos.
For the job scheduler, Spark's scheduler uses the representation of RDDs described above, which assigns tasks to machines based on data locality using delay scheduling. If a task fails, Spark would re-run it on another node as long as its stage's parents are still available.
For memory management, Spark provides three options for storage of persistent RDDs: in-memory storage as deserialized Java objects, in-memory storage as serialized data, and on-disk storage. To manage the limited memory available, Spark uses an LRU eviction policy at the level of RDDs. When a new RDD partition is computed but there is not enough space to store it, Spark would evict a partition from the least recently accessed RDD, unless this is the same RDD as the one with the new partition.
For support of checkpointing, it is always useful for RDDs with long lineage graphs containing wide dependencies. In contrast, for RDDs with narrow dependencies on data in stable storage, checkpointing may never be worthwhile. If a node fails, lost partitions from these RDDs can be recomputed in parallel on other nodes, at a fraction of the cost of replicating the whole RDD. Finally, the read-only nature of RDDs makes them simpler to checkpoint than general shared memory. Because consistency is not a concern.
RDDs can efficiently express a number of cluster programming models that have so far been proposed independently. It is an efficient, general-purpose and fault-tolerant abstraction for sharing data in cluster applications, which offers an API based on coarse-grained transformations that lets them recover data efficiently using lineage.