The key feature required for Bigdata evaluation is speed. Spark runs programs up to 100x faster in memory and 10x faster in disk to compare with Hadoop MapReduce. It reduces the number of read-write operations to disk.
- Fault Tolerance
- Lazy Evaluation
- In-Memory Computing
- Real-time Stream Processing
- Multiple Language Support
- High level API Support
- Optimized Performance
- Advanced Analytics
Fault Tolerance: Apache Spark is designed to handle worker node failures in the cluster. It achieves this feature by DAG(Directed Cyclic Graph) and its RDD (Resilient Distributed Dataset) . DAG contains the lineage graph of all the transformations and actions needed to complete the task. So if any worker node failure, the same results can be achieved by rerunning the steps from the existing DAG. Thus, the loss of data and information is negligible.
Lazy Evaluation: The name itself saying that spark evaluates the transformations lazily that means all the transformations are added to the DAG and final computation evaluate only when action is called. It maintains the lineage graph to remember the operations on RDD and evaluates whenever we need. This feature reduces the execution time of RDD operations and improves the performance.
Disk seeks is becoming very costly with increasing volumes of data. Reading terabytes to petabytes of data from disk and writing back to disk, again and again, is not acceptable.
Spark has overcome this issue by in-memory processing of data. In-memory processing is faster as no time is spent in moving the data in and out of the disk. It keeps data in memory for faster access. Keeping data in servers’ RAM as it makes accessing stored data quickly.
Spark is 100 times faster than MapReduce as it processes everything in memory.
Spark owns advanced DAG execution engine which facilitates in-memory computation and acyclic data flow resulting high speed.
If data is huge that doesn’t fit into the memory then spark doesn’t load all the data at a time into the memory. It will load only the data that fits into the memory than do the processing , once it computes the loaded data then it will again load from the disk and process. This will repeat until process all the data in the disk.
Real-time Stream Processing : Apache Spark supports stream processing. Stream processing involves continuous input and output of data. It emphasizes on the velocity of the data, and data is processed within a small period of time.
No boilerplate coding.
Multiple Language Support: Apache Spark supports following four languages.
Spark is written in Scala but we can work with any of the above languages. Scala and Python have interactive shells which are used by programmers to interact with the Spark. Apache Spark also comes with REPL(Read-Evaluate-Print Loop) that beginners can use to understand the Spark programming model. Spark REPL is also known as Spark CLI(Command Line Interface)
High Level API Support: Apache Spark provides high level APIs like Spark SQL, Spark Streaming, MLib, and GraphX to allow interaction with core functionalities . Spark also facilitates several core data abstractions on top of the distributed collection of data which are RDDs, DataFrames, and DataSets
Optimized Performance: Apache Spark Optimization helps in-memory computations. It achieves faster execution of jobs by using optimal resources. Spark achieves optimization by partitions, caching data, Optimized Shuffle, Catalyst Optimizer and so on.
Advanced Analytics: Spark supports complex analytics on large datasets by using Spark Mlib and Spark GraphX.