RDD in Apache Spark supports two types of operations.
Transformations: Transformation is an function that we can apply on RDD. It produces new RDD from the existing RDD . Each time it creates a new RDD when we apply any transformation. So, input RDD cannot be changed as RDD’s are immutable in nature. These are lazily evaluated means they do not compute their results immediately. Spark remembers all the transformations applied in the form of DAG. All transformations are evaluated only when an action is invoked.
There are two types of transformations.
- Narrow transformation
- Wide transformation.
Actions return final results of RDD computations. It triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system. Lineage graph is dependency graph of all parallel RDDs of RDD.
Actions are RDD operations that produce non-RDD values. An Action is one of the ways to send result from executors to the driver.
Using transformations, one can create RDD from the existing one. But when we want to work with the actual dataset, at that point we use Action. When the Action occurs it does not create the new RDD, unlike transformation. Thus, actions are RDD operations that give no RDD values. Action stores its value either to drivers or to the external storage system. It brings laziness of RDD.
Transformations and actions in Spark.
Data Structure / I/O
Set Theory / Relational
Q) What is the difference between Narrow transformation and Wide transformation?
RDD operations like map, union, filter can operate on a single partition and map the data of that partition to resulting single partition. These kind of operations which maps data from one to one partition are referred as Narrow operations. Narrow operations doesn’t required to distribute the data across the partitions.
Any row of the child RDD will depend on only 1 row of the parent RDD. Since each child row can point to the 1 parent row it depends on, there is a narrow dependency.
RDD operations like groupByKey, distinct, join may require to map the data across the partitions in new RDD. These kind of operations which maps data from one to many partitions are referred as Wide operations.