RDD Operations

RDD in Apache Spark supports two types of operations.

  1. Transformation
  2. Actions

Transformations: Transformation is an function that we can apply on RDD. It produces new RDD from the existing RDD . Each time it creates a new RDD when we apply any transformation. So, input RDD cannot be changed as RDD’s are immutable in nature. These are lazily evaluated means they do not compute their results immediately. Spark remembers all the transformations applied in the form of DAG. All transformations are evaluated only when an action is invoked.

There are two types of transformations.

  1. Narrow transformation
  2. Wide transformation.

Actions:

Actions return final results of RDD computations. It triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system. Lineage graph is dependency graph of all parallel RDDs of RDD.

Actions are RDD operations that produce non-RDD values. An Action is one of the ways to send result from executors to the driver.

Using transformations, one can create RDD from the existing one. But when we want to work with the actual dataset, at that point we use Action. When the Action occurs it does not create the new RDD, unlike transformation. Thus, actions are RDD operations that give no RDD values. Action stores its value either to drivers or to the external storage system. It brings laziness of RDD.

Transformations and actions in Spark.

General:

Transformations:

  1. map
  2. filter
  3. flatMap
  4. mapPartitions
  5. mapPartitionsWithIndex
  6. groupBy
  7. sortBy

Actions:

  1. reduce
  2. collect
  3. aggregate
  4. fold
  5. first
  6. take
  7. forEach
  8. top
  9. treeAggregate
  10. treeReduce
  11. forEachPartition
  12. collectAsMap

Math/Statistical

        Transformation:

  1. sample
  2. randomSplit

Action:

  1. count
  2. takeSample
  3. max
  4. min
  5. sum
  6. histogram
  7. mean
  8. variance
  9. stdev
  10. sampleVariance
  11. countApprox
  12. countApproxDistinct

Set Theory/Relational

        Transformation:

  1. union
  2. intersection
  3. subtract
  4. distinct
  5. cartesian
  6. Zip

Action:

1.takeOrdered

Data Structure / I/O

        Transformation:

  1. keyBy
  2. zipWithIndex
  3. zipWithUniqueID
  4. zipPartitions
  5. coalesce
  6. repartition
  7. repartitionAndSortWithinPartitions
  8. pipe

Action:

  1. saveAsTextFile
  2. saveAsSequenceFile
  3. saveAsObjectFile
  4. saveAsHadoopDataset
  5. saveAsHadoopFile
  6. saveAsNewPIHadoopDataset
  7. saveAsNewPIHadoopFile

PairRDD operations:

General:

Transformations:

  1. flatMapValues
  2. groupByKey
  3. reduceByKey
  4. reduceByKeyLocally
  5. foldByKey
  6. aggregateByKey
  7. sortByKey
  8. combineByKey

Action:

  1. keys
  2. values

Math/ Statistical

Transformations:

  1. sampleByKey

Action:

  1. countByKey
  2. countByValue
  3. countByValueApprox
  4. countApproxDistinctByKey
  5. countByKeyApprox
  6. sampleByKeyExact

Set Theory / Relational

Transformation:

  1. cogroup(=groupWith)
  2. join
  3. subtractByKey
  4. fullOuterJoin
  5. leftOuterJoin
  6. rightOuterJoin

Data Structure:

Transformation:

  1. partitionBy

Q) What is the difference between Narrow transformation and Wide transformation?

Narrow Transformation:

RDD operations like map, union, filter can operate on a single partition and map the data of that partition to resulting single partition. These kind of operations which maps data from one to one partition are referred as Narrow operations. Narrow operations doesn’t required to distribute the data across the partitions.

Any row of the child RDD will depend on only 1 row of the parent RDD. Since each child row can point to the 1 parent row it depends on, there is a narrow dependency.

Wide Transformation:

RDD operations like groupByKey, distinct, join may require to map the data across the partitions in new RDD. These kind of operations which maps data from one to many partitions are referred as Wide operations.

Scroll to Top