Spark Introduction

Apache Spark is an open-source cluster computing framework. It is a fast processing engine for large scale datasets. It was developed to overcome the limitations of Hadoop MapReduce.

Spark runs programs up to 100 times faster in memory and 10 times faster in disk to compare with Hadoop MapReduce.

It was developed in Scala but we can also work with Java, Python and R.

Spark is able to work independently and also work on top of Hadoop as well.

History of Spark

Spark was initially started at UC Berkeley’s AMPLab in 2009 by Matei Zaharia. It was open sourced in 2010 under a BSD license and donated to the Apache Software Foundation in 2013. Apache Spark has become a top level project from February 2014.

Is Spark replacement of Hadoop?

The answer is no. It is not the replacement of Hadoop.

Hadoop comes with HDFS(Storage) and MapReduce(processing) but Spark is only processing framework, it will run on top of Hadoop.

Spark aim is to provide unified platform for Bigdata.

Limitations of Spark

  • Spark does not have its own file management system, so we need to integrate with Hadoop or another cloud-based platform.
  • In-memory capability can become a bottleneck when it comes to cost-efficient processing of Big data.
  • Memory consumption is very high.
  • It requires large data.
  • MLlib lacking in a number of available algorithms.
Scroll to Top