Programming : Spark was developed in Scala but we can also work with Java, Python and R.
Storage : In Spark, we can load data from below sources.
- Local File System
- RDBMS( Relational Database Management System)
After loading, we can transform the data using filter, join etc. This is also called as transformation in spark.
Finally we can store the processed data in same storage areas like Local File Systems, HDFS, S3 , RDBMS or NoSQL.
Spark is not a replacement of Hadoop and is not dependent on Hadoop because it has its own cluster manangement. Hadoop is just one of the ways to implement Spark.
Library : Apache Spark consists of following components.
- Spark Core
- Spark SQL
- Spark Streaming
- Spark MLlib
- Spark GraphX
Spark Core is the heart of Spark Ecosystem.
It is the fundamental component of Spark. It is an API. It uses a specialized fundamental data structure known as RDD (Resilient Distributed Datasets). It contains the basic functionality of Spark, including components for task scheduling, memory management, fault recovery, interacting with storage systems.
All other components of spark ecosystem are built on top of spark core.
Spark SQL is a component on top of Spark Core that introduces a new data abstraction called SchemaRDD, which provides support for sturctured and semi-structured data. It is used to execute SQL queries written using either a basic SQL syntax or HiveSQL.
It supports many sources of data, including Hive tables, parquet and JSON.
It is a component that enables processing of live streams of data.
It comes with a library containing common machine learning (ML) functionality, called MLlib. It provides multiple types of machine learning algorithms including classification, regression, clustering and collaborative filtering as well as supporting functionality such as model evaluation and data import.
It is a component on top of spark core. It is library for manipulating graphs (i.e., social networks friend graph) and performing graph-parallel computation.
Management : It supports 3 types of cluster resource managers.