Spark: What is Spark?


Apache Spark is an open-source, distributed system for processing large amounts of data. It is primarily used for analytics, machine learning, and other applications that require the fast processing of massive datasets.


History of Spark:

In 2009, a project called Mesos was started at Berkeley University. Mesos is a resource management system, similar to YARN in Hadoop.

In Hadoop, the data processing module MapReduce consists of two processes: JobTracker and TaskTracker. The creators of Mesos were aware of the drawbacks of MapReduce, so as a test for Mesos, they developed Spark, although their initial goal was to focus on Mesos.

The initial Spark program was just 100 lines of code, and they observed that Spark was almost 10x faster than Hadoop. This discovery shifted the focus from Mesos to Spark. In 2013, Spark was made open-source.

By August 2014, Spark became a Top-Level Project at Apache. With the emergence of Spark, people began to move away from using MapReduce, as Spark was found to be 100x faster than MapReduce.

Now, Spark is a mandatory Big Data technology for data processing.


What is the main programming language for Spark (Spark 1.x)?

The main programming language for Spark is Scala. To support other languages, wrappers are available, and Python is often used via the Py4J library.

Supported programming languages include:

  • Scala
  • Python
  • Java
  • R
  • SQL

Why Scala?

Scala is built on top of Java. It has all the features of Java and allows you to directly use Java code within Scala. Scala is a scalable language, and its development began in 2002 with the goal of addressing issues in Java. Since Hadoop is also built using Java, Scala complements it well.

Key points:

  • To implement new features in Spark, either Java or Scala should be used.
  • Scala is the major programming language for Spark.
  • Python can also be used with the Py4J library.
  • Scala can run Java code, but the reverse (Java running Scala code) is not possible.

Note: This context applies to Spark 1.x.


From Spark 2.x: Unified Engine for Large-Scale Data Analytics

From Spark 2.x onwards, Spark introduced a unified engine for large-scale data analytics.

Key improvements include:

  • Performance remains the same, regardless of which programming language is used.
  • The API used in Scala, Java, and Python is nearly identical (~90% similarity).
  • This approach makes it easy to switch from one language to another, so if you learned Spark in Python, transitioning to Scala would only require basic knowledge of Scala.

Note: Spark 3.x addressed several minor issues present in Spark 2.x.


Code Snippets: Python, Scala, and Java Comparison

The code for reading data and performing operations in Spark is almost identical across Python, Scala, and Java. This eliminates the myth that you need in-depth programming knowledge to learn Spark. Basic programming skills are sufficient, as long as you understand the core concepts of Spark.

Python Code (Read data from logs.json):

val = spark.read.json("logs.json")
df.where("age > 21").select("name.first").show()

Scala Code (Read data from logs.json):

val = spark.read.json("logs.json")
df.where("age > 21").select("name.first").show()

Java Code (Read data from logs.json):

Dataset df = spark.read.json("logs.json")
df.where("age > 21").select("name.first").show()

More Information about Spark:

The heart of Spark is Spark Core, and it includes the following 4 main libraries:

  1. Spark SQL
  2. Spark Streaming
  3. Spark MLlib
  4. Spark GraphX

The Spark Context and RDD (Resilient Distributed Dataset) are the primary concepts of Spark. These four libraries are built on top of Spark Context and RDD.

Spark Context: The entry point for any operations (such as filter, groupBy, min, max, etc.). Using Spark Context, you create an RDD, on which operations are run.


Conclusion:

Spark is fundamentally about processing large amounts of data (Big Data). In upcoming blogs, I will explore how to process large datasets and how to do the same in the cloud (using AWS, Azure, etc.). Stay tuned for more details on Spark in future discussions!

Comments

Popular posts from this blog

AWS Athena, AWS Lambda, AWS Glue, and Amazon S3 – Detailed Explanation

Kafka Integrated with Spark Structured Streaming

Azure Data Factory: Copying Data from ADLS to MSSQL