🔥Apache Spark Architecture with RDD & DAG

 

Apache Spark Architecture with RDD & DAG

Apache Spark follows a master-slave architecture, designed for fast, distributed data processing. It consists of three main components:

  1. Driver Node (Master)
  2. Cluster Manager
  3. Executors (Workers)

Additionally, two important internal components play a crucial role in execution:

  1. RDD (Resilient Distributed Dataset)
  2. DAG (Directed Acyclic Graph)

1. Driver Node (Master)

The Driver Node is responsible for coordinating and executing a Spark application.

Responsibilities of the Driver:

  • Starts a SparkSession (entry point of Spark).
  • Divides the program into tasks and schedules them.
  • Sends tasks to executors for execution.
  • Monitors task execution and collects results.

Working of the Driver Node:

  1. User submits a Spark job using spark-submit.
  2. Driver requests resources from the Cluster Manager.
  3. The job is converted into a DAG (Directed Acyclic Graph).
  4. The DAG is split into Stages, and each Stage contains Tasks.
  5. The Cluster Manager assigns Executors, and the Driver schedules tasks on them.
  6. Executors process data, and results are sent back to the driver.

2. Cluster Manager

The Cluster Manager is responsible for allocating resources to Spark applications. Spark supports multiple cluster managers:

Cluster Manager Description
Standalone Spark’s built-in cluster manager (default).
YARN Used in Hadoop clusters.
Mesos A general-purpose cluster manager.
Kubernetes Manages Spark in a containerized environment.
Local Mode Runs Spark on a single machine (for testing).

Working of the Cluster Manager:

  1. The driver requests executors from the cluster manager.
  2. The cluster manager allocates executors on worker nodes.
  3. Executors are started on worker nodes.

3. Executors (Workers)

Executors are distributed processing engines that perform actual computation.

Responsibilities of Executors:

  • Receive tasks from the driver.
  • Process data using RDDs.
  • Store data in memory for caching.
  • Return results to the driver.
  • Divide tasks into sub-jobs for parallel execution.

Working of Executors:

  1. Executors receive tasks from the driver.
  2. Each task runs on a partitioned dataset (RDD).
  3. Transformations (like map, filter) are applied to RDDs.
  4. Tasks are further divided into sub-jobs for efficient execution.
  5. Actions (like count, collect) trigger execution.
  6. The processed data is sent back to the driver.

4. SparkContext & SparkSession

SparkContext

  • SparkContext is the entry point to low-level Spark functionalities.
  • Used to initialize RDDs and interact with cluster managers.
  • Required when working with RDD APIs.

Example:

from pyspark import SparkContext
sc = SparkContext("local", "SparkContextExample")

SparkSession

  • SparkSession is the entry point to Spark applications.
  • Combines SQLContext, HiveContext, and SparkContext.
  • Recommended for DataFrame and Dataset APIs.

Example:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SparkSessionExample").getOrCreate()

5. RDD (Resilient Distributed Dataset)

RDD is Spark’s fundamental data structure, designed for fault tolerance and distributed computing.

Characteristics of RDD:

Feature Description
Resilient Recovers automatically from node failures.
Distributed Data is split across multiple nodes for parallel processing.
Immutable Once created, RDDs cannot be changed.
Lazy Evaluation Transformations are not executed immediately, but only when an Action is called.
Partitioned Data is automatically split into partitions for parallel execution.

Types of RDD Operations

  • Transformations → Creates a new RDD from an existing one.
    • Examples: map(), filter(), flatMap(), reduceByKey()
  • Actions → Triggers execution and returns results.
    • Examples: collect(), count(), saveAsTextFile(), first()

Working of RDD

  1. Spark reads data from a source (HDFS, S3, Kafka, etc.).
  2. The data is split into partitions, and an RDD is created.
  3. Transformations are applied to the RDD, creating a DAG.
  4. When an Action is called, Spark executes the DAG.

Example:

from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName("RDDExample").getOrCreate()

# Create an RDD from a list
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)

# Apply a transformation (multiply each element by 2)
transformed_rdd = rdd.map(lambda x: x * 2)

# Perform an action (collect results)
result = transformed_rdd.collect()
print(result)  # Output: [2, 4, 6, 8, 10]

6. DAG (Directed Acyclic Graph)

DAG is the execution plan of a Spark job. It ensures efficient task execution by defining dependencies between RDD transformations and optimizing execution.

How DAG Works?

  1. RDD Transformations create a DAG.
  2. DAG breaks tasks into Stages.
  3. Stages are divided into Tasks, which are executed in parallel.
  4. Execution follows dependencies between RDDs, ensuring fault tolerance.
  5. DAG Scheduler optimizes execution by reordering transformations and pipelining operations when possible.

Example of DAG Execution

rdd = spark.sparkContext.textFile("data.txt") # Stage 1
rdd = rdd.flatMap(lambda x: x.split(" "))  # Stage 2
rdd = rdd.map(lambda x: (x, 1))  # Stage 3
rdd = rdd.reduceByKey(lambda a, b: a + b)  # Stage 4
rdd.collect()  # Stage 5 (Action)

Stages Breakdown:

  1. Stage 1: Load file into an RDD.
  2. Stage 2: Split lines into words.
  3. Stage 3: Convert words into key-value pairs.
  4. Stage 4: Aggregate counts.
  5. Stage 5: Collect results and return to driver.

Apache Spark Architecture Workflow Diagram

Workflow Steps:

  1. User submits a job → Spark Driver initializes SparkSession and SparkContext.
  2. Driver constructs DAG → Converts transformations into a Directed Acyclic Graph.
  3. DAG Scheduler splits DAG into Stages.
  4. Stages are divided into Tasks → Each Task is mapped to a partition.
  5. Cluster Manager allocates resources → Executors receive assigned tasks.
  6. Executors process tasks in parallel and return results to the driver.
  7. Results are collected and displayed.

This architecture enables high-performance, distributed data processing in Apache Spark.

Comments

Popular posts from this blog

🌐Filtering and Copying Files Dynamically in Azure Data Factory (ADF)

🌐End-to-End ETL Pipeline: MS SQL to MS SQL Using Azure Databricks