🔥Apache Spark Architecture with RDD & DAG
Apache Spark Architecture with RDD & DAG
Apache Spark follows a master-slave architecture, designed for fast, distributed data processing. It consists of three main components:
- Driver Node (Master)
- Cluster Manager
- Executors (Workers)
Additionally, two important internal components play a crucial role in execution:
- RDD (Resilient Distributed Dataset)
- DAG (Directed Acyclic Graph)
1. Driver Node (Master)
The Driver Node is responsible for coordinating and executing a Spark application.
Responsibilities of the Driver:
- Starts a SparkSession (entry point of Spark).
- Divides the program into tasks and schedules them.
- Sends tasks to executors for execution.
- Monitors task execution and collects results.
Working of the Driver Node:
- User submits a Spark job using
spark-submit
. - Driver requests resources from the Cluster Manager.
- The job is converted into a DAG (Directed Acyclic Graph).
- The DAG is split into Stages, and each Stage contains Tasks.
- The Cluster Manager assigns Executors, and the Driver schedules tasks on them.
- Executors process data, and results are sent back to the driver.
2. Cluster Manager
The Cluster Manager is responsible for allocating resources to Spark applications. Spark supports multiple cluster managers:
Cluster Manager | Description |
---|---|
Standalone | Spark’s built-in cluster manager (default). |
YARN | Used in Hadoop clusters. |
Mesos | A general-purpose cluster manager. |
Kubernetes | Manages Spark in a containerized environment. |
Local Mode | Runs Spark on a single machine (for testing). |
Working of the Cluster Manager:
- The driver requests executors from the cluster manager.
- The cluster manager allocates executors on worker nodes.
- Executors are started on worker nodes.
3. Executors (Workers)
Executors are distributed processing engines that perform actual computation.
Responsibilities of Executors:
- Receive tasks from the driver.
- Process data using RDDs.
- Store data in memory for caching.
- Return results to the driver.
- Divide tasks into sub-jobs for parallel execution.
Working of Executors:
- Executors receive tasks from the driver.
- Each task runs on a partitioned dataset (RDD).
- Transformations (like map, filter) are applied to RDDs.
- Tasks are further divided into sub-jobs for efficient execution.
- Actions (like count, collect) trigger execution.
- The processed data is sent back to the driver.
4. SparkContext & SparkSession
SparkContext
SparkContext
is the entry point to low-level Spark functionalities.- Used to initialize RDDs and interact with cluster managers.
- Required when working with RDD APIs.
Example:
from pyspark import SparkContext
sc = SparkContext("local", "SparkContextExample")
SparkSession
SparkSession
is the entry point to Spark applications.- Combines SQLContext, HiveContext, and SparkContext.
- Recommended for DataFrame and Dataset APIs.
Example:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SparkSessionExample").getOrCreate()
5. RDD (Resilient Distributed Dataset)
RDD is Spark’s fundamental data structure, designed for fault tolerance and distributed computing.
Characteristics of RDD:
Feature | Description |
---|---|
Resilient | Recovers automatically from node failures. |
Distributed | Data is split across multiple nodes for parallel processing. |
Immutable | Once created, RDDs cannot be changed. |
Lazy Evaluation | Transformations are not executed immediately, but only when an Action is called. |
Partitioned | Data is automatically split into partitions for parallel execution. |
Types of RDD Operations
- Transformations → Creates a new RDD from an existing one.
- Examples:
map()
,filter()
,flatMap()
,reduceByKey()
- Examples:
- Actions → Triggers execution and returns results.
- Examples:
collect()
,count()
,saveAsTextFile()
,first()
- Examples:
Working of RDD
- Spark reads data from a source (HDFS, S3, Kafka, etc.).
- The data is split into partitions, and an RDD is created.
- Transformations are applied to the RDD, creating a DAG.
- When an Action is called, Spark executes the DAG.
Example:
from pyspark.sql import SparkSession
# Initialize SparkSession
spark = SparkSession.builder.appName("RDDExample").getOrCreate()
# Create an RDD from a list
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)
# Apply a transformation (multiply each element by 2)
transformed_rdd = rdd.map(lambda x: x * 2)
# Perform an action (collect results)
result = transformed_rdd.collect()
print(result) # Output: [2, 4, 6, 8, 10]
6. DAG (Directed Acyclic Graph)
DAG is the execution plan of a Spark job. It ensures efficient task execution by defining dependencies between RDD transformations and optimizing execution.
How DAG Works?
- RDD Transformations create a DAG.
- DAG breaks tasks into Stages.
- Stages are divided into Tasks, which are executed in parallel.
- Execution follows dependencies between RDDs, ensuring fault tolerance.
- DAG Scheduler optimizes execution by reordering transformations and pipelining operations when possible.
Example of DAG Execution
rdd = spark.sparkContext.textFile("data.txt") # Stage 1
rdd = rdd.flatMap(lambda x: x.split(" ")) # Stage 2
rdd = rdd.map(lambda x: (x, 1)) # Stage 3
rdd = rdd.reduceByKey(lambda a, b: a + b) # Stage 4
rdd.collect() # Stage 5 (Action)
Stages Breakdown:
- Stage 1: Load file into an RDD.
- Stage 2: Split lines into words.
- Stage 3: Convert words into key-value pairs.
- Stage 4: Aggregate counts.
- Stage 5: Collect results and return to driver.
Apache Spark Architecture Workflow Diagram
Workflow Steps:
- User submits a job → Spark Driver initializes
SparkSession
andSparkContext
. - Driver constructs DAG → Converts transformations into a Directed Acyclic Graph.
- DAG Scheduler splits DAG into Stages.
- Stages are divided into Tasks → Each Task is mapped to a partition.
- Cluster Manager allocates resources → Executors receive assigned tasks.
- Executors process tasks in parallel and return results to the driver.
- Results are collected and displayed.
This architecture enables high-performance, distributed data processing in Apache Spark.
Comments
Post a Comment