Understanding DStream, RDD, and Structured Streaming in Apache Spark
Understanding DStream, RDD, and Structured Streaming in Apache Spark
When working with Apache Spark Streaming, we encounter terms like DStream, RDD, and Structured Streaming. Let's break down these concepts and fully understand them.
1. What is an RDD (Resilient Distributed Dataset)?
Definition
An RDD (Resilient Distributed Dataset) is the fundamental data structure in Apache Spark. It is an immutable, distributed collection of objects that can be processed in parallel across a cluster.
Key Properties of RDDs
- Immutable – Once created, an RDD cannot be changed; transformations create new RDDs.
- Distributed – Data is split across multiple nodes in the cluster.
- Fault-Tolerant – Can recover lost data automatically using lineage.
- Lazy Evaluation – Transformations are not executed immediately but only when an action is triggered.
Example of Creating an RDD in PySpark
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("RDDExample").getOrCreate()
# Convert a Python list into an RDD
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)
# Perform a transformation (multiply each element by 2)
rdd_transformed = rdd.map(lambda x: x * 2)
# Perform an action (collect results)
print(rdd_transformed.collect())
🔹 Output: [2, 4, 6, 8, 10]
2. What is a DStream (Discretized Stream)?
Definition
A DStream (Discretized Stream) is the basic abstraction in Spark Streaming. It is a continuous sequence of RDDs, where each RDD represents data from a particular time interval.
How DStreams Work?
- Receives Streaming Data from a source (e.g., Kafka, HDFS, TCP socket).
- Splits Data into Micro-Batches (small time intervals, like 1 second).
- Processes Each Batch as an RDD using transformations.
- Sends the Processed Data to an output sink (console, database, Kafka, etc.).
Example of Spark Streaming (DStream API)
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
# Create a SparkContext and StreamingContext (batch interval: 1 second)
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 1)
# Create a DStream that listens to a TCP socket
lines = ssc.socketTextStream("localhost", 9999)
# Split lines into words
words = lines.flatMap(lambda line: line.split(" "))
# Count words in each batch
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
# Print output to console
wordCounts.pprint()
# Start streaming
ssc.start()
ssc.awaitTermination()
Explanation of Code:
-
Streaming Context (
ssc
):- The
StreamingContext
is created with a batch interval of 1 second. - This means Spark will process new data every 1 second.
- The
-
Reading Data (
socketTextStream
):- The program listens for incoming text data from a TCP socket (e.g.,
localhost:9999
). - This is commonly used to read real-time data streams.
- The program listens for incoming text data from a TCP socket (e.g.,
-
Processing Data (
flatMap
andreduceByKey
):flatMap(lambda line: line.split(" "))
→ Splits each line of text into words.map(lambda word: (word, 1))
→ Maps each word to a(word, 1)
tuple.reduceByKey(lambda a, b: a + b)
→ Aggregates word counts.
-
Printing Results (
pprint
):- The final word count is printed to the console every batch interval (1 second).
3. What is Structured Streaming?
Definition
Structured Streaming is the modern and more advanced streaming engine in Spark. It is built on Spark SQL and DataFrames instead of RDDs and DStreams.
Why Use Structured Streaming?
✅ Faster and More Optimized – Uses Catalyst optimizer and Tungsten engine.
✅ Easy to Use – Uses SQL and DataFrame APIs instead of RDDs.
✅ Supports Exactly-Once Processing – Prevents duplicate data processing.
✅ Built-in Event-Time Processing – Handles late-arriving data using watermarks.
✅ Integrates with Kafka, HDFS, S3, JDBC, and More.
Example of Structured Streaming
from pyspark.sql import SparkSession
# Create a Spark session
spark = SparkSession.builder.appName("StructuredStreamingExample").getOrCreate()
# Read a stream from a socket
lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
# Split lines into words
words = lines.selectExpr("explode(split(value, ' ')) as word")
# Count words
wordCounts = words.groupBy("word").count()
# Output the result to the console
query = wordCounts.writeStream.outputMode("complete").format("console").start()
query.awaitTermination()
Explanation of Code:
-
Reading Data (
readStream
)- Reads text data from a TCP socket on localhost:9999.
-
Processing Data (
selectExpr
,groupBy
,count
)- Splits each line into words.
- Groups words and counts occurrences.
-
Outputting Data (
writeStream
)- Writes the results continuously to the console.
- Supports output modes:
- Append – Shows new results only.
- Complete – Displays all results so far.
- Update – Updates changed results.
4. Comparison: DStream vs. Structured Streaming
Feature | Spark Streaming (DStreams) | Structured Streaming (DataFrames) |
---|---|---|
API | RDD-based (DStream) | DataFrame/Dataset-based |
Processing Model | Micro-batch (fixed time) | Micro-batch & Continuous |
Latency | Higher (~500ms - few sec) | Lower (few ms - sec) |
Ease of Use | Complex (RDD Transformations) | Simple (SQL and DataFrame API) |
Optimizations | Manual tuning required | Automatic (Catalyst, Tungsten) |
Event-Time Processing | Hard to implement | Built-in support |
Fault Tolerance | Checkpointing, WAL | Exactly-once processing |
Source Support | Kafka, Flume, HDFS | Kafka, HDFS, S3, Delta Lake |
5. Which One Should You Use?
Scenario | Recommended Choice |
---|---|
New Streaming Projects | Structured Streaming |
Low Latency Needs | Structured Streaming |
SQL-based Queries | Structured Streaming |
Legacy Codebases | Spark Streaming (DStreams) |
Final Recommendation:
✅ Use Structured Streaming for new projects because it is faster, easier, and future-proof.
❌ Avoid DStreams unless working with legacy applications.
Conclusion
- RDDs: The basic distributed data structure in Spark.
- DStreams: A sequence of RDDs used in Spark Streaming (legacy).
- Structured Streaming: The modern and optimized approach to real-time data streaming.
Comments
Post a Comment