Understanding DStream, RDD, and Structured Streaming in Apache Spark

Understanding DStream, RDD, and Structured Streaming in Apache Spark

When working with Apache Spark Streaming, we encounter terms like DStream, RDD, and Structured Streaming. Let's break down these concepts and fully understand them.


1. What is an RDD (Resilient Distributed Dataset)?

Definition

An RDD (Resilient Distributed Dataset) is the fundamental data structure in Apache Spark. It is an immutable, distributed collection of objects that can be processed in parallel across a cluster.

Key Properties of RDDs

  • Immutable – Once created, an RDD cannot be changed; transformations create new RDDs.
  • Distributed – Data is split across multiple nodes in the cluster.
  • Fault-Tolerant – Can recover lost data automatically using lineage.
  • Lazy Evaluation – Transformations are not executed immediately but only when an action is triggered.

Example of Creating an RDD in PySpark

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("RDDExample").getOrCreate()

# Convert a Python list into an RDD
data = [1, 2, 3, 4, 5]
rdd = spark.sparkContext.parallelize(data)

# Perform a transformation (multiply each element by 2)
rdd_transformed = rdd.map(lambda x: x * 2)

# Perform an action (collect results)
print(rdd_transformed.collect())

🔹 Output: [2, 4, 6, 8, 10]


2. What is a DStream (Discretized Stream)?

Definition

A DStream (Discretized Stream) is the basic abstraction in Spark Streaming. It is a continuous sequence of RDDs, where each RDD represents data from a particular time interval.

How DStreams Work?

  1. Receives Streaming Data from a source (e.g., Kafka, HDFS, TCP socket).
  2. Splits Data into Micro-Batches (small time intervals, like 1 second).
  3. Processes Each Batch as an RDD using transformations.
  4. Sends the Processed Data to an output sink (console, database, Kafka, etc.).

Example of Spark Streaming (DStream API)

from pyspark import SparkContext
from pyspark.streaming import StreamingContext

# Create a SparkContext and StreamingContext (batch interval: 1 second)
sc = SparkContext("local[2]", "NetworkWordCount")
ssc = StreamingContext(sc, 1)

# Create a DStream that listens to a TCP socket
lines = ssc.socketTextStream("localhost", 9999)

# Split lines into words
words = lines.flatMap(lambda line: line.split(" "))

# Count words in each batch
wordCounts = words.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

# Print output to console
wordCounts.pprint()

# Start streaming
ssc.start()
ssc.awaitTermination()

Explanation of Code:

  1. Streaming Context (ssc):

    • The StreamingContext is created with a batch interval of 1 second.
    • This means Spark will process new data every 1 second.
  2. Reading Data (socketTextStream):

    • The program listens for incoming text data from a TCP socket (e.g., localhost:9999).
    • This is commonly used to read real-time data streams.
  3. Processing Data (flatMap and reduceByKey):

    • flatMap(lambda line: line.split(" ")) → Splits each line of text into words.
    • map(lambda word: (word, 1)) → Maps each word to a (word, 1) tuple.
    • reduceByKey(lambda a, b: a + b) → Aggregates word counts.
  4. Printing Results (pprint):

    • The final word count is printed to the console every batch interval (1 second).

3. What is Structured Streaming?

Definition

Structured Streaming is the modern and more advanced streaming engine in Spark. It is built on Spark SQL and DataFrames instead of RDDs and DStreams.

Why Use Structured Streaming?

Faster and More Optimized – Uses Catalyst optimizer and Tungsten engine.
Easy to Use – Uses SQL and DataFrame APIs instead of RDDs.
Supports Exactly-Once Processing – Prevents duplicate data processing.
Built-in Event-Time Processing – Handles late-arriving data using watermarks.
Integrates with Kafka, HDFS, S3, JDBC, and More.

Example of Structured Streaming

from pyspark.sql import SparkSession

# Create a Spark session
spark = SparkSession.builder.appName("StructuredStreamingExample").getOrCreate()

# Read a stream from a socket
lines = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()

# Split lines into words
words = lines.selectExpr("explode(split(value, ' ')) as word")

# Count words
wordCounts = words.groupBy("word").count()

# Output the result to the console
query = wordCounts.writeStream.outputMode("complete").format("console").start()

query.awaitTermination()

Explanation of Code:

  1. Reading Data (readStream)

    • Reads text data from a TCP socket on localhost:9999.
  2. Processing Data (selectExpr, groupBy, count)

    • Splits each line into words.
    • Groups words and counts occurrences.
  3. Outputting Data (writeStream)

    • Writes the results continuously to the console.
    • Supports output modes:
      • Append – Shows new results only.
      • Complete – Displays all results so far.
      • Update – Updates changed results.

4. Comparison: DStream vs. Structured Streaming

Feature Spark Streaming (DStreams) Structured Streaming (DataFrames)
API RDD-based (DStream) DataFrame/Dataset-based
Processing Model Micro-batch (fixed time) Micro-batch & Continuous
Latency Higher (~500ms - few sec) Lower (few ms - sec)
Ease of Use Complex (RDD Transformations) Simple (SQL and DataFrame API)
Optimizations Manual tuning required Automatic (Catalyst, Tungsten)
Event-Time Processing Hard to implement Built-in support
Fault Tolerance Checkpointing, WAL Exactly-once processing
Source Support Kafka, Flume, HDFS Kafka, HDFS, S3, Delta Lake

5. Which One Should You Use?

Scenario Recommended Choice
New Streaming Projects Structured Streaming
Low Latency Needs Structured Streaming
SQL-based Queries Structured Streaming
Legacy Codebases Spark Streaming (DStreams)

Final Recommendation:

Use Structured Streaming for new projects because it is faster, easier, and future-proof.
❌ Avoid DStreams unless working with legacy applications.


Conclusion

  • RDDs: The basic distributed data structure in Spark.
  • DStreams: A sequence of RDDs used in Spark Streaming (legacy).
  • Structured Streaming: The modern and optimized approach to real-time data streaming.


Comments

Popular posts from this blog

AWS Athena, AWS Lambda, AWS Glue, and Amazon S3 – Detailed Explanation

Kafka Integrated with Spark Structured Streaming

Azure Data Factory: Copying Data from ADLS to MSSQL