Understanding DStream, RDD, and Structured Streaming in Apache Spark When working with Apache Spark Streaming , we encounter terms like DStream, RDD, and Structured Streaming . Let's break down these concepts and fully understand them. 1. What is an RDD (Resilient Distributed Dataset)? Definition An RDD (Resilient Distributed Dataset) is the fundamental data structure in Apache Spark . It is an immutable , distributed collection of objects that can be processed in parallel across a cluster. Key Properties of RDDs Immutable – Once created, an RDD cannot be changed; transformations create new RDDs. Distributed – Data is split across multiple nodes in the cluster. Fault-Tolerant – Can recover lost data automatically using lineage. Lazy Evaluation – Transformations are not executed immediately but only when an action is triggered. Example of Creating an RDD in PySpark from pyspark.sql import SparkSession # Create a Spark session spark = SparkSession.builder.appName("RDD...
Comments
Post a Comment