Posts

Mastering PySpark Column Operations

  Mastering PySpark Column Operations Apache Spark is a powerful distributed computing framework, and PySpark is its Python API, widely used for big data processing. This blog post explores various column operations in PySpark, covering basic to advanced transformations with detailed explanations and code examples. 1. Installing and Setting Up PySpark Before working with PySpark, install it using: pip install pyspark Import the required libraries and set up a Spark session: from pyspark.sql import SparkSession, Row from pyspark.sql.functions import col, lit, upper, lower, when, count, avg, sum, udf from pyspark.sql.types import StringType spark = SparkSession.builder.appName("PySparkColumnOperations").getOrCreate() 2. Understanding Row Objects Basic Row Object Creation from pyspark.sql import Row r1 = Row(1, 'divya') print(r1) # Output: Row(_1=1, _2='divya') Since no field names are provided, PySpark assigns default field names: _1 and _2 . N...
  Mastering PySpark: Top 100 Functions for Data Engineers Apache Spark has revolutionized big data processing with its distributed computing capabilities. PySpark, the Python API for Spark, is widely used for data engineering, analysis, and machine learning. This blog will cover the top 100 PySpark functions every data engineer should know, along with a practice dataset to apply these concepts. Index of PySpark Functions 1. Basic DataFrame Operations Creating an Empty DataFrame Converting RDD to DataFrame Converting DataFrame to Pandas Displaying Data (show()) Defining Schema (StructType & StructField) 2. Column & Row Operations Column Class Selecting Columns (select()) Collecting Data (collect()) Adding & Modifying Columns (withColumn()) Renaming Columns (withColumnRenamed()) Filtering Data (where() & filter()) Dropping Columns & Duplicates (drop(), dropDuplicates()) Sorting Data (orderBy(), sort()) Grouping Data (groupBy()) Joining Dat...

end-to-end data pipeline into detailed baby steps

 I'll break down the entire process of creating this end-to-end data pipeline into detailed baby steps so you can follow along without any confusion. The pipeline will use: Apache Nifi for ingesting data from a REST API Apache Kafka for data streaming Apache Spark Structured Streaming for processing data Snowflake for storing processed data ๐Ÿ”ง Step 1: Set Up Apache NiFi Purpose: To fetch data from the online server and send it to Kafka. ✅ 1.1 Download and Install NiFi Go to NiFi Download Page Download nifi-1.28.1-bin.zip Extract the ZIP file to: C:\bigdata Navigate to: C:\bigdata\nifi-1.28.1\bin Run NiFi: Double-click the run-nifi.bat (Windows) or use the command line: nifi.bat start Wait for NiFi to initialize. ✅ 1.2 Access and Login to NiFi UI Open your browser and go to: https://localhost:8443/nifi/ Locate nifi-app.log for login credentials: C:\bigdata\nifi-1.28.1\logs\nifi-app.log Login using the credentials found in the log. ๐Ÿ”— Step ...

Fake Apache Log Generator: A Powerful Tool for Testing and Analytics

  # ๐Ÿš€ Fake Apache Log Generator: A Powerful Tool for Testing and Analytics ๐Ÿ“ ๐ŸŒŸ Introduction The Fake Apache Log Generator is a Python script that generates a large number of fake Apache logs quickly and efficiently. ⚡ It is particularly useful for creating synthetic workloads for data ingestion pipelines, analytics applications, and testing environments. ๐Ÿงช This tool can output log lines to the console, log files, or directly to gzip files, providing flexibility depending on your use case. ๐Ÿ’ก ๐ŸŒˆ Key Features ✨ High-Speed Log Generation : Generate large volumes of logs rapidly. ๐Ÿ–จ️ Flexible Output Options : Supports output to console, log files ( .log ), or compressed gzip files ( .gz ). ๐ŸŒ Realistic Log Data : Leverages the Faker library to create realistic IP addresses, URIs, and more. ๐Ÿ› ️ Customizable Output : Allows customization of the number of lines, output file prefix, and interval between log entries. ๐Ÿ”„ Infinite Generation : Supports infinite log generation, ...

End-to-End Data Flow Pipeline using Apache NiFi, Kafka-Spark Structured Streaming, and Snowflake

๐Ÿš€ End-to-End Data Flow Pipeline using Apache NiFi, Kafka-Spark Structured Streaming, and Snowflake ๐Ÿ’ฌ Personal Note: ๐ŸŒŸ I was unwell for a while, which caused a pause in my blogging journey. However, I’m feeling much better now and back on track. From now on, I will be posting blogs consistently. Thank you all for your support! ๐Ÿ™✨ ๐Ÿ”„ Flow of Data in this Pipeline: Server (https://randomuser.me/api/) ↓ (REST API) Apache NiFi (InvokeHTTP Processor) ↓ Kafka (Kafka Brokers - PublishKafkaRecord_2_6 Processor) ↓ Consumer (Kafka Structured Streaming - Spark) ↓ Snowflake (Data Storage) ๐ŸŒ Project Overview: This project demonstrates a real-time data streaming pipeline that integrates data collection, processing, and storage using industry-standard tools: ๐ŸŒ Data Collection: Fetched from randomuser.me using Apache NiFi’s InvokeHTTP processor. ๐Ÿญ Streaming Data: Pushed into Kafka using PublishKafkaRecord_2_6 . ⚡ Data Processing: Apache Spark Structured Streaming co...

Kafka Integrated with Spark Structured Streaming

Image
Kafka Integrated with Spark Structured Streaming ๐Ÿš€ Apache Kafka Apache Kafka is an open-source data streaming platform that stores, processes, and analyzes large amounts of real-time data. It is used to build real-time data pipelines and applications that can adapt to data streams. ๐Ÿ”ฅ Event Streaming Event streaming is the digital equivalent of the human body's central nervous system. It is the technological foundation for the "always-on" world, where businesses are increasingly software-defined and automated. Technically, event streaming involves capturing data in real-time from event sources like databases, sensors, mobile devices, cloud services, and applications. It then stores, processes, and routes these events to different destinations as needed, ensuring continuous data flow. ⚡ Kafka as an Event Streaming Platform Kafka combines three key capabilities: ๐Ÿ“ค Publishing and subscribing to streams of events. ๐Ÿ’พ Storing streams of events durably and reliably. ⏳ Proc...

Azure Key Vault: Securely Managing Secrets in ADF Pipelines

  ๐Ÿ” Azure Key Vault: Securely Managing Secrets in ADF Pipelines ๐Ÿ—️ Introduction Azure Key Vault (AKV) is a cloud service that securely stores secrets like API keys, passwords, and database connection strings. Integrating Azure Data Factory (ADF) pipelines with Azure Key Vault ensures enhanced security by managing sensitive information centrally. This guide provides step-by-step instructions to: Create an Azure Key Vault . Store secrets securely. Integrate Key Vault with ADF pipelines . Retrieve credentials from Key Vault within ADF . ๐Ÿ›️ Step 1: Creating Azure Key Vault ๐Ÿ” Search for Azure Key Vault in the Azure portal . ➕ Click Create Key Vault . Configure the settings: Subscription : Free Tier Resource Group :divyalearn_az Key Vault Name : mysecurevault22 Region : Central India Pricing Tier : Standard ✅ Click  Next Access configures: Permission Model : Vault Access Policy Access Policy:   + create Create An Access Policy: C onfigure from Templ...