Posts

Showing posts from February, 2025

Mastering PySpark Column Operations

  Mastering PySpark Column Operations Apache Spark is a powerful distributed computing framework, and PySpark is its Python API, widely used for big data processing. This blog post explores various column operations in PySpark, covering basic to advanced transformations with detailed explanations and code examples. 1. Installing and Setting Up PySpark Before working with PySpark, install it using: pip install pyspark Import the required libraries and set up a Spark session: from pyspark.sql import SparkSession, Row from pyspark.sql.functions import col, lit, upper, lower, when, count, avg, sum, udf from pyspark.sql.types import StringType spark = SparkSession.builder.appName("PySparkColumnOperations").getOrCreate() 2. Understanding Row Objects Basic Row Object Creation from pyspark.sql import Row r1 = Row(1, 'divya') print(r1) # Output: Row(_1=1, _2='divya') Since no field names are provided, PySpark assigns default field names: _1 and _2 . N...
  Mastering PySpark: Top 100 Functions for Data Engineers Apache Spark has revolutionized big data processing with its distributed computing capabilities. PySpark, the Python API for Spark, is widely used for data engineering, analysis, and machine learning. This blog will cover the top 100 PySpark functions every data engineer should know, along with a practice dataset to apply these concepts. Index of PySpark Functions 1. Basic DataFrame Operations Creating an Empty DataFrame Converting RDD to DataFrame Converting DataFrame to Pandas Displaying Data (show()) Defining Schema (StructType & StructField) 2. Column & Row Operations Column Class Selecting Columns (select()) Collecting Data (collect()) Adding & Modifying Columns (withColumn()) Renaming Columns (withColumnRenamed()) Filtering Data (where() & filter()) Dropping Columns & Duplicates (drop(), dropDuplicates()) Sorting Data (orderBy(), sort()) Grouping Data (groupBy()) Joining Dat...

end-to-end data pipeline into detailed baby steps

 I'll break down the entire process of creating this end-to-end data pipeline into detailed baby steps so you can follow along without any confusion. The pipeline will use: Apache Nifi for ingesting data from a REST API Apache Kafka for data streaming Apache Spark Structured Streaming for processing data Snowflake for storing processed data 🔧 Step 1: Set Up Apache NiFi Purpose: To fetch data from the online server and send it to Kafka. ✅ 1.1 Download and Install NiFi Go to NiFi Download Page Download nifi-1.28.1-bin.zip Extract the ZIP file to: C:\bigdata Navigate to: C:\bigdata\nifi-1.28.1\bin Run NiFi: Double-click the run-nifi.bat (Windows) or use the command line: nifi.bat start Wait for NiFi to initialize. ✅ 1.2 Access and Login to NiFi UI Open your browser and go to: https://localhost:8443/nifi/ Locate nifi-app.log for login credentials: C:\bigdata\nifi-1.28.1\logs\nifi-app.log Login using the credentials found in the log. 🔗 Step ...

Fake Apache Log Generator: A Powerful Tool for Testing and Analytics

  # 🚀 Fake Apache Log Generator: A Powerful Tool for Testing and Analytics 📝 🌟 Introduction The Fake Apache Log Generator is a Python script that generates a large number of fake Apache logs quickly and efficiently. ⚡ It is particularly useful for creating synthetic workloads for data ingestion pipelines, analytics applications, and testing environments. 🧪 This tool can output log lines to the console, log files, or directly to gzip files, providing flexibility depending on your use case. 💡 🌈 Key Features ✨ High-Speed Log Generation : Generate large volumes of logs rapidly. 🖨️ Flexible Output Options : Supports output to console, log files ( .log ), or compressed gzip files ( .gz ). 🌍 Realistic Log Data : Leverages the Faker library to create realistic IP addresses, URIs, and more. 🛠️ Customizable Output : Allows customization of the number of lines, output file prefix, and interval between log entries. 🔄 Infinite Generation : Supports infinite log generation, ...

End-to-End Data Flow Pipeline using Apache NiFi, Kafka-Spark Structured Streaming, and Snowflake

🚀 End-to-End Data Flow Pipeline using Apache NiFi, Kafka-Spark Structured Streaming, and Snowflake 💬 Personal Note: 🌟 I was unwell for a while, which caused a pause in my blogging journey. However, I’m feeling much better now and back on track. From now on, I will be posting blogs consistently. Thank you all for your support! 🙏✨ 🔄 Flow of Data in this Pipeline: Server (https://randomuser.me/api/) ↓ (REST API) Apache NiFi (InvokeHTTP Processor) ↓ Kafka (Kafka Brokers - PublishKafkaRecord_2_6 Processor) ↓ Consumer (Kafka Structured Streaming - Spark) ↓ Snowflake (Data Storage) 🌐 Project Overview: This project demonstrates a real-time data streaming pipeline that integrates data collection, processing, and storage using industry-standard tools: 🌐 Data Collection: Fetched from randomuser.me using Apache NiFi’s InvokeHTTP processor. 🏭 Streaming Data: Pushed into Kafka using PublishKafkaRecord_2_6 . ⚡ Data Processing: Apache Spark Structured Streaming co...

Kafka Integrated with Spark Structured Streaming

Image
Kafka Integrated with Spark Structured Streaming 🚀 Apache Kafka Apache Kafka is an open-source data streaming platform that stores, processes, and analyzes large amounts of real-time data. It is used to build real-time data pipelines and applications that can adapt to data streams. 🔥 Event Streaming Event streaming is the digital equivalent of the human body's central nervous system. It is the technological foundation for the "always-on" world, where businesses are increasingly software-defined and automated. Technically, event streaming involves capturing data in real-time from event sources like databases, sensors, mobile devices, cloud services, and applications. It then stores, processes, and routes these events to different destinations as needed, ensuring continuous data flow. ⚡ Kafka as an Event Streaming Platform Kafka combines three key capabilities: 📤 Publishing and subscribing to streams of events. 💾 Storing streams of events durably and reliably. ⏳ Proc...

Azure Key Vault: Securely Managing Secrets in ADF Pipelines

  🔐 Azure Key Vault: Securely Managing Secrets in ADF Pipelines 🏗️ Introduction Azure Key Vault (AKV) is a cloud service that securely stores secrets like API keys, passwords, and database connection strings. Integrating Azure Data Factory (ADF) pipelines with Azure Key Vault ensures enhanced security by managing sensitive information centrally. This guide provides step-by-step instructions to: Create an Azure Key Vault . Store secrets securely. Integrate Key Vault with ADF pipelines . Retrieve credentials from Key Vault within ADF . 🏛️ Step 1: Creating Azure Key Vault 🔍 Search for Azure Key Vault in the Azure portal . ➕ Click Create Key Vault . Configure the settings: Subscription : Free Tier Resource Group :divyalearn_az Key Vault Name : mysecurevault22 Region : Central India Pricing Tier : Standard ✅ Click  Next Access configures: Permission Model : Vault Access Policy Access Policy:   + create Create An Access Policy: C onfigure from Templ...

Azure Data Factory: Copying Data from ADLS to MSSQL

🚀 Azure Data Factory: Copying Data from ADLS to MSSQL 🔑 Key Vault & Initial Setup To copy data from Azure Data Lake Storage (ADLS) to Microsoft SQL Server (MSSQL) using Azure Data Factory (ADF), follow these structured steps. 🌐 Step 1: Creating Azure Data Factory (ADF) & Linking Services 🏗️ Creating an Azure Data Factory 🔍 Search for "Data Factory" in the Azure portal. ➕ Click "Create Data Factory". 📦 Select the Free Tier. 📂 Attach the resource group: rock2025 . 🏷️ Provide a name, e.g., learnazure (lowercase, no special characters or numbers). ✅ Click "Review + Create" and proceed with deployment. 🔗 Linking Azure Data Lake Storage to Data Factory Navigate to Azure Data Factory → Launch ADF Studio . In the Manage section: Click Linked Services → + New Linked Service . Select Azure Data Lake Storage Gen2 . Provide a name, e.g., LS_FEB13 . Choose the subscription (Free Tier). Select the Storage Account (e.g., divyabucket02...

Getting Started with Azure: Creating an Account, Subscription, Resource Group, and Storage Account, Link services

Image
  🚀 Getting Started with Azure: Creating an Account, Subscription, Resource Group, and Storage Account Azure is a powerful cloud platform that provides a range of services for managing, storing, and processing data. Before utilizing Azure services, you need to follow four essential steps: 🆕 Create an Azure Account 📝 Set Up a Subscription 📦 Create a Resource Group 🗂️ Use Resources (e.g., Storage Account) 🛠️ Step 1: Creating a Resource Group in Azure A Resource Group in Azure is a logical container that holds related Azure resources, allowing for easy management, organization, and access control. Any resource in Azure must belong to a resource group. 📌 How to Create a Resource Group? 🔍 Search for "Resource Group" in the Azure search bar. 🏢 Click on "Create Resource Group" and provide a name. ✅ Click "Create" to finalize the resource group. Alternative Method: While creating a Storage Account, you can create a new Resource Group...

How to Create an Azure Free Account in 2025

🌟 How to Create an Azure Free Account in 2025 🌟 📝 Introduction If you are looking to create an Azure free account, you might face an issue where Indian phone numbers are not being accepted for verification. This can be a roadblock, but there is a simple workaround: you need to create a new Microsoft account using an email like abcd@outlook.com . Once the new Microsoft account is created, you can proceed with the Azure free account registration. In this blog, we will guide you step by step to create an Azure free account successfully. 🚀 🔹 Step 1: Create a New Microsoft Account Since Indian numbers are not being accepted , follow these steps to create a new Microsoft account: Go to 🌍 Microsoft Signup Page . Click on 🆕 Create one! under the sign-in box. Enter a new email address, preferably an Outlook email (e.g., abcd@outlook.com ). You can also choose Get a new email address if you don’t have one. Set a strong password and click Next . Enter your First Name and Last ...
Here is your Lambda function with detailed line-by-line explanations using comments: import json # Import the JSON module for handling event data (optional) import boto3 # Import the AWS SDK for Python (Boto3) to interact with AWS services # Create an AWS Glue client to start Glue jobs glue_client = boto3.client('glue') def lambda_handler(event, context): """ AWS Lambda function triggered when a new file is uploaded to an S3 bucket. It extracts the bucket name and file name, then starts an AWS Glue job. """ for record in event['Records']: # Loop through each record in the event # Extract the S3 bucket name where the file was uploaded bucket_name = record['s3']['bucket']['name'] # Extract the file name (key) of the uploaded file object_key = record['s3']['object']['key'] # Print the file path for logging/debugging print...
 This is one of the important concepts where we will see how an end-to-end pipeline will work in AWS. 🚀 We are going to see how to continuously monitor a common source like S3/Redshift from Lambda (using Boto3 code) and initiate a trigger to start some Glue job (Spark code) and perform some action. 💡 Let's assume that: 1️⃣ AWS Lambda should initiate a trigger to another AWS service Glue as soon as some file gets uploaded in an AWS S3 bucket.  2️⃣ Lambda should pass this file information as well to Glue so that the Glue job will perform some transformation and upload that transformed data into AWS RDS (MySQL). 🛠️ Understanding the Flow Chart 📊 1️⃣ A client uploads files (.csv/.json) into an AWS storage location (e.g., S3 ).  2️⃣ Once uploaded, a trigger is initiated in AWS Lambda using Boto3 code .  3️⃣ AWS Glue (ETL Tool) starts a PySpark job to process this file, perform transformations, and load it into AWS RDS (MySQL) .  4️⃣ When a file is uploa...
1. RDD (Resilient Distributed Dataset) What is it? RDD is a fundamental data structure in Spark that represents a distributed collection of data. The data is split across multiple machines (distributed), and it’s immutable (you can’t change it after it’s created). You can only transform it or perform actions on it. Real-life example : Imagine you have a large library of books, and the books are spread across different shelves (distributed). Each shelf contains different genres (data). Once the books are placed on the shelves, you can’t change them, but you can: Sort them (transform) Count how many books you have (action) Filter out books of a certain genre (transformation) Why is it important? RDD allows you to process large amounts of data efficiently. It's fault-tolerant , meaning if one machine fails, it can recover the lost data using the original instructions. 2. DStream (Discretized Stream) What is it? A DStream is an abstraction that represents a...

Understanding DStream, RDD, and Structured Streaming in Apache Spark

Understanding DStream, RDD, and Structured Streaming in Apache Spark When working with Apache Spark Streaming , we encounter terms like DStream, RDD, and Structured Streaming . Let's break down these concepts and fully understand them. 1. What is an RDD (Resilient Distributed Dataset)? Definition An RDD (Resilient Distributed Dataset) is the fundamental data structure in Apache Spark . It is an immutable , distributed collection of objects that can be processed in parallel across a cluster. Key Properties of RDDs Immutable – Once created, an RDD cannot be changed; transformations create new RDDs. Distributed – Data is split across multiple nodes in the cluster. Fault-Tolerant – Can recover lost data automatically using lineage. Lazy Evaluation – Transformations are not executed immediately but only when an action is triggered. Example of Creating an RDD in PySpark from pyspark.sql import SparkSession # Create a Spark session spark = SparkSession.builder.appName("RDD...