Search This Blog

Data Nexus

1.15

Get link
Facebook
X
Pinterest
Email
Other Apps

March 10, 2025

Get link
Facebook
X
Pinterest
Email
Other Apps

Comments

🌐Filtering and Copying Files Dynamically in Azure Data Factory (ADF)

February 28, 2025

Filtering and Copying Files Dynamically in Azure Data Factory (ADF) Introduction In Azure Data Factory (ADF) , automating file processing is a common use case. This guide demonstrates how to: Retrieve file metadata (including filenames inside a folder). Filter files based on today’s date. Apply advanced filter conditions such as contains() , startswith() , and endswith() . Copy only filtered files dynamically using ForEach and Copy Data activities. Step 1: Set Variable for Today's Date The first step is to create a Set Variable activity to store today’s date in ddMMyyyy format. Activity: Set Variable Variable Name: dt Value: @formatDateTime(utcNow(), 'ddMMyyyy') Explanation utcNow() fetches the current UTC time. formatDateTime(utcNow(), 'ddMMyyyy') converts it into a day-month-year format. Example Output: { "name": "dt", "value": "28022025" } Step 2: Get Metadata of Folder Contents ...

🔥Understanding DStream, RDD, and Structured Streaming in Apache Spark

February 05, 2025

Understanding DStream, RDD, and Structured Streaming in Apache Spark When working with Apache Spark Streaming , we encounter terms like DStream, RDD, and Structured Streaming . Let's break down these concepts and fully understand them. 1. What is an RDD (Resilient Distributed Dataset)? Definition An RDD (Resilient Distributed Dataset) is the fundamental data structure in Apache Spark . It is an immutable , distributed collection of objects that can be processed in parallel across a cluster. Key Properties of RDDs Immutable – Once created, an RDD cannot be changed; transformations create new RDDs. Distributed – Data is split across multiple nodes in the cluster. Fault-Tolerant – Can recover lost data automatically using lineage. Lazy Evaluation – Transformations are not executed immediately but only when an action is triggered. Example of Creating an RDD in PySpark from pyspark.sql import SparkSession # Create a Spark session spark = SparkSession.builder.appName("RDD...

🚀End-to-End Data Flow Pipeline using Apache NiFi, Kafka-Spark Structured Streaming, and Snowflake

February 20, 2025

🚀 End-to-End Data Flow Pipeline using Apache NiFi, Kafka-Spark Structured Streaming, and Snowflake 💬 Personal Note: 🌟 I was unwell for a while, which caused a pause in my blogging journey. However, I’m feeling much better now and back on track. From now on, I will be posting blogs consistently. Thank you all for your support! 🙏✨ 🔄 Flow of Data in this Pipeline: Server (https://randomuser.me/api/) ↓ (REST API) Apache NiFi (InvokeHTTP Processor) ↓ Kafka (Kafka Brokers - PublishKafkaRecord_2_6 Processor) ↓ Consumer (Kafka Structured Streaming - Spark) ↓ Snowflake (Data Storage) 🌐 Project Overview: This project demonstrates a real-time data streaming pipeline that integrates data collection, processing, and storage using industry-standard tools: 🌐 Data Collection: Fetched from randomuser.me using Apache NiFi’s InvokeHTTP processor. 🏭 Streaming Data: Pushed into Kafka using PublishKafkaRecord_2_6 . ⚡ Data Processing: Apache Spark Structured Streaming co...

Labels

: ⚡PySpark
🌐Azure
🐍python history
🐍python interview guide
🐍python learing path
🐍python string
🐍python string short answer
🐍python_cheat_tabel
🐍python_list
🐍python_list_shortanswer

🐘Hadoop
🐝Hive
📊SQL
📝Jupyter Notebook
🔶Databricks
🖥️☁️aws
🦘kafka
ec2
nifisnowflake
structured streaming
website

Show more Show less