From S3 to ADLS Gen2: A Step-by-Step ETL Data Flow and Pipeline Execution

Introduction

When working with large-scale data processing, ensuring smooth data flow and debugging on the fly is crucial. This guide explains how to set up an SQL-like data pipeline with debugging enabled to show live data as it processes through different ETL stages.

Step 1: Creating Datasets

Creating a Dataset in ADLS Gen2

Dataset Name: ds_copy_one_by_one
Storage Type: ADLS Gen2 (Delimited Text)
Linked Service Name: ls_adls
Storage Account Tier: Fire Tier
Storage Account Name: divya22
File Path (Input): asl.csv

Creating a Dataset from S3

Dataset Name: ds_s3_nep
Dataset Type: Delimited Text
Linked Service Name: ls_awss3
File Path: Bucket Name and Path
Authentication Type: Access Key
Access Key ID: Your AWS Access Key ID
Secret Access Key: Your AWS Secret Access Key
Service URL: S3 Bucket Path

Step 2: Testing Connections

Navigate to Datasets → Select ds_s3_nep and ds_copy_one_by_one
Test Connection for both datasets to verify access

Step 3: Creating an ETL Data Flow

Data Flow Name: s3dataflow
Add Data Flow: Click the + symbol and choose Dataflow

Step 3.1: Adding Sources

Source 1 (S3 Dataset):
- Output Stream Name: s3pathsources
- Dataset: ds_s3_nep
- Test Source Connection
Source 2 (ADLS Gen2 Dataset):
- Output Stream Name: asl_Asls
- Dataset: ds_copy_one_by_one
- Enable Drifted Column Detection: Checked
- Infer Schema Projection: Enabled
- Data Preview: On

Step 4: Joining Two Sources

Click the + symbol and select Join.
Join Settings:
- Join Name: join1
- Left Stream: s3pathsources
- Right Stream: asl_Asls
- Join Type: Inner, Left, or Right Join
- Join Condition: Column name == name
- Enable Data Preview

Step 5: Adding a Filter Transformation

Click the + symbol and select Filter.
Filter Settings:
- Incoming Stream: join1
- Condition: age > 18
- Boolean Expression: Mandatory

Step 6: Writing Data to ADLS Gen2 (Sink)

Sink Dataset: ds_sinkadls
File Path (Output): joinresult.csv
Validate Schema Before Execution

Step 7: Calling the Data Flow in a Pipeline

Pipeline Name: s3dataprocessing
Add Data Flow Activity: Choose s3dataflow created earlier
Enable Debug Mode in the pipeline.
Monitor Data Preview for each transformation to observe data changes in real-time.
Validate the schema and ensure proper data mapping.

Conclusion

This ETL pipeline enables SQL-like data processing while providing real-time debugging, making it easier to track data flow and resolve issues dynamically. Using this approach ensures smooth data transformation and reliable outputs.

Search This Blog

Data Nexus

🌐From S3 to ADLS Gen2: A Step-by-Step ETL Data Flow and Pipeline Execution

From S3 to ADLS Gen2: A Step-by-Step ETL Data Flow and Pipeline Execution

Introduction

Step 1: Creating Datasets

Creating a Dataset in ADLS Gen2

Creating a Dataset from S3

Step 2: Testing Connections

Step 3: Creating an ETL Data Flow

Step 3.1: Adding Sources

Step 4: Joining Two Sources

Step 5: Adding a Filter Transformation

Step 6: Writing Data to ADLS Gen2 (Sink)

Step 7: Calling the Data Flow in a Pipeline

Conclusion

Comments

Post a Comment

Popular posts from this blog

🌐Filtering and Copying Files Dynamically in Azure Data Factory (ADF)

🔥Apache Spark Architecture with RDD & DAG

🖥️☁️AWS Athena, AWS Lambda, AWS Glue, and Amazon S3 – Detailed Explanation