🌐From S3 to ADLS Gen2: A Step-by-Step ETL Data Flow and Pipeline Execution

 

From S3 to ADLS Gen2: A Step-by-Step ETL Data Flow and Pipeline Execution

Introduction

When working with large-scale data processing, ensuring smooth data flow and debugging on the fly is crucial. This guide explains how to set up an SQL-like data pipeline with debugging enabled to show live data as it processes through different ETL stages.

Step 1: Creating Datasets

Creating a Dataset in ADLS Gen2

  1. Dataset Name: ds_copy_one_by_one
  2. Storage Type: ADLS Gen2 (Delimited Text)
  3. Linked Service Name: ls_adls
  4. Storage Account Tier: Fire Tier
  5. Storage Account Name: divya22
  6. File Path (Input): asl.csv

Creating a Dataset from S3

  1. Dataset Name: ds_s3_nep
  2. Dataset Type: Delimited Text
  3. Linked Service Name: ls_awss3
  4. File Path: Bucket Name and Path
  5. Authentication Type: Access Key
  6. Access Key ID: Your AWS Access Key ID
  7. Secret Access Key: Your AWS Secret Access Key
  8. Service URL: S3 Bucket Path

Step 2: Testing Connections

  • Navigate to Datasets → Select ds_s3_nep and ds_copy_one_by_one
  • Test Connection for both datasets to verify access

Step 3: Creating an ETL Data Flow

  1. Data Flow Name: s3dataflow
  2. Add Data Flow: Click the + symbol and choose Dataflow

Step 3.1: Adding Sources

  • Source 1 (S3 Dataset):

    • Output Stream Name: s3pathsources
    • Dataset: ds_s3_nep
    • Test Source Connection
  • Source 2 (ADLS Gen2 Dataset):

    • Output Stream Name: asl_Asls
    • Dataset: ds_copy_one_by_one
    • Enable Drifted Column Detection: Checked
    • Infer Schema Projection: Enabled
    • Data Preview: On

Step 4: Joining Two Sources

  1. Click the + symbol and select Join.
  2. Join Settings:
    • Join Name: join1
    • Left Stream: s3pathsources
    • Right Stream: asl_Asls
    • Join Type: Inner, Left, or Right Join
    • Join Condition: Column name == name
    • Enable Data Preview

Step 5: Adding a Filter Transformation

  1. Click the + symbol and select Filter.
  2. Filter Settings:
    • Incoming Stream: join1
    • Condition: age > 18
    • Boolean Expression: Mandatory

Step 6: Writing Data to ADLS Gen2 (Sink)

  1. Sink Dataset: ds_sinkadls
  2. File Path (Output): joinresult.csv
  3. Validate Schema Before Execution

Step 7: Calling the Data Flow in a Pipeline

  1. Pipeline Name: s3dataprocessing
  2. Add Data Flow Activity: Choose s3dataflow created earlier
  3. Enable Debug Mode in the pipeline.
  4. Monitor Data Preview for each transformation to observe data changes in real-time.
  5. Validate the schema and ensure proper data mapping.

Conclusion

This ETL pipeline enables SQL-like data processing while providing real-time debugging, making it easier to track data flow and resolve issues dynamically. Using this approach ensures smooth data transformation and reliable outputs.

Comments

Popular posts from this blog

🌐Filtering and Copying Files Dynamically in Azure Data Factory (ADF)

🔥Apache Spark Architecture with RDD & DAG

🖥️☁️AWS Athena, AWS Lambda, AWS Glue, and Amazon S3 – Detailed Explanation