🌐From S3 to ADLS Gen2: A Step-by-Step ETL Data Flow and Pipeline Execution
From S3 to ADLS Gen2: A Step-by-Step ETL Data Flow and Pipeline Execution
Introduction
When working with large-scale data processing, ensuring smooth data flow and debugging on the fly is crucial. This guide explains how to set up an SQL-like data pipeline with debugging enabled to show live data as it processes through different ETL stages.
Step 1: Creating Datasets
Creating a Dataset in ADLS Gen2
- Dataset Name:
ds_copy_one_by_one
- Storage Type: ADLS Gen2 (Delimited Text)
- Linked Service Name:
ls_adls
- Storage Account Tier:
Fire Tier
- Storage Account Name:
divya22
- File Path (Input):
asl.csv
Creating a Dataset from S3
- Dataset Name:
ds_s3_nep
- Dataset Type: Delimited Text
- Linked Service Name:
ls_awss3
- File Path: Bucket Name and Path
- Authentication Type: Access Key
- Access Key ID: Your AWS Access Key ID
- Secret Access Key: Your AWS Secret Access Key
- Service URL: S3 Bucket Path
Step 2: Testing Connections
- Navigate to Datasets → Select
ds_s3_nep
andds_copy_one_by_one
- Test Connection for both datasets to verify access
Step 3: Creating an ETL Data Flow
- Data Flow Name:
s3dataflow
- Add Data Flow: Click the + symbol and choose Dataflow
Step 3.1: Adding Sources
-
Source 1 (S3 Dataset):
- Output Stream Name:
s3pathsources
- Dataset:
ds_s3_nep
- Test Source Connection
- Output Stream Name:
-
Source 2 (ADLS Gen2 Dataset):
- Output Stream Name:
asl_Asls
- Dataset:
ds_copy_one_by_one
- Enable Drifted Column Detection: Checked
- Infer Schema Projection: Enabled
- Data Preview: On
- Output Stream Name:
Step 4: Joining Two Sources
- Click the + symbol and select Join.
- Join Settings:
- Join Name:
join1
- Left Stream:
s3pathsources
- Right Stream:
asl_Asls
- Join Type: Inner, Left, or Right Join
- Join Condition: Column
name == name
- Enable Data Preview
- Join Name:
Step 5: Adding a Filter Transformation
- Click the + symbol and select Filter.
- Filter Settings:
- Incoming Stream:
join1
- Condition:
age > 18
- Boolean Expression: Mandatory
- Incoming Stream:
Step 6: Writing Data to ADLS Gen2 (Sink)
- Sink Dataset:
ds_sinkadls
- File Path (Output):
joinresult.csv
- Validate Schema Before Execution
Step 7: Calling the Data Flow in a Pipeline
- Pipeline Name:
s3dataprocessing
- Add Data Flow Activity: Choose
s3dataflow
created earlier - Enable Debug Mode in the pipeline.
- Monitor Data Preview for each transformation to observe data changes in real-time.
- Validate the schema and ensure proper data mapping.
Conclusion
This ETL pipeline enables SQL-like data processing while providing real-time debugging, making it easier to track data flow and resolve issues dynamically. Using this approach ensures smooth data transformation and reliable outputs.
Comments
Post a Comment