🌐End-to-End ETL Pipeline: MS SQL to MS SQL Using Azure Databricks
Understanding Databricks and Azure Databricks
1. Introduction
What is Databricks?
Databricks is a cloud-based platform optimized for big data processing and machine learning. It is built on Apache Spark, an open-source distributed computing system known for its speed and efficiency in handling large datasets.
What is Azure Databricks?
Azure Databricks is a Microsoft Azure cloud service that enhances Databricks by integrating it with Azure services. It provides enterprise-level scalability, security, and AI capabilities for data engineering, machine learning, and analytics.
2. Key Features
Features of Databricks:
- Unified Platform: Combines data engineering, machine learning, and analytics.
- Optimized Performance: Enhances Apache Spark for better speed.
- Scalability: Handles large-scale data efficiently.
- Multiple Editions:
- Community Edition (Free): Best for learning and small experiments.
- Paid Version: Offers enterprise support, additional features, and scalability.
Features of Azure Databricks:
Azure Databricks extends Databricks with Azure-specific enhancements:
- Seamless Cloud Integration: Connects Azure AI, Spark, and cloud services.
- Azure Components:
- Azure Data Factory (ADF) for data orchestration.
- Azure Fabric for unified analytics.
- Azure Synapse, ADLS, and Key Vault for storage and security.
- Fully Managed Service: Unlike Databricks, Azure Databricks is a paid-only service.
3. Setting Up Azure Databricks
Creating an Azure Databricks Instance
- Search for "Azure Databricks" in the Azure portal.
- Provide workspace details:
- Workspace Name:
mydivya
- Region: Central India
- Subscription: Standard
- Keep other settings at default
- Workspace Name:
- Deploy the workspace.
Setting Up a Cluster in Azure Databricks
- Create a New Cluster.
- Choose between:
- Single Node Cluster (for learning)
- Multi-Node Cluster (for real-time processing):
- Min Workers: 2
- Max Workers: 8
- Memory: 16 GB
- Auto-Termination: 10 minutes
- Add Libraries if dependencies are missing.
4. Connecting Azure Databricks with a Database
Steps to Connect Databricks with Oracle or MySQL
- Open SQL Workbench and connect to the database.
- Create a new database (avoid using
master
). - Create tables and insert sample data:
CREATE TABLE EMP (
EMPNO NUMERIC(4) NOT NULL,
ENAME VARCHAR(10),
JOB VARCHAR(9),
MGR NUMERIC(4),
HIREDATE DATETIME,
SAL NUMERIC(7, 2),
COMM NUMERIC(7, 2),
DEPTNO NUMERIC(2)
);
INSERT INTO EMP VALUES
(7369, 'SMITH', 'CLERK', 7902, '1980-12-17', 800, NULL, 20),
(7499, 'ALLEN', 'SALESMAN', 7698, '1981-02-20', 1600, 300, 30);
CREATE TABLE DEPT (
DEPTNO NUMERIC(2),
DNAME VARCHAR(14),
LOC VARCHAR(13)
);
INSERT INTO DEPT VALUES (10, 'ACCOUNTING', 'NEW YORK'),
(20, 'RESEARCH', 'DALLAS'),
(30, 'SALES', 'CHICAGO'),
(40, 'OPERATIONS', 'BOSTON');
CREATE TABLE BONUS (
ENAME VARCHAR(10),
JOB VARCHAR(9),
SAL NUMERIC,
COMM NUMERIC
);
CREATE TABLE SALGRADE (
GRADE NUMERIC,
LOSAL NUMERIC,
HISAL NUMERIC
);
INSERT INTO SALGRADE VALUES (1, 700, 1200),
(2, 1201, 1400),
(3, 1401, 2000),
(4, 2001, 3000),
(5, 3001, 9999);
- Connect Azure Databricks to the database and read tables:
qry1 = "(SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE table_type='BASE TABLE') t"
df = spark.read.format("jdbc") \
.option("url", host) \
.option("user", "chandra") \
.option("password", "Mypassword.1") \
.option("dbtable", qry1) \
.load()
df.show()
# Convert table names to a list
tab_list = [row[0] for row in df.collect()]
print(tab_list)
- Read each table and display data:
for t in tab_list:
if not df.isEmpty():
df = spark.read.format("jdbc") \
.option("url", host) \
.option("user", "chandra") \
.option("password", "Mypassword.1") \
.option("dbtable", t) \
.load().na.drop()
df.show()
- Filter and query data:
qry = "(SELECT * FROM dbo.EMP WHERE SAL > 2000 AND JOB='MANAGER') hhh"
df = spark.read.format("jdbc") \
.option("url", host) \
.option("user", "chandra") \
.option("password", "Mypassword.1") \
.option("dbtable", qry) \
.load()
df.show()
5. Storing Transformed Data in SQL Server
After processing, save the data back to SQL Server with a modified table name.
myhost = "jdbc:mysql://venumysql.cnay6k4i6xv6.ap-south-1.rds.amazonaws.com:3306/mysqldb"
for t in tab_list:
df.write.mode("append").format("jdbc") \
.option("url", myhost) \
.option("dbtable", t + "_mssql") \
.option("user", "admin") \
.option("password", "Mypassword.1") \
.save()
6. Conclusion
Azure Databricks is a powerful cloud-based analytics platform that extends Databricks with Azure-specific enhancements. It unifies Spark, AI, and cloud computing, offering a seamless environment for big data processing. By setting up clusters, connecting to databases, and running queries, users can efficiently process and transform large datasets.
Comments
Post a Comment