🌐End-to-End ETL Pipeline: MS SQL to MS SQL Using Azure Databricks

March 02, 2025

Understanding Databricks and Azure Databricks

1. Introduction

What is Databricks?

Databricks is a cloud-based platform optimized for big data processing and machine learning. It is built on Apache Spark, an open-source distributed computing system known for its speed and efficiency in handling large datasets.

What is Azure Databricks?

Azure Databricks is a Microsoft Azure cloud service that enhances Databricks by integrating it with Azure services. It provides enterprise-level scalability, security, and AI capabilities for data engineering, machine learning, and analytics.

2. Key Features

Features of Databricks:

Unified Platform: Combines data engineering, machine learning, and analytics.
Optimized Performance: Enhances Apache Spark for better speed.
Scalability: Handles large-scale data efficiently.
Multiple Editions:
- Community Edition (Free): Best for learning and small experiments.
- Paid Version: Offers enterprise support, additional features, and scalability.

Features of Azure Databricks:

Azure Databricks extends Databricks with Azure-specific enhancements:

Seamless Cloud Integration: Connects Azure AI, Spark, and cloud services.
Azure Components:
- Azure Data Factory (ADF) for data orchestration.
- Azure Fabric for unified analytics.
- Azure Synapse, ADLS, and Key Vault for storage and security.
Fully Managed Service: Unlike Databricks, Azure Databricks is a paid-only service.

3. Setting Up Azure Databricks

Creating an Azure Databricks Instance

Search for "Azure Databricks" in the Azure portal.
Provide workspace details:
- Workspace Name: mydivya
- Region: Central India
- Subscription: Standard
- Keep other settings at default
Deploy the workspace.

Setting Up a Cluster in Azure Databricks

Create a New Cluster.
Choose between:
- Single Node Cluster (for learning)
- Multi-Node Cluster (for real-time processing):
  - Min Workers: 2
  - Max Workers: 8
  - Memory: 16 GB
  - Auto-Termination: 10 minutes
Add Libraries if dependencies are missing.

4. Connecting Azure Databricks with a Database

Steps to Connect Databricks with Oracle or MySQL

Open SQL Workbench and connect to the database.
Create a new database (avoid using master).
Create tables and insert sample data:

CREATE TABLE EMP (
    EMPNO NUMERIC(4) NOT NULL,
    ENAME VARCHAR(10),
    JOB VARCHAR(9),
    MGR NUMERIC(4),
    HIREDATE DATETIME,
    SAL NUMERIC(7, 2),
    COMM NUMERIC(7, 2),
    DEPTNO NUMERIC(2)
);

INSERT INTO EMP VALUES
(7369, 'SMITH', 'CLERK', 7902, '1980-12-17', 800, NULL, 20),
(7499, 'ALLEN', 'SALESMAN', 7698, '1981-02-20', 1600, 300, 30);

CREATE TABLE DEPT (
    DEPTNO NUMERIC(2),
    DNAME VARCHAR(14),
    LOC VARCHAR(13)
);

INSERT INTO DEPT VALUES (10, 'ACCOUNTING', 'NEW YORK'),
                        (20, 'RESEARCH', 'DALLAS'),
                        (30, 'SALES', 'CHICAGO'),
                        (40, 'OPERATIONS', 'BOSTON');

CREATE TABLE BONUS (
    ENAME VARCHAR(10),
    JOB VARCHAR(9),
    SAL NUMERIC,
    COMM NUMERIC
);

CREATE TABLE SALGRADE (
    GRADE NUMERIC,
    LOSAL NUMERIC,
    HISAL NUMERIC
);

INSERT INTO SALGRADE VALUES (1, 700, 1200),
                            (2, 1201, 1400),
                            (3, 1401, 2000),
                            (4, 2001, 3000),
                            (5, 3001, 9999);

Connect Azure Databricks to the database and read tables:

qry1 = "(SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE table_type='BASE TABLE') t"
df = spark.read.format("jdbc") \
    .option("url", host) \
    .option("user", "chandra") \
    .option("password", "Mypassword.1") \
    .option("dbtable", qry1) \
    .load()
df.show()

# Convert table names to a list
tab_list = [row[0] for row in df.collect()]
print(tab_list)

Read each table and display data:

for t in tab_list:
    if not df.isEmpty():
        df = spark.read.format("jdbc") \
            .option("url", host) \
            .option("user", "chandra") \
            .option("password", "Mypassword.1") \
            .option("dbtable", t) \
            .load().na.drop()
        df.show()

Filter and query data:

qry = "(SELECT * FROM dbo.EMP WHERE SAL > 2000 AND JOB='MANAGER') hhh"
df = spark.read.format("jdbc") \
    .option("url", host) \
    .option("user", "chandra") \
    .option("password", "Mypassword.1") \
    .option("dbtable", qry) \
    .load()
df.show()

5. Storing Transformed Data in SQL Server

After processing, save the data back to SQL Server with a modified table name.

myhost = "jdbc:mysql://venumysql.cnay6k4i6xv6.ap-south-1.rds.amazonaws.com:3306/mysqldb"

for t in tab_list:
    df.write.mode("append").format("jdbc") \
        .option("url", myhost) \
        .option("dbtable", t + "_mssql") \
        .option("user", "admin") \
        .option("password", "Mypassword.1") \
        .save()

6. Conclusion

Azure Databricks is a powerful cloud-based analytics platform that extends Databricks with Azure-specific enhancements. It unifies Spark, AI, and cloud computing, offering a seamless environment for big data processing. By setting up clusters, connecting to databases, and running queries, users can efficiently process and transform large datasets.

Search This Blog

Data Nexus