🌐End-to-End ETL Pipeline: MS SQL to MS SQL Using Azure Databricks

Understanding Databricks and Azure Databricks

1. Introduction

What is Databricks?

Databricks is a cloud-based platform optimized for big data processing and machine learning. It is built on Apache Spark, an open-source distributed computing system known for its speed and efficiency in handling large datasets.

What is Azure Databricks?

Azure Databricks is a Microsoft Azure cloud service that enhances Databricks by integrating it with Azure services. It provides enterprise-level scalability, security, and AI capabilities for data engineering, machine learning, and analytics.

2. Key Features

Features of Databricks:

  1. Unified Platform: Combines data engineering, machine learning, and analytics.
  2. Optimized Performance: Enhances Apache Spark for better speed.
  3. Scalability: Handles large-scale data efficiently.
  4. Multiple Editions:
    • Community Edition (Free): Best for learning and small experiments.
    • Paid Version: Offers enterprise support, additional features, and scalability.

Features of Azure Databricks:

Azure Databricks extends Databricks with Azure-specific enhancements:

  1. Seamless Cloud Integration: Connects Azure AI, Spark, and cloud services.
  2. Azure Components:
    • Azure Data Factory (ADF) for data orchestration.
    • Azure Fabric for unified analytics.
    • Azure Synapse, ADLS, and Key Vault for storage and security.
  3. Fully Managed Service: Unlike Databricks, Azure Databricks is a paid-only service.

3. Setting Up Azure Databricks

Creating an Azure Databricks Instance

  1. Search for "Azure Databricks" in the Azure portal.
  2. Provide workspace details:
    • Workspace Name: mydivya
    • Region: Central India
    • Subscription: Standard
    • Keep other settings at default
  3. Deploy the workspace.

Setting Up a Cluster in Azure Databricks

  1. Create a New Cluster.
  2. Choose between:
    • Single Node Cluster (for learning)
    • Multi-Node Cluster (for real-time processing):
      • Min Workers: 2
      • Max Workers: 8
      • Memory: 16 GB
      • Auto-Termination: 10 minutes
  3. Add Libraries if dependencies are missing.

4. Connecting Azure Databricks with a Database

Steps to Connect Databricks with Oracle or MySQL

  1. Open SQL Workbench and connect to the database.
  2. Create a new database (avoid using master).
  3. Create tables and insert sample data:
CREATE TABLE EMP (
    EMPNO NUMERIC(4) NOT NULL,
    ENAME VARCHAR(10),
    JOB VARCHAR(9),
    MGR NUMERIC(4),
    HIREDATE DATETIME,
    SAL NUMERIC(7, 2),
    COMM NUMERIC(7, 2),
    DEPTNO NUMERIC(2)
);

INSERT INTO EMP VALUES
(7369, 'SMITH', 'CLERK', 7902, '1980-12-17', 800, NULL, 20),
(7499, 'ALLEN', 'SALESMAN', 7698, '1981-02-20', 1600, 300, 30);

CREATE TABLE DEPT (
    DEPTNO NUMERIC(2),
    DNAME VARCHAR(14),
    LOC VARCHAR(13)
);

INSERT INTO DEPT VALUES (10, 'ACCOUNTING', 'NEW YORK'),
                        (20, 'RESEARCH', 'DALLAS'),
                        (30, 'SALES', 'CHICAGO'),
                        (40, 'OPERATIONS', 'BOSTON');

CREATE TABLE BONUS (
    ENAME VARCHAR(10),
    JOB VARCHAR(9),
    SAL NUMERIC,
    COMM NUMERIC
);

CREATE TABLE SALGRADE (
    GRADE NUMERIC,
    LOSAL NUMERIC,
    HISAL NUMERIC
);

INSERT INTO SALGRADE VALUES (1, 700, 1200),
                            (2, 1201, 1400),
                            (3, 1401, 2000),
                            (4, 2001, 3000),
                            (5, 3001, 9999);
  1. Connect Azure Databricks to the database and read tables:
qry1 = "(SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE table_type='BASE TABLE') t"
df = spark.read.format("jdbc") \
    .option("url", host) \
    .option("user", "chandra") \
    .option("password", "Mypassword.1") \
    .option("dbtable", qry1) \
    .load()
df.show()

# Convert table names to a list
tab_list = [row[0] for row in df.collect()]
print(tab_list)
  1. Read each table and display data:
for t in tab_list:
    if not df.isEmpty():
        df = spark.read.format("jdbc") \
            .option("url", host) \
            .option("user", "chandra") \
            .option("password", "Mypassword.1") \
            .option("dbtable", t) \
            .load().na.drop()
        df.show()
  1. Filter and query data:
qry = "(SELECT * FROM dbo.EMP WHERE SAL > 2000 AND JOB='MANAGER') hhh"
df = spark.read.format("jdbc") \
    .option("url", host) \
    .option("user", "chandra") \
    .option("password", "Mypassword.1") \
    .option("dbtable", qry) \
    .load()
df.show()

5. Storing Transformed Data in SQL Server

After processing, save the data back to SQL Server with a modified table name.

myhost = "jdbc:mysql://venumysql.cnay6k4i6xv6.ap-south-1.rds.amazonaws.com:3306/mysqldb"

for t in tab_list:
    df.write.mode("append").format("jdbc") \
        .option("url", myhost) \
        .option("dbtable", t + "_mssql") \
        .option("user", "admin") \
        .option("password", "Mypassword.1") \
        .save()

6. Conclusion

Azure Databricks is a powerful cloud-based analytics platform that extends Databricks with Azure-specific enhancements. It unifies Spark, AI, and cloud computing, offering a seamless environment for big data processing. By setting up clusters, connecting to databases, and running queries, users can efficiently process and transform large datasets.

Comments

Popular posts from this blog

🌐Filtering and Copying Files Dynamically in Azure Data Factory (ADF)

🔥Apache Spark Architecture with RDD & DAG

🖥️☁️AWS Athena, AWS Lambda, AWS Glue, and Amazon S3 – Detailed Explanation