🌐Step-by-Step Guide to Creating and Accessing Azure Blob Storage in Databricks

 

Step-by-Step Guide to Creating and Accessing Azure Blob Storage in Databricks

Introduction

Azure Blob Storage is a scalable, cloud-based object storage solution for storing massive amounts of unstructured data. This blog will guide you step by step on how to create a container, generate access keys, mount the storage in Databricks, and process CSV files into an MSSQL database.


Step 1: Create an Azure Storage Account and Container

Before proceeding, you need an Azure Storage Account and a container.

1.1 Create a Storage Account

  1. Log in to the Azure Portal.
  2. Navigate to Storage Accounts.
  3. Click Create and provide the following details:
    • Subscription: Choose your subscription.
    • Resource Group: Create a new one or use an existing one.
    • Storage Account Name: venu28nov2023
    • Region: Select your preferred region.
    • Performance: Standard.
    • Replication: Locally Redundant Storage (LRS) for cost efficiency.
  4. Click Review + Create and wait for deployment.

1.2 Create a Container

  1. Go to your newly created storage account.
  2. Navigate to Containers under Data Storage.
  3. Click + Container, name it input, and select Private (no anonymous access).
  4. Click Create.

Step 2: Generate Access Keys

To access the storage account, retrieve the access key:

  1. Navigate to Storage Account > Access Keys.
  2. Copy one of the keys (Primary or Secondary).
    • Example:
      Accesskeys = "tnd12MEQycM6I4EWg9qjRv3k+pFZL5Op1XwQSE0z5i0gGzX8J0jvh8w7iqq4uO9yAc7RLHa3+cSo+ASt335JdQ=="
      

Step 3: Mount Azure Blob Storage in Databricks

To mount the container in Databricks, run the following command:

storage_account_name = "venu28nov2023"
container_name = "input"
Accesskeys = "tnd12MEQycM6I4EWg9qjRv3k+pFZL5Op1XwQSE0z5i0gGzX8J0jvh8w7iqq4uO9yAc7RLHa3+cSo+ASt335JdQ=="

dbutils.fs.mount(
  source = f"wasbs://{container_name}@{storage_account_name}.blob.core.windows.net",
  mount_point = f"/mnt/iotdata1",
  extra_configs = {f"fs.azure.account.key.{storage_account_name}.blob.core.windows.net":Accesskeys}
)

This mounts the input container to /mnt/iotdata1 in Databricks.


Step 4: Process CSV Data

4.1 Define the Data Location

data="/mnt/wallmart/csvdata"

4.2 List and Filter Files

import re
import os
files = dbutils.fs.ls(data)
files_paths = [f.path for f in files if "us-500" in f.path and f.path.endswith("us-500.csv")]
print(files_paths)

This filters CSV files that contain "us-500" in their name.

4.3 Define MSSQL Connection

host="jdbc:sqlserver://sampleserver111.database.windows.net:1433;databaseName=mssqldb;"
conf = {
    "url": host,
    "user": "chandra",
    "password": "Mypassword.1"
}

4.4 Read CSV and Write to SQL Table

for p in files_paths:
    filename = os.path.splitext(os.path.basename(p))[0]
    tab = re.sub(r'[^a-zA-Z0-9]', '', filename)
    print("Importing data from", p)
    print("Storing in MSSQL table:", tab)
    
    df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(p)
    df.write.mode("overwrite").format("jdbc").options(**conf).option("dbtable", tab).save()
    df.show()

This reads each CSV file and imports it into a dynamically named MSSQL table.


Step 5: Mount Another Storage Account

To mount another storage account (storageaccountqqqq), use:

storage_account_name = "storageaccountqqqq"
container_name = "input"
Accesskeys = "+cT3+AStwi6v7Q=="

dbutils.fs.mount(
  source = f"wasbs://{container_name}@{storage_account_name}.blob.core.windows.net",
  mount_point = f"/mnt/wallmart",
  extra_configs = {f"fs.azure.account.key.{storage_account_name}.blob.core.windows.net":Accesskeys}
)

This mounts the input container from the second storage account to /mnt/wallmart.


Conclusion

In this blog, we covered:

  • Creating an Azure Storage Account and Blob Container.
  • Retrieving Access Keys.
  • Mounting Blob Storage in Databricks.
  • Listing and filtering CSV files.
  • Importing CSV data into an MSSQL database.

By following these steps, you can effectively store and manage large-scale datasets in Azure and Databricks.

Comments

  1. Should have also added enable hierarchy option while creating the storage account. So they can add container inside the container

    ReplyDelete

Post a Comment

Popular posts from this blog

🌐Filtering and Copying Files Dynamically in Azure Data Factory (ADF)

🔥Apache Spark Architecture with RDD & DAG

🖥️☁️AWS Athena, AWS Lambda, AWS Glue, and Amazon S3 – Detailed Explanation