🌐Step-by-Step Guide to Creating and Accessing Azure Blob Storage in Databricks
Step-by-Step Guide to Creating and Accessing Azure Blob Storage in Databricks
Introduction
Azure Blob Storage is a scalable, cloud-based object storage solution for storing massive amounts of unstructured data. This blog will guide you step by step on how to create a container, generate access keys, mount the storage in Databricks, and process CSV files into an MSSQL database.
Step 1: Create an Azure Storage Account and Container
Before proceeding, you need an Azure Storage Account and a container.
1.1 Create a Storage Account
- Log in to the Azure Portal.
- Navigate to Storage Accounts.
- Click Create and provide the following details:
- Subscription: Choose your subscription.
- Resource Group: Create a new one or use an existing one.
- Storage Account Name:
venu28nov2023 - Region: Select your preferred region.
- Performance: Standard.
- Replication: Locally Redundant Storage (LRS) for cost efficiency.
- Click Review + Create and wait for deployment.
1.2 Create a Container
- Go to your newly created storage account.
- Navigate to Containers under Data Storage.
- Click + Container, name it input, and select Private (no anonymous access).
- Click Create.
Step 2: Generate Access Keys
To access the storage account, retrieve the access key:
- Navigate to Storage Account > Access Keys.
- Copy one of the keys (Primary or Secondary).
- Example:
Accesskeys = "tnd12MEQycM6I4EWg9qjRv3k+pFZL5Op1XwQSE0z5i0gGzX8J0jvh8w7iqq4uO9yAc7RLHa3+cSo+ASt335JdQ=="
- Example:
Step 3: Mount Azure Blob Storage in Databricks
To mount the container in Databricks, run the following command:
storage_account_name = "venu28nov2023"
container_name = "input"
Accesskeys = "tnd12MEQycM6I4EWg9qjRv3k+pFZL5Op1XwQSE0z5i0gGzX8J0jvh8w7iqq4uO9yAc7RLHa3+cSo+ASt335JdQ=="
dbutils.fs.mount(
source = f"wasbs://{container_name}@{storage_account_name}.blob.core.windows.net",
mount_point = f"/mnt/iotdata1",
extra_configs = {f"fs.azure.account.key.{storage_account_name}.blob.core.windows.net":Accesskeys}
)
This mounts the input container to /mnt/iotdata1 in Databricks.
Step 4: Process CSV Data
4.1 Define the Data Location
data="/mnt/wallmart/csvdata"
4.2 List and Filter Files
import re
import os
files = dbutils.fs.ls(data)
files_paths = [f.path for f in files if "us-500" in f.path and f.path.endswith("us-500.csv")]
print(files_paths)
This filters CSV files that contain "us-500" in their name.
4.3 Define MSSQL Connection
host="jdbc:sqlserver://sampleserver111.database.windows.net:1433;databaseName=mssqldb;"
conf = {
"url": host,
"user": "chandra",
"password": "Mypassword.1"
}
4.4 Read CSV and Write to SQL Table
for p in files_paths:
filename = os.path.splitext(os.path.basename(p))[0]
tab = re.sub(r'[^a-zA-Z0-9]', '', filename)
print("Importing data from", p)
print("Storing in MSSQL table:", tab)
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(p)
df.write.mode("overwrite").format("jdbc").options(**conf).option("dbtable", tab).save()
df.show()
This reads each CSV file and imports it into a dynamically named MSSQL table.
Step 5: Mount Another Storage Account
To mount another storage account (storageaccountqqqq), use:
storage_account_name = "storageaccountqqqq"
container_name = "input"
Accesskeys = "+cT3+AStwi6v7Q=="
dbutils.fs.mount(
source = f"wasbs://{container_name}@{storage_account_name}.blob.core.windows.net",
mount_point = f"/mnt/wallmart",
extra_configs = {f"fs.azure.account.key.{storage_account_name}.blob.core.windows.net":Accesskeys}
)
This mounts the input container from the second storage account to /mnt/wallmart.
Conclusion
In this blog, we covered:
- Creating an Azure Storage Account and Blob Container.
- Retrieving Access Keys.
- Mounting Blob Storage in Databricks.
- Listing and filtering CSV files.
- Importing CSV data into an MSSQL database.
By following these steps, you can effectively store and manage large-scale datasets in Azure and Databricks.
Should have also added enable hierarchy option while creating the storage account. So they can add container inside the container
ReplyDelete