🔥 Masking Aadhar Card Numbers and Email Addresses in PySpark

Masking Aadhar Card Numbers and Email Addresses in PySpark

Introduction

Data privacy and security are crucial when handling sensitive information like Aadhar card numbers and email addresses. In this blog, we will:

  • Mask Aadhar card numbers while showing only the last four digits.
  • Mask email addresses differently based on their domain:
    • gmail.com: Show only the first character and mask the rest.
    • yahoo.com: Reverse the username.
    • hotmail.com: Replace vowels with * and mask the middle part.
  • All these transformations will be performed using PySpark functions.

1. Masking Aadhar Card Numbers

Step-by-Step Explanation

  1. Remove special characters: Use regexp_replace to remove hyphens and spaces.
  2. Mask the number: Display only the last four digits and replace the first 12 digits with *.

PySpark Code

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, regexp_replace, lit, substring, concat

# Initialize Spark Session
spark = SparkSession.builder.master("local").appName("MaskAadharEmail").getOrCreate()

# Sample Data
data = [("1234-5678-9012-3456", "Venu@gmail.com"),
        ("2345 6789 0123 4567", "Satyapriya@yahoo.com"),
        ("3456-7890-1234-5678", "Anita.singh@hotmail.com")]

columns = ["AadharCardNumber", "Email"]
df = spark.createDataFrame(data, columns)

# Mask Aadhar Card Number
df = df.withColumn("AadharCardNumber", regexp_replace(col("AadharCardNumber"), "[- ]", "")) \
       .withColumn("AadharCardNumberStar", concat(lit("*" * 12), substring(col("AadharCardNumber"), -4, 4)))

df.select("AadharCardNumberStar").show(truncate=False)

Output:

+----------------------+
|AadharCardNumberStar |
+----------------------+
|************3456      |
|************4567      |
|************5678      |
+----------------------+

2. Masking Email Addresses

Requirements:

  • gmail.com: Show the first character and mask the rest with *.
  • yahoo.com: Reverse the username.
  • hotmail.com: Replace vowels with * and mask the middle.

Step-by-Step Explanation:

  1. Split email into username and domain.
  2. Apply conditional transformations based on the domain.
  3. Reconstruct the email address after masking.

PySpark Code

from pyspark.sql.functions import split, when, reverse, regexp_replace, length, expr

# Step 1: Split Email into Username and Domain
df = df.withColumn("Username", split(col("Email"), "@")[0]) \
       .withColumn("Domain", concat(lit("@"), split(col("Email"), "@")[1]))

# Step 2: Apply masking based on domain
df = df.withColumn("MaskedUsername",
                   when(col("Domain") == "@gmail.com",
                        concat(substring(col("Username"), 1, 1), expr("repeat('*', length(Username)-1)")))
                   .when(col("Domain") == "@yahoo.com", reverse(col("Username")))
                   .when(col("Domain") == "@hotmail.com",
                        regexp_replace(col("Username"), "[aeiouAEIOU]", "*")))

# Step 3: Reconstruct the masked email
df = df.withColumn("MaskedEmail", concat(col("MaskedUsername"), col("Domain"))) \
       .drop("Username", "Domain", "MaskedUsername")

df.select("MaskedEmail").show(truncate=False)

Output:

+-------------------------+
|MaskedEmail              |
+-------------------------+
|v********@gmail.com      |
|ayirpaytas@yahoo.com     |
|*n*t.s*ngh@hotmail.com   |
+-------------------------+

3. Understanding the Transformations

3.1 Gmail.com

  • Transformation: Show the first character and mask the rest.
  • Example: Venu@gmail.com → v********@gmail.com
  • Functions Used: substring(), expr("repeat('*', n)"), concat()

3.2 Yahoo.com

  • Transformation: Reverse the username.
  • Example: Satyapriya@yahoo.com → ayirpaytas@yahoo.com
  • Function Used: reverse()

3.3 Hotmail.com

  • Transformation: Replace vowels with *.
  • Example: Anita.singh@hotmail.com → *n*t.s*ngh@hotmail.com
  • Function Used: regexp_replace()

4. Key Functions Explained

Function Description
regexp_replace() Replaces characters based on regex patterns.
substring() Extracts part of a string.
concat() Combines multiple columns or strings.
reverse() Reverses the characters in a string.
expr() Executes SQL expressions in PySpark.
split() Splits a string by a delimiter.

5. Final Output

Original Email Masked Email AadharCardNumberStar
Venu@gmail.com v********@gmail.com ************3456
Satyapriya@yahoo.com ayirpaytas@yahoo.com ************4567
Anita.singh@hotmail.com nt.s*ngh@hotmail.com ************5678

6. Conclusion

In this blog, we learned how to mask sensitive data like Aadhar card numbers and email addresses using PySpark. Such masking techniques are essential in data engineering for ensuring data privacy and compliance with data protection regulations.

Key Takeaways:

  • Always sanitize and mask sensitive information in datasets.
  • PySpark provides powerful functions like regexp_replace(), reverse(), and concat() for string manipulations.
  • Domain-specific email masking ensures customized privacy based on business rules.

💡 Explore more such PySpark transformations for efficient and secure data processing!


📌 Databricks Notebook: Click here to view the notebook

Comments

Popular posts from this blog

🔥Apache Spark Architecture with RDD & DAG

🌐Filtering and Copying Files Dynamically in Azure Data Factory (ADF)

🌐End-to-End ETL Pipeline: MS SQL to MS SQL Using Azure Databricks