🔥 Masking Aadhar Card Numbers and Email Addresses in PySpark
Masking Aadhar Card Numbers and Email Addresses in PySpark
Introduction
Data privacy and security are crucial when handling sensitive information like Aadhar card numbers and email addresses. In this blog, we will:
- Mask Aadhar card numbers while showing only the last four digits.
- Mask email addresses differently based on their domain:
- gmail.com: Show only the first character and mask the rest.
- yahoo.com: Reverse the username.
- hotmail.com: Replace vowels with * and mask the middle part.
- All these transformations will be performed using PySpark functions.
1. Masking Aadhar Card Numbers
Step-by-Step Explanation
- Remove special characters: Use
regexp_replace
to remove hyphens and spaces.
- Mask the number: Display only the last four digits and replace the first 12 digits with
*
.
regexp_replace
to remove hyphens and spaces.*
.PySpark Code
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, regexp_replace, lit, substring, concat
# Initialize Spark Session
spark = SparkSession.builder.master("local").appName("MaskAadharEmail").getOrCreate()
# Sample Data
data = [("1234-5678-9012-3456", "Venu@gmail.com"),
("2345 6789 0123 4567", "Satyapriya@yahoo.com"),
("3456-7890-1234-5678", "Anita.singh@hotmail.com")]
columns = ["AadharCardNumber", "Email"]
df = spark.createDataFrame(data, columns)
# Mask Aadhar Card Number
df = df.withColumn("AadharCardNumber", regexp_replace(col("AadharCardNumber"), "[- ]", "")) \
.withColumn("AadharCardNumberStar", concat(lit("*" * 12), substring(col("AadharCardNumber"), -4, 4)))
df.select("AadharCardNumberStar").show(truncate=False)
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, regexp_replace, lit, substring, concat
# Initialize Spark Session
spark = SparkSession.builder.master("local").appName("MaskAadharEmail").getOrCreate()
# Sample Data
data = [("1234-5678-9012-3456", "Venu@gmail.com"),
("2345 6789 0123 4567", "Satyapriya@yahoo.com"),
("3456-7890-1234-5678", "Anita.singh@hotmail.com")]
columns = ["AadharCardNumber", "Email"]
df = spark.createDataFrame(data, columns)
# Mask Aadhar Card Number
df = df.withColumn("AadharCardNumber", regexp_replace(col("AadharCardNumber"), "[- ]", "")) \
.withColumn("AadharCardNumberStar", concat(lit("*" * 12), substring(col("AadharCardNumber"), -4, 4)))
df.select("AadharCardNumberStar").show(truncate=False)
Output:
+----------------------+
|AadharCardNumberStar |
+----------------------+
|************3456 |
|************4567 |
|************5678 |
+----------------------+
+----------------------+
|AadharCardNumberStar |
+----------------------+
|************3456 |
|************4567 |
|************5678 |
+----------------------+
2. Masking Email Addresses
Requirements:
- gmail.com: Show the first character and mask the rest with
*
.
- yahoo.com: Reverse the username.
- hotmail.com: Replace vowels with
*
and mask the middle.
*
.*
and mask the middle.Step-by-Step Explanation:
- Split email into username and domain.
- Apply conditional transformations based on the domain.
- Reconstruct the email address after masking.
PySpark Code
from pyspark.sql.functions import split, when, reverse, regexp_replace, length, expr
# Step 1: Split Email into Username and Domain
df = df.withColumn("Username", split(col("Email"), "@")[0]) \
.withColumn("Domain", concat(lit("@"), split(col("Email"), "@")[1]))
# Step 2: Apply masking based on domain
df = df.withColumn("MaskedUsername",
when(col("Domain") == "@gmail.com",
concat(substring(col("Username"), 1, 1), expr("repeat('*', length(Username)-1)")))
.when(col("Domain") == "@yahoo.com", reverse(col("Username")))
.when(col("Domain") == "@hotmail.com",
regexp_replace(col("Username"), "[aeiouAEIOU]", "*")))
# Step 3: Reconstruct the masked email
df = df.withColumn("MaskedEmail", concat(col("MaskedUsername"), col("Domain"))) \
.drop("Username", "Domain", "MaskedUsername")
df.select("MaskedEmail").show(truncate=False)
from pyspark.sql.functions import split, when, reverse, regexp_replace, length, expr
# Step 1: Split Email into Username and Domain
df = df.withColumn("Username", split(col("Email"), "@")[0]) \
.withColumn("Domain", concat(lit("@"), split(col("Email"), "@")[1]))
# Step 2: Apply masking based on domain
df = df.withColumn("MaskedUsername",
when(col("Domain") == "@gmail.com",
concat(substring(col("Username"), 1, 1), expr("repeat('*', length(Username)-1)")))
.when(col("Domain") == "@yahoo.com", reverse(col("Username")))
.when(col("Domain") == "@hotmail.com",
regexp_replace(col("Username"), "[aeiouAEIOU]", "*")))
# Step 3: Reconstruct the masked email
df = df.withColumn("MaskedEmail", concat(col("MaskedUsername"), col("Domain"))) \
.drop("Username", "Domain", "MaskedUsername")
df.select("MaskedEmail").show(truncate=False)
Output:
+-------------------------+
|MaskedEmail |
+-------------------------+
|v********@gmail.com |
|ayirpaytas@yahoo.com |
|*n*t.s*ngh@hotmail.com |
+-------------------------+
+-------------------------+
|MaskedEmail |
+-------------------------+
|v********@gmail.com |
|ayirpaytas@yahoo.com |
|*n*t.s*ngh@hotmail.com |
+-------------------------+
3. Understanding the Transformations
3.1 Gmail.com
- Transformation: Show the first character and mask the rest.
- Example:
Venu@gmail.com → v********@gmail.com
- Functions Used:
substring()
, expr("repeat('*', n)")
, concat()
Venu@gmail.com → v********@gmail.com
substring()
, expr("repeat('*', n)")
, concat()
3.2 Yahoo.com
- Transformation: Reverse the username.
- Example:
Satyapriya@yahoo.com → ayirpaytas@yahoo.com
- Function Used:
reverse()
Satyapriya@yahoo.com → ayirpaytas@yahoo.com
reverse()
3.3 Hotmail.com
- Transformation: Replace vowels with
*
.
- Example:
Anita.singh@hotmail.com → *n*t.s*ngh@hotmail.com
- Function Used:
regexp_replace()
*
.Anita.singh@hotmail.com → *n*t.s*ngh@hotmail.com
regexp_replace()
4. Key Functions Explained
Function | Description |
---|---|
regexp_replace() |
Replaces characters based on regex patterns. |
substring() |
Extracts part of a string. |
concat() |
Combines multiple columns or strings. |
reverse() |
Reverses the characters in a string. |
expr() |
Executes SQL expressions in PySpark. |
split() |
Splits a string by a delimiter. |
5. Final Output
Original Email | Masked Email | AadharCardNumberStar |
---|---|---|
Venu@gmail.com | v********@gmail.com | ************3456 |
Satyapriya@yahoo.com | ayirpaytas@yahoo.com | ************4567 |
Anita.singh@hotmail.com | nt.s*ngh@hotmail.com | ************5678 |
6. Conclusion
In this blog, we learned how to mask sensitive data like Aadhar card numbers and email addresses using PySpark. Such masking techniques are essential in data engineering for ensuring data privacy and compliance with data protection regulations.
✅ Key Takeaways:
- Always sanitize and mask sensitive information in datasets.
- PySpark provides powerful functions like
regexp_replace()
,reverse()
, andconcat()
for string manipulations. - Domain-specific email masking ensures customized privacy based on business rules.
💡 Explore more such PySpark transformations for efficient and secure data processing!
📌 Databricks Notebook: Click here to view the notebook
Comments
Post a Comment