🔥 Mastering PySpark: Step-by-Step Email Masking with Detailed Explanations
✨ Mastering PySpark: Step-by-Step Email Masking with Detailed Explanations
When dealing with sensitive data, it's crucial to mask personally identifiable information (PII) like email addresses. In this blog, we’ll explore a step-by-step breakdown of how to mask email usernames based on domain-specific logic using PySpark.
🚀 Importing Essential PySpark Functions
from pyspark.sql.functions import col, when, substring, concat, regexp_replace, reverse
🔍 Explanation:
col: Refers to DataFrame columns in transformations.when: Implements conditional logic similar to SQL'sCASE WHEN.substring: Extracts portions of a string based on starting index and length.concat: Concatenates multiple columns or string values into one.regexp_replace: Replaces characters in a string using regular expressions.reverse: Reverses the characters in a given string.
💻 The Core Transformation Code
df_final = df_step1.withColumn(
"MaskedUsername",
when(col("Domain") == "@gmail.com",
concat(
substring(col("Username"), 1, 1), # Keep the first character
regexp_replace(substring(col("Username"), 2, 100), ".", "*") # Replace rest with '*'
)
)
.when(col("Domain") == "@yahoo.com", reverse(col("Username")))
.when(col("Domain") == "@hotmail.com",
regexp_replace(col("Username"), "[aeiouAEIOU]", "*"))
).withColumn(
"MaskedEmail",
concat(col("MaskedUsername"), col("Domain"))
).drop("Username", "Domain", "MaskedUsername")
df_final.show(truncate=False)
🔎 Detailed Line-by-Line Explanation
1️⃣ Adding the MaskedUsername Column
✔ For @gmail.com Domains:
when(col("Domain") == "@gmail.com",
concat(
substring(col("Username"), 1, 1),
regexp_replace(substring(col("Username"), 2, 100), ".", "*")
)
)
- Condition: Checks if the domain equals
@gmail.com. substring(col("Username"), 1, 1): Extracts the first character of the username (1-based index).substring(col("Username"), 2, 100): Extracts the remaining characters starting from the second character. The number100is an arbitrarily large number ensuring all subsequent characters are included.regexp_replace(..., ".", "*"): Replaces each character in the extracted substring with*.concat(...): Combines the first character with the masked substring.
✨ Example:
- Input:
Venu@gmail.com - MaskedUsername:
V***
✔ For @yahoo.com Domains:
.when(col("Domain") == "@yahoo.com", reverse(col("Username")))
- Condition: Checks if the domain equals
@yahoo.com. reverse(col("Username")): Reverses the entire username string.
✨ Example:
- Input:
Satyapriya@yahoo.com - MaskedUsername:
ayairpayaS
✔ For @hotmail.com Domains:
.when(col("Domain") == "@hotmail.com",
regexp_replace(col("Username"), "[aeiouAEIOU]", "*"))
- Condition: Checks if the domain equals
@hotmail.com. regexp_replace(col("Username"), "[aeiouAEIOU]", "*"):- Uses a regular expression to find all vowels (
a, e, i, o, uin both lowercase and uppercase). - Replaces each vowel with
*.
- Uses a regular expression to find all vowels (
✨ Example:
- Input:
Anita.singh@hotmail.com - MaskedUsername:
*n*t*.s*ngh
2️⃣ Creating the MaskedEmail Column
.withColumn(
"MaskedEmail",
concat(col("MaskedUsername"), col("Domain"))
)
- Purpose: Combines the masked username with the original domain to reconstruct the masked email address.
✨ Example:
- For Gmail:
V***@gmail.com - For Yahoo:
ayairpayaS@yahoo.com - For Hotmail:
*n*t*.s*ngh@hotmail.com
3️⃣ Cleaning Up Unnecessary Columns
.drop("Username", "Domain", "MaskedUsername")
- Purpose: Removes intermediate columns that are no longer required, leaving only the essential information.
🎯 Final Output:
+-------------------+------------------------------+
|AadharCardNumber |MaskedEmail |
+-------------------+------------------------------+
|1234-5678-9012-3456|V***@gmail.com |
|2345 6789 0123 4567|ayairpayaS@yahoo.com |
|3456-7890-1234-5678|*n*t*.s*ngh@hotmail.com |
+-------------------+------------------------------+
💡 Key Takeaways:
- The
substring()function extracts specific parts of strings, where100is chosen as a safe upper limit. - The
regexp_replace()function handles powerful string manipulations like masking specific characters. - Conditional masking is implemented efficiently using
when(), making the transformations adaptable based on domain types. - The
reverse()function provides quick transformations for reversing strings when required.
✨ Final Thoughts
This approach to email masking using PySpark is not only secure but also highly adaptable. By leveraging these transformations, you can ensure sensitive information remains protected while still maintaining the utility of the data for analytical purposes.
Comments
Post a Comment