🔥 Mastering PySpark: Step-by-Step Email Masking with Detailed Explanations

Mastering PySpark: Step-by-Step Email Masking with Detailed Explanations

When dealing with sensitive data, it's crucial to mask personally identifiable information (PII) like email addresses. In this blog, we’ll explore a step-by-step breakdown of how to mask email usernames based on domain-specific logic using PySpark.


🚀 Importing Essential PySpark Functions

from pyspark.sql.functions import col, when, substring, concat, regexp_replace, reverse

🔍 Explanation:

  • col: Refers to DataFrame columns in transformations.
  • when: Implements conditional logic similar to SQL's CASE WHEN.
  • substring: Extracts portions of a string based on starting index and length.
  • concat: Concatenates multiple columns or string values into one.
  • regexp_replace: Replaces characters in a string using regular expressions.
  • reverse: Reverses the characters in a given string.

💻 The Core Transformation Code

df_final = df_step1.withColumn(
    "MaskedUsername",
    when(col("Domain") == "@gmail.com",
         concat(
             substring(col("Username"), 1, 1),  # Keep the first character
             regexp_replace(substring(col("Username"), 2, 100), ".", "*")  # Replace rest with '*'
         )
    )
    .when(col("Domain") == "@yahoo.com", reverse(col("Username")))
    .when(col("Domain") == "@hotmail.com",
          regexp_replace(col("Username"), "[aeiouAEIOU]", "*"))
).withColumn(
    "MaskedEmail",
    concat(col("MaskedUsername"), col("Domain"))
).drop("Username", "Domain", "MaskedUsername")

df_final.show(truncate=False)

🔎 Detailed Line-by-Line Explanation

1️⃣ Adding the MaskedUsername Column

For @gmail.com Domains:
when(col("Domain") == "@gmail.com",
     concat(
         substring(col("Username"), 1, 1),
         regexp_replace(substring(col("Username"), 2, 100), ".", "*")
     )
)
  • Condition: Checks if the domain equals @gmail.com.
  • substring(col("Username"), 1, 1): Extracts the first character of the username (1-based index).
  • substring(col("Username"), 2, 100): Extracts the remaining characters starting from the second character. The number 100 is an arbitrarily large number ensuring all subsequent characters are included.
  • regexp_replace(..., ".", "*"): Replaces each character in the extracted substring with *.
  • concat(...): Combines the first character with the masked substring.
Example:
  • Input: Venu@gmail.com
  • MaskedUsername: V***

For @yahoo.com Domains:
.when(col("Domain") == "@yahoo.com", reverse(col("Username")))
  • Condition: Checks if the domain equals @yahoo.com.
  • reverse(col("Username")): Reverses the entire username string.
Example:
  • Input: Satyapriya@yahoo.com
  • MaskedUsername: ayairpayaS

For @hotmail.com Domains:
.when(col("Domain") == "@hotmail.com",
      regexp_replace(col("Username"), "[aeiouAEIOU]", "*"))
  • Condition: Checks if the domain equals @hotmail.com.
  • regexp_replace(col("Username"), "[aeiouAEIOU]", "*"):
    • Uses a regular expression to find all vowels (a, e, i, o, u in both lowercase and uppercase).
    • Replaces each vowel with *.
Example:
  • Input: Anita.singh@hotmail.com
  • MaskedUsername: *n*t*.s*ngh

2️⃣ Creating the MaskedEmail Column

.withColumn(
    "MaskedEmail",
    concat(col("MaskedUsername"), col("Domain"))
)
  • Purpose: Combines the masked username with the original domain to reconstruct the masked email address.
Example:
  • For Gmail: V***@gmail.com
  • For Yahoo: ayairpayaS@yahoo.com
  • For Hotmail: *n*t*.s*ngh@hotmail.com

3️⃣ Cleaning Up Unnecessary Columns

.drop("Username", "Domain", "MaskedUsername")
  • Purpose: Removes intermediate columns that are no longer required, leaving only the essential information.

🎯 Final Output:

+-------------------+------------------------------+
|AadharCardNumber   |MaskedEmail                  |
+-------------------+------------------------------+
|1234-5678-9012-3456|V***@gmail.com               |
|2345 6789 0123 4567|ayairpayaS@yahoo.com         |
|3456-7890-1234-5678|*n*t*.s*ngh@hotmail.com      |
+-------------------+------------------------------+

💡 Key Takeaways:

  • The substring() function extracts specific parts of strings, where 100 is chosen as a safe upper limit.
  • The regexp_replace() function handles powerful string manipulations like masking specific characters.
  • Conditional masking is implemented efficiently using when(), making the transformations adaptable based on domain types.
  • The reverse() function provides quick transformations for reversing strings when required.

Final Thoughts

This approach to email masking using PySpark is not only secure but also highly adaptable. By leveraging these transformations, you can ensure sensitive information remains protected while still maintaining the utility of the data for analytical purposes.


Comments

Popular posts from this blog

🌐Filtering and Copying Files Dynamically in Azure Data Factory (ADF)

🔥Apache Spark Architecture with RDD & DAG

🖥️☁️AWS Athena, AWS Lambda, AWS Glue, and Amazon S3 – Detailed Explanation