🔥Spark Date Functions

Spark Date Functions

Introduction

Apache Spark, by default, considers all data as strings. Even if a dataset contains various data types like integers, dates, and characters, Spark initially treats everything as a string. To enable proper data type recognition, we use option("inferSchema","true") when reading CSV files. However, even with schema inference, Spark may not recognize date formats unless they follow the yyyy-MM-dd pattern.

Converting Strings to Date Format

If your date data is not in the yyyy-MM-dd format, you must convert it to the correct format to perform operations like date differences.

Example Dataset

Download the dataset from:

Creating a DataFrame

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, coalesce, to_date, date_add, date_sub, date_format, current_date, current_timestamp

# Initialize Spark Session
spark = SparkSession.builder.master("local[2]").appName("test").getOrCreate()

# Load data
data = "/FileStore/tables/donations-2.csv"
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(data)

# Show initial data
df.show()

Spark Date Functions

1. to_date() Function

The to_date() function converts string-based dates to an actual date format.

Syntax:

to_date(date_column)
to_date(date_column, format)

Example:

res = df.withColumn("hiredate", to_date(col("dt"), "dd-MMM-yyyy"))

Handling Multiple Date Formats Using UDF:

possible_formats = ["dd-MM-yyyy", "dd-MMM-yyyy", "dd/MMM/yyyy", "d-MMM-yyyy", 
                    "dd/MM/yyyy", "yyyy-MMM-d", "yyyy-MM-dd"]

def dateto(col_name, formats=possible_formats):
    return coalesce(*[to_date(col(col_name), f) for f in formats])

res = df.withColumn("hiredate", dateto("dt")).drop("dt")
res.show()
res.printSchema()

2. Adding and Subtracting Days (date_add, date_sub)

These functions add or subtract a specified number of days from a given date.

Syntax:

date_add(start_date, num_days)
date_sub(start_date, num_days)

Example:

res = res.withColumn("prohibision_upto", date_add(col("hiredate"), 100))
         .withColumn("100daysbefore", date_add(col("hiredate"), -100))
         .withColumn("before100", date_sub(col("hiredate"), 100))
res.show()

3. Formatting Dates (date_format)

The date_format function converts a date into a specified format.

Syntax:

date_format(date_column, format)

Example:

res = res.withColumn("formatted_date", date_format(col("hiredate"), "MMM/dd/yyyy"))
res.show()

4. Current Date and Timestamp (current_date(), current_timestamp())

These functions return the current date and timestamp.

Syntax:

current_date()
current_timestamp()

Example:

res = res.withColumn("today", current_date())
         .withColumn("ts", current_timestamp())
res.show()

Conclusion

Understanding and effectively using Spark's date functions is essential for processing time-based data efficiently. These functions help in converting, formatting, calculating differences, and managing time zones in Spark DataFrames.

For further details, visit the official Spark documentation.

Comments

Popular posts from this blog

🔥Apache Spark Architecture with RDD & DAG

🌐Filtering and Copying Files Dynamically in Azure Data Factory (ADF)

🌐End-to-End ETL Pipeline: MS SQL to MS SQL Using Azure Databricks