🔥Spark Date Functions

Spark Date Functions

Introduction

Apache Spark, by default, considers all data as strings. Even if a dataset contains various data types like integers, dates, and characters, Spark initially treats everything as a string. To enable proper data type recognition, we use option("inferSchema","true") when reading CSV files. However, even with schema inference, Spark may not recognize date formats unless they follow the yyyy-MM-dd pattern.

Converting Strings to Date Format

If your date data is not in the yyyy-MM-dd format, you must convert it to the correct format to perform operations like date differences.

Example Dataset

Download the dataset from:

Creating a DataFrame

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, coalesce, to_date, date_add, date_sub, date_format, current_date, current_timestamp

# Initialize Spark Session
spark = SparkSession.builder.master("local[2]").appName("test").getOrCreate()

# Load data
data = "/FileStore/tables/donations-2.csv"
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(data)

# Show initial data
df.show()

Spark Date Functions

1. to_date() Function

The to_date() function converts string-based dates to an actual date format.

Syntax:

to_date(date_column)
to_date(date_column, format)

Example:

res = df.withColumn("hiredate", to_date(col("dt"), "dd-MMM-yyyy"))

Handling Multiple Date Formats Using UDF:

possible_formats = ["dd-MM-yyyy", "dd-MMM-yyyy", "dd/MMM/yyyy", "d-MMM-yyyy", 
                    "dd/MM/yyyy", "yyyy-MMM-d", "yyyy-MM-dd"]

def dateto(col_name, formats=possible_formats):
    return coalesce(*[to_date(col(col_name), f) for f in formats])

res = df.withColumn("hiredate", dateto("dt")).drop("dt")
res.show()
res.printSchema()

2. Adding and Subtracting Days (date_add, date_sub)

These functions add or subtract a specified number of days from a given date.

Syntax:

date_add(start_date, num_days)
date_sub(start_date, num_days)

Example:

res = res.withColumn("prohibision_upto", date_add(col("hiredate"), 100))
         .withColumn("100daysbefore", date_add(col("hiredate"), -100))
         .withColumn("before100", date_sub(col("hiredate"), 100))
res.show()

3. Formatting Dates (date_format)

The date_format function converts a date into a specified format.

Syntax:

date_format(date_column, format)

Example:

res = res.withColumn("formatted_date", date_format(col("hiredate"), "MMM/dd/yyyy"))
res.show()

4. Current Date and Timestamp (current_date(), current_timestamp())

These functions return the current date and timestamp.

Syntax:

current_date()
current_timestamp()

Example:

res = res.withColumn("today", current_date())
         .withColumn("ts", current_timestamp())
res.show()

Conclusion

Understanding and effectively using Spark's date functions is essential for processing time-based data efficiently. These functions help in converting, formatting, calculating differences, and managing time zones in Spark DataFrames.

For further details, visit the official Spark documentation.

Comments