🔥Spark Date Functions

February 26, 2025

Spark Date Functions

Introduction

Apache Spark, by default, considers all data as strings. Even if a dataset contains various data types like integers, dates, and characters, Spark initially treats everything as a string. To enable proper data type recognition, we use `option("inferSchema","true")` when reading CSV files. However, even with schema inference, Spark may not recognize date formats unless they follow the `yyyy-MM-dd` pattern.

Converting Strings to Date Format

If your date data is not in the `yyyy-MM-dd` format, you must convert it to the correct format to perform operations like date differences.

Example Dataset

Download the dataset from:

Google Drive

Databricks Cloud

Creating a DataFrame

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, coalesce, to_date, date_add, date_sub, date_format, current_date, current_timestamp

# Initialize Spark Session
spark = SparkSession.builder.master("local[2]").appName("test").getOrCreate()

# Load data
data = "/FileStore/tables/donations-2.csv"
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(data)

# Show initial data
df.show()

Spark Date Functions

1. `to_date()` Function

The `to_date()` function converts string-based dates to an actual date format.

Syntax:

to_date(date_column)
to_date(date_column, format)

Example:

res = df.withColumn("hiredate", to_date(col("dt"), "dd-MMM-yyyy"))

Handling Multiple Date Formats Using UDF:

possible_formats = ["dd-MM-yyyy", "dd-MMM-yyyy", "dd/MMM/yyyy", "d-MMM-yyyy", 
                    "dd/MM/yyyy", "yyyy-MMM-d", "yyyy-MM-dd"]

def dateto(col_name, formats=possible_formats):
    return coalesce(*[to_date(col(col_name), f) for f in formats])

res = df.withColumn("hiredate", dateto("dt")).drop("dt")
res.show()
res.printSchema()

2. Adding and Subtracting Days (`date_add`, `date_sub`)

These functions add or subtract a specified number of days from a given date.

Syntax:

date_add(start_date, num_days)
date_sub(start_date, num_days)

Example:

res = res.withColumn("prohibision_upto", date_add(col("hiredate"), 100))
         .withColumn("100daysbefore", date_add(col("hiredate"), -100))
         .withColumn("before100", date_sub(col("hiredate"), 100))
res.show()

3. Formatting Dates (`date_format`)

The `date_format` function converts a date into a specified format.

Syntax:

date_format(date_column, format)

Example:

res = res.withColumn("formatted_date", date_format(col("hiredate"), "MMM/dd/yyyy"))
res.show()

4. Current Date and Timestamp (`current_date()`, `current_timestamp()`)

These functions return the current date and timestamp.

Syntax:

current_date()
current_timestamp()

Example:

res = res.withColumn("today", current_date())
         .withColumn("ts", current_timestamp())
res.show()

Conclusion

Understanding and effectively using Spark's date functions is essential for processing time-based data efficiently. These functions help in converting, formatting, calculating differences, and managing time zones in Spark DataFrames.

For further details, visit the official Spark documentation.

🔥Spark Date Functions

Spark Date Functions

Introduction

Converting Strings to Date Format

If your date data is not in the yyyy-MM-dd format, you must convert it to the correct format to perform operations like date differences.

Example Dataset

Download the dataset from: Google Drive Databricks Cloud

Creating a DataFrame

Spark Date Functions

1. to_date() Function

The to_date() function converts string-based dates to an actual date format.

Syntax:

to_date(date_column) to_date(date_column, format)

Example:

res = df.withColumn("hiredate", to_date(col("dt"), "dd-MMM-yyyy"))

Handling Multiple Date Formats Using UDF:

2. Adding and Subtracting Days (date_add, date_sub)

These functions add or subtract a specified number of days from a given date.

Syntax:

date_add(start_date, num_days) date_sub(start_date, num_days)

Example:

res = res.withColumn("prohibision_upto", date_add(col("hiredate"), 100)) .withColumn("100daysbefore", date_add(col("hiredate"), -100)) .withColumn("before100", date_sub(col("hiredate"), 100)) res.show()

3. Formatting Dates (date_format)

The date_format function converts a date into a specified format.

Syntax:

date_format(date_column, format)

Example:

res = res.withColumn("formatted_date", date_format(col("hiredate"), "MMM/dd/yyyy")) res.show()

4. Current Date and Timestamp (current_date(), current_timestamp())

These functions return the current date and timestamp.

Syntax:

current_date() current_timestamp()

Example:

res = res.withColumn("today", current_date()) .withColumn("ts", current_timestamp()) res.show()

Conclusion

Understanding and effectively using Spark's date functions is essential for processing time-based data efficiently. These functions help in converting, formatting, calculating differences, and managing time zones in Spark DataFrames. For further details, visit the official Spark documentation.

Comments

Post a Comment

Popular posts from this blog

🔥Apache Spark Architecture with RDD & DAG

🌐Filtering and Copying Files Dynamically in Azure Data Factory (ADF)

🌐End-to-End ETL Pipeline: MS SQL to MS SQL Using Azure Databricks

If your date data is not in the `yyyy-MM-dd` format, you must convert it to the correct format to perform operations like date differences.

Download the dataset from:

Google Drive

Databricks Cloud

1. `to_date()` Function

The `to_date()` function converts string-based dates to an actual date format.

`to_date(date_column) to_date(date_column, format)`

`res = df.withColumn("hiredate", to_date(col("dt"), "dd-MMM-yyyy"))`

2. Adding and Subtracting Days (`date_add`, `date_sub`)

`date_add(start_date, num_days) date_sub(start_date, num_days)`

`res = res.withColumn("prohibision_upto", date_add(col("hiredate"), 100)) .withColumn("100daysbefore", date_add(col("hiredate"), -100)) .withColumn("before100", date_sub(col("hiredate"), 100)) res.show()`

3. Formatting Dates (`date_format`)

The `date_format` function converts a date into a specified format.

`date_format(date_column, format)`

`res = res.withColumn("formatted_date", date_format(col("hiredate"), "MMM/dd/yyyy")) res.show()`

4. Current Date and Timestamp (`current_date()`, `current_timestamp()`)

`current_date() current_timestamp()`

`res = res.withColumn("today", current_date()) .withColumn("ts", current_timestamp()) res.show()`

Understanding and effectively using Spark's date functions is essential for processing time-based data efficiently. These functions help in converting, formatting, calculating differences, and managing time zones in Spark DataFrames.

For further details, visit the official Spark documentation.