🔥Spark Date Functions
Spark Date Functions
Introduction
Apache Spark, by default, considers all data as strings. Even if a dataset contains various data types like integers, dates, and characters, Spark initially treats everything as a string. To enable proper data type recognition, we use option("inferSchema","true")
when reading CSV files. However, even with schema inference, Spark may not recognize date formats unless they follow the yyyy-MM-dd
pattern.
Converting Strings to Date Format
If your date data is not in the yyyy-MM-dd
format, you must convert it to the correct format to perform operations like date differences.
Example Dataset
Download the dataset from:
Creating a DataFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, coalesce, to_date, date_add, date_sub, date_format, current_date, current_timestamp
# Initialize Spark Session
spark = SparkSession.builder.master("local[2]").appName("test").getOrCreate()
# Load data
data = "/FileStore/tables/donations-2.csv"
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(data)
# Show initial data
df.show()
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, coalesce, to_date, date_add, date_sub, date_format, current_date, current_timestamp
# Initialize Spark Session
spark = SparkSession.builder.master("local[2]").appName("test").getOrCreate()
# Load data
data = "/FileStore/tables/donations-2.csv"
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load(data)
# Show initial data
df.show()
Spark Date Functions
1. to_date()
Function
The to_date()
function converts string-based dates to an actual date format.
Syntax:
to_date(date_column)
to_date(date_column, format)
to_date(date_column)
to_date(date_column, format)
Example:
res = df.withColumn("hiredate", to_date(col("dt"), "dd-MMM-yyyy"))
res = df.withColumn("hiredate", to_date(col("dt"), "dd-MMM-yyyy"))
Handling Multiple Date Formats Using UDF:
possible_formats = ["dd-MM-yyyy", "dd-MMM-yyyy", "dd/MMM/yyyy", "d-MMM-yyyy",
"dd/MM/yyyy", "yyyy-MMM-d", "yyyy-MM-dd"]
def dateto(col_name, formats=possible_formats):
return coalesce(*[to_date(col(col_name), f) for f in formats])
res = df.withColumn("hiredate", dateto("dt")).drop("dt")
res.show()
res.printSchema()
possible_formats = ["dd-MM-yyyy", "dd-MMM-yyyy", "dd/MMM/yyyy", "d-MMM-yyyy",
"dd/MM/yyyy", "yyyy-MMM-d", "yyyy-MM-dd"]
def dateto(col_name, formats=possible_formats):
return coalesce(*[to_date(col(col_name), f) for f in formats])
res = df.withColumn("hiredate", dateto("dt")).drop("dt")
res.show()
res.printSchema()
2. Adding and Subtracting Days (date_add
, date_sub
)
These functions add or subtract a specified number of days from a given date.
Syntax:
date_add(start_date, num_days)
date_sub(start_date, num_days)
date_add(start_date, num_days)
date_sub(start_date, num_days)
Example:
res = res.withColumn("prohibision_upto", date_add(col("hiredate"), 100))
.withColumn("100daysbefore", date_add(col("hiredate"), -100))
.withColumn("before100", date_sub(col("hiredate"), 100))
res.show()
res = res.withColumn("prohibision_upto", date_add(col("hiredate"), 100))
.withColumn("100daysbefore", date_add(col("hiredate"), -100))
.withColumn("before100", date_sub(col("hiredate"), 100))
res.show()
3. Formatting Dates (date_format
)
The date_format
function converts a date into a specified format.
Syntax:
date_format(date_column, format)
date_format(date_column, format)
Example:
res = res.withColumn("formatted_date", date_format(col("hiredate"), "MMM/dd/yyyy"))
res.show()
res = res.withColumn("formatted_date", date_format(col("hiredate"), "MMM/dd/yyyy"))
res.show()
4. Current Date and Timestamp (current_date()
, current_timestamp()
)
These functions return the current date and timestamp.
Syntax:
current_date()
current_timestamp()
current_date()
current_timestamp()
Example:
res = res.withColumn("today", current_date())
.withColumn("ts", current_timestamp())
res.show()
res = res.withColumn("today", current_date())
.withColumn("ts", current_timestamp())
res.show()
Conclusion
Understanding and effectively using Spark's date functions is essential for processing time-based data efficiently. These functions help in converting, formatting, calculating differences, and managing time zones in Spark DataFrames.
For further details, visit the official Spark documentation.
Comments
Post a Comment