🔥PySpark Column Functions: A Comprehensive Guide

PySpark Column Functions: A Comprehensive Guide

PySpark provides a powerful set of functions to manipulate and transform data efficiently. In this guide, we'll explore various column functions with examples and their outputs.

1. col() - Select a Column

from pyspark.sql import functions as F
df.select(F.col('id')).show()

Output:

+---+
| id|
+---+
|  1|
|  2|
|  3|
|  4|
+---+

2. lit() - Add a Constant Column

df.withColumn('constant', F.lit(10)).show()

Output:

+---+-----+--------+
| id| name|constant|
+---+-----+--------+
|  1|Alice|      10|
|  2|  Bob|      10|
|  3|Alice|      10|
|  4|Charlie|    10|
+---+-----+--------+

3. when() - Conditional Column

df.withColumn('is_adult', F.when(F.col('id') > 1, 'Yes').otherwise('No')).show()

Output:

+---+-----+--------+
| id| name|is_adult|
+---+-----+--------+
|  1|Alice|      No|
|  2|  Bob|     Yes|
|  3|Alice|     Yes|
|  4|Charlie|   Yes|
+---+-----+--------+

4. isNull() - Filter Null Values

df.filter(F.col('name').isNull()).show()

Output:

# No output since none of the values in 'name' are null

5. isNotNull() - Filter Non-Null Values

df.filter(F.col('name').isNotNull()).show()

Output:

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Bob|
|  3|Alice|
|  4|Charlie|
+---+-----+

6. between() - Filter Within a Range

df.filter(F.col('id').between(2, 3)).show()

Output:

+---+-----+
| id| name|
+---+-----+
|  2|  Bob|
|  3|Alice|
+---+-----+

7. like() - Pattern Matching

df.filter(F.col('name').like('A%')).show()

Output:

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  3|Alice|
+---+-----+

8. rlike() - Regular Expression Matching

df.filter(F.col('name').rlike('A.*')).show()

Output:

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  3|Alice|
+---+-----+

9. alias() - Rename a Column

df.select(F.col('name').alias('user_name')).show()

Output:

+---------+
|user_name|
+---------+
|    Alice|
|      Bob|
|    Alice|
|  Charlie|
+---------+

10. cast() - Change Data Type

df.withColumn('id_str', F.col('id').cast('string')).show()

Output:

+---+-----+-------+
| id| name|id_str |
+---+-----+-------+
|  1|Alice|      1|
|  2|  Bob|      2|
|  3|Alice|      3|
|  4|Charlie|    4|
+---+-----+-------+

11. expr() - SQL Expressions

df.select(F.expr('id + 1')).show()

Output:

+--------+
|(id + 1)|
+--------+
|       2|
|       3|
|       4|
|       5|
+--------+

12. Aggregation Functions

count()

df.select(F.count('id')).show()

Output:

+--------+
|count(id)|
+--------+
|       4|
+--------+

countDistinct()

df.select(F.countDistinct('name')).show()

Output:

+------------------+
|count(DISTINCT name)|
+-------------------+
|                  3|
+-------------------+

sum()

df.select(F.sum('id')).show()

Output:

+--------+
|sum(id) |
+--------+
|      10|
+--------+

avg()

df.select(F.avg('id')).show()

Output:

+--------+
|avg(id) |
+--------+
|     2.5|
+--------+

min() and max()

df.select(F.min('id'), F.max('id')).show()

Output:

+--------+--------+
|min(id) |max(id) |
+--------+--------+
|      1 |      4 |
+--------+--------+

13. Window Functions

rank()

from pyspark.sql.window import Window
windowSpec = Window.orderBy('id')
df.withColumn('rank', F.rank().over(windowSpec)).show()

Output:

+---+-----+----+
| id| name|rank|
+---+-----+----+
|  1|Alice|   1|
|  2|  Bob|   2|
|  3|Alice|   3|
|  4|Charlie|   4|
+---+-----+----+

Conclusion

PySpark provides a rich set of functions to manipulate data efficiently. This guide covers some of the most commonly used column functions, helping you streamline your data transformation workflows. Keep experimenting with these functions to fully utilize PySpark's capabilities!

Comments

Popular posts from this blog

🔥Apache Spark Architecture with RDD & DAG

🌐Filtering and Copying Files Dynamically in Azure Data Factory (ADF)

🌐End-to-End ETL Pipeline: MS SQL to MS SQL Using Azure Databricks