🔥PySpark Column Functions: A Comprehensive Guide

February 21, 2025

PySpark Column Functions: A Comprehensive Guide

PySpark provides a powerful set of functions to manipulate and transform data efficiently. In this guide, we'll explore various column functions with examples and their outputs.

1. `col()` - Select a Column

from pyspark.sql import functions as F
df.select(F.col('id')).show()

Output:

+---+
| id|
+---+
|  1|
|  2|
|  3|
|  4|
+---+

2. `lit()` - Add a Constant Column

df.withColumn('constant', F.lit(10)).show()

Output:

+---+-----+--------+
| id| name|constant|
+---+-----+--------+
|  1|Alice|      10|
|  2|  Bob|      10|
|  3|Alice|      10|
|  4|Charlie|    10|
+---+-----+--------+

3. `when()` - Conditional Column

df.withColumn('is_adult', F.when(F.col('id') > 1, 'Yes').otherwise('No')).show()

Output:

+---+-----+--------+
| id| name|is_adult|
+---+-----+--------+
|  1|Alice|      No|
|  2|  Bob|     Yes|
|  3|Alice|     Yes|
|  4|Charlie|   Yes|
+---+-----+--------+

4. `isNull()` - Filter Null Values

df.filter(F.col('name').isNull()).show()

Output:

# No output since none of the values in 'name' are null

5. `isNotNull()` - Filter Non-Null Values

df.filter(F.col('name').isNotNull()).show()

Output:

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  2|  Bob|
|  3|Alice|
|  4|Charlie|
+---+-----+

6. `between()` - Filter Within a Range

df.filter(F.col('id').between(2, 3)).show()

Output:

+---+-----+
| id| name|
+---+-----+
|  2|  Bob|
|  3|Alice|
+---+-----+

7. `like()` - Pattern Matching

df.filter(F.col('name').like('A%')).show()

Output:

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  3|Alice|
+---+-----+

8. `rlike()` - Regular Expression Matching

df.filter(F.col('name').rlike('A.*')).show()

Output:

+---+-----+
| id| name|
+---+-----+
|  1|Alice|
|  3|Alice|
+---+-----+

9. `alias()` - Rename a Column

df.select(F.col('name').alias('user_name')).show()

Output:

+---------+
|user_name|
+---------+
|    Alice|
|      Bob|
|    Alice|
|  Charlie|
+---------+

10. `cast()` - Change Data Type

df.withColumn('id_str', F.col('id').cast('string')).show()

Output:

+---+-----+-------+
| id| name|id_str |
+---+-----+-------+
|  1|Alice|      1|
|  2|  Bob|      2|
|  3|Alice|      3|
|  4|Charlie|    4|
+---+-----+-------+

11. `expr()` - SQL Expressions

df.select(F.expr('id + 1')).show()

Output:

+--------+
|(id + 1)|
+--------+
|       2|
|       3|
|       4|
|       5|
+--------+

12. Aggregation Functions

`count()`

df.select(F.count('id')).show()

Output:

+--------+
|count(id)|
+--------+
|       4|
+--------+

`countDistinct()`

df.select(F.countDistinct('name')).show()

Output:

+------------------+
|count(DISTINCT name)|
+-------------------+
|                  3|
+-------------------+

`sum()`

df.select(F.sum('id')).show()

Output:

+--------+
|sum(id) |
+--------+
|      10|
+--------+

`avg()`

df.select(F.avg('id')).show()

Output:

+--------+
|avg(id) |
+--------+
|     2.5|
+--------+

`min()` and `max()`

df.select(F.min('id'), F.max('id')).show()

Output:

+--------+--------+
|min(id) |max(id) |
+--------+--------+
|      1 |      4 |
+--------+--------+

13. Window Functions

`rank()`

from pyspark.sql.window import Window
windowSpec = Window.orderBy('id')
df.withColumn('rank', F.rank().over(windowSpec)).show()

Output:

+---+-----+----+
| id| name|rank|
+---+-----+----+
|  1|Alice|   1|
|  2|  Bob|   2|
|  3|Alice|   3|
|  4|Charlie|   4|
+---+-----+----+

Conclusion

PySpark provides a rich set of functions to manipulate data efficiently. This guide covers some of the most commonly used column functions, helping you streamline your data transformation workflows. Keep experimenting with these functions to fully utilize PySpark's capabilities!

Search This Blog

Data Nexus