🔥PySpark Column Functions: A Comprehensive Guide
PySpark Column Functions: A Comprehensive Guide
PySpark provides a powerful set of functions to manipulate and transform data efficiently. In this guide, we'll explore various column functions with examples and their outputs.
1. col()
- Select a Column
from pyspark.sql import functions as F
df.select(F.col('id')).show()
Output:
+---+
| id|
+---+
| 1|
| 2|
| 3|
| 4|
+---+
2. lit()
- Add a Constant Column
df.withColumn('constant', F.lit(10)).show()
Output:
+---+-----+--------+
| id| name|constant|
+---+-----+--------+
| 1|Alice| 10|
| 2| Bob| 10|
| 3|Alice| 10|
| 4|Charlie| 10|
+---+-----+--------+
3. when()
- Conditional Column
df.withColumn('is_adult', F.when(F.col('id') > 1, 'Yes').otherwise('No')).show()
Output:
+---+-----+--------+
| id| name|is_adult|
+---+-----+--------+
| 1|Alice| No|
| 2| Bob| Yes|
| 3|Alice| Yes|
| 4|Charlie| Yes|
+---+-----+--------+
4. isNull()
- Filter Null Values
df.filter(F.col('name').isNull()).show()
Output:
# No output since none of the values in 'name' are null
5. isNotNull()
- Filter Non-Null Values
df.filter(F.col('name').isNotNull()).show()
Output:
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 2| Bob|
| 3|Alice|
| 4|Charlie|
+---+-----+
6. between()
- Filter Within a Range
df.filter(F.col('id').between(2, 3)).show()
Output:
+---+-----+
| id| name|
+---+-----+
| 2| Bob|
| 3|Alice|
+---+-----+
7. like()
- Pattern Matching
df.filter(F.col('name').like('A%')).show()
Output:
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 3|Alice|
+---+-----+
8. rlike()
- Regular Expression Matching
df.filter(F.col('name').rlike('A.*')).show()
Output:
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 3|Alice|
+---+-----+
9. alias()
- Rename a Column
df.select(F.col('name').alias('user_name')).show()
Output:
+---------+
|user_name|
+---------+
| Alice|
| Bob|
| Alice|
| Charlie|
+---------+
10. cast()
- Change Data Type
df.withColumn('id_str', F.col('id').cast('string')).show()
Output:
+---+-----+-------+
| id| name|id_str |
+---+-----+-------+
| 1|Alice| 1|
| 2| Bob| 2|
| 3|Alice| 3|
| 4|Charlie| 4|
+---+-----+-------+
11. expr()
- SQL Expressions
df.select(F.expr('id + 1')).show()
Output:
+--------+
|(id + 1)|
+--------+
| 2|
| 3|
| 4|
| 5|
+--------+
12. Aggregation Functions
count()
df.select(F.count('id')).show()
Output:
+--------+
|count(id)|
+--------+
| 4|
+--------+
countDistinct()
df.select(F.countDistinct('name')).show()
Output:
+------------------+
|count(DISTINCT name)|
+-------------------+
| 3|
+-------------------+
sum()
df.select(F.sum('id')).show()
Output:
+--------+
|sum(id) |
+--------+
| 10|
+--------+
avg()
df.select(F.avg('id')).show()
Output:
+--------+
|avg(id) |
+--------+
| 2.5|
+--------+
min()
and max()
df.select(F.min('id'), F.max('id')).show()
Output:
+--------+--------+
|min(id) |max(id) |
+--------+--------+
| 1 | 4 |
+--------+--------+
13. Window Functions
rank()
from pyspark.sql.window import Window
windowSpec = Window.orderBy('id')
df.withColumn('rank', F.rank().over(windowSpec)).show()
Output:
+---+-----+----+
| id| name|rank|
+---+-----+----+
| 1|Alice| 1|
| 2| Bob| 2|
| 3|Alice| 3|
| 4|Charlie| 4|
+---+-----+----+
Conclusion
PySpark provides a rich set of functions to manipulate data efficiently. This guide covers some of the most commonly used column functions, helping you streamline your data transformation workflows. Keep experimenting with these functions to fully utilize PySpark's capabilities!
Comments
Post a Comment