Mastering PySpark: Top 100 Functions for Data Engineers
Apache Spark has revolutionized big data processing with its distributed computing capabilities. PySpark, the Python API for Spark, is widely used for data engineering, analysis, and machine learning. This blog will cover the top 100 PySpark functions every data engineer should know, along with a practice dataset to apply these concepts.
Index of PySpark Functions
1. Basic DataFrame Operations
- Creating an Empty DataFrame
- Converting RDD to DataFrame
- Converting DataFrame to Pandas
- Displaying Data (show())
- Defining Schema (StructType & StructField)
2. Column & Row Operations
- Column Class
- Selecting Columns (select())
- Collecting Data (collect())
- Adding & Modifying Columns (withColumn())
- Renaming Columns (withColumnRenamed())
- Filtering Data (where() & filter())
- Dropping Columns & Duplicates (drop(), dropDuplicates())
- Sorting Data (orderBy(), sort())
- Grouping Data (groupBy())
- Joining DataFrames (join())
- Union Operations (union(), unionAll(), unionByName())
3. User-Defined Functions (UDFs) & Transformations
- Creating UDFs
- Transforming Data (transform(), apply())
- Mapping Functions (map(), flatMap())
- Iterating Over Rows (foreach())
4. Sampling & Null Handling
- Random Sampling (sample(), sampleBy())
- Handling Missing Values (fillna(), fill())
5. Pivoting & Partitioning
- Pivoting Data (pivot())
- Partitioning Data (partitionBy())
6. Map & Dictionary Operations
- Using MapType for Key-Value Pairs
7. Inbuilt Functions
- col(), lit(), when(), isNull(), isNotNull(), between(), like(), rlike(), alias(), cast(), expr()
8. Aggregate Functions
- Count (count(), countDistinct(), approx_count_distinct())
- Sum (sum(), sumDistinct())
- Average (avg())
- Min & Max (min(), max())
- First & Last (first(), last())
- Standard Deviation & Variance (stddev(), variance())
- Collecting Data into Lists & Sets (collect_list(), collect_set())
- Correlation & Covariance (corr(), covar_pop(), covar_samp())
- Statistical Measures (kurtosis(), skewness(), approxQuantile())
9. Window Functions
- Rank-Based Functions (rank(), dense_rank(), ntile(), row_number())
- Lead & Lag Functions (lead(), lag())
- Percentile Functions (percent_rank())
- Window Definitions (window())
10. String Functions
- Concatenation (concat(), concat_ws())
- String Length & Case (length(), lower(), upper())
- Trimming & Padding (trim(), ltrim(), rtrim(), pad(), rpad(), lpad())
- String Manipulation (reverse(), substring(), substr(), split(), regexp_extract(), regexp_replace())
- Finding & Replacing Characters (instr(), translate())
- Encoding & Decoding (encode(), decode())
- Formatting & Capitalization (format_number(), initcap())
11. Date & Time Functions
- Current Date & Time (current_date(), current_timestamp())
- Date Arithmetic (date_add(), date_sub())
- Extracting Date Parts (year(), month(), dayofmonth(), dayofweek(), dayofyear(), hour(), minute(), second())
- Formatting Dates (date_format(), trunc(), date_trunc())
- Working with Timestamps (to_date(), to_timestamp(), from_unixtime(), unix_timestamp())
- Month & Week Operations (add_months(), months_between(), weekofyear())
- Finding Specific Days (last_day(), next_day())
12. Array Functions
- Creating & Manipulating Arrays (array(), array_contains(), array_distinct(), array_intersect(), array_union())
Practice Dataset
To help you get hands-on experience with these functions, here is a sample dataset:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType
data = [(i, f"Person_{i}", 20 + (i % 30), ["Engineer", "Doctor", "Teacher", "Artist", "Scientist"][i % 5], 3000.0 + (i * 100)) for i in range(1, 51)]
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("age", IntegerType(), True),
StructField("profession", StringType(), True),
StructField("salary", DoubleType(), True)
])
spark = SparkSession.builder.appName("PySpark Practice").getOrCreate()
df = spark.createDataFrame(data, schema)
df.show()
Conclusion
Mastering these PySpark functions will significantly improve your efficiency in handling large datasets. This guide, along with the practice dataset, will help you build confidence and proficiency in PySpark. Try implementing these functions and experiment with different scenarios!
Happy coding!
Comments
Post a Comment