Mastering PySpark: Top 100 Functions for Data Engineers

Apache Spark has revolutionized big data processing with its distributed computing capabilities. PySpark, the Python API for Spark, is widely used for data engineering, analysis, and machine learning. This blog will cover the top 100 PySpark functions every data engineer should know, along with a practice dataset to apply these concepts.


Index of PySpark Functions

1. Basic DataFrame Operations

  • Creating an Empty DataFrame
  • Converting RDD to DataFrame
  • Converting DataFrame to Pandas
  • Displaying Data (show())
  • Defining Schema (StructType & StructField)

2. Column & Row Operations

  • Column Class
  • Selecting Columns (select())
  • Collecting Data (collect())
  • Adding & Modifying Columns (withColumn())
  • Renaming Columns (withColumnRenamed())
  • Filtering Data (where() & filter())
  • Dropping Columns & Duplicates (drop(), dropDuplicates())
  • Sorting Data (orderBy(), sort())
  • Grouping Data (groupBy())
  • Joining DataFrames (join())
  • Union Operations (union(), unionAll(), unionByName())

3. User-Defined Functions (UDFs) & Transformations

  • Creating UDFs
  • Transforming Data (transform(), apply())
  • Mapping Functions (map(), flatMap())
  • Iterating Over Rows (foreach())

4. Sampling & Null Handling

  • Random Sampling (sample(), sampleBy())
  • Handling Missing Values (fillna(), fill())

5. Pivoting & Partitioning

  • Pivoting Data (pivot())
  • Partitioning Data (partitionBy())

6. Map & Dictionary Operations

  • Using MapType for Key-Value Pairs

7. Inbuilt Functions

  • col(), lit(), when(), isNull(), isNotNull(), between(), like(), rlike(), alias(), cast(), expr()

8. Aggregate Functions

  • Count (count(), countDistinct(), approx_count_distinct())
  • Sum (sum(), sumDistinct())
  • Average (avg())
  • Min & Max (min(), max())
  • First & Last (first(), last())
  • Standard Deviation & Variance (stddev(), variance())
  • Collecting Data into Lists & Sets (collect_list(), collect_set())
  • Correlation & Covariance (corr(), covar_pop(), covar_samp())
  • Statistical Measures (kurtosis(), skewness(), approxQuantile())

9. Window Functions

  • Rank-Based Functions (rank(), dense_rank(), ntile(), row_number())
  • Lead & Lag Functions (lead(), lag())
  • Percentile Functions (percent_rank())
  • Window Definitions (window())

10. String Functions

  • Concatenation (concat(), concat_ws())
  • String Length & Case (length(), lower(), upper())
  • Trimming & Padding (trim(), ltrim(), rtrim(), pad(), rpad(), lpad())
  • String Manipulation (reverse(), substring(), substr(), split(), regexp_extract(), regexp_replace())
  • Finding & Replacing Characters (instr(), translate())
  • Encoding & Decoding (encode(), decode())
  • Formatting & Capitalization (format_number(), initcap())

11. Date & Time Functions

  • Current Date & Time (current_date(), current_timestamp())
  • Date Arithmetic (date_add(), date_sub())
  • Extracting Date Parts (year(), month(), dayofmonth(), dayofweek(), dayofyear(), hour(), minute(), second())
  • Formatting Dates (date_format(), trunc(), date_trunc())
  • Working with Timestamps (to_date(), to_timestamp(), from_unixtime(), unix_timestamp())
  • Month & Week Operations (add_months(), months_between(), weekofyear())
  • Finding Specific Days (last_day(), next_day())

12. Array Functions

  • Creating & Manipulating Arrays (array(), array_contains(), array_distinct(), array_intersect(), array_union())

Practice Dataset

To help you get hands-on experience with these functions, here is a sample dataset:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, DoubleType

data = [(i, f"Person_{i}", 20 + (i % 30), ["Engineer", "Doctor", "Teacher", "Artist", "Scientist"][i % 5], 3000.0 + (i * 100)) for i in range(1, 51)]

schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("profession", StringType(), True),
    StructField("salary", DoubleType(), True)
])

spark = SparkSession.builder.appName("PySpark Practice").getOrCreate()
df = spark.createDataFrame(data, schema)
df.show()

Conclusion

Mastering these PySpark functions will significantly improve your efficiency in handling large datasets. This guide, along with the practice dataset, will help you build confidence and proficiency in PySpark. Try implementing these functions and experiment with different scenarios!

Happy coding!

Comments

Popular posts from this blog

AWS Athena, AWS Lambda, AWS Glue, and Amazon S3 – Detailed Explanation

Kafka Integrated with Spark Structured Streaming

Azure Data Factory: Copying Data from ADLS to MSSQL