10 Essential PySpark Commands for Big Data Processing

Check out these 10 ways to leverage efficient distributed dataset processing combining the strengths of Spark and Python libraries for data science.

By Iván Palomares Carrascosa, KDnuggets Technical Content Specialist on January 20, 2025 in Python

10 Essential PySpark Commands for Big Data Processing

Image by Author | Ideogram

PySpark unifies the best of two worlds: the simplicity of Python language and libraries, and the scalability of Apache Spark.

This article lists — through code examples — 10 handy PySpark commands for turbocharging big data processing pipelines in Python projects.

Initial Setup

To illustrate the use of the 10 commands introduced, we import the necessary libraries, initialize a Spark session, and load the publicly available penguins dataset that describes penguin specimens belonging to three species using a mix of numerical and categorical features. The dataset is initially loaded into a Pandas DataFrame.

!pip install pyspark pandas

from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession.builder \
    .appName("PenguinAnalysis") \
    .config("spark.driver.memory", "2g") \
    .getOrCreate()

pdf = pd.read_csv("https://raw.githubusercontent.com/gakudo-ai/open-datasets/refs/heads/main/penguins.csv")

The 10 Essential PySpark Commands for Big Data Processing

1. Loading data into a Spark DataFrame

Spark DataFrames differ from Pandas DataFrames in their distributed nature, lazy evaluation, and immutability. Converting a Pandas DataFrame is as simple as using the createDataFrame command.

df = spark.createDataFrame(pdf)

This data structure is more suited for parallel data processing under the scenes while maintaining many of the characteristics of regular DataFrames.

2. Select and filter data

The select and filter functions select specific columns (features) in a Spark DataFrame and filter rows (instances) where a specified condition is true. This example selects and filters penguins from the Gentoo species weighing over 5kg.

df.select("species", "body_mass_g").filter(df.species == "Gentoo").filter(df.body_mass_g > 5000).show()

3. Group and aggregate data

Grouping data by categories is typically necessary to perform aggregations like getting average values or other stats per category. This example shows how to use the groupBy and agg commands together to obtain summaries of average measurements per species.

df.groupBy("species").agg({"flipper_length_mm": "mean", "body_mass_g": "mean"}).show()

4. Window functions

Window functions perform calculations across rows related to the current row, such as ranking or running totals. This example shows how to create a window function that partitions the data into species and then ranks penguins by body mass within each species.

from pyspark.sql.window import Window
from pyspark.sql.functions import rank, col
windowSpec = Window.partitionBy("species").orderBy(col("body_mass_g").desc())
df.withColumn("mass_rank", rank().over(windowSpec)).show()

5. Join operations

Like SQL join operations for merging tables, PySpark join operations combine two DataFrames based on specified columns in common, using several policies like left join, right join, outer join, etc. This example builds a DataFrame containing total counts per species, then applies a left join based on the "linkage column" species to attach to every observation in the original dataset the count associated to its species.

species_stats = df.groupBy("species").count()
df.join(species_stats, "species", "left").show()

6. User-defined functions

PySpark's udf enables the creation of user-defined functions, essentially custom lambda functions that once defined can be used to apply complex transformations to columns in a DataFrame. This example defines and applies an user-defined function to map penguin body mass into a new categorical variable describing the penguin size as large or small.

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
size_category = udf(lambda x: "Large" if x > 4500 else "Small", StringType())
df.withColumn("size_class", size_category(df.body_mass_g)).show()

7. Pivot tables

Pivot tables are used to transform categories within a column, such as sex, into multiple columns aimed at describing aggregations or summary statistics for each category. Below we create a pivot table that, for each species, calculates and displays the average bill length separately by gender:

df.groupBy("species").pivot("sex").agg({"bill_length_mm": "avg"}).show()

8. Handling missing values

The fill and dropna functions can be used to impute missing values or clean the dataset from rows containing missing values. They can be also used jointly if we want to impute missing values of one specific column, but then we want to remove observations containing missing values in any other columns after that:

df.na.fill({"sex": "Unknown"}).dropna().show()

9. Save a processed dataset

Of course, once we have applied some processing operations on the dataset, we can save it in various formats such as parquet:

large_penguins = df.filter(df.body_mass_g > 4500)
large_penguins.write.mode("overwrite").parquet("large_penguins.parquet")

10. Perform SQL queries

The last command in this list allows you to embed SQL queries into a PySpark command and run them on temporary views of Spark DataFrames. This approach combines the flexibility of SQL with PySpark's distributed processing.

df.createOrReplaceTempView("penguins")
spark.sql("""
    SELECT species, island, 
           AVG(body_mass_g) as avg_mass,
           AVG(flipper_length_mm) as avg_flipper
    FROM penguins 
    GROUP BY species, island
    ORDER BY avg_mass DESC
""").show()

We highly recommend you practice trying out these 10 example commands one by one in a Jupyter or Google Colab notebook so that you can see the resulting outputs. The dataset is readily available to be read directly and loaded into your code.

Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.