Prep #01: Efficiently Processing 300 Million Rows with PySpark

We’ve got a dataset with 300 M rows. What’s your approach to processing it in PySpark so it’s both readable and fast?

Apr 28, 2025

Tackling a 300-million-row dataset can feel daunting. You need code that your teammates can read, and pipelines that finish in minutes instead of hours. That’s exactly where PySpark’s DataFrame API shines. In this post, I’ll walk you through the key tricks - explicit schemas, modular transformations, smart partitioning, and more to build data pipelines that are both crystal-clear and lightning-fast.

Start with Schema & Storage

from pyspark.sql.types import StructType, StructField, StringType, TimestampType, IntegerType 

schema = StructType([
    StructField("user_id", StringType(), False),
    StructField("event_time", TimestampType(), True),
    StructField("action", StringType(), True),
    StructField("value", IntegerType(), True),
])
df = (spark.read
    .schema(schema)
    .parquet("/data/events/"))

Why explicit schema? It skips the heavy “infer schema” phase and ensures data types are correct from the get-go.
Why Parquet? Columnar storage means Spark reads only the needed columns and leverages predicate pushdown.

Write Modular, Declarative Steps

Break your pipeline into named transformations. Example:

def filter_recent(df):
    return df.filter(df.event_time >= "2024-01-01")

def enrich_with_user(df, user_df):
    return df.join(broadcast(user_df), "user_id")

cleaned = filter_recent(df)
users   = spark.read.json("/data/users/")
enriched = enrich_with_user(cleaned, users)

Benefit: Each function does one thing; you can unit-test them in isolation.

Smart Partitioning
If your data is naturally keyed by date or region, read it as such:
```
events = spark.read.parquet("/data/events/year=2024/month=04/")
```
When you need to rebalance before a heavy shuffle:
```
balanced = enriched.repartition(200, "region")  
```
And before writing the final output, if you want fewer files:
```
final = balanced.coalesce(20)
final.write.mode("overwrite").parquet("/output/results/")
```
- repartition() creates a full shuffle to distribute data—use it when upstream partitions are skewed or too few.
- coalesce() collapses partitions without a full shuffle—ideal when you just need to reduce file count.
Broadcast Joins for Tiny Tables
```
from pyspark.sql.functions import broadcast

country_df = spark.read.csv("/data/countries.csv", header=True)
joined = cleaned.join(broadcast(country_df), on="country_code")
```
- By broadcasting, Spark ships the small country_df to each executor and avoids a costly shuffle—massive savings when your lookup table is small.
Favor Built-Ins Over UDFs
Bad:
```
from pyspark.sql.functions import udf

@udf(IntegerType())
def compute_length(s):
    return len(s)

df.withColumn("len", compute_length(df.action))
```
Better:
```
from pyspark.sql.functions import length

df.withColumn("len", length("action"))
```
Built-ins like length, substring, and date_add are implemented in Scala and execute much faster than Python UDFs. If your logic truly isn’t supported natively, reach for a Pandas UDF for vectorized execution.
Cache When It Counts
```
filtered = cleaned.filter("value > 0").cache()
# multiple downstream operations reuse `filtered`
```
- Only cache if you reuse a DataFrame more than once, and remember to unpersist() once you’re done.
Monitor & Tune
- Spark UI: Look at the “SQL” tab to see which stages dominate runtime.
- Skew: If one partition takes much longer, try salting keys or increasing shuffle partitions.
- shuffle.partitions: The default 200 may not fit 300 M rows—experiment with 500, 1000, or more, depending on your cluster.

Real-World Edge Cases

Data Skew: If 90% of your data lands in one key, that executor becomes a hotspot. Consider adding a random “salt” to the join key and then removing it post-join.
Nulls & Corrupt Records: Always handle them early - either filter out or route to a “dirty data” table.

Performance Win Example

On one pipeline, switching a join from standard to broadcast cut runtime from 45 minutes to 7 minutes—a 6× improvement. Similarly, replacing three Python UDFs with built-ins and a single Pandas UDF saved another 10 minutes.

Follow-up Questions

Conclusion

By leaning on PySpark’s declarative API, explicit schemas, smart partitioning, and strategic use of caching and broadcast joins, you can process hundreds of millions of rows both readably and rapidly. Monitor your jobs in the Spark UI, tune shuffle settings, and always prefer built-in functions over UDFs. With this toolkit, scaling to big data becomes a matter of sound engineering, not guesswork.

Thanks for reading DE Prep! This post is public so feel free to share it.

Discussion about this post

Ready for more?