Prep #01: Efficiently Processing 300 Million Rows with PySpark
We’ve got a dataset with 300 M rows. What’s your approach to processing it in PySpark so it’s both readable and fast?
Tackling a 300-million-row dataset can feel daunting. You need code that your teammates can read, and pipelines that finish in minutes instead of hours. That’s exactly where PySpark’s DataFrame API shines. In this post, I’ll walk you through the key tricks - explicit schemas, modular transformations, smart partitioning, and more to build data pipelines that are both crystal-clear and lightning-fast.
Start with Schema & Storage
from pyspark.sql.types import StructType, StructField, StringType, TimestampType, IntegerType schema = StructType([ StructField("user_id", StringType(), False), StructField("event_time", TimestampType(), True), StructField("action", StringType(), True), StructField("value", IntegerType(), True), ]) df = (spark.read .schema(schema) .parquet("/data/events/"))Why explicit schema? It skips the heavy “infer schema” phase and ensures data types are correct from the get-go.
Why Parquet? Columnar storage means Spark reads only the needed columns and leverages predicate pushdown.
Write Modular, Declarative Steps
Break your pipeline into named transformations. Example:
def filter_recent(df): return df.filter(df.event_time >= "2024-01-01") def enrich_with_user(df, user_df): return df.join(broadcast(user_df), "user_id") cleaned = filter_recent(df) users = spark.read.json("/data/users/") enriched = enrich_with_user(cleaned, users)Benefit: Each function does one thing; you can unit-test them in isolation.
Smart Partitioning
If your data is naturally keyed by date or region, read it as such:
events = spark.read.parquet("/data/events/year=2024/month=04/")When you need to rebalance before a heavy shuffle:
balanced = enriched.repartition(200, "region")And before writing the final output, if you want fewer files:
final = balanced.coalesce(20) final.write.mode("overwrite").parquet("/output/results/")repartition()creates a full shuffle to distribute data—use it when upstream partitions are skewed or too few.coalesce()collapses partitions without a full shuffle—ideal when you just need to reduce file count.
Broadcast Joins for Tiny Tables
from pyspark.sql.functions import broadcast country_df = spark.read.csv("/data/countries.csv", header=True) joined = cleaned.join(broadcast(country_df), on="country_code")By broadcasting, Spark ships the small
country_dfto each executor and avoids a costly shuffle—massive savings when your lookup table is small.
Favor Built-Ins Over UDFs
Bad:
from pyspark.sql.functions import udf @udf(IntegerType()) def compute_length(s): return len(s) df.withColumn("len", compute_length(df.action))Better:
from pyspark.sql.functions import length df.withColumn("len", length("action"))Built-ins like
length,substring, anddate_addare implemented in Scala and execute much faster than Python UDFs. If your logic truly isn’t supported natively, reach for a Pandas UDF for vectorized execution.Cache When It Counts
filtered = cleaned.filter("value > 0").cache() # multiple downstream operations reuse `filtered`Only cache if you reuse a DataFrame more than once, and remember to
unpersist()once you’re done.
Monitor & Tune
Spark UI: Look at the “SQL” tab to see which stages dominate runtime.
Skew: If one partition takes much longer, try salting keys or increasing shuffle partitions.
shuffle.partitions: The default 200 may not fit 300 M rows—experiment with 500, 1000, or more, depending on your cluster.
Real-World Edge Cases
Data Skew: If 90% of your data lands in one key, that executor becomes a hotspot. Consider adding a random “salt” to the join key and then removing it post-join.
Nulls & Corrupt Records: Always handle them early - either filter out or route to a “dirty data” table.
Performance Win Example
On one pipeline, switching a join from standard to broadcast cut runtime from 45 minutes to 7 minutes—a 6× improvement. Similarly, replacing three Python UDFs with built-ins and a single Pandas UDF saved another 10 minutes.
Follow-up Questions
Conclusion
By leaning on PySpark’s declarative API, explicit schemas, smart partitioning, and strategic use of caching and broadcast joins, you can process hundreds of millions of rows both readably and rapidly. Monitor your jobs in the Spark UI, tune shuffle settings, and always prefer built-in functions over UDFs. With this toolkit, scaling to big data becomes a matter of sound engineering, not guesswork.


