How Broadcast Joins Work in Spark (And When You Should Use Them)

Question: Explain how broadcast joins work in Spark.

Apr 28, 2025

When you’re dealing with big data in Spark, joins often become one of your biggest performance bottlenecks.

Every traditional join (like an inner join between two tables) requires a shuffle — where Spark moves massive chunks of data across the network to group related rows together. Shuffling is slow and expensive.

But what if you could skip all that messy data movement entirely?

That’s where broadcast joins come in.

What Is a Broadcast Join?

A broadcast join is a clever optimization where Spark copies a small table to every node in your cluster. This way, every worker node has a full copy of the smaller table, and can locally match its partitions of the large table without needing to shuffle any data across the network.

In simple terms:

Spark thinks: "This table is small enough — let me email a copy to all my friends instead of everyone meeting up in one place."

Real-World Example

Imagine you have:

A Customers dataset with 100 million rows
A Country Codes lookup table with just 200 rows

You need to join these two datasets to add country names to customer profiles.

You could run a normal join:

customers.join(country_codes, customers.country_code == country_codes.code)

But this triggers a heavy shuffle.

Instead, by using a broadcast join, you tell Spark:

from pyspark.sql.functions import broadcast

customers.join(broadcast(country_codes), customers.country_code == country_codes.code)

Spark then broadcasts the small country_codes table to every node, and each partition of the customers table can perform the join locally without any shuffling at all.

How Spark Chooses to Broadcast Automatically

By default, Spark automatically attempts a broadcast join if the smaller table is under 10 MB (spark.sql.autoBroadcastJoinThreshold).

If needed, you can adjust the threshold:

spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 20 * 1024 * 1024)  # 20MB

Or, you can manually force a broadcast join using the broadcast() function, like we saw earlier.

Advantages of Broadcast Joins

✅ No shuffle needed → Saves time and network I/O
✅ Massively faster for joins involving small tables
✅ Ideal for lookup tables (like countries, states, categories)

When NOT to Use Broadcast Joins

Be cautious about using broadcast joins in the following scenarios:

❗ Table is Too Large
If the "small" table isn’t actually small (say >10MB or so), broadcasting it can cause workers to run out of memory, leading to crashes.

❗Data Skew in the Large Table
If the large table has heavily skewed keys (some keys appear far more often than others), broadcast joins won't fix the imbalance, and certain partitions will become bottlenecks.

❗Memory Duplication Overhead
Every worker gets a full copy of the broadcasted table. If you have many nodes or many concurrent joins happening, the memory cost adds up fast.

Final Thoughts

Broadcast joins are one of the easiest and most powerful tricks you can use to supercharge your Spark jobs — if you know when to apply them.

Here’s a quick decision tree:

✅ Is one table small (<10MB or <100k rows)?
✅ Can all workers hold a copy in memory?
✅ Are you trying to speed up a join that’s bottlenecked by shuffles?

If yes to all, broadcast it.

Your Spark jobs (and your cloud bill) will thank you.

Thanks for reading DE Prep! This post is public so feel free to share it.