DE Prep
DELulu
🚀 Prep #24 - Slowly Changing Dimensions & Idempotency
0:00
-19:28

🚀 Prep #24 - Slowly Changing Dimensions & Idempotency

Week 1 - Day 2 of Zach's DE BootCamp (DataExpert.io)!

🚨 Data déjà vu: When yesterday’s job betrays today’s dashboards

You hit re-run, expecting a tidy backfill. Instead, numbers shift, exec reports disagree, and Slack fills with “What happened?!”
Nine times out of ten, the culprit is the same: non-idempotent pipelines coupled with poorly modeled Slowly Changing Dimensions (SCDs).

This recap of Day 2 in Zach Wilson’s DE BootCamp will save you the migraines he once suffered (one bug even drove him to quit Facebook). Let’s dig in. ☕️

Thanks for reading DE Prep! Subscribe for free to receive new posts and support my work.


1️⃣ The Real Villain: Non-Idempotency

“A healthy pipeline produces the same output no matter when—or how often—you run it.” – Zach

Break that rule and you invite:

  • Duplicate rows (think INSERT INTO without a TRUNCATE)

  • Ever-expanding time windows (WHERE date > yesterday)

  • Jobs that fire before all inputs land (missing partition sensors)

  • Downstream chaos: every child table inherits the flaw

Result? Silent failures, analyst distrust, endless reconciliation meetings—and maybe a resignation letter.


2️⃣ SCDs in Plain English

A Slowly Changing Dimension tracks attributes that drift over time—your device, your favorite food, even your home country.

Four flavors matter most:

  • Type 0 – The rock-solid facts (e.g., birthday). No time columns, fully idempotent.

  • Type 1 – “Just overwrite the old value.” Great for OLTP apps, a nightmare for analytics. Non-idempotent.

  • Type 2 – Keeps full history with start_date and end_date (or an is_current flag). Zach calls this “the gold standard” because it preserves truth and stays idempotent.

  • Type 3 – Stores only the original and current value. Loses history after the first change; barely better than Type 1.

Takeaway: Unless a value truly never changes (Type 0), reach for Type 2. Anything else courts future pain.


3️⃣ Loading Type 2 the Smart Way

  • Big-Bang Rebuild – Recompute the entire table each day. Simple to code, costly at scale.

  • Incremental Merge – Pull yesterday’s dimension, layer today’s changes, merge, done. Faster and friendlier to compute budgets.

Remember Zach’s mantra: “Not every pipeline needs to be a Ferrari. Optimize only when business value outweighs engineering time.”


4️⃣ Zach’s Facebook Horror Story

An upstream team chased data freshness by always pulling the “latest” partition.
During backfills, that partition flipped between “yesterday” and “today.”
Dashboards for fake-account detection contradicted each other.
Months of debugging later, Zach walked away from the company.

Moral: Shortcuts to lower latency are never worth the downstream chaos.


5️⃣ Battle-Tested Best Practices

  • Replace INSERT INTO with MERGE or INSERT OVERWRITE.

  • Bound every query with explicit start and end dates.

  • Gate DAGs on all required partitions.

  • Never rely on “latest” values in production loads.

  • Default to SCD Type 2 for anything that can change, even rarely.


Infographics


👉 Enjoyed this breakdown of Day 2 from Zach’s DE BootCamp? Follow DE Prep for detailed walkthroughs of each day — packed with real-world takeaways, modeling strategies, and performance tips you can actually use.

You can explore all BootCamp recaps and DE Preps in one place at DEtermined


Thanks for reading DE Prep! This post is public so feel free to share it.

Share

Discussion about this episode

User's avatar