🚨 Data déjà vu: When yesterday’s job betrays today’s dashboards
You hit re-run, expecting a tidy backfill. Instead, numbers shift, exec reports disagree, and Slack fills with “What happened?!”
Nine times out of ten, the culprit is the same: non-idempotent pipelines coupled with poorly modeled Slowly Changing Dimensions (SCDs).
This recap of Day 2 in Zach Wilson’s DE BootCamp will save you the migraines he once suffered (one bug even drove him to quit Facebook). Let’s dig in. ☕️
1️⃣ The Real Villain: Non-Idempotency
“A healthy pipeline produces the same output no matter when—or how often—you run it.” – Zach
Break that rule and you invite:
Duplicate rows (think
INSERT INTO
without aTRUNCATE
)Ever-expanding time windows (
WHERE date > yesterday
)Jobs that fire before all inputs land (missing partition sensors)
Downstream chaos: every child table inherits the flaw
Result? Silent failures, analyst distrust, endless reconciliation meetings—and maybe a resignation letter.
2️⃣ SCDs in Plain English
A Slowly Changing Dimension tracks attributes that drift over time—your device, your favorite food, even your home country.
Four flavors matter most:
Type 0 – The rock-solid facts (e.g., birthday). No time columns, fully idempotent.
Type 1 – “Just overwrite the old value.” Great for OLTP apps, a nightmare for analytics. Non-idempotent.
Type 2 – Keeps full history with
start_date
andend_date
(or anis_current
flag). Zach calls this “the gold standard” because it preserves truth and stays idempotent.Type 3 – Stores only the original and current value. Loses history after the first change; barely better than Type 1.
Takeaway: Unless a value truly never changes (Type 0), reach for Type 2. Anything else courts future pain.
3️⃣ Loading Type 2 the Smart Way
Big-Bang Rebuild – Recompute the entire table each day. Simple to code, costly at scale.
Incremental Merge – Pull yesterday’s dimension, layer today’s changes, merge, done. Faster and friendlier to compute budgets.
Remember Zach’s mantra: “Not every pipeline needs to be a Ferrari. Optimize only when business value outweighs engineering time.”
4️⃣ Zach’s Facebook Horror Story
An upstream team chased data freshness by always pulling the “latest” partition.
During backfills, that partition flipped between “yesterday” and “today.”
Dashboards for fake-account detection contradicted each other.
Months of debugging later, Zach walked away from the company.
Moral: Shortcuts to lower latency are never worth the downstream chaos.
5️⃣ Battle-Tested Best Practices
Replace
INSERT INTO
withMERGE
orINSERT OVERWRITE
.Bound every query with explicit start and end dates.
Gate DAGs on all required partitions.
Never rely on “latest” values in production loads.
Default to SCD Type 2 for anything that can change, even rarely.
Infographics
👉 Enjoyed this breakdown of Day 2 from Zach’s DE BootCamp? Follow DE Prep for detailed walkthroughs of each day — packed with real-world takeaways, modeling strategies, and performance tips you can actually use.
You can explore all BootCamp recaps and DE Preps in one place at DEtermined
Share this post