Prep #04 - What Happens if Your ETL Pipeline Fails Mid-Run? How to Recover Safely
Failures are inevitable. But data loss, duplication, or corruption? That doesn’t have to be.
Today’s DE Prep dives into a deceptively common but under-discussed interview question:
❓ “What happens if your ETL pipeline fails mid-run? How do you recover safely?”
Whether you're building data pipelines for a Fortune 500 or a startup, how you recover from failure often matters more than preventing it altogether.
Let’s explore how production-grade pipelines are designed to fail gracefully and recover reliably.
🧨 The Fallout of a Mid-Run Failure
When an ETL pipeline breaks halfway through execution, you’re left in a state of partial truth. Here’s what typically happens:
1. Partial Writes
Imagine a pipeline that ingests from Kafka → transforms in Spark → writes to Snowflake. If it fails during the write step, you might:
Have written only part of the day’s records
Inserted duplicates if retries aren’t handled correctly
Left the table in a non-deterministic state
2. Downstream Chaos
If your dashboards, alerts, or ML models depend on the freshness of this pipeline:
They might silently operate on stale or incomplete data
This can lead to erroneous business decisions
3. Broken Dependencies
Multi-stage pipelines (e.g., raw → staging → curated → analytics) often pass watermarks downstream. A mid-run failure can cause:
Skewed or missing aggregations
Misaligned partitions
✅ Designing for Safe Recovery
Here’s how seasoned data engineers build ETL pipelines that can recover quickly and safely.
🔁 1. Idempotency is Non-Negotiable
Ensure your ETL operations are idempotent, meaning:
Reprocessing the same data does not cause duplication or corruption.
How?
Use upserts (MERGE) instead of blind INSERTs
Deduplicate using unique keys (e.g., UUIDs, event_id)
Truncate-and-load for batch partitions
-- Snowflake example
MERGE INTO users u
USING staging_users s
ON u.user_id = s.user_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...
🧭 2. Track Watermarks
Store high watermarks (e.g., last processed timestamp, batch ID) to know where to resume. This ensures:
No overlap during retries
No missed records
# Example: Storing watermark in metadata store
last_timestamp = metadata_store.get("last_processed_ts")
new_data = df.filter(col("event_time") > last_timestamp)
🧱 3. Use Checkpoints or Intermediate Tables
Break your pipeline into discrete stages:
Raw ingestion → Staging → Final table
If something fails in the final write, staging data is preserved for debugging or retry
Tools like Airflow, Dagster, and Prefect support this natively.
🕵️ 4. Use Transactions (When Possible)
If you’re working with databases that support atomic transactions, wrap your inserts or merges in transactions:
If it fails, the entire batch rolls back
Ensures all-or-nothing behavior
BEGIN;
INSERT INTO sales SELECT * FROM temp_sales;
-- Only commit if everything worked
COMMIT;
If using a distributed system like Spark or Hive, this is harder. That’s where partition overwrite strategies help.
🚨 5. Add Monitoring and Alerting
Don’t wait for users to report broken dashboards. Add:
Task-level monitoring (e.g., Airflow sensors)
Data freshness checks (e.g., Great Expectations or dbt tests)
Slack or email alerts on failure
if job_failed:
send_alert("ETL job failed at step X with error: ...")
🔁 6. Use Smart Retry Logic
Not all failures deserve a retry. Categorize them:
Transient errors (e.g., timeout, 500s) → Retry with exponential backoff
Data bugs or schema mismatches → Don’t retry, notify the team for manual intervention
Most orchestrators allow this customization out-of-the-box.
🛠 Real-World Scenario
Let’s say you're loading clickstream data to BigQuery via a daily batch job.
Step 1: Data lands in GCS (staging)
✅ Durable, repeatable, not impacted by failure downstream
Step 2: Transformations applied via dbt
⛔️ Job fails due to schema change in source
Recovery Path:
The GCS staging file is untouched
Dev is notified via Slack
Schema fix is deployed, dbt run is retried
Thanks to dedup logic, rerunning is safe
📌 TL;DR – Failures Will Happen. Plan for Them.
To recover safely when ETL jobs fail:
✅ Make pipelines idempotent
✅ Use watermarks and audit logs
✅ Write to staging before final sinks
✅ Wrap writes in transactions (when possible)
✅ Monitor and alert with context
✅ Retry only when it makes sense
This mindset separates reliable pipelines from fragile ones and shows interviewers that you think like a production engineer.