Prep #04 - What Happens if Your ETL Pipeline Fails Mid-Run? How to Recover Safely

May 01, 2025

Failures are inevitable. But data loss, duplication, or corruption? That doesn’t have to be.

Today’s DE Prep dives into a deceptively common but under-discussed interview question:

❓ “What happens if your ETL pipeline fails mid-run? How do you recover safely?”

Whether you're building data pipelines for a Fortune 500 or a startup, how you recover from failure often matters more than preventing it altogether.

Let’s explore how production-grade pipelines are designed to fail gracefully and recover reliably.

🧨 The Fallout of a Mid-Run Failure

When an ETL pipeline breaks halfway through execution, you’re left in a state of partial truth. Here’s what typically happens:

1. Partial Writes

Imagine a pipeline that ingests from Kafka → transforms in Spark → writes to Snowflake. If it fails during the write step, you might:

Have written only part of the day’s records
Inserted duplicates if retries aren’t handled correctly
Left the table in a non-deterministic state

2. Downstream Chaos

If your dashboards, alerts, or ML models depend on the freshness of this pipeline:

They might silently operate on stale or incomplete data
This can lead to erroneous business decisions

3. Broken Dependencies

Multi-stage pipelines (e.g., raw → staging → curated → analytics) often pass watermarks downstream. A mid-run failure can cause:

Skewed or missing aggregations
Misaligned partitions

✅ Designing for Safe Recovery

Here’s how seasoned data engineers build ETL pipelines that can recover quickly and safely.

🔁 1. Idempotency is Non-Negotiable

Ensure your ETL operations are idempotent, meaning:

Reprocessing the same data does not cause duplication or corruption.

How?

Use upserts (MERGE) instead of blind INSERTs
Deduplicate using unique keys (e.g., UUIDs, event_id)
Truncate-and-load for batch partitions

-- Snowflake example
MERGE INTO users u
USING staging_users s
ON u.user_id = s.user_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...

🧭 2. Track Watermarks

Store high watermarks (e.g., last processed timestamp, batch ID) to know where to resume. This ensures:

No overlap during retries
No missed records

# Example: Storing watermark in metadata store
last_timestamp = metadata_store.get("last_processed_ts")
new_data = df.filter(col("event_time") > last_timestamp)

🧱 3. Use Checkpoints or Intermediate Tables

Break your pipeline into discrete stages:

Raw ingestion → Staging → Final table
If something fails in the final write, staging data is preserved for debugging or retry

Tools like Airflow, Dagster, and Prefect support this natively.

🕵️ 4. Use Transactions (When Possible)

If you’re working with databases that support atomic transactions, wrap your inserts or merges in transactions:

If it fails, the entire batch rolls back
Ensures all-or-nothing behavior

BEGIN;

INSERT INTO sales SELECT * FROM temp_sales;

-- Only commit if everything worked
COMMIT;

If using a distributed system like Spark or Hive, this is harder. That’s where partition overwrite strategies help.

🚨 5. Add Monitoring and Alerting

Don’t wait for users to report broken dashboards. Add:

Task-level monitoring (e.g., Airflow sensors)
Data freshness checks (e.g., Great Expectations or dbt tests)
Slack or email alerts on failure

if job_failed:
    send_alert("ETL job failed at step X with error: ...")

🔁 6. Use Smart Retry Logic

Not all failures deserve a retry. Categorize them:

Transient errors (e.g., timeout, 500s) → Retry with exponential backoff
Data bugs or schema mismatches → Don’t retry, notify the team for manual intervention

Most orchestrators allow this customization out-of-the-box.

🛠 Real-World Scenario

Let’s say you're loading clickstream data to BigQuery via a daily batch job.

Step 1: Data lands in GCS (staging)
✅ Durable, repeatable, not impacted by failure downstream

Step 2: Transformations applied via dbt
⛔️ Job fails due to schema change in source

Recovery Path:

The GCS staging file is untouched
Dev is notified via Slack
Schema fix is deployed, dbt run is retried
Thanks to dedup logic, rerunning is safe

📌 TL;DR – Failures Will Happen. Plan for Them.

To recover safely when ETL jobs fail:

✅ Make pipelines idempotent
✅ Use watermarks and audit logs
✅ Write to staging before final sinks
✅ Wrap writes in transactions (when possible)
✅ Monitor and alert with context
✅ Retry only when it makes sense

This mindset separates reliable pipelines from fragile ones and shows interviewers that you think like a production engineer.

Thanks for reading DE Prep! This post is public so feel free to share it.