Prep #03 - Handling Schema Evolution When Upstream Changes Happen
What’s your approach to schema evolution when changes happen in the source data?
When building modern data pipelines, one reality you can’t avoid is upstream schema changes. Columns get renamed, data types mutate, new fields pop up out of nowhere - and your clean, carefully orchestrated pipelines suddenly start throwing cryptic errors at 2 a.m.
So, how do experienced data engineers handle schema evolution gracefully?
Let me walk you through a real-world approach that could save you from countless mid-run disasters and breakable workflows.
🔍 Step 1: Detect Schema Changes Early
Before diving into any transformation, validate the schema of incoming data:
Schema Registry (e.g., Confluent) - to enforce contracts with Kafka topics or upstream producers.
Parquet/Avro + Schema Inference - for column-based storage, where metadata can be checked easily.
Custom diffing logic - comparing new batch metadata against previous versions (for example, in S3 or GCS ingestion jobs).
great_expectations
or dbt tests - to assert schema expectations declaratively.
You don’t want to catch schema changes after your jobs fail. Catch them early and alert proactively.
🧠 Step 2: Understand the Type of Schema Change
All schema changes are not equal, classify them like this:
Additive changes (e.g., new columns): Usually safe. Just need downstream to ignore or handle them.
Destructive changes (e.g., dropped or renamed columns): Dangerous. Can break logic or make metrics invalid.
Type changes (e.g., string → int): Risky. Can cause type casting errors or silent failures.
You need to respond differently to each category.
🔧 Step 3: Build Forward-Compatible Pipelines
This is where design matters. Some key patterns to rely on:
✅ Use Safe Access Patterns
When working with semi-structured data (JSON, nested events), avoid direct column references:
# Instead of this:
df["user_name"]
# Do this:
df.get("user_name")
This avoids errors if the field disappears in a newer version.
✅ Enable Schema Evolution in Your Storage Layer
For example, in Delta Lake:
spark.read.format("delta").option("mergeSchema", "true").load("/delta/events")
Or in BigQuery, enable schema update options during load jobs.
✅ Maintain a Schema Mapping Layer
Instead of hardcoding logic per schema, use a mapping layer to normalize column names/types across versions.
# A config dict can help abstract version differences
column_mapping = {
"v1": {"user_name": "user"},
"v2": {"username": "user"},
}
def normalize(df, version):
return df.withColumnRenamed(column_mapping[version]["user"], "user")
This future-proofs your transformations.
📬 Step 4: Establish Data Contracts (When Possible)
In real teams, upstream schema changes should never be a surprise. The best protection? A data contract:
Shared Slack channels with producers.
Versioned schemas in Git.
Agreements on change windows and deprecation timelines.
But… let’s be real: many teams don’t have this luxury. So it’s better also to include:
Schema validation jobs (cron + alerts)
Fail-safe defaults (e.g., default values for missing fields)
Lineage tracking (to see what broke where)
✅ Step 5: Testing & CI/CD
Finally, test everything like a real product:
dbt tests for column existence and types.
Great Expectations for row-level checks and type enforcement.
CI/CD pipelines that fail if schema changes aren't accounted for.
Real-World Example in Spark
Here’s how I would handle JSON ingestion with schema changes in PySpark:
df = spark.read.option("mergeSchema", "true").json("s3://logs/2025/")
# Safely handle a new field that may or may not exist
if "event_source" in df.columns:
df = df.withColumn("event_source", df["event_source"].cast("string"))
else:
df = df.withColumn("event_source", lit("unknown"))
This ensures your ETL doesn’t crash due to a missing field and allows analysts to still work with consistent tables.
TL;DR
When upstream schema changes happen:
Detect them early via metadata checks or schema registry.
Classify changes to understand risk level.
Design forward-compatible pipelines that don’t break when new fields arrive.
Work with contracts if possible; use fail-safes if not.
Validate everything in staging before it hits production.
Schema evolution isn’t scary if your pipelines are built for change.
🧠 Pro Tip
Always log and version the schema that each batch was processed with. It’ll save you hours of debugging when a dashboard starts showing “0 users” out of nowhere.