DE Prep
DELulu
🚀 Prep #23 - Data Modeling: Complex Data Types and Cumulation
0:00
-9:57

🚀 Prep #23 - Data Modeling: Complex Data Types and Cumulation

Day 1 of Zach's DE BootCamp (DataExpert.io)!

Welcome to Day 1 of Zach’s Data Engineering BootCamp — and wow, what a journey already. We started with the basics of data modeling and ended up diving into cumulative table design, advanced compression techniques, and how Spark joins can make or break your pipelines.

Whether you're wrangling transactional data or designing warehouse-ready datasets, this post captures everything we covered on Day 1 — plus practical lessons and real-world examples.

Thanks for reading DE Prep! Subscribe for free to receive new posts and support my work.

Thanks for reading DE Prep! Subscribe for free to receive new posts and support my work.


🧭 The Data Modeling Continuum

Zach introduced a powerful mental model for thinking about how data is structured at different layers of a system. He broke it down into four main types of data modeling, each tailored to a specific purpose:

  • OLTP (Online Transaction Processing): Used by application engineers. Prioritizes fast writes and strict normalization. Common in production databases like Postgres or MySQL. Think: user signups, transactions, or booking logs.

  • Master Data: The “golden record” maintained by data engineers. This is a cleaned, consistent, and unified version of key entities, like users, listings, or products. Master tables are often shared across systems and pipelines.

  • OLAP (Online Analytical Processing): Built for data analysts and scientists. These tables are flattened and denormalized to allow fast filtering, slicing, and aggregations — ideal for dashboards and reporting tools.

  • Metrics Layer: The most distilled version of your data. Fully aggregated and precomputed KPIs like “average revenue per user” or “total active users last 7 days.”

Zach shared an example from his time at Airbnb: starting with 40 transactional tables (hosts, listings, prices, availability), he built a master data table that made querying pricing and availability fast and simple. That table then fed OLAP cubes and metrics dashboards used by business and product teams.


📈 Cumulative Table Design — Tracking History the Smart Way

Cumulative tables are designed to hold the entire historical state of an entity in one place. They’re especially powerful for user analytics, retention tracking, and behavioral trends.

Here’s how it works:

  • Every day, you take yesterday’s cumulative table and join it with today’s snapshot.

  • Use a full outer join so that you keep users who may not appear today (but were seen yesterday).

  • Use coalesce() to merge the latest known values.

  • This lets you build history incrementally — one row per user, with columns like last_seen or an array of the last 30 days of activity.

user_id | last_seen | last_30_days_activity
--------|-----------|-----------------------
123     | 2025-06-16| [0, 1, 1, 0, 1, 0, 1]

This pattern powers growth accounting at scale — think:

  • new users (not seen yesterday, seen today)

  • resurrected users (seen after a gap)

  • churned users (inactive today, active before)

This technique is what powered Facebook’s legendary Dim_All_Users table — a single source of truth used by over 10,000 downstream pipelines.

But cumulative tables come with trade-offs:

  • Backfills must be sequential. You can’t parallelize a year’s worth of data — you must go day by day.

  • You need to handle PII and deletions proactively by filtering out hibernated or deleted users.

  • Still, when done right, cumulative tables unlock huge performance wins and historical depth with minimal query effort.


⚖️ The Compactness vs. Usability Tradeoff

Zach explained that data tables live along a spectrum:

  • Highly usable tables are clean, flat, and easy to GROUP BY. These are built for analysts and BI tools.

  • Highly compact tables are optimized for storage and speed, like compressed blobs or binary formats. Great for APIs and mobile apps, but useless for querying.

  • The middle ground lies with data engineers. Here, we lean into complex types like struct, map, and array — they help us retain structure, reduce duplication, and still enable reasonable query performance.

At Airbnb, the app shipped a compressed version of the listing availability calendar to save on I/O. But for internal analysis, Zach had to decode that data into a readable format so analysts could actually query it.


🛠 Struct, Map, and Array — Choosing the Right Complex Type

Each complex type has its own use case:

  • Structs group related fields together under a single column. They’re like nested rows, and each field can have a different data type. Great for fixed schemas like user profiles or nested metadata.

  • Maps are flexible key-value stores, but all values must be the same data type. Ideal for storing dynamic attributes like user tags or feature flags.

  • Arrays are ordered lists of items with the same type. Perfect for modeling time-series data like login streaks, activity counts, or purchase sequences.

You can even combine these — for example, an array<struct<date, status, price>> is a common pattern in master data where we need to track daily snapshots.


🕰 Modeling Temporal Dimensions — To Explode or Not to Explode?

At Airbnb, Zach faced a classic challenge: every listing had 365 future nights. Multiply that by 6 million listings and you’re suddenly managing 2 billion rows.

The question was: should they explode the data and keep one row per listing-night? Or keep things compact using nested arrays?

Exploding is easier for joins, but it inflates row counts and hurts performance. Nesting is more efficient, but harder to query.

That’s where Parquet compression — specifically Run-Length Encoding (RLE) — comes in.


📦 Run-Length Encoding (RLE) and the Cost of a Shuffle

Here’s the magic: if a column has the same value for multiple rows (e.g., listing_id for 365 rows), Parquet stores it just once, plus a count. That’s RLE — and it makes your data smaller and faster.

But there’s a catch.

When Spark performs a join, it shuffles the data. And that shuffle breaks the sort order, which means RLE can’t kick in, and suddenly your data is 5x or 10x bigger.

Zach’s advice: don’t explode early. Instead:

  • Keep data nested (e.g., one row per listing, with an array of nights).

  • Perform your join on the outer entity (e.g., listing_id).

  • Only explode after the join is done — this keeps your sorting intact and compression effective.


🧠 Final Takeaways from Day 1

  • Design for your audience. OLTP for apps, Master Data for pipelines, OLAP for analysts, and Metrics for the business.

  • Cumulative tables are gold for historical tracking, but you must build them sequentially and manage PII carefully.

  • Structs, maps, and arrays are powerful tools when used with intention.

  • Compression isn’t automatic — it depends on ordering. Use Parquet + RLE smartly.

  • Spark shuffles are great for parallelism, but terrible for preserving sort order — plan accordingly.


👉 Enjoyed this breakdown of Day 1 from Zach’s DE BootCamp? Follow DE Prep for detailed walkthroughs of each day — packed with real-world takeaways, modeling strategies, and performance tips you can actually use.

You can explore all BootCamp recaps and DE Preps in one place at DEtermined


Thanks for reading DE Prep! This post is public so feel free to share it.

Share

Discussion about this episode

User's avatar