Welcome back to the Data Modeling deep dive! Starting afresh after losing consistency almost 4 months back. Apologies for that, but I promise to deliver frequent posts on Zach’s Community DE bootcamp from here on!
Days 1 and 2 covered fundamental concepts like the Data Modeling Continuum, complex data types like Map and Array, and the challenges of cumulative tables.
For Day 3, we shift our focus from entity persistence to advanced analytical constructs and relationships, revealing how seemingly simple counting problems can derail analytics and how a shared schema can manage hundreds of disparate sources.
1. The Counting Conundrum: Additive vs. Non-Additive Dimensions
A fundamental challenge in dimensional modeling is aggregation: can you trust your subtotals? This question is at the heart of additive versus non-additive dimensions, a perplexing little concept. This topic mostly concerns how to count stuff.
A dimension is considered additive over a specific time window if and only if the grain of data over that window can only ever be one value at a time. If you can sum up all the subtotals and reliably get the correct grand total, your dimension is additive.
When Subtotals Lie
The risk lies when an entity (like a user or a driver) can simultaneously belong to multiple dimensional values within the reporting time frame.
• Additive Example (Safe): If you are aggregating miles driven by Civic drivers versus Corolla drivers, summing the subtotals works. This is because a single person cannot drive a mile in two separate cars at the same time.
• Non-Additive Example (Danger Zone): If you try to count the number of Honda drivers, summing Civic drivers plus Corolla drivers will double-count anyone who owns both cars (e.g., owning both a Civic and a Corolla). The summation fails because a driver can drive two different cars at the same time or in the same day.
The Pain of Non-Additivity
When a dimension is non-additive, especially for metrics involving counts or ratios (which involve counts), you cannot rely on partial aggregations.
• The Fix: You are forced to go all the way back to the row-level data and perform a resource-intensive COUNT DISTINCT over the entire dataset to get the grand total. You must go down a layer because you do not know how much overlap there is between the buckets (like Civic and Corolla drivers).
A Rule of Thumb
To determine additivity, ask yourself: Can a user be two of these dimensional values at the same time in a given day?. If they can (like using both an iPhone and an Android, which are both a user’s tool), it is non-additive, and you must deal with the aggregation differently. This was such a major problem at Facebook that frameworks were built (e.g., Milky Way) specifically to handle non-additive aggregations efficiently.
2. Unlocking Hidden Power: The Strategic Use of Enums
Enumerations (enums) define a predefined, fixed list of possible values for a field. Enums are a very powerful tool for robust data architecture, providing advantages you don’t get with simple strings.
Why Enums Are a Data Quality Superpower:
1. Built-in Data Quality: If an incoming value doesn’t fit the enum definition during the casting process, the pipeline usually fails, preventing bad data from entering your warehouse automatically.
2. Built-in Documentation: Enums provide clear documentation, showing users all possible values for a field.
3. Static Fields and Efficiency: Enums can carry related static fields (metadata). For example, in unit economics, an enum for line items included a static field defining whether that item was a “Revenue” or a “Cost” (e.g., fees are revenue, coupons are cost). These static fields can be shipped directly with your code, offering a more efficient way to bring in metadata than a traditional broadcast join.
The 50-Value Rule
While powerful, don’t overuse them. A good rule of thumb is: If a dimension has more than 50 possible values, it probably shouldn’t be an enum. Modeling Country as an enum is generally discouraged because it involves around 200 values.
Enums and Partitioning
Enums are excellent for sub-partitions. At Facebook, when processing notifications, the large dataset was partitioned not just by date but also by the enumerated channel value (e.g., SMS, email, push). If you have the exhaustive list of all possible values for a partition, you can build a pipeline that covers everything and processes data more effectively in parallel.
3. Mastering Complexity: The “Little Book of Pipelines” Pattern
What happens when you have a massive ETL covering many topics, pushing the variety V of big data (e.g., 50 upstream datasets)?.
Zach came up with this design called the Little Book of Pipelines, which leverages enums to manage large-scale data integration. This pattern was used by him at Netflix for asset inventory and at Airbnb for unit economics.
1. Group by Enumeration: Group disparate upstream data sets (like fees, coupons, insurance, taxes, and infrastructure costs) into a single, enumerated group.
2. Shared Schema: Each source function maps its raw data to a single shared schema. This pattern works for the family of apps at Facebook (Oculus, Instagram, WhatsApp) and for unit economics at Airbnb.
3. Customized DQ: The “Little Book” stores the enumerated values and their customized data quality (DQ) checks. DQ checks need to be custom because what is anomalous for fees might not be anomalous for coupons.
4. Flexible Maintenance: If a new source is needed, you add a new value to the enum, define its source function, and the rest of the cohesive pipeline structure works.
This pattern dramatically increases pipeline quality and provides built-in documentation, as analysts can query the enumeration table to see all possible partitions.
4. Navigating the Flexible Schema Tradeoff
To successfully implement the shared schema required by the “Little Book” pattern, data engineers must embrace a flexible schema. You want to avoid creating a table with 500 columns where most are NULL.
A flexible schema leverages complex data types, primarily the Map type, which is powerful, especially as Map<string, string> because it is agnostic.
The Benefits of Maps:
• Easy Expansion: If new columns appear, they can simply be added as keys into the map without running ALTER TABLE.
• Managing Sparsity: You avoid creating numerous NULL columns; if a property doesn’t exist for a row, it simply isn’t in the map.
• Other Properties: Maps are often used for an other_properties column, a dumping ground for data that won’t be queried frequently, preventing complex modeling efforts for low-value columns.
The Compression Cost
The major drawback of relying on flexible schemas and Map data types is compression. Maps are generally the worst for compression, even worse than JSON. This is because the column header (the key) must be stored in every single row as data, rather than being stored once as part of the schema definition.
5. Beyond Entities: Graph Data Modeling
Graph data modeling represents a significant shift from dimensional or relational modeling: it is relationship focused and less entity focused. Graph modeling truly shines when trying to understand how things are connected. The schema tends to be very flexible.
The Universal Graph Schema
Every graph database, regardless of the implementation, relies on two fundamental components with standardized schemas:
1. The Vertex (Node) Schema: The schema for an entity in a graph has three components:
◦ Identifier: The unique ID (e.g., a player’s name or user ID).
◦ Type: The enumerated type of the node (e.g., player, server, application, or Instagram user).
◦ Properties: A flexible data type (like a map) containing attributes (e.g., height, weight, draft year).
2. The Edge (Relationship) Schema: Edges connect two nodes and represent the relationship between them.
◦ Subject Identifier & Subject Type. (The subject is the entity doing the thing).
◦ Object Identifier & Object Type. (The object is the thing the action is being done on).
◦ Edge Type: A verb describing the relationship (e.g., plays on, plays with, plays against, or is a).
◦ Properties: Attributes of the relationship itself (e.g., how many years a player was on a team).
By shifting focus from entities to relationships (like mapping a WhatsApp user to an Instagram user because they are the same person), you enable complex graph traversal queries that are typically messy and inefficient using relational joins.
👉 Enjoyed this breakdown of Day 3 from Zach’s DE BootCamp? Follow DE Prep for detailed walkthroughs - packed with real-world takeaways, modeling strategies, and performance tips you can actually use.







