🧠 Prep #10 - Designing a Real-Time Data Warehouse That Thinks in Milliseconds

How would you design a data warehouse for real-time analytics?

May 09, 2025

Real-time analytics is no longer reserved for fintech and fraud detection - it’s becoming standard in e-commerce, gaming, healthtech, and B2B SaaS. But traditional data warehouses weren’t built for this level of speed.

So, how do we bridge the gap?

Let’s design a modern data warehouse that serves up real-time insights without falling apart.

🏗️ Step 1: Pick Your Architecture - Lambda or Kappa?

You need to decide how to combine historical and real-time data. Two battle-tested blueprints:

✅ Lambda Architecture

Batch Layer: Stores the master dataset. Think Parquet files on S3.
Speed Layer: Processes recent data in real time via Kafka + Flink.
Serving Layer: Combines both for querying.

           Historical Data             Real-Time Events
                  │                            │
          [Batch Processor]         [Stream Processor]
                  │                            │
                  └──────┬──────► Merge in Serving Layer
                         │
                 Query from Warehouse

✅ Kappa Architecture

Skip batch entirely. Just stream everything and replay when needed.

           Kafka (immutable log of truth)
                   │
           Stream Processor (Flink / Spark)
                   │
          Real-time sink to warehouse + OLAP

💡 Rule of thumb: Use Lambda when you must reconcile history. Use Kappa if your events are complete and clean.

⚙️ Step 2: Ingest Events with Kafka

Whether it’s clickstreams, IoT, or transaction logs, you’ll want Apache Kafka (or Kinesis) as your ingestion backbone.

# Example: Kafka consumer for Flink
env = StreamExecutionEnvironment.get_execution_environment()
kafka_props = {'bootstrap.servers': 'localhost:9092', 'group.id': 'analytics'}
stream = FlinkKafkaConsumer('lovable-events', SimpleStringSchema(), kafka_props)

Pipe this stream to your processor for enrichment and filtering.

🔄 Step 3: Transform in Real Time (Flink or Spark)

Real-time data is messy. You’ll want to clean, enrich, and aggregate it before it hits your warehouse.

# Example: Enrich user events with session data in PyFlink
events = stream.map(lambda e: json.loads(e))
with_sessions = events.key_by(lambda e: e["user_id"]).process(SessionAssigner())

You can also calculate rolling averages, fraud signals, etc., using window functions.

🧊 Step 4: Stream It Into Your Warehouse

Warehouses now support streaming ingestion. Here’s how:

Snowflake → Snowpipe
BigQuery → streaming_insert()
Redshift → Kinesis Data Firehose

-- Example: Snowflake MERGE for upserts
MERGE INTO user_sessions t
USING (SELECT * FROM stream_data) s
ON t.session_id = s.session_id
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT (...);

🧠 Pro tip: Always deduplicate using event IDs or timestamps.

🧱 Step 5: Use Materialized Views for Low-Latency Dashboards

If your dashboard queries are slow, cache the results:

CREATE MATERIALIZED VIEW daily_signups AS
SELECT signup_date, COUNT(*) 
FROM users_stream 
GROUP BY signup_date;

Schedule this refresh every 5 minutes.

📦 Step 6: Store Raw Data in Object Storage

Keep everything, just not in the warehouse.

# Example: Write raw stream to S3
s3_path = "s3a://lovable-data/raw/"
df.write.format("parquet").mode("append").save(s3_path)

This lets you replay, audit, or reprocess whenever needed.

📊 Real-World Example

Let’s say we’re building analytics for a product like Lovable:

Event Sources: User actions (likes, shares), App performance logs
Stream Layer: Kafka
Processing Layer: Apache Flink with sessionization and enrichment
Warehouse: Snowflake (via Snowpipe)
OLAP Store (optional): Apache Druid for sub-second filtering
Dashboards: Superset or Metabase powered by materialized views

With this setup, PMs can see live feature adoption trends as they happen.

🔍 Monitoring and Observability

Set up alerts for:

Kafka consumer lag
Flink job failures
Streaming ingestion latency (via Snowflake's load history)

Use Grafana + Prometheus or Datadog to stay ahead of issues.

💰 Bonus: Control Costs

Real-time is powerful, but expensive.

Pre-aggregate metrics upstream
Filter events before ingestion
Archive old raw data to cold storage (S3 Glacier, GCS Nearline)

✅ Final Checklist

- Kafka or Kinesis for ingestion
- Flink/Spark for stream processing
- Real-time ingestion into the warehouse
- Upsert logic to avoid duplicates
- Materialized views for performance
- Object storage for replay and retention

🧠 TL;DR

A real-time data warehouse is not one tool - it’s a system.
You need streaming ingestion, transformation, and smart querying to make it all work.
When built right, you’ll go from batchy reports to live product insights in minutes - not days.

Thanks for reading DE Prep! This post is public so feel free to share it.