Archived

Data Pipeline Migration to Real-Time

Data Engineer · 2022 · 1.5 years · 20 people · 3 min read

Migrated batch data pipelines to real-time streaming, reducing data freshness from 24 hours to under 5 minutes

Overview

Led the transformation of legacy batch ETL pipelines to a modern real-time streaming architecture, enabling near-instant analytics and ML model updates

Problem

Business decisions were based on day-old data. The nightly batch jobs took 6+ hours to complete, frequently failed, and couldn't scale with growing data volumes. Data scientists waited 24+ hours to see the impact of model changes.

Constraints

Must process 500GB+ of daily data
Cannot lose data during migration
Existing downstream consumers must continue working
Team unfamiliar with streaming technologies

Approach

Implemented a hybrid architecture using Apache Flink for stream processing while maintaining batch capabilities for historical reprocessing. Used the Kappa architecture pattern where streaming is the primary path and batch is derived from the stream.

Key Decisions

Choose Apache Flink over Spark Streaming

Reasoning:

Flink's true streaming model (vs Spark's micro-batch) provides lower latency and better exactly-once semantics. The savepoint mechanism enables safe deployments and schema evolution.

Alternatives considered:

Apache Spark Structured Streaming
Apache Kafka Streams
AWS Kinesis

Implement schema registry for data contracts

Reasoning:

Schema evolution is inevitable. A central schema registry with compatibility checks prevents breaking changes from propagating through the pipeline. Enables safe, independent evolution of producers and consumers.

Alternatives considered:

Schema embedded in messages
Schema-less JSON

Use Delta Lake for the data lakehouse

Reasoning:

Delta Lake provides ACID transactions on data lake storage, enabling reliable upserts and time travel. The unified batch and streaming support aligns with our hybrid architecture.

Tech Stack

Apache Flink
Apache Kafka
Delta Lake
AWS S3
Apache Airflow
dbt
Python
Kubernetes

Result & Impact

Under 5 minutes (down from 24 hours)

Data Freshness
99.9% (up from 85%)

Pipeline Reliability
40% reduction

Processing Cost
50% more experiments per quarter

Data Team Productivity

Real-time data has transformed how the business operates. Marketing can see campaign performance immediately. Fraud detection models update continuously. The data team spends time on analysis instead of babysitting pipelines.

Learnings

Streaming adds operational complexity—invest in observability and alerting
Exactly-once semantics are hard—understand the guarantees your tools actually provide
Schema management is critical—treat schemas as code with versioning and reviews
Hybrid batch+streaming is pragmatic—pure streaming isn't always necessary or cost-effective

Migration Strategy

We couldn’t switch everything to streaming overnight. Instead, we ran batch and streaming pipelines in parallel, comparing outputs to validate correctness. This “shadow mode” ran for 4 weeks before we trusted the streaming pipeline enough to make it primary.

The key insight was treating the Kafka topic as the source of truth. Both real-time and batch consumers read from the same stream, ensuring consistency.

Operational Challenges

Flink’s stateful processing required careful capacity planning. Checkpointing to S3 added latency during backpressure situations. We spent significant time tuning checkpoint intervals and state backend configurations.

The schema registry became a critical piece of infrastructure. We implemented CI/CD checks that validate schema compatibility before allowing merges, preventing breaking changes from reaching production.

All projects