Archived

Data Pipeline Migration to Real-Time

Data Engineer · 2022 · 1.5 years · 20 people · 3 min read

Migrated batch data pipelines to real-time streaming, reducing data freshness from 24 hours to under 5 minutes

Overview

Led the transformation of legacy batch ETL pipelines to a modern real-time streaming architecture, enabling near-instant analytics and ML model updates

Problem

Business decisions were based on day-old data. The nightly batch jobs took 6+ hours to complete, frequently failed, and couldn't scale with growing data volumes. Data scientists waited 24+ hours to see the impact of model changes.

Constraints

  • Must process 500GB+ of daily data
  • Cannot lose data during migration
  • Existing downstream consumers must continue working
  • Team unfamiliar with streaming technologies

Approach

Implemented a hybrid architecture using Apache Flink for stream processing while maintaining batch capabilities for historical reprocessing. Used the Kappa architecture pattern where streaming is the primary path and batch is derived from the stream.

Key Decisions

Choose Apache Flink over Spark Streaming

Reasoning:

Flink's true streaming model (vs Spark's micro-batch) provides lower latency and better exactly-once semantics. The savepoint mechanism enables safe deployments and schema evolution.

Alternatives considered:
  • Apache Spark Structured Streaming
  • Apache Kafka Streams
  • AWS Kinesis

Implement schema registry for data contracts

Reasoning:

Schema evolution is inevitable. A central schema registry with compatibility checks prevents breaking changes from propagating through the pipeline. Enables safe, independent evolution of producers and consumers.

Alternatives considered:
  • Schema embedded in messages
  • Schema-less JSON

Use Delta Lake for the data lakehouse

Reasoning:

Delta Lake provides ACID transactions on data lake storage, enabling reliable upserts and time travel. The unified batch and streaming support aligns with our hybrid architecture.

Tech Stack

  • Apache Flink
  • Apache Kafka
  • Delta Lake
  • AWS S3
  • Apache Airflow
  • dbt
  • Python
  • Kubernetes

Result & Impact

  • Under 5 minutes (down from 24 hours)
    Data Freshness
  • 99.9% (up from 85%)
    Pipeline Reliability
  • 40% reduction
    Processing Cost
  • 50% more experiments per quarter
    Data Team Productivity

Real-time data has transformed how the business operates. Marketing can see campaign performance immediately. Fraud detection models update continuously. The data team spends time on analysis instead of babysitting pipelines.

Learnings

  • Streaming adds operational complexity—invest in observability and alerting
  • Exactly-once semantics are hard—understand the guarantees your tools actually provide
  • Schema management is critical—treat schemas as code with versioning and reviews
  • Hybrid batch+streaming is pragmatic—pure streaming isn't always necessary or cost-effective

Migration Strategy

We couldn’t switch everything to streaming overnight. Instead, we ran batch and streaming pipelines in parallel, comparing outputs to validate correctness. This “shadow mode” ran for 4 weeks before we trusted the streaming pipeline enough to make it primary.

The key insight was treating the Kafka topic as the source of truth. Both real-time and batch consumers read from the same stream, ensuring consistency.

Operational Challenges

Flink’s stateful processing required careful capacity planning. Checkpointing to S3 added latency during backpressure situations. We spent significant time tuning checkpoint intervals and state backend configurations.

The schema registry became a critical piece of infrastructure. We implemented CI/CD checks that validate schema compatibility before allowing merges, preventing breaking changes from reaching production.