Real-time Streaming with Apache Kafka and Flink: Architecture for 2026¶

Real-time data processing is no longer a luxury. In 2026, it is a fundamental requirement for fraud detection, trading systems, IoT telemetry, and customer experience personalization. Apache Kafka and Apache Flink form the backbone of most production streaming architectures.

Why Kafka + Flink?¶

Kafka acts as a distributed commit log — a reliable, scalable event store with ordering guarantees. Flink is a stateful stream processor with exactly-once semantics and sub-second latency.

Together, they cover the entire pipeline:

Ingest → Kafka (producers, connectors, CDC)
Process → Flink (transformations, aggregations, windowing, pattern detection)
Serve → Kafka → consumers (databases, APIs, dashboards, alerting)

Reference Architecture¶

┌─────────────┐    ┌─────────────┐    ┌──────────────┐    ┌────────────┐
│  Producers  │───▶│    Kafka     │───▶│    Flink      │───▶│  Sink/DB   │
│  (API, IoT, │    │  (Topics,   │    │  (Jobs,       │    │  (Postgres,│
│   CDC, App) │    │   Partitions)│    │   Windows,    │    │   Redis,   │
└─────────────┘    └─────────────┘    │   State)      │    │   S3, API) │
                                       └──────────────┘    └────────────┘

Kafka Cluster — Best Practices 2026¶

Sizing: - Minimum 3 brokers for production (5+ for high-throughput) - KRaft mode (no ZooKeeper) — production-ready since Kafka 3.5+ - Tiered Storage for cost-effective retention (hot/warm/cold)

Topic design: - Partition count = expected throughput / partition throughput (~10 MB/s per partition) - Replication factor 3 (min.insync.replicas=2) - Compacted topics for state/lookup data - Naming convention: {domain}.{entity}.{version} (e.g., trading.orders.v2)

Schema management: - Confluent Schema Registry (Avro/Protobuf) - Schema evolution = backward + forward compatible - Never change a field type, only add optional fields

# Real-time Streaming with Apache Kafka and Flink: Architecture for 2026
acks: all
linger.ms: 5
batch.size: 65536
compression.type: lz4
max.in.flight.requests.per.connection: 5
enable.idempotence: true

Flink Jobs — Production Patterns¶

Windowing strategies:

// Tumbling window — fixní intervaly (např. 1-minutové agregace)
stream
    .keyBy(event -> event.getSymbol())
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .aggregate(new VWAPAggregator());

// Sliding window — překrývající se okna (5 min window, 1 min slide)
stream
    .keyBy(event -> event.getSymbol())
    .window(SlidingEventTimeWindows.of(Time.minutes(5), Time.minutes(1)))
    .process(new MovingAverageFunction());

// Session window — dynamické okna podle aktivity
stream
    .keyBy(event -> event.getUserId())
    .window(EventTimeSessionWindows.withGap(Time.minutes(30)))
    .process(new SessionAnalyzer());

Watermarks and late data: - Watermark = progress indicator for event time - BoundedOutOfOrdernessWatermarks with a tolerance of 5-30 seconds - Late data → side output for later processing - Never rely on processing time for business logic

State management: - RocksDB state backend for large state (TB+) - Incremental checkpointing (interval 1-5 minutes) - State TTL for automatic cleanup of old keys - Savepoints before every deployment

Enterprise Use Cases¶

1. Fraud Detection (< 100ms latency)¶

Transaction → Kafka → Flink CEP (Complex Event Processing)
                         │
                         ├── Pattern: 5+ transactions > 10K CZK in 10 min
                         ├── Pattern: new device + high amount
                         └── Pattern: geographic anomaly (2 countries in 5 min)
                         │
                         ▼
                    Alert → Block → Review

Flink CEP (Complex Event Processing) enables declarative definition of fraud patterns directly in code. Real-world systems combine rule-based CEP with an ML model (feature store populated by Flink, real-time inference).

2. Trading & Market Data¶

Exchange feeds → Kafka (partitioned by symbol)
                    │
                    ▼
              Flink Jobs:
              ├── VWAP calculation (1s/5s/1m windows)
              ├── Order book reconstruction
              ├── Spread monitoring + alerting
              ├── Regime detection (vol clustering)
              └── Signal generation → Order Management System

Key metrics: - End-to-end latency: < 10ms (intra-DC) - Throughput: 1M+ events/sec per topic - State: order book per symbol (~100KB), total GB

3. IoT Telemetry & Predictive Maintenance¶

Thousands of sensors → Kafka (MQTT bridge) → Flink: - Anomaly detection on sensor data - Rolling statistics (avg, p95, p99 per device) - Predictive models (feature extraction → ML serving) - Alert escalation (warning → critical → shutdown)

4. Real-time Personalization¶

E-commerce / content platforms: - User clickstream → Kafka → Flink → feature update - Session-level features (pages viewed, time on site, cart value) - Real-time recommendation model update - A/B test metrics in real time

Monitoring & Observability¶

Kafka Metrics (must-have)¶

Metric	Alert threshold	Meaning
`UnderReplicatedPartitions`	> 0	Data at risk
`ActiveControllerCount`	≠ 1	Split brain
`ConsumerGroupLag`	> 10K (depends)	Processing falling behind
`RequestQueueSize`	> 100	Broker overloaded
`LogFlushLatency`	p99 > 100ms	Disk bottleneck

Flink Metrics¶

Metric	Alert threshold	Meaning
`lastCheckpointDuration`	> 60s	State too large or slow disk
`numRecordsOutPerSecond`	drop > 50%	Processing stall
`busyTimeMsPerSecond`	> 900	Operator saturated
`currentInputWatermark`	drift > 5min	Late data issue

Stack: Prometheus + Grafana + custom dashboards. Kafka Exporter for JMX metrics, Flink has a native Prometheus reporter.

Deployment & Operations¶

Kubernetes deployment¶

# Kafka na Kubernetes via Strimzi operator
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: production-cluster
spec:
  kafka:
    version: 3.7.0
    replicas: 5
    config:
      auto.create.topics.enable: false
      min.insync.replicas: 2
      log.retention.hours: 168
    storage:
      type: persistent-claim
      size: 500Gi
      class: fast-ssd
  zookeeper:
    replicas: 0  # KRaft mode

# Flink na Kubernetes via Flink Operator
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
  name: trading-processor
spec:
  image: core/flink-trading:1.18
  flinkVersion: v1_18
  jobManager:
    resource:
      memory: "4096m"
      cpu: 2
  taskManager:
    resource:
      memory: "8192m"
      cpu: 4
    replicas: 3
  job:
    jarURI: local:///opt/flink/jobs/trading.jar
    parallelism: 12
    upgradeMode: savepoint

CI/CD for Streaming Jobs¶

Schema validation — CI checks schema compatibility
Integration tests — Testcontainers with embedded Kafka + Flink MiniCluster
Canary deployment — new job reads from production topic, writes to shadow topic
Savepoint → upgrade — Flink savepoint, deploy new version, resume
Rollback plan — keep previous savepoint, revert if metrics degrade

Alternatives and When to Consider Them¶

Technology	When to use instead of Kafka+Flink
Redpanda	Drop-in Kafka replacement, lower latency, simpler ops
Apache Pulsar	Multi-tenancy, geo-replication, native tiered storage
Materialize	SQL-first streaming (Postgres wire protocol)
RisingWave	Cloud-native streaming DB, SQL interface
Kafka Streams	Simple transformations without a separate cluster

Conclusion¶

Kafka + Flink remain the de facto standard for enterprise streaming in 2026. KRaft mode has eliminated the ZooKeeper dependency, Tiered Storage reduces costs, and Flink 1.18+ brings better exactly-once semantics and faster checkpointing.

Keys to success: - Schema-first design — schemas are the contract between teams - Observability from the start — not as an afterthought - State management — the biggest source of bugs, invest in testing - Capacity planning — Kafka partition count cannot be easily changed

CORE SYSTEMS helps companies design and operate streaming architectures from proof-of-concept to production deployments handling millions of events per second.

Need help designing a streaming architecture? Contact us for a consultation.

kafkaflinkstreamingreal-timedata-engineeringevent-driven

CORE SYSTEMS

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Real-time Streaming with Apache Kafka and Flink: Architecture for 2026