Real-time Streaming with Apache Kafka and Flink: Architecture for 2026¶
Real-time data processing is no longer a luxury. In 2026, it is a fundamental requirement for fraud detection, trading systems, IoT telemetry, and customer experience personalization. Apache Kafka and Apache Flink form the backbone of most production streaming architectures.
Why Kafka + Flink?¶
Kafka acts as a distributed commit log — a reliable, scalable event store with ordering guarantees. Flink is a stateful stream processor with exactly-once semantics and sub-second latency.
Together, they cover the entire pipeline:
- Ingest → Kafka (producers, connectors, CDC)
- Process → Flink (transformations, aggregations, windowing, pattern detection)
- Serve → Kafka → consumers (databases, APIs, dashboards, alerting)
Reference Architecture¶
┌─────────────┐ ┌─────────────┐ ┌──────────────┐ ┌────────────┐
│ Producers │───▶│ Kafka │───▶│ Flink │───▶│ Sink/DB │
│ (API, IoT, │ │ (Topics, │ │ (Jobs, │ │ (Postgres,│
│ CDC, App) │ │ Partitions)│ │ Windows, │ │ Redis, │
└─────────────┘ └─────────────┘ │ State) │ │ S3, API) │
└──────────────┘ └────────────┘
Kafka Cluster — Best Practices 2026¶
Sizing: - Minimum 3 brokers for production (5+ for high-throughput) - KRaft mode (no ZooKeeper) — production-ready since Kafka 3.5+ - Tiered Storage for cost-effective retention (hot/warm/cold)
Topic design:
- Partition count = expected throughput / partition throughput (~10 MB/s per partition)
- Replication factor 3 (min.insync.replicas=2)
- Compacted topics for state/lookup data
- Naming convention: {domain}.{entity}.{version} (e.g., trading.orders.v2)
Schema management: - Confluent Schema Registry (Avro/Protobuf) - Schema evolution = backward + forward compatible - Never change a field type, only add optional fields
# Real-time Streaming with Apache Kafka and Flink: Architecture for 2026
acks: all
linger.ms: 5
batch.size: 65536
compression.type: lz4
max.in.flight.requests.per.connection: 5
enable.idempotence: true
Flink Jobs — Production Patterns¶
Windowing strategies:
// Tumbling window — fixní intervaly (např. 1-minutové agregace)
stream
.keyBy(event -> event.getSymbol())
.window(TumblingEventTimeWindows.of(Time.minutes(1)))
.aggregate(new VWAPAggregator());
// Sliding window — překrývající se okna (5 min window, 1 min slide)
stream
.keyBy(event -> event.getSymbol())
.window(SlidingEventTimeWindows.of(Time.minutes(5), Time.minutes(1)))
.process(new MovingAverageFunction());
// Session window — dynamické okna podle aktivity
stream
.keyBy(event -> event.getUserId())
.window(EventTimeSessionWindows.withGap(Time.minutes(30)))
.process(new SessionAnalyzer());
Watermarks and late data:
- Watermark = progress indicator for event time
- BoundedOutOfOrdernessWatermarks with a tolerance of 5-30 seconds
- Late data → side output for later processing
- Never rely on processing time for business logic
State management: - RocksDB state backend for large state (TB+) - Incremental checkpointing (interval 1-5 minutes) - State TTL for automatic cleanup of old keys - Savepoints before every deployment
Enterprise Use Cases¶
1. Fraud Detection (< 100ms latency)¶
Transaction → Kafka → Flink CEP (Complex Event Processing)
│
├── Pattern: 5+ transactions > 10K CZK in 10 min
├── Pattern: new device + high amount
└── Pattern: geographic anomaly (2 countries in 5 min)
│
▼
Alert → Block → Review
Flink CEP (Complex Event Processing) enables declarative definition of fraud patterns directly in code. Real-world systems combine rule-based CEP with an ML model (feature store populated by Flink, real-time inference).
2. Trading & Market Data¶
Exchange feeds → Kafka (partitioned by symbol)
│
▼
Flink Jobs:
├── VWAP calculation (1s/5s/1m windows)
├── Order book reconstruction
├── Spread monitoring + alerting
├── Regime detection (vol clustering)
└── Signal generation → Order Management System
Key metrics: - End-to-end latency: < 10ms (intra-DC) - Throughput: 1M+ events/sec per topic - State: order book per symbol (~100KB), total GB
3. IoT Telemetry & Predictive Maintenance¶
Thousands of sensors → Kafka (MQTT bridge) → Flink: - Anomaly detection on sensor data - Rolling statistics (avg, p95, p99 per device) - Predictive models (feature extraction → ML serving) - Alert escalation (warning → critical → shutdown)
4. Real-time Personalization¶
E-commerce / content platforms: - User clickstream → Kafka → Flink → feature update - Session-level features (pages viewed, time on site, cart value) - Real-time recommendation model update - A/B test metrics in real time
Monitoring & Observability¶
Kafka Metrics (must-have)¶
| Metric | Alert threshold | Meaning |
|---|---|---|
UnderReplicatedPartitions |
> 0 | Data at risk |
ActiveControllerCount |
≠ 1 | Split brain |
ConsumerGroupLag |
> 10K (depends) | Processing falling behind |
RequestQueueSize |
> 100 | Broker overloaded |
LogFlushLatency |
p99 > 100ms | Disk bottleneck |
Flink Metrics¶
| Metric | Alert threshold | Meaning |
|---|---|---|
lastCheckpointDuration |
> 60s | State too large or slow disk |
numRecordsOutPerSecond |
drop > 50% | Processing stall |
busyTimeMsPerSecond |
> 900 | Operator saturated |
currentInputWatermark |
drift > 5min | Late data issue |
Stack: Prometheus + Grafana + custom dashboards. Kafka Exporter for JMX metrics, Flink has a native Prometheus reporter.
Deployment & Operations¶
Kubernetes deployment¶
# Kafka na Kubernetes via Strimzi operator
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: production-cluster
spec:
kafka:
version: 3.7.0
replicas: 5
config:
auto.create.topics.enable: false
min.insync.replicas: 2
log.retention.hours: 168
storage:
type: persistent-claim
size: 500Gi
class: fast-ssd
zookeeper:
replicas: 0 # KRaft mode
# Flink na Kubernetes via Flink Operator
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
name: trading-processor
spec:
image: core/flink-trading:1.18
flinkVersion: v1_18
jobManager:
resource:
memory: "4096m"
cpu: 2
taskManager:
resource:
memory: "8192m"
cpu: 4
replicas: 3
job:
jarURI: local:///opt/flink/jobs/trading.jar
parallelism: 12
upgradeMode: savepoint
CI/CD for Streaming Jobs¶
- Schema validation — CI checks schema compatibility
- Integration tests — Testcontainers with embedded Kafka + Flink MiniCluster
- Canary deployment — new job reads from production topic, writes to shadow topic
- Savepoint → upgrade — Flink savepoint, deploy new version, resume
- Rollback plan — keep previous savepoint, revert if metrics degrade
Alternatives and When to Consider Them¶
| Technology | When to use instead of Kafka+Flink |
|---|---|
| Redpanda | Drop-in Kafka replacement, lower latency, simpler ops |
| Apache Pulsar | Multi-tenancy, geo-replication, native tiered storage |
| Materialize | SQL-first streaming (Postgres wire protocol) |
| RisingWave | Cloud-native streaming DB, SQL interface |
| Kafka Streams | Simple transformations without a separate cluster |
Conclusion¶
Kafka + Flink remain the de facto standard for enterprise streaming in 2026. KRaft mode has eliminated the ZooKeeper dependency, Tiered Storage reduces costs, and Flink 1.18+ brings better exactly-once semantics and faster checkpointing.
Keys to success: - Schema-first design — schemas are the contract between teams - Observability from the start — not as an afterthought - State management — the biggest source of bugs, invest in testing - Capacity planning — Kafka partition count cannot be easily changed
CORE SYSTEMS helps companies design and operate streaming architectures from proof-of-concept to production deployments handling millions of events per second.
Need help designing a streaming architecture? Contact us for a consultation.
Need help with implementation?
Our experts can help with design, implementation, and operations. From architecture to production.
Contact us