Skip to content
_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN DE
Let's talk

Real-time Streaming with Apache Kafka and Flink: Architecture for 2026

20. 02. 2026 5 min read CORE SYSTEMSdata
Real-time Streaming with Apache Kafka and Flink: Architecture for 2026

Real-time Streaming with Apache Kafka and Flink: Architecture for 2026

Real-time data processing is no longer a luxury. In 2026, it is a fundamental requirement for fraud detection, trading systems, IoT telemetry, and customer experience personalization. Apache Kafka and Apache Flink form the backbone of most production streaming architectures.

Kafka acts as a distributed commit log — a reliable, scalable event store with ordering guarantees. Flink is a stateful stream processor with exactly-once semantics and sub-second latency.

Together, they cover the entire pipeline:

  • Ingest → Kafka (producers, connectors, CDC)
  • Process → Flink (transformations, aggregations, windowing, pattern detection)
  • Serve → Kafka → consumers (databases, APIs, dashboards, alerting)

Reference Architecture

┌─────────────┐    ┌─────────────┐    ┌──────────────┐    ┌────────────┐
│  Producers  │───▶│    Kafka     │───▶│    Flink      │───▶│  Sink/DB   │
│  (API, IoT, │    │  (Topics,   │    │  (Jobs,       │    │  (Postgres,│
│   CDC, App) │    │   Partitions)│    │   Windows,    │    │   Redis,   │
└─────────────┘    └─────────────┘    │   State)      │    │   S3, API) │
                                       └──────────────┘    └────────────┘

Kafka Cluster — Best Practices 2026

Sizing: - Minimum 3 brokers for production (5+ for high-throughput) - KRaft mode (no ZooKeeper) — production-ready since Kafka 3.5+ - Tiered Storage for cost-effective retention (hot/warm/cold)

Topic design: - Partition count = expected throughput / partition throughput (~10 MB/s per partition) - Replication factor 3 (min.insync.replicas=2) - Compacted topics for state/lookup data - Naming convention: {domain}.{entity}.{version} (e.g., trading.orders.v2)

Schema management: - Confluent Schema Registry (Avro/Protobuf) - Schema evolution = backward + forward compatible - Never change a field type, only add optional fields

# Real-time Streaming with Apache Kafka and Flink: Architecture for 2026
acks: all
linger.ms: 5
batch.size: 65536
compression.type: lz4
max.in.flight.requests.per.connection: 5
enable.idempotence: true

Windowing strategies:

// Tumbling window — fixní intervaly (např. 1-minutové agregace)
stream
    .keyBy(event -> event.getSymbol())
    .window(TumblingEventTimeWindows.of(Time.minutes(1)))
    .aggregate(new VWAPAggregator());

// Sliding window — překrývající se okna (5 min window, 1 min slide)
stream
    .keyBy(event -> event.getSymbol())
    .window(SlidingEventTimeWindows.of(Time.minutes(5), Time.minutes(1)))
    .process(new MovingAverageFunction());

// Session window — dynamické okna podle aktivity
stream
    .keyBy(event -> event.getUserId())
    .window(EventTimeSessionWindows.withGap(Time.minutes(30)))
    .process(new SessionAnalyzer());

Watermarks and late data: - Watermark = progress indicator for event time - BoundedOutOfOrdernessWatermarks with a tolerance of 5-30 seconds - Late data → side output for later processing - Never rely on processing time for business logic

State management: - RocksDB state backend for large state (TB+) - Incremental checkpointing (interval 1-5 minutes) - State TTL for automatic cleanup of old keys - Savepoints before every deployment

Enterprise Use Cases

1. Fraud Detection (< 100ms latency)

Transaction → Kafka → Flink CEP (Complex Event Processing)
                         │
                         ├── Pattern: 5+ transactions > 10K CZK in 10 min
                         ├── Pattern: new device + high amount
                         └── Pattern: geographic anomaly (2 countries in 5 min)
                         │
                         ▼
                    Alert → Block → Review

Flink CEP (Complex Event Processing) enables declarative definition of fraud patterns directly in code. Real-world systems combine rule-based CEP with an ML model (feature store populated by Flink, real-time inference).

2. Trading & Market Data

Exchange feeds → Kafka (partitioned by symbol)
                    │
                    ▼
              Flink Jobs:
              ├── VWAP calculation (1s/5s/1m windows)
              ├── Order book reconstruction
              ├── Spread monitoring + alerting
              ├── Regime detection (vol clustering)
              └── Signal generation → Order Management System

Key metrics: - End-to-end latency: < 10ms (intra-DC) - Throughput: 1M+ events/sec per topic - State: order book per symbol (~100KB), total GB

3. IoT Telemetry & Predictive Maintenance

Thousands of sensors → Kafka (MQTT bridge) → Flink: - Anomaly detection on sensor data - Rolling statistics (avg, p95, p99 per device) - Predictive models (feature extraction → ML serving) - Alert escalation (warning → critical → shutdown)

4. Real-time Personalization

E-commerce / content platforms: - User clickstream → Kafka → Flink → feature update - Session-level features (pages viewed, time on site, cart value) - Real-time recommendation model update - A/B test metrics in real time

Monitoring & Observability

Kafka Metrics (must-have)

Metric Alert threshold Meaning
UnderReplicatedPartitions > 0 Data at risk
ActiveControllerCount ≠ 1 Split brain
ConsumerGroupLag > 10K (depends) Processing falling behind
RequestQueueSize > 100 Broker overloaded
LogFlushLatency p99 > 100ms Disk bottleneck
Metric Alert threshold Meaning
lastCheckpointDuration > 60s State too large or slow disk
numRecordsOutPerSecond drop > 50% Processing stall
busyTimeMsPerSecond > 900 Operator saturated
currentInputWatermark drift > 5min Late data issue

Stack: Prometheus + Grafana + custom dashboards. Kafka Exporter for JMX metrics, Flink has a native Prometheus reporter.

Deployment & Operations

Kubernetes deployment

# Kafka na Kubernetes via Strimzi operator
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: production-cluster
spec:
  kafka:
    version: 3.7.0
    replicas: 5
    config:
      auto.create.topics.enable: false
      min.insync.replicas: 2
      log.retention.hours: 168
    storage:
      type: persistent-claim
      size: 500Gi
      class: fast-ssd
  zookeeper:
    replicas: 0  # KRaft mode
# Flink na Kubernetes via Flink Operator
apiVersion: flink.apache.org/v1beta1
kind: FlinkDeployment
metadata:
  name: trading-processor
spec:
  image: core/flink-trading:1.18
  flinkVersion: v1_18
  jobManager:
    resource:
      memory: "4096m"
      cpu: 2
  taskManager:
    resource:
      memory: "8192m"
      cpu: 4
    replicas: 3
  job:
    jarURI: local:///opt/flink/jobs/trading.jar
    parallelism: 12
    upgradeMode: savepoint

CI/CD for Streaming Jobs

  1. Schema validation — CI checks schema compatibility
  2. Integration tests — Testcontainers with embedded Kafka + Flink MiniCluster
  3. Canary deployment — new job reads from production topic, writes to shadow topic
  4. Savepoint → upgrade — Flink savepoint, deploy new version, resume
  5. Rollback plan — keep previous savepoint, revert if metrics degrade

Alternatives and When to Consider Them

Technology When to use instead of Kafka+Flink
Redpanda Drop-in Kafka replacement, lower latency, simpler ops
Apache Pulsar Multi-tenancy, geo-replication, native tiered storage
Materialize SQL-first streaming (Postgres wire protocol)
RisingWave Cloud-native streaming DB, SQL interface
Kafka Streams Simple transformations without a separate cluster

Conclusion

Kafka + Flink remain the de facto standard for enterprise streaming in 2026. KRaft mode has eliminated the ZooKeeper dependency, Tiered Storage reduces costs, and Flink 1.18+ brings better exactly-once semantics and faster checkpointing.

Keys to success: - Schema-first design — schemas are the contract between teams - Observability from the start — not as an afterthought - State management — the biggest source of bugs, invest in testing - Capacity planning — Kafka partition count cannot be easily changed

CORE SYSTEMS helps companies design and operate streaming architectures from proof-of-concept to production deployments handling millions of events per second.


Need help designing a streaming architecture? Contact us for a consultation.

kafkaflinkstreamingreal-timedata-engineeringevent-driven
Share:

CORE SYSTEMS

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Contact us