_CORE
AI & Agentic Systems Core Information Systems Cloud & Platform Engineering Data Platform & Integration Security & Compliance QA, Testing & Observability IoT, Automation & Robotics Mobile & Digital Banking & Finance Insurance Public Administration Defense & Security Healthcare Energy & Utilities Telco & Media Manufacturing Logistics & E-commerce Retail & Loyalty
References Technologies Blog Know-how Tools
About Collaboration Careers
CS EN
Let's talk

QA, Testing & Observability

Quality is a process. Not a sprint at the end.

We test AI as a system: accuracy, robustness, safety, regression behaviour. Observability tells you WHY, not just THAT.

Test Automation

Unit, integration, e2e tests. CI pipeline runs on every commit. Automated regression in minutes.

Manual regression testing is the most expensive way to slow down development. A QA team that clicks through the same scenarios for 3 days before every release is a bottleneck. Test automation moves regression into the CI pipeline — it runs on every commit, results in minutes.

Test pyramid in practice: Unit tests (70%) — fast, isolated, hundreds per second. Integration tests (20%) — API contracts, database operations, message brokers. E2E tests (10%) — critical business flows across the full stack. The ratio is not dogma, but direction — more unit, fewer e2e.

E2E framework: Playwright for web (multi-browser, auto-waiting, network interception). Detox for React Native, XCTest/Espresso for native. Page Object Model for maintainability. Visual regression testing (Percy, Chromatic) for UI changes.

CI integration: Tests run in GitHub Actions / GitLab CI on every push. Parallelisation for speed — 500 tests in 5 minutes, not 45. Flaky test detection and quarantine — an unstable test doesn’t block the pipeline but generates an alert. Test report with coverage trends.

Contract testing: Pact for API contracts between frontend and backend, between microservices. The provider verifies the contract in its own CI — a breaking change is caught before merge, not in integration. Consumer-driven contracts for independent teams.

testsautomationci
Detail →

Observability Stack

Metrics, logs, traces. Grafana, Prometheus, Loki, Jaeger. You see what is happening and why.

Monitoring tells you THAT there is a problem. Observability tells you WHY. With monitoring you know that the API is slow. With observability you see the specific trace: the request passed through 6 services, the bottleneck is a query in order-service that takes 8s due to a missing index. Fix in 5 minutes instead of 5 hours.

Three pillars: Metrics (Prometheus) — numerical time-series, alerting on SLOs. Logs (Loki, Elasticsearch) — structured events with context. Traces (Jaeger, Tempo) — the path of a request through a distributed system. Three pillars connected — from an alert you click to the relevant trace, from the trace to logs.

OpenTelemetry as standard: Vendor-neutral instrumentation. One SDK, export to any backend (Grafana stack, Datadog, New Relic). Auto-instrumentation for popular frameworks (.NET, Java, Python, Node.js). Custom spans for business logic — you see not just HTTP requests but also “processing the order took 3.2s”.

Dashboards and alerting: Grafana dashboards for SRE (SLO burn rate, error budget), for the dev team (deployment frequency, lead time, MTTR), for business (conversions, revenue, active users). Alerting on symptoms (SLO violation), not causes (CPU > 80%). PagerDuty/OpsGenie integration with escalation.

Costs under control: Observability data grows fast. Sampling strategies (head-based, tail-based) for traces. Log levels and retention policies. Metrics with appropriate granularity. Typically 60-80% savings versus “log everything”.

observabilitygrafanaotel
Detail →

AI Evaluations

Precision, recall, safety scoring. LLM evaluation, drift detection, A/B model testing.

An AI model without evaluations is a black box in production. Is it working? Maybe. Better than last week? You don’t know. Safely? You hope. AI evaluations introduce measurability — you know exactly how the model performs, where it fails and when it degrades.

LLM evaluations: Precision, recall, faithfulness (hallucination rate), relevance, safety scoring. Evaluation datasets specific to your domain — not generic benchmarks, but real queries from your users. Automated evaluations via LLM-as-judge (GPT-4 rating production model responses) and human-in-the-loop.

Drift detection: Model quality changes over time — the distribution of input data shifts, user behaviour changes, the world changes. Monitoring of key metrics with alerting: if precision drops by 5%, you get an alert. Sliding window analysis for detecting gradual degradation.

A/B model testing: New model vs. existing. Traffic split 50/50, measuring business metrics (conversion, user satisfaction, task completion) and technical metrics (latency, cost per request). Statistical significance before a decision — not “seems better”, but “is better with p < 0.05”.

Evaluation pipeline: Automated evaluations in CI/CD — a new model must pass the eval suite before deploying to production. Quality gate: if precision < 0.85 or safety score < 0.95, deploy is halted. Regression testing — a new model must not perform worse in any category.

Tooling: LangSmith, Ragas, custom eval frameworks. Eval datasets versioned in Git. Results in Grafana dashboards alongside infrastructure metrics. One view of the health of the entire AI system.

ai-evalllmdrift
Detail →

Performance & Load Testing

k6, Gatling, JMeter. You know how much the system can handle before your customers find out.

Customers are the worst load testing tool. When you find out about a performance problem from Twitter, it is too late. Load testing reveals system limits in a controlled environment — you know exactly where the bottleneck is and how much headroom you have.

Types of tests: Load test (expected traffic), stress test (2-3× expected), spike test (sudden surge), soak test (constant load for 24-72h for memory leaks and connection pool exhaustion). Each type reveals a different problem. We don’t just “throw 1000 users at it” — we simulate real behaviour patterns.

k6 as primary tool: JavaScript scripts, CI/CD integration, Grafana dashboards. Scripts versioned in Git alongside application code. Thresholds defined as code — the test fails if P95 latency > 200ms or error rate > 1%. Distributed load from multiple regions for global applications.

Profiling and bottleneck analysis: A load test is just the beginning. The important thing is understanding WHY the system doesn’t reach its target. APM profiling (async profiler, dotTrace), database query analysis (slow query log, execution plans), resource monitoring (CPU, memory, network, disk I/O). We identify the top 3 bottlenecks and fix them.

Baseline and trending: We compare every release against a baseline. Automatic performance regression detection in CI. Trend dashboard — P95 latency grows by 5ms with every release, in 6 months it will be a problem. Better to address it now.

Capacity planning: From load tests we extrapolate: how many users can we handle on current infrastructure? What does it cost to scale 2×? 10×? Data-driven infrastructure decisions, not guesses.

performanceloadk6
Detail →

Incident Response

Runbooks, on-call processes, blameless post-mortems. The same errors don't happen twice.

Incidents happen. What matters is what you do afterwards. Organisations without an incident response process improvise under stress. Result: long MTTR, poor communication, repeated mistakes. We build processes that work on Sunday night.

Severity framework: SEV1 (business impact, customers affected) → immediate escalation, war room, 15-min status updates. SEV2 (degraded performance) → on-call responds within 30 min. SEV3 (minor issue) → resolved in business hours. SEV4 (cosmetic) → backlog. Clear rules, no debates about severity during an incident.

Runbooks: Step-by-step procedures for the top 15-20 incidents. “API returning 500” → check health endpoints → check database connectivity → check recent deployments → rollback if needed. A runbook is not an essay — it is a checklist. Updated after every post-mortem.

On-call: Rotation (typically weekly), primary + secondary on-call. PagerDuty/OpsGenie with intelligent routing. Escalation matrix — if primary doesn’t respond within 5 minutes, secondary is notified. Compensation for on-call — people who wake up at night deserve recognition.

Blameless post-mortem: Within 48 hours of SEV1/SEV2. Incident timeline, root cause, contributing factors, action items with owners and deadlines. No “whose fault is it” — instead “what do we change so this doesn’t happen again”. Sharing learnings across the organisation. Post-mortem database as a knowledge base.

Chaos engineering: Controlled failure injection in production. Shutting down an instance, increasing latency, simulating network partition. Verifying that failover and degradation mechanisms work. Netflix-style Game Days quarterly.

incidentrunbookpostmortem
Detail →

Quality Gates

Automatic quality checks in CI/CD. Deploy is halted when quality falls below standard.

A quality gate is an automatic guardian. Code that doesn’t meet the quality standard doesn’t make it to production. No exceptions, no “I’ll deploy it and fix it later”. The gate is unforgiving but fair — the rules are clear and known in advance.

Static analysis: SonarQube / SonarCloud for code quality (code smells, duplication, complexity), security (OWASP Top 10, CWE), coverage. Quality profiles per project — different standards for new code vs. legacy. New code must have coverage > 80%, zero critical issues. Gradual tightening for existing codebases.

Security gates: Dependency scanning (Snyk, Dependabot) — known CVEs in dependencies block deploy. Container image scanning (Trivy) — vulnerable base images. SAST (static application security testing) integrated into CI. Secrets detection (GitLeaks) — no credentials in code.

Performance gates: Automated load test in CI (subset, 5 minutes). If P95 latency increases >10% compared to baseline, deploy is halted. Bundle size check for frontend — a new dependency must not add more than 50KB without an explicit review. Lighthouse score for web performance.

Deployment gates: Canary deployment with automatic evaluation. Metrics (error rate, latency) compared to baseline. If degradation > threshold, automatic rollback. Progressive delivery — gate at every step (5% → 25% → 50% → 100%).

Culture: Quality gates only work if the team embraces them. It is not a management tool for control — it is a safety net for developers. The gate should catch what code review misses. False positive rate below 5% — otherwise the team starts ignoring gates.

quality-gatecicdsonar
Detail →
Observability vs Monitoring

Observability vs Monitoring

Monitoring tells you THAT there is a problem. Observability tells you WHY. Observability is the ability to understand what is happening inside a system — from logs, metrics and tracing.

Příklad z praxe: With monitoring you know the API is slow. With observability you see the specific trace: a query on the orders table takes 8s due to a missing index added in yesterday's deploy. The fix takes 5 minutes instead of 5 hours.
  • Three pillars: metrics, logs, traces
  • SLO/SLI defined for critical services
  • Alerting on symptoms, not causes
  • Runbooks for the top 10 incidents
95%+
Test coverage
<30 min
MTTD
<4h
MTTR
0
Critical bugs/Q

Jak to děláme

1

Quality Assessment

We evaluate current testing processes, coverage and observability stack.

2

Strategy & tooling

We design the testing pyramid, select tools and define SLOs/SLIs.

3

Test automation

We implement automated tests — unit, integration, E2E and performance.

4

Observability stack

We deploy monitoring, logging, tracing and alerting for the production environment.

5

Continuous improvement

Regular reviews of quality metrics, expanding coverage and optimising the pipeline.

When it is time to address quality

Typical situations

  1. Tests only manual — QA clicks through before every release. Regressions are caught in production.
  2. Production is a black box — When it crashes, we search for hours. We log things but don’t know what to look for.
  3. AI in production without evals — The model runs but we don’t know if it’s degrading.
  4. Post-mortem = blame game — Searching for the culprit instead of the cause. The same errors repeat.

Quality Lifecycle

We build quality as a continuous process:

  1. Quality Assessment — Where are we today? Audit of tests, observability, incident processes.
  2. Strategy & Tooling — What to test, how, with what. Quality metrics and SLO/SLI.
  3. Implementation — Test automation, observability stack, runbooks. Hands-on delivery.
  4. Integration into CI/CD — Quality gates in the pipeline. Automatic checks.
  5. Continuous learning — Post-mortems, trend analysis, process improvement.

Stack

Jest, Cypress, Playwright, k6, Gatling, OpenTelemetry, Grafana, Prometheus, Loki, Jaeger, Elasticsearch, Kibana, Datadog, PagerDuty, OpsGenie, SonarQube, pytest, LangSmith, Ragas.

Časté otázky

Start where it hurts most. Identify critical business flows and write e2e tests. Then add integration tests for the API. You don't need 100% coverage from day one.

The initial investment is higher, but ROI returns in 3-6 months. A manual QA team clicking through regression tests costs more and is slower.

Systematic measurement of AI model quality — precision, recall, safety. Detection of degradation over time. Without evals you don't know whether your agent is performing better or worse than last week.

Basic monitoring with alerting in 2-4 weeks. Full observability stack (metrics + logs + traces + dashboards) in 6-8 weeks.

Máte projekt?

Pojďme si o něm promluvit.

Domluvit schůzku