Disaster Recovery as Code (DRaC) is an approach that transfers the entire disaster recovery strategy into version-controlled code. No manual runbooks, no outdated wiki pages — everything is automated, testable, and reproducible.
Why Traditional Disaster Recovery Fails¶
Most Czech companies have a DR plan. On paper. In SharePoint. Last updated two years ago. And nobody has ever tested it end-to-end.
Statistics speak clearly: according to a 2025 Gartner survey, 76% of DR tests fail on the first attempt. Reason? Documentation doesn’t match reality. Infrastructure has changed since the last review. People who wrote the plan are no longer with the company. Passwords have expired. Certificates are invalid.
Traditional DR suffers from a fundamental problem: it separates infrastructure code from recovery code. Your Terraform manages production, but DR procedures live in Confluence. These two worlds gradually diverge — and when disaster strikes, you discover this at the worst possible moment.
The Cost of Downtime in 2026¶
For a mid-sized Czech company with online operations, an hour of downtime costs CZK 150,000–500,000. For a bank or e-commerce platform, significantly more. And that doesn’t count reputational damage, regulatory fines (NIS2, DORA), or loss of customer trust.
The NIS2 directive, effective since October 2024, explicitly requires a tested recovery plan and the ability to demonstrate its functionality. A paper plan is no longer sufficient.
What Is Disaster Recovery as Code¶
DRaC applies Infrastructure as Code principles to the entire disaster recovery lifecycle:
- Declarative definition — DR strategy described in code (Terraform, Pulumi, Crossplane)
- Versioning — every change to the DR plan is a commit in Git with a review process
- Automated testing — DR tests run regularly in the CI/CD pipeline
- Idempotence — running the DR process multiple times leads to the same result
- Documentation as code — runbooks generated from code, always up to date
DRaC Architecture¶
The entire system consists of several layers:
Layer 1: Infrastructure Definition Terraform or Pulumi modules defining both production and DR infrastructure. The key is that both environments share the same modules — they differ only in parameters (region, sizing, activation).
Layer 2: Data Replication Asynchronous or synchronous data replication between primary and DR site. For databases: native replication (PostgreSQL streaming, MySQL Group Replication). For storage: cross-region replication (Azure GRS, AWS S3 CRR). For Kubernetes: Velero or Kasten K10.
Layer 3: Failover Orchestration Automated failover process — DNS switching, traffic routing, database promotion, application startup sequence. Implemented as a pipeline (GitHub Actions, GitLab CI, Azure DevOps).
Layer 4: Validation & Monitoring Automated smoke tests after failover. Health checks. Alerting. RPO/RTO metrics.
Practical Implementation with Terraform¶
Dual-region Setup on Azure¶
The foundation is a Terraform workspace with two environments. Production runs in West Europe, DR standby in North Europe:
# Disaster Recovery as Code — Automated Cloud Disaster Recovery
variable "is_dr_active" {
type = bool
default = false
}
variable "region" {
type = string
}
resource "azurerm_resource_group" "main" {
name = "rg-app-${var.region}"
location = var.region
}
resource "azurerm_app_service_plan" "main" {
name = "asp-${var.region}"
location = var.region
resource_group_name = azurerm_resource_group.main.name
sku {
tier = var.is_dr_active ? "Standard" : "Basic"
size = var.is_dr_active ? "S2" : "B1"
capacity = var.is_dr_active ? 3 : 1
}
}
The DR environment runs on minimal resources (warm standby). Upon activation, it automatically scales to production level. Savings: 60–80% of costs compared to hot standby, with failover under 15 minutes.
Database failover¶
For Azure SQL Database or PostgreSQL Flexible Server:
resource "azurerm_postgresql_flexible_server" "primary" {
name = "psql-primary"
location = "westeurope"
sku_name = "GP_Standard_D4s_v3"
high_availability {
mode = "ZoneRedundant"
standby_availability_zone = "2"
}
}
resource "azurerm_postgresql_flexible_server" "replica" {
name = "psql-replica"
location = "northeurope"
create_mode = "Replica"
source_server_id = azurerm_postgresql_flexible_server.primary.id
sku_name = "GP_Standard_D2s_v3"
}
Read replica in the secondary region. During failover, it gets promoted to primary server. RPO depends on replication lag — typically under 5 seconds for asynchronous replication within the Azure backbone.
Runbooks as Code¶
A traditional runbook is a Word document with steps like “Log in to server X and run command Y.” A DRaC runbook is executable code:
# .github/workflows/dr-failover.yml
name: DR Failover
on:
workflow_dispatch:
inputs:
reason:
description: 'Failover reason'
required: true
confirm:
description: 'Confirm: FAILOVER'
required: true
jobs:
pre-check:
runs-on: ubuntu-latest
steps:
- name: Validate confirmation
run: |
if [ "${{ inputs.confirm }}" != "FAILOVER" ]; then
echo "Failover not confirmed. Terminating."
exit 1
fi
- name: Check DR site health
run: |
STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://dr.app.example.com/health)
if [ "$STATUS" != "200" ]; then
echo "DR site is not healthy. Status: $STATUS"
exit 1
fi
- name: Snapshot current state
run: terraform -chdir=infra/production output -json > /tmp/pre-failover-state.json
failover:
needs: pre-check
runs-on: ubuntu-latest
steps:
- name: Promote DR database
run: |
az postgres flexible-server replica stop-replication \
--resource-group rg-app-northeurope \
--name psql-replica
- name: Scale DR infrastructure
run: |
cd infra/dr
terraform apply -var="is_dr_active=true" -auto-approve
- name: Update DNS
run: |
az network dns record-set a update \
--resource-group rg-dns \
--zone-name app.example.com \
--name "@" \
--set "ARecords[0].ipv4Address=$DR_IP"
- name: Smoke tests
run: |
sleep 60 # DNS propagation
./scripts/smoke-tests.sh https://app.example.com
- name: Notify team
run: |
curl -X POST "$SLACK_WEBHOOK" \
-d '{"text":"🚨 DR Failover completed. Reason: ${{ inputs.reason }}"}'
post-failover:
needs: failover
runs-on: ubuntu-latest
steps:
- name: Validate RPO/RTO
run: |
START_TIME=${{ needs.pre-check.outputs.start_time }}
END_TIME=$(date +%s)
RTO=$((END_TIME - START_TIME))
echo "RTO: ${RTO}s"
if [ $RTO -gt 900 ]; then
echo "⚠️ RTO exceeded 15min target"
fi
Every step is automated. Human intervention is needed only for confirming the failover — the rest runs automatically.
RPO and RTO — How to Realistically Achieve Them¶
Recovery Point Objective (RPO)¶
RPO defines how much data you can afford to lose. It depends on the replication strategy:
| Strategy | RPO | Cost | Note |
|---|---|---|---|
| Synchronous replication | ~0 | High | Latency penalty, suitable for finance |
| Asynchronous replication | 1–30s | Medium | Sweet spot for most applications |
| Periodic snapshots | 1–24h | Low | Acceptable for non-critical data |
| Backup & Restore | 24h+ | Lowest | Only for archival data |
Recommendation for Czech companies: Asynchronous replication with RPO under 30 seconds covers 90% of use cases. Synchronous replication only for financial transactions and regulatory-sensitive data.
Recovery Time Objective (RTO)¶
RTO defines how quickly you must be back online:
- Hot standby (RTO < 5 min): DR environment runs at full capacity, you just switch traffic. Expensive — you pay for 2x infrastructure.
- Warm standby (RTO 5–30 min): DR environment runs on minimal resources, scales up on failover. Best price/performance ratio.
- Pilot light (RTO 30–120 min): Only data layer replicated, compute is provisioned on failover.
- Cold standby (RTO 2–24h): Only backups. Infrastructure is built from scratch.
Automated DR Tests¶
A DR plan that isn’t tested is not a DR plan. DRaC enables automated testing without impacting production:
Chaos Engineering for DR¶
# dr-test-weekly.yml
name: Weekly DR Test
on:
schedule:
- cron: '0 3 * * 0' # Every Sunday at 3:00 AM
jobs:
dr-test:
runs-on: ubuntu-latest
steps:
- name: Provision test environment
run: |
cd infra/dr-test
terraform apply -auto-approve
- name: Restore from latest backup
run: |
LATEST_BACKUP=$(az backup item show --query "properties.lastRecoveryPoint" -o tsv)
az backup restore --restore-mode AlternateLocation \
--target-resource-group rg-dr-test \
--recovery-point $LATEST_BACKUP
- name: Run application tests
run: ./scripts/full-integration-tests.sh https://dr-test.internal
- name: Validate data integrity
run: |
PROD_COUNT=$(psql $PROD_DB -c "SELECT count(*) FROM orders" -t)
DR_COUNT=$(psql $DR_DB -c "SELECT count(*) FROM orders" -t)
DIFF=$((PROD_COUNT - DR_COUNT))
echo "Data difference: $DIFF records (RPO indicator)"
- name: Cleanup
if: always()
run: terraform -chdir=infra/dr-test destroy -auto-approve
- name: Report
run: |
echo "DR Test Results:" >> $GITHUB_STEP_SUMMARY
echo "- Restore time: ${RESTORE_TIME}s" >> $GITHUB_STEP_SUMMARY
echo "- Data integrity: ${DATA_DIFF} records delta" >> $GITHUB_STEP_SUMMARY
echo "- Integration tests: ${TEST_RESULT}" >> $GITHUB_STEP_SUMMARY
Every week, the entire DR process is automatically tested. Without human intervention. Results go to a dashboard, and an alert is sent on failure.
Multi-cloud DR Strategy¶
For companies with regulatory requirements (banks, public sector), multi-cloud DR may be necessary — primary operations on Azure, DR on AWS, or vice versa:
Crossplane for Multi-cloud Orchestration¶
Crossplane enables defining infrastructure in a cloud-agnostic way:
apiVersion: compute.crossplane.io/v1alpha1
kind: Workload
metadata:
name: app-dr
spec:
primary:
provider: azure
region: westeurope
resources:
compute: 4vCPU/16GB
storage: 500GB-SSD
failover:
provider: aws
region: eu-central-1
resources:
compute: 4vCPU/16GB
storage: 500GB-SSD
trigger:
healthCheck:
endpoint: https://app.example.com/health
interval: 30s
threshold: 3
Advantage: Vendor lock-in protection. If Azure has a regional outage (it happened in January 2023), the AWS DR site takes over.
Disadvantage: Complexity. You must maintain application compatibility with both clouds. We recommend only for Tier 0 services (core banking, critical infrastructure).
Kubernetes-native DR with Velero¶
For companies running on Kubernetes, Velero is the de facto standard for backup and DR:
# Install Velero with Azure provider
velero install \
--provider azure \
--plugins velero/velero-plugin-for-microsoft-azure:v1.9.0 \
--bucket velero-backups \
--backup-location-config resourceGroup=rg-velero,storageAccount=stvelero \
--snapshot-location-config resourceGroup=rg-velero
# Regular backup
velero schedule create daily-backup \
--schedule="0 2 * * *" \
--include-namespaces production \
--ttl 720h
# DR restore to a different cluster
velero restore create --from-backup daily-backup-20260216 \
--namespace-mappings production:dr-production
Velero backs up: - Kubernetes resources (Deployments, Services, ConfigMaps, Secrets) - Persistent Volumes (Azure Disk snapshots, EBS snapshots) - Custom Resources (CRDs, operators)
Restore to a DR cluster typically takes 5–15 minutes depending on PV size.
Costs — How Much Does DRaC Cost¶
Typical calculation for a mid-sized Czech company (50 employees, online product):
| Item | Monthly Cost | Note |
|---|---|---|
| Warm standby infra | CZK 15,000–30,000 | 20–30% of production costs |
| Cross-region replication | CZK 2,000–5,000 | Data transfer + storage |
| CI/CD for DR tests | CZK 500–2,000 | GitHub Actions minutes |
| Monitoring & alerting | CZK 3,000–8,000 | Datadog/Grafana Cloud |
| Total | CZK 20,500–45,000 |
Compare with the cost of a one-day outage (CZK 150,000–500,000). DR pays for itself with the first incident.
Cost Optimization¶
- Spot instances for DR compute — warm standby doesn’t have to be on-demand
- Lifecycle policies — older snapshots automatically to cold storage
- Shared DR infrastructure — multiple applications share one DR cluster
- Scale-to-zero — DR compute is provisioned only during tests or failover
Regulatory Context — NIS2 and DORA¶
NIS2 (Network and Information Security Directive 2)¶
Effective since October 2024 for all essential and important entities in the EU. It requires:
- Business Continuity Plan
- Regular testing of DR procedures
- Incident notification within 24 hours (first warning) and 72 hours (full report)
- Demonstrable recovery capability
DRaC covers NIS2 requirements naturally — automated tests generate auditable records, code in Git provides change history, and pipeline logs prove regular testing.
DORA (Digital Operational Resilience Act)¶
For the financial sector. Since January 2025, it requires:
- Threat-Led Penetration Testing (TLPT) including DR scenarios
- ICT Risk Management framework with tested DR
- Reporting major incidents within 4 hours
Common Mistakes and How to Avoid Them¶
1. “We test DR once a year” An annual test is an audit, not a DR strategy. With DRaC, you test every week automatically. Costs: minimal. Certainty: maximum.
2. “We have geo-replication, we’re safe” Geo-replication protects against hardware failure. It does not protect against: ransomware (replicates encrypted data too), configuration errors (replicates broken config too), logical data corruption (replicates corrupted data too). You need point-in-time recovery alongside replication.
3. “DR is the responsibility of IT operations” DR is the responsibility of the entire team. Developers must write applications that support graceful failover — connection retry, circuit breaker, idempotent operations. DevOps sets up infrastructure. Business defines RPO/RTO.
4. “We’ll restore from backup” A backup without a verified restore process is just data on a disk. DRaC includes automatic restore testing — every backup is validated by actually restoring a test environment from it.
5. “DNS failover is enough” DNS propagation takes minutes to hours (TTL). For RTO under 5 minutes, you need anycast routing or a load balancer at the global level (Azure Front Door, AWS Global Accelerator, Cloudflare).
Implementation Plan — 12 Weeks¶
Weeks 1–2: Assessment¶
- Inventory of applications and data flows
- Classification by criticality (Tier 0/1/2/3)
- RPO/RTO definition per tier
- Gap analysis of existing DR
Weeks 3–4: Architecture¶
- DR strategy selection per tier
- Terraform module design
- Replication strategy for databases and storage
- Cost estimation
Weeks 5–8: Implementation¶
- Terraform modules for DR infrastructure
- Replication setup and validation
- Failover pipeline (GitHub Actions / Azure DevOps)
- Smoke tests and health checks
Weeks 9–10: Testing¶
- Full DR test (failover + failback)
- Performance tests of DR environment
- RPO/RTO validation against targets
- Security review of DR access
Weeks 11–12: Operationalization¶
- Automated weekly test in CI/CD
- Documentation generated from code
- On-call runbook integration (PagerDuty, OpsGenie)
- Team training
Conclusion¶
Disaster Recovery as Code is not a luxury — it is a necessity for every company that depends on digital services. And in 2026, in the era of NIS2 and DORA, it is a regulatory requirement.
Key principles:
- Everything in code — no manual runbooks, no wiki pages
- Test regularly — automatically, every week, no exceptions
- Warm standby is the sweet spot — 20–30% of costs, RTO under 15 minutes
- RPO and RTO are defined by business, not IT — start with a conversation about value
- DR is a continuous process — not a one-time project
Czech companies that implement DRaC today will have a measurable competitive advantage: lower risk, faster recovery, simpler compliance, and better sleep for the entire team.
Need help with implementation? CORE SYSTEMS offers DR Assessment — a 2-week analysis of your infrastructure with a concrete implementation plan. Contact us for a free consultation.
Need help with implementation?
Our experts can help with design, implementation, and operations. From architecture to production.
Contact us