Disaster Recovery as Code — Automated Cloud Disaster Recovery

Disaster Recovery as Code (DRaC) is an approach that transfers the entire disaster recovery strategy into version-controlled code. No manual runbooks, no outdated wiki pages — everything is automated, testable, and reproducible.

Why Traditional Disaster Recovery Fails¶

Most Czech companies have a DR plan. On paper. In SharePoint. Last updated two years ago. And nobody has ever tested it end-to-end.

Statistics speak clearly: according to a 2025 Gartner survey, 76% of DR tests fail on the first attempt. Reason? Documentation doesn’t match reality. Infrastructure has changed since the last review. People who wrote the plan are no longer with the company. Passwords have expired. Certificates are invalid.

Traditional DR suffers from a fundamental problem: it separates infrastructure code from recovery code. Your Terraform manages production, but DR procedures live in Confluence. These two worlds gradually diverge — and when disaster strikes, you discover this at the worst possible moment.

The Cost of Downtime in 2026¶

For a mid-sized Czech company with online operations, an hour of downtime costs CZK 150,000–500,000. For a bank or e-commerce platform, significantly more. And that doesn’t count reputational damage, regulatory fines (NIS2, DORA), or loss of customer trust.

The NIS2 directive, effective since October 2024, explicitly requires a tested recovery plan and the ability to demonstrate its functionality. A paper plan is no longer sufficient.

What Is Disaster Recovery as Code¶

DRaC applies Infrastructure as Code principles to the entire disaster recovery lifecycle:

Declarative definition — DR strategy described in code (Terraform, Pulumi, Crossplane)
Versioning — every change to the DR plan is a commit in Git with a review process
Automated testing — DR tests run regularly in the CI/CD pipeline
Idempotence — running the DR process multiple times leads to the same result
Documentation as code — runbooks generated from code, always up to date

DRaC Architecture¶

The entire system consists of several layers:

Layer 1: Infrastructure Definition Terraform or Pulumi modules defining both production and DR infrastructure. The key is that both environments share the same modules — they differ only in parameters (region, sizing, activation).

Layer 2: Data Replication Asynchronous or synchronous data replication between primary and DR site. For databases: native replication (PostgreSQL streaming, MySQL Group Replication). For storage: cross-region replication (Azure GRS, AWS S3 CRR). For Kubernetes: Velero or Kasten K10.

Layer 3: Failover Orchestration Automated failover process — DNS switching, traffic routing, database promotion, application startup sequence. Implemented as a pipeline (GitHub Actions, GitLab CI, Azure DevOps).

Layer 4: Validation & Monitoring Automated smoke tests after failover. Health checks. Alerting. RPO/RTO metrics.

Practical Implementation with Terraform¶

Dual-region Setup on Azure¶

The foundation is a Terraform workspace with two environments. Production runs in West Europe, DR standby in North Europe:

# Disaster Recovery as Code — Automated Cloud Disaster Recovery
variable "is_dr_active" {
  type    = bool
  default = false
}

variable "region" {
  type = string
}

resource "azurerm_resource_group" "main" {
  name     = "rg-app-${var.region}"
  location = var.region
}

resource "azurerm_app_service_plan" "main" {
  name                = "asp-${var.region}"
  location            = var.region
  resource_group_name = azurerm_resource_group.main.name

  sku {
    tier     = var.is_dr_active ? "Standard" : "Basic"
    size     = var.is_dr_active ? "S2" : "B1"
    capacity = var.is_dr_active ? 3 : 1
  }
}

The DR environment runs on minimal resources (warm standby). Upon activation, it automatically scales to production level. Savings: 60–80% of costs compared to hot standby, with failover under 15 minutes.

Database failover¶

For Azure SQL Database or PostgreSQL Flexible Server:

resource "azurerm_postgresql_flexible_server" "primary" {
  name                = "psql-primary"
  location            = "westeurope"
  sku_name            = "GP_Standard_D4s_v3"

  high_availability {
    mode                      = "ZoneRedundant"
    standby_availability_zone = "2"
  }
}

resource "azurerm_postgresql_flexible_server" "replica" {
  name                          = "psql-replica"
  location                      = "northeurope"
  create_mode                   = "Replica"
  source_server_id              = azurerm_postgresql_flexible_server.primary.id
  sku_name                      = "GP_Standard_D2s_v3"
}

Read replica in the secondary region. During failover, it gets promoted to primary server. RPO depends on replication lag — typically under 5 seconds for asynchronous replication within the Azure backbone.

Runbooks as Code¶

A traditional runbook is a Word document with steps like “Log in to server X and run command Y.” A DRaC runbook is executable code:

# .github/workflows/dr-failover.yml
name: DR Failover
on:
  workflow_dispatch:
    inputs:
      reason:
        description: 'Failover reason'
        required: true
      confirm:
        description: 'Confirm: FAILOVER'
        required: true

jobs:
  pre-check:
    runs-on: ubuntu-latest
    steps:
      - name: Validate confirmation
        run: |
          if [ "${{ inputs.confirm }}" != "FAILOVER" ]; then
            echo "Failover not confirmed. Terminating."
            exit 1
          fi

      - name: Check DR site health
        run: |
          STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://dr.app.example.com/health)
          if [ "$STATUS" != "200" ]; then
            echo "DR site is not healthy. Status: $STATUS"
            exit 1
          fi

      - name: Snapshot current state
        run: terraform -chdir=infra/production output -json > /tmp/pre-failover-state.json

  failover:
    needs: pre-check
    runs-on: ubuntu-latest
    steps:
      - name: Promote DR database
        run: |
          az postgres flexible-server replica stop-replication \
            --resource-group rg-app-northeurope \
            --name psql-replica

      - name: Scale DR infrastructure
        run: |
          cd infra/dr
          terraform apply -var="is_dr_active=true" -auto-approve

      - name: Update DNS
        run: |
          az network dns record-set a update \
            --resource-group rg-dns \
            --zone-name app.example.com \
            --name "@" \
            --set "ARecords[0].ipv4Address=$DR_IP"

      - name: Smoke tests
        run: |
          sleep 60  # DNS propagation
          ./scripts/smoke-tests.sh https://app.example.com

      - name: Notify team
        run: |
          curl -X POST "$SLACK_WEBHOOK" \
            -d '{"text":"🚨 DR Failover completed. Reason: ${{ inputs.reason }}"}'

  post-failover:
    needs: failover
    runs-on: ubuntu-latest
    steps:
      - name: Validate RPO/RTO
        run: |
          START_TIME=${{ needs.pre-check.outputs.start_time }}
          END_TIME=$(date +%s)
          RTO=$((END_TIME - START_TIME))
          echo "RTO: ${RTO}s"
          if [ $RTO -gt 900 ]; then
            echo "⚠️ RTO exceeded 15min target"
          fi

Every step is automated. Human intervention is needed only for confirming the failover — the rest runs automatically.

RPO and RTO — How to Realistically Achieve Them¶

Recovery Point Objective (RPO)¶

RPO defines how much data you can afford to lose. It depends on the replication strategy:

Strategy	RPO	Cost	Note
Synchronous replication	~0	High	Latency penalty, suitable for finance
Asynchronous replication	1–30s	Medium	Sweet spot for most applications
Periodic snapshots	1–24h	Low	Acceptable for non-critical data
Backup & Restore	24h+	Lowest	Only for archival data

Recommendation for Czech companies: Asynchronous replication with RPO under 30 seconds covers 90% of use cases. Synchronous replication only for financial transactions and regulatory-sensitive data.

Recovery Time Objective (RTO)¶

RTO defines how quickly you must be back online:

Hot standby (RTO < 5 min): DR environment runs at full capacity, you just switch traffic. Expensive — you pay for 2x infrastructure.
Warm standby (RTO 5–30 min): DR environment runs on minimal resources, scales up on failover. Best price/performance ratio.
Pilot light (RTO 30–120 min): Only data layer replicated, compute is provisioned on failover.
Cold standby (RTO 2–24h): Only backups. Infrastructure is built from scratch.

Automated DR Tests¶

A DR plan that isn’t tested is not a DR plan. DRaC enables automated testing without impacting production:

Chaos Engineering for DR¶

# dr-test-weekly.yml
name: Weekly DR Test
on:
  schedule:
    - cron: '0 3 * * 0'  # Every Sunday at 3:00 AM

jobs:
  dr-test:
    runs-on: ubuntu-latest
    steps:
      - name: Provision test environment
        run: |
          cd infra/dr-test
          terraform apply -auto-approve

      - name: Restore from latest backup
        run: |
          LATEST_BACKUP=$(az backup item show --query "properties.lastRecoveryPoint" -o tsv)
          az backup restore --restore-mode AlternateLocation \
            --target-resource-group rg-dr-test \
            --recovery-point $LATEST_BACKUP

      - name: Run application tests
        run: ./scripts/full-integration-tests.sh https://dr-test.internal

      - name: Validate data integrity
        run: |
          PROD_COUNT=$(psql $PROD_DB -c "SELECT count(*) FROM orders" -t)
          DR_COUNT=$(psql $DR_DB -c "SELECT count(*) FROM orders" -t)
          DIFF=$((PROD_COUNT - DR_COUNT))
          echo "Data difference: $DIFF records (RPO indicator)"

      - name: Cleanup
        if: always()
        run: terraform -chdir=infra/dr-test destroy -auto-approve

      - name: Report
        run: |
          echo "DR Test Results:" >> $GITHUB_STEP_SUMMARY
          echo "- Restore time: ${RESTORE_TIME}s" >> $GITHUB_STEP_SUMMARY
          echo "- Data integrity: ${DATA_DIFF} records delta" >> $GITHUB_STEP_SUMMARY
          echo "- Integration tests: ${TEST_RESULT}" >> $GITHUB_STEP_SUMMARY

Every week, the entire DR process is automatically tested. Without human intervention. Results go to a dashboard, and an alert is sent on failure.

Multi-cloud DR Strategy¶

For companies with regulatory requirements (banks, public sector), multi-cloud DR may be necessary — primary operations on Azure, DR on AWS, or vice versa:

Crossplane for Multi-cloud Orchestration¶

Crossplane enables defining infrastructure in a cloud-agnostic way:

apiVersion: compute.crossplane.io/v1alpha1
kind: Workload
metadata:
  name: app-dr
spec:
  primary:
    provider: azure
    region: westeurope
    resources:
      compute: 4vCPU/16GB
      storage: 500GB-SSD
  failover:
    provider: aws
    region: eu-central-1
    resources:
      compute: 4vCPU/16GB
      storage: 500GB-SSD
    trigger:
      healthCheck:
        endpoint: https://app.example.com/health
        interval: 30s
        threshold: 3

Advantage: Vendor lock-in protection. If Azure has a regional outage (it happened in January 2023), the AWS DR site takes over.

Disadvantage: Complexity. You must maintain application compatibility with both clouds. We recommend only for Tier 0 services (core banking, critical infrastructure).

Kubernetes-native DR with Velero¶

For companies running on Kubernetes, Velero is the de facto standard for backup and DR:

# Install Velero with Azure provider
velero install \
  --provider azure \
  --plugins velero/velero-plugin-for-microsoft-azure:v1.9.0 \
  --bucket velero-backups \
  --backup-location-config resourceGroup=rg-velero,storageAccount=stvelero \
  --snapshot-location-config resourceGroup=rg-velero

# Regular backup
velero schedule create daily-backup \
  --schedule="0 2 * * *" \
  --include-namespaces production \
  --ttl 720h

# DR restore to a different cluster
velero restore create --from-backup daily-backup-20260216 \
  --namespace-mappings production:dr-production

Velero backs up: - Kubernetes resources (Deployments, Services, ConfigMaps, Secrets) - Persistent Volumes (Azure Disk snapshots, EBS snapshots) - Custom Resources (CRDs, operators)

Restore to a DR cluster typically takes 5–15 minutes depending on PV size.

Costs — How Much Does DRaC Cost¶

Typical calculation for a mid-sized Czech company (50 employees, online product):

Item	Monthly Cost	Note
Warm standby infra	CZK 15,000–30,000	20–30% of production costs
Cross-region replication	CZK 2,000–5,000	Data transfer + storage
CI/CD for DR tests	CZK 500–2,000	GitHub Actions minutes
Monitoring & alerting	CZK 3,000–8,000	Datadog/Grafana Cloud
Total	CZK 20,500–45,000

Compare with the cost of a one-day outage (CZK 150,000–500,000). DR pays for itself with the first incident.

Cost Optimization¶

Spot instances for DR compute — warm standby doesn’t have to be on-demand
Lifecycle policies — older snapshots automatically to cold storage
Shared DR infrastructure — multiple applications share one DR cluster
Scale-to-zero — DR compute is provisioned only during tests or failover

Regulatory Context — NIS2 and DORA¶

NIS2 (Network and Information Security Directive 2)¶

Effective since October 2024 for all essential and important entities in the EU. It requires:

Business Continuity Plan
Regular testing of DR procedures
Incident notification within 24 hours (first warning) and 72 hours (full report)
Demonstrable recovery capability

DRaC covers NIS2 requirements naturally — automated tests generate auditable records, code in Git provides change history, and pipeline logs prove regular testing.

DORA (Digital Operational Resilience Act)¶

For the financial sector. Since January 2025, it requires:

Threat-Led Penetration Testing (TLPT) including DR scenarios
ICT Risk Management framework with tested DR
Reporting major incidents within 4 hours

Common Mistakes and How to Avoid Them¶

1. “We test DR once a year” An annual test is an audit, not a DR strategy. With DRaC, you test every week automatically. Costs: minimal. Certainty: maximum.

2. “We have geo-replication, we’re safe” Geo-replication protects against hardware failure. It does not protect against: ransomware (replicates encrypted data too), configuration errors (replicates broken config too), logical data corruption (replicates corrupted data too). You need point-in-time recovery alongside replication.

3. “DR is the responsibility of IT operations” DR is the responsibility of the entire team. Developers must write applications that support graceful failover — connection retry, circuit breaker, idempotent operations. DevOps sets up infrastructure. Business defines RPO/RTO.

4. “We’ll restore from backup” A backup without a verified restore process is just data on a disk. DRaC includes automatic restore testing — every backup is validated by actually restoring a test environment from it.

5. “DNS failover is enough” DNS propagation takes minutes to hours (TTL). For RTO under 5 minutes, you need anycast routing or a load balancer at the global level (Azure Front Door, AWS Global Accelerator, Cloudflare).

Implementation Plan — 12 Weeks¶

Weeks 1–2: Assessment¶

Inventory of applications and data flows
Classification by criticality (Tier 0/1/2/3)
RPO/RTO definition per tier
Gap analysis of existing DR

Weeks 3–4: Architecture¶

DR strategy selection per tier
Terraform module design
Replication strategy for databases and storage
Cost estimation

Weeks 5–8: Implementation¶

Terraform modules for DR infrastructure
Replication setup and validation
Failover pipeline (GitHub Actions / Azure DevOps)
Smoke tests and health checks

Weeks 9–10: Testing¶

Full DR test (failover + failback)
Performance tests of DR environment
RPO/RTO validation against targets
Security review of DR access

Weeks 11–12: Operationalization¶

Automated weekly test in CI/CD
Documentation generated from code
On-call runbook integration (PagerDuty, OpsGenie)
Team training

Conclusion¶

Disaster Recovery as Code is not a luxury — it is a necessity for every company that depends on digital services. And in 2026, in the era of NIS2 and DORA, it is a regulatory requirement.

Key principles:

Everything in code — no manual runbooks, no wiki pages
Test regularly — automatically, every week, no exceptions
Warm standby is the sweet spot — 20–30% of costs, RTO under 15 minutes
RPO and RTO are defined by business, not IT — start with a conversation about value
DR is a continuous process — not a one-time project

Czech companies that implement DRaC today will have a measurable competitive advantage: lower risk, faster recovery, simpler compliance, and better sleep for the entire team.

Need help with implementation? CORE SYSTEMS offers DR Assessment — a 2-week analysis of your infrastructure with a concrete implementation plan. Contact us for a free consultation.

disaster-recoveryinfrastructure-as-codecloudazureawsresilienceautomation

CORE SYSTEMS

We build core systems and AI agents that keep operations running. 15 years of experience with enterprise IT.

Need help with implementation?

Our experts can help with design, implementation, and operations. From architecture to production.

Need help with implementation? Schedule a meeting