DevOps Intermediate
SRE — Toil Reduction¶
SREToilAutomationEfficiency 5 min read
Identifying and eliminating toil. Automating repetitive tasks, measurement and strategies for reduction.
What is Toil¶
Toil is manual, repetitive, automatable work with no lasting value. Google SRE recommends a maximum of 50% of time spent on toil.
- Manual — requires human intervention
- Repetitive — you do it over and over again
- Automatable — a machine could handle it
- Tactical — no strategic value
- Linearly growing — grows with the number of services
Identifying Toil¶
Measure toil systematically. Examples of typical toil tasks:
- SSL cert renewal → automate with cert-manager
- DB backup verification → CronJob + alerting
- User provisioning → SCIM/SSO
- Deployment rollback → GitOps automatic rollback
- Log investigation → better alerting and structured logging
Automation Strategy¶
- Elimination — do you need it at all?
- Automation — script, CronJob, operator
- Self-service — platform engineering, internal developer portal
- Standardization — templates, golden paths
Prioritize by: frequency x time x number of people
Summary¶
Toil reduction is a key SRE discipline. Measure toil, prioritize by impact and systematically automate.
Need Help with Implementation?¶
Our team has experience designing and implementing modern architectures. We’re happy to help.