A community resource  ·  v1.0

Reclaim SRE

Site Reliability Engineering has a meaning. Somewhere along the way, it became a job title handed out to on-call support staff. Let's fix that.

The problem

You might be an SRE in name only

Every week, engineers post something like this:

"So I have role of SRE but 80% of our tasks are support and 20% are responding to alerts... What can I do to be an SRE or apply for an SRE job. I have almost 3 years of experience."

- r/sre, recurring post, every week

This engineer isn't confused about SRE. Their company is. They've been handed a title that means something and a job that means something else.

This isn't a small problem. The misuse of "SRE" is widespread enough that it's actively harming practitioners' careers, poisoning job searches, and letting organizations avoid doing the hard work of actually improving reliability.


The definition

What SRE actually is

SRE was defined by Google's Ben Treynor Sloss in 2003 and documented publicly in 2016. The core idea is simple and specific: software engineers applying engineering discipline to operations problems. The goal is not to add more humans to react faster- it's to eliminate the conditions that make reliability fragile in the first place.

Key texts: the Google SRE Book (free online) and the SRE Workbook (free online).


The original rules

Ben Treynor Sloss's founding practices

These are the practices Treynor Sloss articulated when he invented SRE at Google. They are specific, structural, and deliberately hard to fake.

Source: Ben Treynor Sloss, Google - 2003/2016
01
Hire only coders. SREs are software engineers first. Operations is the domain; coding is the discipline.
02
Have an SLA for your service. Reliability must be defined before it can be managed.
03
Measure and report performance against the SLA. Unmeasured reliability is just a feeling.
04
Use error budgets and gate launches on them. When reliability is at risk, feature work yields- either by pausing launches, reprioritizing reliability improvements to the top of the roadmap, or both. The budget is the arbiter.
05
Have a common staffing pool for SRE and developers. Engineers move between roles. This keeps SRE from becoming a silo and keeps developers accountable for what they ship.
06
Have excess ops work flow to the dev team. When SREs are overloaded, developers absorb the overflow. This creates a direct incentive for developers to write more operable software.
07
Cap SRE operational load at 50%. If more than half of an SRE's time is reactive ops work, something is broken. The other half must be engineering.
08
Share 5% of ops work with the dev team. Developers should always have skin in the operational game.
09
On-call teams need at least 8 people (one location) or 6 people per site (multiple locations). Understaffed on-call is not SRE. It's burnout with a title.
10
Aim for a maximum of two events per on-call shift. More than two means the system is not in a healthy state and engineering investment is required.
11
Do a postmortem for every event. Every incident is data. Discarding it is a choice to keep failing the same way.
12
Postmortems are blameless. They focus on process and technology, not people. Blame forecloses learning.

The practical test

Five practices for any team

Google's rules were written for Google's scale. Most organizations aren't Google-shaped. These five practices translate the spirit of SRE into something any engineering team can implement and audit against- regardless of size.

Operational responsibility is shared and managed via automation

The team is incentivized to eliminate sources of toil, not accommodate them. Operational burden flows back to developers when SREs are overloaded, creating shared accountability for reliability.

Customer success is quantitatively measured

Service Level Objectives provide a clear, shared picture of how the team's actions affect production. Reliability is not a feeling- it can be measured, and measurements inform decisions.

Error budgets inform work prioritization

If production no longer performs at the level required for customer success, reliability improvements move to the top of the roadmap- which may mean pausing feature launches, deprioritizing other work, or both. Conversely, a healthy error budget is permission to take more risk. Both signals matter.

The team learns from failure in a blameless way

There is sufficient psychological safety to speak openly about the contributing factors to an incident. Postmortems produce systemic change- not scapegoats, not shrugs.

On-call rotations are properly staffed and humane

24/7 production coverage exists without destroying the team's wellbeing. On-call load is tracked, capped, and treated as an engineering problem to be solved- not a permanent condition to be endured.


The line

SRE vs. not SRE

These aren't edge cases. If your day-to-day looks like the right column, you are not doing SRE- regardless of your title.

This is SRE
  • Defined SLOs with error budgets
  • Writing code to reduce toil
  • On-call capped and properly staffed
  • Authority to freeze deploys on reliability grounds
  • Blameless postmortems that drive systemic change
  • Developers share operational responsibility
  • Reliability targets that drive real decisions
This is not SRE
  • Tier-1 or tier-2 product support
  • Ticket triage and customer troubleshooting
  • Alert babysitting with no remediation authority
  • No SLOs, no error budgets
  • Postmortems that produce no action items
  • Being the "ops person" developers throw work to
  • On-call burnout treated as normal

What to do about it

If you're a practitioner
Name the gap

Use this page. Show it to your manager. Ask: "What are our SLOs?" If there aren't any, you have your diagnosis. Then decide: push for change, or move on.

If you're hiring
Be honest in the JD

If the role is primarily support and monitoring, call it Platform Operations or Production Support. Mislabeling wastes everyone's time and burns out the engineers you hire.

If you're a leader
Audit before you hire

Before posting an SRE role, answer three questions: Do we have SLOs? Can SREs freeze releases? Is toil tracked? Build the foundation before you hire for it.

If you're job searching
Interview the interviewer

Ask: "What are your SLOs?" and "Has an SRE ever blocked a deploy for reliability reasons?" The answers tell you whether SRE is real here or just a title.


This is a living document. The goal is clarity, not gatekeeping- there are many paths to good reliability engineering. But words mean things, and right now the word "SRE" is doing no one any favors.

Share it. Link it. Translate it. Use it however you want.