SLI/SLO/SLA?
The language of reliability
WHY, WHAT, HOW
Alex Ewerlöf
Senior Staff Engineer
Volvo Cars
alexewerlof.com
What does reliability mean?
Availability
Scalability
Predictability
Resilience
Performance
Uptime
Security
Trustability
Accessibility
Usability
Reasonability
Traceability
Dependability
Stability
Safety
Consistency
Accuracy
Tolerance
Why do we care?
Opportunity loss
Seize opportunity
Financial loss
Improve revenue
Reputation loss
Earn trust
Incident panic
Controlled risk
Shift the conversation
You have a service?
Service level is about formulating and communicating the expectations to reduce confusion, miscommunication, disappointment, frustration, and unnecessary incidents.
Then you have a service level whether you acknowledge and/or communicate it or not!
SLI: Service Level Indicator
SLO: Service Level Objective
SLA: Service Level Agreement
The metric indicating how your consumers perceive reliability level of your service. Percentage of good in a given period.
Where you want your SLI to be. Guides the optimization efforts. For example 99.9%.
Legal layer on top of SLO towards paying stakeholders. Breaching it leads to punishment, legal action, or compensation.
SLI vs SLO
If SLI is the ruler, SLO is the actual number we want to achieve
SLI: measures reliability
SLO: specifies our objective
Metrics aggregate multiple variables
Metric
Variable 1
The parts you control
& can be accountable for
+
Variable 2
+
Variable 3
+
Variable 4
+
Variable 5
SLI vs SLO
If SLI is the needle in the gauge, SLO is the marker that we want to be above.
SLO: specifies our objective
SLI: Measures reliability
Handshake at the boundary
My Service
My dependencies
🤝
My
Service Consumers
My dependencies
My dependencies
🤝
🤝
🤝
My Service
My
Service Consumers
Their
Service Consumers
⏬🤝
💪
Prepare
My Service
My
Service Consumers
Their
Service Consumers
⏬🤝
My Service
⏬🤝
My
Service Consumers
Their
Service Consumers
⏬🤝
Share the risk
My Service
🔊⏫
My
Service Consumers
Their
Service Consumers
Negotiate
SLI
SLI =
Good
Valid
x 100
SLI: Decisions
You need to make these decisions to define a SLI:
SLI =
Good
Valid
x 100
SLI =
Good
Valid
x 100
Good: Upper bound
Good
Good
Bad
Time
Value
Upper threshold of good
Good: Lower bound
Good
Bad
Time
Value
Bad
Lower threshold of good
Good: Range bound
Bad
Time
Value
Upper threshold of good
Lower threshold of good
Good
Time based
Time
Value
Threshold of good
Bad time
Time based
Downtime
Event based
Time
Events
Bad events
Valid events
Window length
Now
Time
Count events in the past 14 days
Count events in the past 30 days
Compliance Period Length = 14 day
Compliance Period Length = 30 day
💥That massive incident that burned the entire error budget!
30 days ago
14 days ago
SLI Example
Let’s imagine two systems. They can be a front-end and backend, or two microservices, or a backend and database.
Depends on responses from
System A
System B
SLI Example
Let’s assume that the analysis of System A shows that System B should return a response in less than 2000ms
The response time should be < 2000ms
System A
System B
SLI Example
Any request that is responded in more than 2000ms is considered bad
3400 ms
1990 ms
2020 ms
5100 ms
1200 ms
1002 ms
900 ms
6000 ms
…
System A
System B
SLI Example
6000
5100
2020
3400
1999
1200
1002
900
Threshold of good
…
Compliance period
SLI Example
If the response time for 992300 out of 1000000 requests in that window are below 2000ms:
SLI =
992300
1000300
x 100 = 99.2%
Number of good interactions
Total number of interactions
Service Level Indicator value
Is it good?
Is 99.2% good? Is it bad?
We don’t know!
We need another piece of information which indicates our objective; the commitment we have made towards our consumers.
SLO
Service Level Objective
SLO
If SLI is the ruler, SLO is the actual number we want to achieve
SLI: measures reliability
SLO: specifies our objective
SLO
“Get away with”
“An SLO should define the lowest level of reliability that you can get away with for each service.”
— Jay Judkowitz and Mark Carter (Google PMs)
SLO decisions
SLO Window
SLO nine-notation
SLO: every 9, 10x cost
For every 9 we add, the system is 10x more reliable but its cost will also increase roughly ten fold: you are going to need more redundancies, better monitoring, more automated tests, better tech/hardware stack and sometimes you need to rewrite the whole system to achieve higher reliability. All of that may also imply a higher headcount.
Note
SLA
Service Level Agreement
SLA
SLI measures reliability
SLO sets expectations
SLA ties that expectation to
legal consequences
SLA = SLI + SLO + law
What happens if you don’t??
What to measure?
What to aim for?
SLI workshop
Intro�(45-60m)
Workshop�(90-180m)
What is SLI/SLO/SLA and what’s expected of teams?
Risks�(10-20m)
Metrics�(20-30m)
SLIs�(10-20m)
SLOs�(10-20m)
(10m break)
Building the mental model and tools to set service levels and own them.
Find meaningful SLIs Commit to reasonable SLOs
Define Service�(10-20m)
Thank you
Polestar employees get free access to my book and newsletter via this link: