1 of 46

SLI/SLO/SLA?

The language of reliability

WHY, WHAT, HOW

Alex Ewerlöf

Senior Staff Engineer

Volvo Cars

alexewerlof.com

2 of 46

What does reliability mean?

Availability

Scalability

Predictability

Resilience

Performance

Uptime

Security

Trustability

Accessibility

Usability

Reasonability

Traceability

Dependability

Stability

Safety

Consistency

Accuracy

Tolerance

3 of 46

Why do we care?

Opportunity loss

Seize opportunity

Financial loss

Improve revenue

Reputation loss

Earn trust

Incident panic

Controlled risk

4 of 46

Shift the conversation

5 of 46

You have a service?

Service level is about formulating and communicating the expectations to reduce confusion, miscommunication, disappointment, frustration, and unnecessary incidents.

Then you have a service level whether you acknowledge and/or communicate it or not!

6 of 46

SLI: Service Level Indicator

SLO: Service Level Objective

SLA: Service Level Agreement

The metric indicating how your consumers perceive reliability level of your service. Percentage of good in a given period.

Where you want your SLI to be. Guides the optimization efforts. For example 99.9%.

Legal layer on top of SLO towards paying stakeholders. Breaching it leads to punishment, legal action, or compensation.

7 of 46

SLI vs SLO

If SLI is the ruler, SLO is the actual number we want to achieve

SLI: measures reliability

SLO: specifies our objective

8 of 46

Metrics aggregate multiple variables

Metric

Variable 1

The parts you control

& can be accountable for

+

Variable 2

+

Variable 3

+

Variable 4

+

Variable 5

9 of 46

SLI vs SLO

If SLI is the needle in the gauge, SLO is the marker that we want to be above.

SLO: specifies our objective

SLI: Measures reliability

10 of 46

Handshake at the boundary

My Service

My dependencies

🤝

My

Service Consumers

My dependencies

My dependencies

🤝

🤝

🤝

11 of 46

My Service

My

Service Consumers

Their

Service Consumers

⏬🤝

12 of 46

💪

Prepare

My Service

My

Service Consumers

Their

Service Consumers

⏬🤝

13 of 46

My Service

⏬🤝

My

Service Consumers

Their

Service Consumers

⏬🤝

Share the risk

14 of 46

My Service

🔊⏫

My

Service Consumers

Their

Service Consumers

Negotiate

15 of 46

SLI

  • These are just key metrics that indicate the reliability of a service
  • Percentage of good in a given time period

SLI =

Good

Valid

x 100

16 of 46

SLI: Decisions

You need to make these decisions to define a SLI:

  1. What does good look like? e.g. Threshold, err code,...
  2. What is a valid scope for optimization? e.g. all requests with a specific query param or header
  3. Why a particular metric represents how reliability is perceived?
  4. Where and how do we measure?
  5. Time-based or event-based?

SLI =

Good

Valid

x 100

17 of 46

SLI =

Good

Valid

x 100

18 of 46

Good: Upper bound

Good

Good

Bad

Time

Value

Upper threshold of good

19 of 46

Good: Lower bound

Good

Bad

Time

Value

Bad

Lower threshold of good

20 of 46

Good: Range bound

Bad

Time

Value

Upper threshold of good

Lower threshold of good

Good

21 of 46

Time based

Time

Value

Threshold of good

Bad time

22 of 46

Time based

Downtime

23 of 46

Event based

Time

Events

Bad events

Valid events

24 of 46

Window length

Now

Time

Count events in the past 14 days

Count events in the past 30 days

Compliance Period Length = 14 day

Compliance Period Length = 30 day

💥That massive incident that burned the entire error budget!

30 days ago

14 days ago

25 of 46

SLI Example

Let’s imagine two systems. They can be a front-end and backend, or two microservices, or a backend and database.

Depends on responses from

System A

System B

26 of 46

SLI Example

Let’s assume that the analysis of System A shows that System B should return a response in less than 2000ms

The response time should be < 2000ms

System A

System B

27 of 46

SLI Example

Any request that is responded in more than 2000ms is considered bad

3400 ms

1990 ms

2020 ms

5100 ms

1200 ms

1002 ms

900 ms

6000 ms

System A

System B

28 of 46

SLI Example

  • SLI is calculated for a given window of time (also known as “compliance period”). For example: 5 minutes or 1 week.
  • The length of the window depends on the type of commitment we’re making towards our consumers

6000

5100

2020

3400

1999

1200

1002

900

Threshold of good

Compliance period

29 of 46

SLI Example

If the response time for 992300 out of 1000000 requests in that window are below 2000ms:

SLI =

992300

1000300

x 100 = 99.2%

Number of good interactions

Total number of interactions

Service Level Indicator value

30 of 46

Is it good?

Is 99.2% good? Is it bad?

We don’t know!

We need another piece of information which indicates our objective; the commitment we have made towards our consumers.

31 of 46

SLO

Service Level Objective

32 of 46

SLO

If SLI is the ruler, SLO is the actual number we want to achieve

SLI: measures reliability

SLO: specifies our objective

33 of 46

SLO

  • SLO is our reliability target
  • SLI tells us what reliability means. SLO tells us what reliability level we aim for.
  • Since SLI is 0-100, SLO is also 0-100
  • SLO sets the objective for the services that have SLI
  • Once you know your goals you should measure your progress to achieve them
  • What gets measured gets done

34 of 46

“Get away with”

“An SLO should define the lowest level of reliability that you can get away with for each service.”

— Jay Judkowitz and Mark Carter (Google PMs)

35 of 46

SLO decisions

  • The objective (0-100): usually 99+
  • Thresholds (if SLI is parameterized)
  • Measurement window
    • Length
    • Rolling/Calendar
  • Who sets it
    • Bottom-up
    • Top-down

36 of 46

SLO Window

  • One of the core ideas in SLI is the time span or “compliance period” (also known as Window)
  • We recognize that errors are normal, so instead of blame and panic we shift the conversation to: how much error is tolerable?

37 of 46

SLO nine-notation

  • SLOs typically have lots of 9’s
  • An alternative way to speak about them is:
    • “2-nines” is another way to say 99%
    • “3-nines” is 99.9%
    • “4-nines” is 99.99%
    • “5-nines” is 99.999% (realm of highly available systems)

38 of 46

SLO: every 9, 10x cost

For every 9 we add, the system is 10x more reliable but its cost will also increase roughly ten fold: you are going to need more redundancies, better monitoring, more automated tests, better tech/hardware stack and sometimes you need to rewrite the whole system to achieve higher reliability. All of that may also imply a higher headcount.

39 of 46

40 of 46

Note

  • We are not saying the system cannot be 100% available. In a given month, it may be!
  • We are saying we don’t commit to 100%
  • The key difference is commitment and buying that wiggle room to enable experiments and prepare the consumers for when things inevitably do go wrong.

41 of 46

SLA

Service Level Agreement

42 of 46

SLA

  • It is a legal agreement
  • The commitment we make towards external customers
  • The commitment our 3rd parties make to us (eg. our cloud providers)
  • Has legal consequences (being sued or having to credit the customer)

43 of 46

SLI measures reliability

SLO sets expectations

SLA ties that expectation to

legal consequences

44 of 46

SLA = SLI + SLO + law

What happens if you don’t??

What to measure?

What to aim for?

45 of 46

SLI workshop

Intro(45-60m)

Workshop(90-180m)

What is SLI/SLO/SLA and what’s expected of teams?

Risks(10-20m)

Metrics(20-30m)

SLIs(10-20m)

SLOs(10-20m)

(10m break)

Building the mental model and tools to set service levels and own them.

Find meaningful SLIs Commit to reasonable SLOs

Define Service(10-20m)

46 of 46

Thank you

Polestar employees get free access to my book and newsletter via this link:

https://blog.alexewerlof.com/fef4c0f8