1 of 92

A practical guide to monitoring and alerting with time series at scale

Velocity SC 2016

Jamie Wilkinson <jaq@google.com>

Site Reliability Engineering, Google

2 of 92

SRE

Site Reliability Engineering in Storage Infrastructure at Google

“SRE is what happens when you ask a software engineer to design an operations function.”

3 of 92

4 of 92

SRE

Ultimately responsible for the reliability of google.com

Less than 50% time spent on operations,

More than 50% on engineering reliability and automation

5 of 92

SRE

Ultimately responsible for the reliability of google.com

Less than 50% time spent on operations,

More than 50% on engineering reliability and automation

SRE = DevOps Engineer

6 of 92

SRE

Ultimately responsible for the reliability of google.com

Monitoring is one of many tools to achieve that goal.

7 of 92

8 of 92

What is “monitoring”

incident response
performance analysis
capacity planning
failure detection

9 of 92

proximity in time

measurement granularity

performance analysis

capacity planning

incident response

failure detection

10 of 92

proximity in time

measurement granularity

performance analysis

capacity planning

incident response

failure detection

11 of 92

Alerting on thresholds

12 of 92

Alert when the beer supply is low

13 of 92

Alert when beer supply low

ALERT BarneyWorriedAboutBeerSupply

IF cases - 1 - 1 = 1

ANNOTATIONS {

summary = “Hey Homer, I’m worried about the beer supply.”

description = “After this case, and the next case, there’s only one case left! Yeah yeah, Oh Barney's right. Yeah, lets get some more beer.. yeah.. hey, what about some beer, yeah Barney's right…”

}

14 of 92

Disk full alert

Alert when 90% full

Different filesystems have different sizes

10% of 2TB is 200GB

False positive!

Alert on absolute space, < 500MB

Arbitrary number

Different workloads with different needs: 500MB might not be enough warning

15 of 92

Disk full alert

More general alert based on human interactions:

How long before the disk is full?

and

How long will it take for a human to remediate a full disk?

16 of 92

CALCULUS

😱

17 of 92

Alerting on rates of change

18 of 92

Dennis Hopper's Alert

19 of 92

Dennis Hopper's Alert

ALERT BombArmed

IF speed_mph >= 50

ANNOTATIONS {

summary = “Pop quiz, hotshot!”

}

ALERT EXPLODE

IF max(ALERTS{alertname=BombArmed, alertstate=firing}[1d]) > 0 and speed_mph < 50

20 of 92

Keanu's Alert

21 of 92

22 of 92

Keanu's alert

23 of 92

Keanu's alert

ALERT StartSavingTheBus

IF (v - 50)/a <= ${threshold}

24 of 92

Why does #monitoringsuck?

TL;DR:

when the cost of maintenance is too high

to improve the quality of alerts
to improve exploratory tools

25 of 92

Why does X ∀ X ∈ {Ops} suck?

the cost of maintenance must scale sublinearly with the growth of the service

26 of 92

Callback!

Less than 50% time spent on operations,

More than 50% on engineering reliability and automation

27 of 92

service size: e.g. queries, storage footprint, cores used

“ops work”

cost

time

28 of 92

Automate yourself out of a job

Homogenity, Configuration Management
Abstractions, Higher level languages
Convenient interfaces in tools

scriptable
Service Oriented Architectures

29 of 92

30 of 92

31 of 92

http://wikimon.net/Bolgmon

32 of 92

Borgmon is

a metric collector
a timeseries database
a interactive query tool
a programmable calculator for discrete multidimensional vector streams
nearly identical to Prometheus

33 of 92

34 of 92

http://prometheus.io

35 of 92

How it works

Dynamically discover target addresses
Scrape /metrics pages

evenly distributed load across targets

Evaluate rulesets mapped to targets

vector arithmetic

Send alerts
Record to Timeseries Database (TSDB)

36 of 92

Process Overview

Monitored task 0

Prometheus

Browser

/metrics

http

37 of 92

Service Discovery

Monitored task 0

Prometheus

Browser

/metrics

http

Monitored task 1

Monitored task 2

ZK�etcd

consul

etc

38 of 92

Alert Notifications

Monitored task 0

Borgmon

Browser

/metrics

http

Monitored task 1

Monitored task 2

Alert�manager

key-value�pairs

email,�pagerduty,�slack etc

Prometheus

39 of 92

Long-term storage

Monitored task 0

Borgmon

Browser

/metrics

http

Monitored task 1

Monitored task 2

Alert�manager

key-value�pairs

email,�pagerduty,�slack etc

TSDB

Prometheus

40 of 92

Global & other monitoring

Monitored task 0

Browser

/metrics

http

Monitored task 1

Monitored task 2

Alert�manager

key-value�pairs

email,�pagerduty,�slack etc

TSDB

Other Prometheus�(eg global, etc)

Prometheus

41 of 92

Sprinkle some shards on it

Scraper shards

Monitored task 1

Monitored task 2000

Monitored task 0

Borgmon

Browser

http

Monitored task 1000

Monitored task 2000

Alert�manager

key-value�pairs

email,�pagerduty,�slack etc

TSDB

Other Prometheus�(eg global, etc)

Scraper shards

Prometheus

42 of 92

alert design

43 of 92

SLAs, SLOs, SLIs

SLI → Indicator: a measurement

99.9th percentile response latency

SLO → Objective: a goal

below 5ms, over a 10 minute interval

SLA → Agreement: economic incentives

or we get paged

44 of 92

Clients provision against SLO

Jeff Dean, “A Reliable Whole From Unreliable Parts”

“Achieving Rapid Response Times in Large Online Services”

http://research.google.com/people/jeff/Berkeley-Latency-Mar2012.pdf

45 of 92

“My Philosophy on Alerting”

Rob Ewaschuk

Every time my pager goes off, I should be able to react with a sense of urgency. I can only do this a few times a day before I get fatigued.
Every page should be actionable; simply noting "this paged again" is not an action.
Every page should require intelligence to deal with: no robotic, scriptable responses.

46 of 92

“Alerts” don’t have to page you

Alerts that do page should indicate violations of SLO.

“Undisciplined” alerts can be used to help diagnosis by firing but being routed nowhere, displayed on a debugging console.

disk fullness
task crashes
backend slowness

47 of 92

instrumentation

enough blabber, let’s build something

48 of 92

Prometheus Client API

import “github.com/prometheus/client_golang/prometheus”

var request_count =

prometheus.NewCounter(prometheus.CounterOpts{

Name: “requests”, Help: “total requests”})

func HandleRequest … {

…

request_count.Add(1)

…

49 of 92

/metrics handlers can be plain text

# HELP requests total requests

# TYPE requests counter�requests 20056

# HELP errors total errors served�# TYPE errors counter�errors{code="400"} 2027�errors{code="500"} 824�

(also supports a binary format)

50 of 92

Tips for effective metric setup

Let Prometheus aggregate for you
Prefer numbers over text
Avoid timestamps
Initialize variables on program start

51 of 92

Tips for effective metric setup

Let Prometheus aggregate for you

Fine:

queries_total 1234.0

errors_total 12.0

Questionable:

qps 18.0

error_rate 0.2

Why?

52 of 92

Timeseries Have Types

Counter: monotonically nondecreasing

"preserves the order" i.e. UP

"nondecreasing" can be flat

53 of 92

Timeseries Have Types

Gauge: everything else... not monotonic

54 of 92

Counters FTW

Δt

55 of 92

Counters FTW

no loss of meaning after sampling

Δt

56 of 92

Gauges FTL

Δt

57 of 92

Gauges FTL

lose spike events shorter than sampling interval

Δt

58 of 92

Tips for effective metric setup

Prefer numbers over strings

Helpful:

exceptions{name=”

GksSqlPermissionsException”} 142

Less Helpful:

last_exception GksSqlPermissionsException

Why?

59 of 92

Tips for effective metric setup

Avoid timestamps

Usually unnecessary:

last_sync_time 1209770045.0

Usually better:

sync_total 534.0

Why?

60 of 92

collection

61 of 92

Configuring Prometheus

prometheus.yml

[targets, etc]

rule files

(DSL)

62 of 92

Configuring Prometheus

prometheus.yml:

global:� scrape_interval: 1m� labels: # Added to all targets

zone: us-east�rule_files:� [ - <filepath> ... ]�scrape_configs:� [ - <scrape_config> ... ]

63 of 92

Finding Targets

scrape_configs:

- job_name: “smtp”

static_configs:

- targets:

- ‘mail.example.com:3903’

- job_name: “barserver”

file_sd_configs:

- [json_filenames generated by, e.g. puppet]

- job_name: “webserver”

dns_sd_configs:

- names: # DNS SRV lookup

- web.example.com

- job_name: “fooserver”

consul_sd_configs: # autodiscovery from consul queries

64 of 92

Labels & Vectors

65 of 92

Data Storage Requirements

A 'service' can consist of:

multiple processes running many operations
multiple machines
multiple datacenters�

The solution needs to:

Keep high-dimension data organized
Allow various aggregation types (max, average, percentile)
Allow flexible querying and slicing of data (by machine, by datacenter, by error type, etc)

66 of 92

The timeseries arena

Data is stored in one global database in memory (checkpointed to disk)
Each data point has the form: (timestamp, value)
Data points are stored in chronological lists called timeseries.
Each timeseries is named by a set of unique labels, of the form name=value
Timeseries data can be queried via a variable reference (a specification of labels and values).

The result is a vector or matrix.

67 of 92

Structure of timeseries

label1

label2

label3

label4

...

68 of 92

Variables and Labels

Labels come from

the target’s name: job, instance
the target’s exported metrics
the configuration: labels, relabels
the processing rules

69 of 92

Variables and labels

{var=”errors”,job=”web”,instance=”server01:8000”,zone=”us-east”,code=”500”} 16

{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-west”,code=”500”} 10

{var=”errors”,job=”web”,instance=”server02:8080”,zone=”us-west”,code=”500”} 0

{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-east”,code=”500”} 12

{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-west”,code=”500”} 10

{var=”errors”,job=”web”,instance=”server02:8080”,zone=”us-west”,code=”500”} 10

{var=”requests”,job=”web”,instance=”server01:8080”,zone=”us-east”} 50456

{var=”requests”,job=”web”,instance=”server01:8080”,zone=”us-west”} 12432

{var=”requests”,job=”web”,instance=”server02:8080”,zone=”us-west”} 43424

70 of 92