1 of 92

A practical guide to monitoring and alerting with time series at scale

Velocity SC 2016

Jamie Wilkinson <jaq@google.com>

Site Reliability Engineering, Google

2 of 92

SRE

Site Reliability Engineering in Storage Infrastructure at Google

“SRE is what happens when you ask a software engineer to design an operations function.”

3 of 92

4 of 92

SRE

Ultimately responsible for the reliability of google.com

Less than 50% time spent on operations,

More than 50% on engineering reliability and automation

5 of 92

SRE

Ultimately responsible for the reliability of google.com

Less than 50% time spent on operations,

More than 50% on engineering reliability and automation

SRE = DevOps Engineer

6 of 92

SRE

Ultimately responsible for the reliability of google.com

Monitoring is one of many tools to achieve that goal.

7 of 92

8 of 92

What is “monitoring”

  • incident response
  • performance analysis
  • capacity planning
  • failure detection

9 of 92

proximity in time

measurement granularity

performance analysis

capacity planning

incident response

failure detection

10 of 92

proximity in time

measurement granularity

performance analysis

capacity planning

incident response

failure detection

11 of 92

Alerting on thresholds

12 of 92

Alert when the beer supply is low

13 of 92

Alert when beer supply low

ALERT BarneyWorriedAboutBeerSupply

IF cases - 1 - 1 = 1

ANNOTATIONS {

summary = “Hey Homer, I’m worried about the beer supply.”

description = “After this case, and the next case, there’s only one case left! Yeah yeah, Oh Barney's right. Yeah, lets get some more beer.. yeah.. hey, what about some beer, yeah Barney's right…”

}

14 of 92

Disk full alert

Alert when 90% full

Different filesystems have different sizes

10% of 2TB is 200GB

False positive!

Alert on absolute space, < 500MB

Arbitrary number

Different workloads with different needs: 500MB might not be enough warning

15 of 92

Disk full alert

More general alert based on human interactions:

How long before the disk is full?

and

How long will it take for a human to remediate a full disk?

16 of 92

CALCULUS

😱

17 of 92

Alerting on rates of change

18 of 92

Dennis Hopper's Alert

19 of 92

Dennis Hopper's Alert

ALERT BombArmed

IF speed_mph >= 50

ANNOTATIONS {

summary = “Pop quiz, hotshot!”

}

ALERT EXPLODE

IF max(ALERTS{alertname=BombArmed, alertstate=firing}[1d]) > 0 and speed_mph < 50

20 of 92

Keanu's Alert

21 of 92

22 of 92

Keanu's alert

23 of 92

Keanu's alert

ALERT StartSavingTheBus

IF (v - 50)/a <= ${threshold}

24 of 92

Why does #monitoringsuck?

TL;DR:

when the cost of maintenance is too high

  • to improve the quality of alerts
  • to improve exploratory tools

25 of 92

Why does X ∀ X ∈ {Ops} suck?

the cost of maintenance must scale sublinearly with the growth of the service

26 of 92

Callback!

Less than 50% time spent on operations,

More than 50% on engineering reliability and automation

27 of 92

service size: e.g. queries, storage footprint, cores used

“ops work”

cost

time

28 of 92

Automate yourself out of a job

  • Homogenity, Configuration Management
  • Abstractions, Higher level languages
  • Convenient interfaces in tools
    • scriptable
    • Service Oriented Architectures

29 of 92

30 of 92

31 of 92

32 of 92

Borgmon is

  • a metric collector
  • a timeseries database
  • a interactive query tool
  • a programmable calculator for discrete multidimensional vector streams
  • nearly identical to Prometheus

33 of 92

34 of 92

http://prometheus.io

35 of 92

How it works

  • Dynamically discover target addresses
  • Scrape /metrics pages
    • evenly distributed load across targets
  • Evaluate rulesets mapped to targets
    • vector arithmetic
  • Send alerts
  • Record to Timeseries Database (TSDB)

36 of 92

Process Overview

Monitored task 0

Prometheus

Browser

/metrics

http

37 of 92

Service Discovery

Monitored task 0

Prometheus

Browser

/metrics

http

Monitored task 1

Monitored task 2

ZK�etcd

consul

etc

38 of 92

Alert Notifications

Monitored task 0

Borgmon

Browser

/metrics

http

Monitored task 1

Monitored task 2

Alert�manager

key-value�pairs

email,�pagerduty,�slack etc

Prometheus

39 of 92

Long-term storage

Monitored task 0

Borgmon

Browser

/metrics

http

Monitored task 1

Monitored task 2

Alert�manager

key-value�pairs

email,�pagerduty,�slack etc

TSDB

Prometheus

40 of 92

Global & other monitoring

Monitored task 0

Browser

/metrics

http

Monitored task 1

Monitored task 2

Alert�manager

key-value�pairs

email,�pagerduty,�slack etc

TSDB

Other Prometheus�(eg global, etc)

Prometheus

41 of 92

Sprinkle some shards on it

Scraper shards

Monitored task 1

Monitored task 1

Monitored task 2000

Monitored task 2000

Monitored task 0

Borgmon

Browser

http

Monitored task 1000

Monitored task 2000

Alert�manager

key-value�pairs

email,�pagerduty,�slack etc

TSDB

Other Prometheus�(eg global, etc)

Scraper shards

Scraper shards

Prometheus

42 of 92

alert design

43 of 92

SLAs, SLOs, SLIs

  • SLI → Indicator: a measurement
    • 99.9th percentile response latency
  • SLO → Objective: a goal
    • below 5ms, over a 10 minute interval
  • SLA → Agreement: economic incentives
    • or we get paged

44 of 92

Clients provision against SLO

45 of 92

“My Philosophy on Alerting”

Rob Ewaschuk

  • Every time my pager goes off, I should be able to react with a sense of urgency. I can only do this a few times a day before I get fatigued.
  • Every page should be actionable; simply noting "this paged again" is not an action.
  • Every page should require intelligence to deal with: no robotic, scriptable responses.

46 of 92

“Alerts” don’t have to page you

Alerts that do page should indicate violations of SLO.

“Undisciplined” alerts can be used to help diagnosis by firing but being routed nowhere, displayed on a debugging console.

  • disk fullness
  • task crashes
  • backend slowness

47 of 92

instrumentation

enough blabber, let’s build something

48 of 92

Prometheus Client API

import “github.com/prometheus/client_golang/prometheus”

var request_count =

prometheus.NewCounter(prometheus.CounterOpts{

Name: “requests”, Help: “total requests”})

func HandleRequest … {

request_count.Add(1)

49 of 92

/metrics handlers can be plain text

# HELP requests total requests

# TYPE requests counter�requests 20056

# HELP errors total errors served�# TYPE errors counter�errors{code="400"} 2027�errors{code="500"} 824�

(also supports a binary format)

50 of 92

Tips for effective metric setup

  • Let Prometheus aggregate for you
  • Prefer numbers over text
  • Avoid timestamps
  • Initialize variables on program start 

51 of 92

Tips for effective metric setup

Let Prometheus aggregate for you

  • Fine:  

queries_total 1234.0

errors_total  12.0

  • Questionable:   

qps 18.0

error_rate 0.2

 

 Why?

52 of 92

Timeseries Have Types

Counter: monotonically nondecreasing

"preserves the order" i.e. UP

"nondecreasing" can be flat

53 of 92

Timeseries Have Types

Gauge: everything else... not monotonic

54 of 92

Counters FTW

Δt

55 of 92

Counters FTW

no loss of meaning after sampling

Δt

56 of 92

Gauges FTL

Δt

57 of 92

Gauges FTL

lose spike events shorter than sampling interval

Δt

58 of 92

Tips for effective metric setup

Prefer numbers over strings

  • Helpful:  

exceptions{name=”

GksSqlPermissionsException”} 142

  • Less Helpful:   

last_exception GksSqlPermissionsException

   Why?

59 of 92

Tips for effective metric setup

Avoid timestamps

  • Usually unnecessary:  

last_sync_time 1209770045.0

  • Usually better: 

sync_total 534.0

Why?

60 of 92

collection

61 of 92

Configuring Prometheus

prometheus.yml

[targets, etc]

rule files

(DSL)

62 of 92

Configuring Prometheus

prometheus.yml:

global:� scrape_interval: 1m� labels: # Added to all targets

zone: us-east�rule_files:� [ - <filepath> ... ]�scrape_configs:� [ - <scrape_config> ... ]

63 of 92

Finding Targets

scrape_configs:

- job_name: “smtp”

static_configs:

- targets:

- ‘mail.example.com:3903’

- job_name: “barserver”

file_sd_configs:

- [json_filenames generated by, e.g. puppet]

- job_name: “webserver”

dns_sd_configs:

- names: # DNS SRV lookup

- web.example.com

- job_name: “fooserver”

consul_sd_configs: # autodiscovery from consul queries

64 of 92

Labels & Vectors

65 of 92

Data Storage Requirements

  • A 'service' can consist of:
    • multiple processes running many operations
    • multiple machines
    • multiple datacenters�
  • The solution needs to:
    • Keep high-dimension data organized
    • Allow various aggregation types (max, average, percentile)
    • Allow flexible querying and slicing of data (by machine, by datacenter, by error type, etc)

66 of 92

The timeseries arena

  • Data is stored in one global database in memory (checkpointed to disk)
  • Each data point has the form: (timestamp, value)
  • Data points are stored in chronological lists called timeseries.
  • Each timeseries is named by a set of unique labels, of the form name=value
  • Timeseries data can be queried via a variable reference (a specification of labels and values).  
    • The result is a vector or matrix.

67 of 92

Structure of timeseries

label1

label2

label3

label4

...

68 of 92

Variables and Labels

Labels come from

  • the target’s name: job, instance
  • the target’s exported metrics
  • the configuration: labels, relabels
  • the processing rules

69 of 92

Variables and labels

{var=”errors”,job=”web”,instance=”server01:8000”,zone=”us-east”,code=”500”} 16

{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-west”,code=”500”} 10

{var=”errors”,job=”web”,instance=”server02:8080”,zone=”us-west”,code=”500”} 0

{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-east”,code=”500”} 12

{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-west”,code=”500”} 10

{var=”errors”,job=”web”,instance=”server02:8080”,zone=”us-west”,code=”500”} 10

{var=”requests”,job=”web”,instance=”server01:8080”,zone=”us-east”} 50456

{var=”requests”,job=”web”,instance=”server01:8080”,zone=”us-west”} 12432

{var=”requests”,job=”web”,instance=”server02:8080”,zone=”us-west”} 43424

70 of 92

Variables and labels

{var=”errors”,job=”web”,instance=”server01:8000”,zone=”us-east”,code=”500”} 16

{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-west”,code=”500”} 10

{var=”errors”,job=”web”,instance=”server02:8080”,zone=”us-west”,code=”500”} 0

{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-east”,code=”500”} 12

{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-west”,code=”500”} 10

{var=”errors”,job=”web”,instance=”server02:8080”,zone=”us-west”,code=”500”} 10

{var=”requests”,job=”web”,instance=”server01:8080”,zone=”us-east”} 50456

{var=”requests”,job=”web”,instance=”server01:8080”,zone=”us-west”} 12432

{var=”requests”,job=”web”,instance=”server02:8080”,zone=”us-west”} 43424

errors{job=”web”}

71 of 92

Variables and labels

errors{job=”web”,zone=”us-west”}

{var=”errors”,job=”web”,instance=”server01:8000”,zone=”us-east”,code=”500”} 16

{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-west”,code=”500”} 10

{var=”errors”,job=”web”,instance=”server02:8080”,zone=”us-west”,code=”500”} 0

{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-east”,code=”500”} 12

{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-west”,code=”500”} 10

{var=”errors”,job=”web”,instance=”server02:8080”,zone=”us-west”,code=”500”} 10

{var=”requests”,job=”web”,instance=”server01:8080”,zone=”us-east”} 50456

{var=”requests”,job=”web”,instance=”server01:8080”,zone=”us-west”} 12432

{var=”requests”,job=”web”,instance=”server02:8080”,zone=”us-west”} 43424

72 of 92

Single-valued Vector

errors{job=”web”,zone=”us-east”,

instance=”server01:8000”,code=”500”}

{var=”errors”,job=”web”,instance=”server01:8000”,zone=”us-east”,code=”500”} 16

{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-west”,code=”500”} 10

{var=”errors”,job=”web”,instance=”server02:8080”,zone=”us-west”,code=”500”} 0

{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-east”,code=”500”} 12

{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-west”,code=”500”} 10

{var=”errors”,job=”web”,instance=”server02:8080”,zone=”us-west”,code=”500”} 10

{var=”requests”,job=”web”,instance=”server01:8080”,zone=”us-east”} 50456

{var=”requests”,job=”web”,instance=”server01:8080”,zone=”us-west”} 12432

{var=”requests”,job=”web”,instance=”server02:8080”,zone=”us-west”} 43424

73 of 92

rule evaluation

74 of 92

recording rules

task:requests:rate10s =

rate(requests{job=”web”}[10s])

requests{instance="localhost:8001",job="web"} 21235 21244

requests{instance="localhost:8005",job="web"} 21211 21222

task:requests:rate10s{instance="localhost:8007",job="web"} 8.777777777777779

task:requests:rate10s{instance="localhost:8009",job="web"} 10.222222222222223

75 of 92

recording rules

task:requests:rate10s =

rate(requests{job=”web”}[10s])

“variable reference”

76 of 92

recording rules

task:requests:rate10s =

rate(requests{job=”web”}[10s])

“range expression”

77 of 92

recording rules

task:requests:rate10s =

rate(requests{job=”web”}[10s])

“function”

78 of 92

recording rules

task:requests:rate10s =

rate(requests{job=”web”}[10s])

“recorded variable”

79 of 92

recording rules

task:requests:rate10s =

rate(requests{job=”web”}[10s])

“level”

80 of 92

recording rules

task:requests:rate10s =

rate(requests{job=”web”}[10s])

“operation”

81 of 92

recording rules

task:requests:sum =

sum(requests{job=”web”})

“operation”

82 of 92

recording rules

task:requests:delta10s =

delta(requests{job=”web”}[10s])

“operation”

83 of 92

recording rules

task:errors:delta10s =

delta(errors{job=”web”}[10s])

“name”

84 of 92

aggregation based on topology

task:requests:rate10s =

rate(requests{job=”web”}[10s])

dc:requests:rate10s =

sum by (job,zone)(

task:requests:rate10s)

global:requests:rate10s =

sum by (job)(dc:requests:rate10s)

85 of 92

aggregation based on topology

task:requests:rate10s =

rate(requests{job=”web”}[10s])

dc:requests:rate10s =

sum by (job,zone)(

task:requests:rate10s)

global:requests:rate10s =

sum by (job)(dc:requests:rate10s)

86 of 92

relations based on schema

dc:errors:ratio_rate10s =

sum by (job)(dc:errors:rate10s)

/ on (job)

dc:requests:rate10s

87 of 92

relations based on schema

dc:errors:ratio_rate10s =

sum by (job)(dc:errors:rate10s)

/ on (job)

dc:requests:rate10s

88 of 92

relations based on schema

dc:errors:ratio_rate10s =

dc:errors:rate10s

/ on (job) group_left(code)

dc:requests:rate10s

89 of 92

Demo

90 of 92

Recap

  • Use “higher level abstractions” to lower cost of maintenance
  • Use metrics, not checks, to get Big Data
  • Design alerts based on Service Objectives

91 of 92

Fin

jaq@google.com

http://prometheus.io

http://github.com/jaqx0r/blts

“My Philosophy on Alerting”

“Achieving Rapid Response Times in Large Online Services”

Prometheus (2012) Poster © 20th Century Fox

92 of 92

Question Time