A practical guide to monitoring and alerting with time series at scale
SRE
Site Reliability Engineering in Storage Infrastructure at Google
“SRE is what happens when you ask a software engineer to design an operations function.”
SRE
Ultimately responsible for the reliability of google.com
Less than 50% time spent on operations,
More than 50% on engineering reliability and automation
SRE
Ultimately responsible for the reliability of google.com
Less than 50% time spent on operations,
More than 50% on engineering reliability and automation
SRE = DevOps Engineer
SRE
Ultimately responsible for the reliability of google.com
Monitoring is one of many tools to achieve that goal.
What is “monitoring”
proximity in time
measurement granularity
performance analysis
capacity planning
incident response
failure detection
proximity in time
measurement granularity
performance analysis
capacity planning
incident response
failure detection
Alerting on thresholds
Alert when the beer supply is low
Alert when beer supply low
ALERT BarneyWorriedAboutBeerSupply
IF cases - 1 - 1 = 1
ANNOTATIONS {
summary = “Hey Homer, I’m worried about the beer supply.”
description = “After this case, and the next case, there’s only one case left! Yeah yeah, Oh Barney's right. Yeah, lets get some more beer.. yeah.. hey, what about some beer, yeah Barney's right…”
}
Disk full alert
Alert when 90% full
Different filesystems have different sizes
10% of 2TB is 200GB
False positive!
Alert on absolute space, < 500MB
Arbitrary number
Different workloads with different needs: 500MB might not be enough warning
Disk full alert
More general alert based on human interactions:
How long before the disk is full?
and
How long will it take for a human to remediate a full disk?
CALCULUS
😱
Alerting on rates of change
Dennis Hopper's Alert
Dennis Hopper's Alert
ALERT BombArmed
IF speed_mph >= 50
ANNOTATIONS {
summary = “Pop quiz, hotshot!”
}
ALERT EXPLODE
IF max(ALERTS{alertname=BombArmed, alertstate=firing}[1d]) > 0 and speed_mph < 50
Keanu's Alert
Keanu's alert
Keanu's alert
ALERT StartSavingTheBus
IF (v - 50)/a <= ${threshold}
Why does #monitoringsuck?
TL;DR:
when the cost of maintenance is too high
Why does X ∀ X ∈ {Ops} suck?
the cost of maintenance must scale sublinearly with the growth of the service
Callback!
Less than 50% time spent on operations,
More than 50% on engineering reliability and automation
service size: e.g. queries, storage footprint, cores used
“ops work”
cost
time
Automate yourself out of a job
Borgmon is
http://prometheus.io
How it works
Process Overview
Monitored task 0
Prometheus
Browser
/metrics
http
Service Discovery
Monitored task 0
Prometheus
Browser
/metrics
http
Monitored task 1
Monitored task 2
ZK�etcd
consul
etc
Alert Notifications
Monitored task 0
Borgmon
Browser
/metrics
http
Monitored task 1
Monitored task 2
Alert�manager
key-value�pairs
email,�pagerduty,�slack etc
Prometheus
Long-term storage
Monitored task 0
Borgmon
Browser
/metrics
http
Monitored task 1
Monitored task 2
Alert�manager
key-value�pairs
email,�pagerduty,�slack etc
TSDB
Prometheus
Global & other monitoring
Monitored task 0
Browser
/metrics
http
Monitored task 1
Monitored task 2
Alert�manager
key-value�pairs
email,�pagerduty,�slack etc
TSDB
Other Prometheus�(eg global, etc)
Prometheus
Sprinkle some shards on it
Scraper shards
Monitored task 1
Monitored task 1
Monitored task 2000
Monitored task 2000
Monitored task 0
Borgmon
Browser
http
Monitored task 1000
Monitored task 2000
Alert�manager
key-value�pairs
email,�pagerduty,�slack etc
TSDB
Other Prometheus�(eg global, etc)
Scraper shards
Scraper shards
Prometheus
alert design
SLAs, SLOs, SLIs
Clients provision against SLO
Jeff Dean, “A Reliable Whole From Unreliable Parts”
“Achieving Rapid Response Times in Large Online Services”
http://research.google.com/people/jeff/Berkeley-Latency-Mar2012.pdf
“My Philosophy on Alerting”
Rob Ewaschuk
“Alerts” don’t have to page you
Alerts that do page should indicate violations of SLO.
“Undisciplined” alerts can be used to help diagnosis by firing but being routed nowhere, displayed on a debugging console.
instrumentation
enough blabber, let’s build something
Prometheus Client API
import “github.com/prometheus/client_golang/prometheus”
var request_count =
prometheus.NewCounter(prometheus.CounterOpts{
Name: “requests”, Help: “total requests”})
func HandleRequest … {
…
request_count.Add(1)
…
/metrics handlers can be plain text
# HELP requests total requests
# TYPE requests counter�requests 20056
# HELP errors total errors served�# TYPE errors counter�errors{code="400"} 2027�errors{code="500"} 824�
(also supports a binary format)
Tips for effective metric setup
Tips for effective metric setup
Let Prometheus aggregate for you
queries_total 1234.0
errors_total 12.0
qps 18.0
error_rate 0.2
Why?
Timeseries Have Types
Counter: monotonically nondecreasing
"preserves the order" i.e. UP
"nondecreasing" can be flat
Timeseries Have Types
Gauge: everything else... not monotonic
Counters FTW
Δt
Counters FTW
no loss of meaning after sampling
Δt
Gauges FTL
Δt
Gauges FTL
lose spike events shorter than sampling interval
Δt
Tips for effective metric setup
Prefer numbers over strings
exceptions{name=”
GksSqlPermissionsException”} 142
last_exception GksSqlPermissionsException
Why?
Tips for effective metric setup
Avoid timestamps
last_sync_time 1209770045.0
sync_total 534.0
Why?
collection
Configuring Prometheus
prometheus.yml
[targets, etc]
rule files
(DSL)
Configuring Prometheus
prometheus.yml:
global:� scrape_interval: 1m� labels: # Added to all targets
zone: us-east�rule_files:� [ - <filepath> ... ]�scrape_configs:� [ - <scrape_config> ... ]
Finding Targets
scrape_configs:
- job_name: “smtp”
static_configs:
- targets:
- ‘mail.example.com:3903’
- job_name: “barserver”
file_sd_configs:
- [json_filenames generated by, e.g. puppet]
- job_name: “webserver”
dns_sd_configs:
- names: # DNS SRV lookup
- web.example.com
- job_name: “fooserver”
consul_sd_configs: # autodiscovery from consul queries
Labels & Vectors
Data Storage Requirements
The timeseries arena
Structure of timeseries
label1
label2
label3
label4
...
Variables and Labels
Labels come from
Variables and labels
{var=”errors”,job=”web”,instance=”server01:8000”,zone=”us-east”,code=”500”} 16
{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-west”,code=”500”} 10
{var=”errors”,job=”web”,instance=”server02:8080”,zone=”us-west”,code=”500”} 0
{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-east”,code=”500”} 12
{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-west”,code=”500”} 10
{var=”errors”,job=”web”,instance=”server02:8080”,zone=”us-west”,code=”500”} 10
{var=”requests”,job=”web”,instance=”server01:8080”,zone=”us-east”} 50456
{var=”requests”,job=”web”,instance=”server01:8080”,zone=”us-west”} 12432
{var=”requests”,job=”web”,instance=”server02:8080”,zone=”us-west”} 43424
Variables and labels
{var=”errors”,job=”web”,instance=”server01:8000”,zone=”us-east”,code=”500”} 16
{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-west”,code=”500”} 10
{var=”errors”,job=”web”,instance=”server02:8080”,zone=”us-west”,code=”500”} 0
{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-east”,code=”500”} 12
{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-west”,code=”500”} 10
{var=”errors”,job=”web”,instance=”server02:8080”,zone=”us-west”,code=”500”} 10
{var=”requests”,job=”web”,instance=”server01:8080”,zone=”us-east”} 50456
{var=”requests”,job=”web”,instance=”server01:8080”,zone=”us-west”} 12432
{var=”requests”,job=”web”,instance=”server02:8080”,zone=”us-west”} 43424
errors{job=”web”}
Variables and labels
errors{job=”web”,zone=”us-west”}
{var=”errors”,job=”web”,instance=”server01:8000”,zone=”us-east”,code=”500”} 16
{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-west”,code=”500”} 10
{var=”errors”,job=”web”,instance=”server02:8080”,zone=”us-west”,code=”500”} 0
{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-east”,code=”500”} 12
{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-west”,code=”500”} 10
{var=”errors”,job=”web”,instance=”server02:8080”,zone=”us-west”,code=”500”} 10
{var=”requests”,job=”web”,instance=”server01:8080”,zone=”us-east”} 50456
{var=”requests”,job=”web”,instance=”server01:8080”,zone=”us-west”} 12432
{var=”requests”,job=”web”,instance=”server02:8080”,zone=”us-west”} 43424
Single-valued Vector
errors{job=”web”,zone=”us-east”,
instance=”server01:8000”,code=”500”}
{var=”errors”,job=”web”,instance=”server01:8000”,zone=”us-east”,code=”500”} 16
{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-west”,code=”500”} 10
{var=”errors”,job=”web”,instance=”server02:8080”,zone=”us-west”,code=”500”} 0
{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-east”,code=”500”} 12
{var=”errors”,job=”web”,instance=”server01:8080”,zone=”us-west”,code=”500”} 10
{var=”errors”,job=”web”,instance=”server02:8080”,zone=”us-west”,code=”500”} 10
{var=”requests”,job=”web”,instance=”server01:8080”,zone=”us-east”} 50456
{var=”requests”,job=”web”,instance=”server01:8080”,zone=”us-west”} 12432
{var=”requests”,job=”web”,instance=”server02:8080”,zone=”us-west”} 43424
rule evaluation
recording rules
task:requests:rate10s =
rate(requests{job=”web”}[10s])
requests{instance="localhost:8001",job="web"} 21235 21244
requests{instance="localhost:8005",job="web"} 21211 21222
→
task:requests:rate10s{instance="localhost:8007",job="web"} 8.777777777777779
task:requests:rate10s{instance="localhost:8009",job="web"} 10.222222222222223
recording rules
task:requests:rate10s =
rate(requests{job=”web”}[10s])
“variable reference”
recording rules
task:requests:rate10s =
rate(requests{job=”web”}[10s])
“range expression”
recording rules
task:requests:rate10s =
rate(requests{job=”web”}[10s])
“function”
recording rules
task:requests:rate10s =
rate(requests{job=”web”}[10s])
“recorded variable”
recording rules
task:requests:rate10s =
rate(requests{job=”web”}[10s])
“level”
recording rules
task:requests:rate10s =
rate(requests{job=”web”}[10s])
“operation”
recording rules
task:requests:sum =
sum(requests{job=”web”})
“operation”
recording rules
task:requests:delta10s =
delta(requests{job=”web”}[10s])
“operation”
recording rules
task:errors:delta10s =
delta(errors{job=”web”}[10s])
“name”
aggregation based on topology
task:requests:rate10s =
rate(requests{job=”web”}[10s])
dc:requests:rate10s =
sum by (job,zone)(
task:requests:rate10s)
global:requests:rate10s =
sum by (job)(dc:requests:rate10s)
aggregation based on topology
task:requests:rate10s =
rate(requests{job=”web”}[10s])
dc:requests:rate10s =
sum by (job,zone)(
task:requests:rate10s)
global:requests:rate10s =
sum by (job)(dc:requests:rate10s)
relations based on schema
dc:errors:ratio_rate10s =
sum by (job)(dc:errors:rate10s)
/ on (job)
dc:requests:rate10s
relations based on schema
dc:errors:ratio_rate10s =
sum by (job)(dc:errors:rate10s)
/ on (job)
dc:requests:rate10s
relations based on schema
dc:errors:ratio_rate10s =
dc:errors:rate10s
/ on (job) group_left(code)
dc:requests:rate10s
Demo
Recap
Fin
“My Philosophy on Alerting”
“Achieving Rapid Response Times in Large Online Services”
Prometheus (2012) Poster © 20th Century Fox
Question Time