1 of 67

building twitter’s next-gen

ALERTING SYSTEM

2 of 67

dan sotolongo

@sortalongo

megan kanne

@megankanne

justin nguyen

@justanguyen

#talk-megan-justin-dan

3 of 67

OBSERVABILITY AT TWITTER

scaleable

robust

realtime

estwo on flickr

4 of 67

5 of 67

6 of 67

Collection Agent 1

Ingestion Service

Storage (Manhattan)

Timeseries DB Query Engine

Temporal Indexing Service

Alerting Service

Visualization Service

Alerting Cmd Line Tools

Service 1

Timeseries Cmd Line Tools

Collection Agent n

Service n

….

7 of 67

OLD ALERTING SYSTEM

successes and challenges

8 of 67

300M

4.3B

14x

metrics written per minute

Nov 2013

June 2016

9 of 67

25K 3M

alerts per minute

alert monitors per minute

10 of 67

making alerts

11 of 67

alert >

rule >

monitor

12 of 67

the good:

the bad:

  • Simple config language
  • Large body of existing examples
  • Easy to write, commit, and upload new alerts
  • Simple config language
  • Large body of existing examples
  • Easy to write, commit, and upload new alerts

13 of 67

14 of 67

15 of 67

16 of 67

alerts

dashboards

17 of 67

alerts

dashboards

18 of 67

being on call

19 of 67

Zone1

Zone2

20 of 67

Zone1

Zone2

21 of 67

22 of 67

23 of 67

THE SOLUTION

improve configurations

reduce loss of visibility

taspicsvns on flickr

24 of 67

making alerts: simplicity

25 of 67

chart

alert

26 of 67

27 of 67

server-side validator

follows best practices

queries not expensive

...

28 of 67

on call: reliability

29 of 67

old

zone 1

zone 2

new

zone 1

zone 2

30 of 67

t0

node 1

state: ok

timestamp: t0

t1

node 1

t2

node 2

state: ok

last evaluation: t0

t1

t2

evaluate t1 & t2

31 of 67

30%

time to detect in minutes

old

new

2.5

1.75

32 of 67

alerting service

alerting service

alert scheduler

alert runner

start

alert source

zookeeper

configs

configs

timeseries db

alerting api

alert source

zone 1

zone 2

timeseries db

storage (Manhattan)

current state

eval

snooze

react

record

history recorder

notifier

stop?

alert evaluator

snoozed?

...

balancer

shards

shards

balancer

33 of 67

human reasoning

34 of 67

INTEGRATION

TESTING

bring together signals

postsumptio on flickr

35 of 67

CONTEXT

global context

(twitter)

peer context

(dependencies)

local context

(changes in my system)

36 of 67

runbook

contact

37 of 67

EMPOWER

HUMANS

elaine_macc on flickr

38 of 67

LESSONS LEARNED

seldonscott on flickr

39 of 67

distributed systems

Requirements

  • Availability
  • Scale

Challenges

  • Consistency
  • Structural complexity
  • Reasoning about time

40 of 67

alerting system distribution

Work is split into shards

  • Shard by ruleset
  • Reuse existing tools:
    • Zookeeper
    • Manhattan (similar to Riak, Cassandra)
    • Aurora/Mesos

41 of 67

Distributed Systems Design

42 of 67

engineering principles

  • Abstract cleanly
  • Minimize coordination
  • Parallelize work
  • Distribute load evenly
  • Do it fast!

43 of 67

44 of 67

(small)

45 of 67

46 of 67

47 of 67

48 of 67

END

49 of 67

sharding rulesets

Ruleset

Rule

Rule

Fanout

Fanout

Fanout

Fanout

Fanout

Fanout

50 of 67

51 of 67

sharding rulesets

Ruleset

Rule

Rule

Fanout

Fanout

Fanout

Fanout

Fanout

Fanout

Ruleset

Rule

Rule

Fanout

Fanout

Fanout

Fanout

Fanout

Fanout

52 of 67

distributed systems design

?

53 of 67

So many vows…

No matter what you do,

you’re forsaking one vow or the other.

54 of 67

users

Support

Collaboration

55 of 67

user support

Sisyphus, Marcell Jankovics

56 of 67

user support: front line

Interaction points

  • UI
  • Command line tools
  • Configuration libraries

Helping out

  • Comprehensive validation
  • Helpful usage messages
  • Extensive documentation of user-visible code

57 of 67

user support:� second line

User guides

  • Getting started
  • Core concepts
  • Best practices
  • FAQ
  • API Docs

Documentation lives with code� (in a monorepo)

58 of 67

User Support: Third Line

59 of 67

user collaboration

Migrations:

  • Some happy
  • Others…

60 of 67

61 of 67

user collaboration

NEW!

62 of 67

user collaboration

NEW!

63 of 67

user collaboration

NEW!

64 of 67

Peter Trevelyan, Shifting Lines

65 of 67

thanks to

66 of 67

Ian Brown

Jonathan Cao

Hao Huang

Aras Saulys

Ning Wang

Si Wang

Mike Moreno

Caitie McCaffrey

Anthony Asta

JC Martin

Ryan O’Neill

Steven Parkes

Jacob Reiff

Yann Ramin

Michael Suzuki

Franklin Hu

Cory Watson

67 of 67

QUESTIONS?

if this sounds cool, come talk to us:

Justin Nguyen� @jnguyen� @justanguyen

Megan Kanne� @megan� @megankanne

Dan Sotolongo� @sortalongo_� @sortalongo

#talk-megan-justin-dan