1 of 62

Building Reliable Distributed Systems

Craft Conference�June 2, 2022

Budapest

2 of 62

Loren Sands-Ramshaw

Author of The GraphQL Guide

Full-stack developer

Language Runtime Engineer

@ Temporal.io

3 of 62

Distributed systems
Reliable distributed systems
Why use Temporal
How to use Temporal

5 of 62

github.com/lorensr/distributed-systems

6 of 62

The TL;DR Intro

7 of 62

Temporal is the open source runtime for managing distributed application state at scale.

What is Temporal?

8 of 62

Why?

The most valuable, mission-critical workloads in any software company are long-running and tie together multiple services.

9 of 62

System Requirements

Because this work is complex:

You want to easily model dynamic asynchronous logic...
...and reuse, test, version and migrate it.

Because this work relies on unreliable systems:

You want to standardize timeouts and retries.
You want offer "reliability on rails" to every team.

Because this work is so important:

You must never drop any work.
You must log all progress.
You must be able to scale it up without replatforming.

Orchestration

Event Sourcing

Workflows as Code

10 of 62

Status Quo

choreography

Temporal

orchestration

Commands

Queries

Cloud/Platform

App Devs

12 of 62

Microservices Death Star

17 of 62

Choreography vs Orchestration

https://theburningmonk.com/2020/08/choreography-vs-orchestration-in-the-land-of-serverless/

18 of 62

Programming Model

19 of 62

Programming Model

22 of 62

Timeouts and Retries

25 of 62

Code demo

github.com/temporalio/samples-typescript

26 of 62

Workflow APIs

27 of 62

Monthly Billing

28 of 62

Signals and Queries

30 of 62

Core APIs
Advanced APIs

Workflow APIs (Timers, Signals, Queries, Child/External WFs, continueAsNew, SideEffects)
Activity APIs (Retries, Timeouts, Heartbeating, Cancellation)
Visibility APIs
Performance APIs (Local Activities)

Security

mTLS (AuthN)
Authorizer (AuthZ)
DataConverter
Namespaces

Maintenance

Testing
Versioning & Replay

Production

Logging
Monitoring/Metrics

Experimental

Archival
Multi-Cluster

31 of 62

Server

DevTools

SDKs

tctl CLI

33 of 62

Status Quo

choreography

Temporal

orchestration

Commands

Queries

Cloud/Platform

App Devs

34 of 62

Outcomes

More reliable

Fail to execute/drop data less often: from 1 production incident a week to ~0
When parts of application do fail, always recover to consistent state

More productive

40-60% fewer lines of code and infra when writing features
DistSys/Orchestration concerns outsourced to Temporal

Easier to operate

Temporal consolidates errors, lets you make fixes without downtime
Event sourced system is highly observable by default

35 of 62

Business Transactions

Needs

Handling subscriptions, installment payments, communications reliably
Integrate multiple payment systems & ecommerce backends
Detecting/Triaging suspicious activity

Coinbase Quotes

“Temporal maintains the high level of reliability offered by the homegrown system while also providing tremendous amounts of visibility into running processes.”

https://docs.temporal.io/blog/reliable-crypto-transactions-at-coinbase

36 of 62

Long Running Processes

Needs

Human-in-the-loop approval/triage
Expert labeling of ML metadata
Customer loyalty program - reward points over indefinite time
Customer engagement and threat detection

Checkr Quotes

“All new data sources incorporated into our background checks are now done via Temporal, and more specifically by choice of the engineering team working on it.”
“[Temporal] allows us to share workflow components with different teams.”

https://docs.temporal.io/blog/how-temporal-simplified-checkr-workflows

37 of 62

Data Pipelines

Needs

Machine learning training
Data aggregation & analytics
ETL between databases & warehouses

Descript Quotes

"We had one incident every week just on the transcription workflow because it was too complicated to maintain... we were afraid of doing any changes in that code path."

38 of 62

Infrastructure Provisioning

Needs

Likely intermittent failures
Polling for quick response (not cronjob)
Complex dynamic logic
Guarantee strong lock on specific resource

Examples

CI/CD services (eg Uber, Vercel)
Managed Deployments (eg automated management, migration, recovery of MySQL, ElasticSearch, Apache Cassandra, HashiCorp Consul)
Kubernetes provisioning (eg Banzaicloud)

40 of 62

Temporal’s tenth rule

Any sufficiently complex distributed system contains an ad-hoc, undocumented, unscalable, and unreliable implementation of half of Temporal.

41 of 62

Use Temporal

Don’t use Temporal

Important work (reliability is important to you)

Not important (it’s fine if it fails occasionally)
Need very low latency

44 of 62

Cloud

temporal.io/cloud

Temporal Cloud is a fully managed cloud offering of Temporal Server. Why Cloud?

Updates: Automatic updates with latest releases
Experience: We have the most years of experience operating Temporal in production.
Scale: Our design partners are multi-billion dollar publicly listed companies.
Dependencies: No more managing dependencies like Elasticsearch or Cassandra.
Support: Dedicated channels and SLA for support and product feedback.

“The frequency of production incidents has declined from once-a-week to virtually zero.”

“Temporal Cloud provides Snap a highly reliable and scalable foundation for Snap Stories, enabling us to deliver the amazing global experience our users have come to expect.”

45 of 62

Engine
Task Queue
Timers
Transactions

46 of 62

Engine
Task Queue
Timers
Transactions

47 of 62

Engine
Task Queue
Timers
Transactions

48 of 62

Engine
Task Queue
Timers
Transactions

49 of 62

Workflow Engine

Engine
Task Queue
Timers
Transactions

50 of 62

Now… Scale This!

Multiple Hosts
Multiple Stores
Multiple Clusters
Batch Operations

51 of 62

CAP Theorem

A Temporal cluster is eventually available and highly consistent

Availability loss doesn't result in data loss, but in increased latency.
If persistence nodes are down, your Workflows will not progress, but the data will still be highly consistent.

Network failures are prevented from reaching the application level.

The optional multi-cluster replication feature greatly increases system availability.

Temporal as a Distributed System

+ Replication

52 of 62

Thank You

temporal.io/ts

temporal.io/subscribe

temporal.io/youtube

temporal.io/slack

lorensr.me

@lorendsr

loren@temporal.io

53 of 62

How would you write Uber?

Search
Pricing
Matching
Pickup
Dropoff
Rating
Tipping
Payment
Email

54 of 62

How would you write Uber?

Search
Pricing
Matching
Pickup
Dropoff
Rating
Tipping
Payment
Email

Cancellation

Change of Route

Driver Lost

Uber Pool

Refunds

56 of 62

React Code Organization

Components

https://www.freecodecamp.org/news/react-introduction-for-people-who-know-just-enough-jquery-to-get-by-2019-version-28a4b4316d1a/

61 of 62

What React did for Frontend Programming

Deterministic Renders
Local State (useState)
Reducing Boilerplate
Composition (Child Components)
Side Effects (useEffect)
Memoization (useMemo)
Normalization (Synthetic Events)
Devtools
Central Scheduler

1 of 62

2 of 62

3 of 62

4 of 62

5 of 62

6 of 62

7 of 62

8 of 62

9 of 62

10 of 62

11 of 62

12 of 62

13 of 62

14 of 62

15 of 62

16 of 62

17 of 62

18 of 62

19 of 62

20 of 62

21 of 62

22 of 62

23 of 62

24 of 62

25 of 62

26 of 62

27 of 62

28 of 62

29 of 62

30 of 62

31 of 62

32 of 62

33 of 62

34 of 62

35 of 62

36 of 62

37 of 62

38 of 62

39 of 62

40 of 62

41 of 62

42 of 62

43 of 62

44 of 62

45 of 62

46 of 62

47 of 62

48 of 62

49 of 62

50 of 62

51 of 62

52 of 62

53 of 62

54 of 62

55 of 62

56 of 62

57 of 62

58 of 62

59 of 62

60 of 62

61 of 62

62 of 62