1 of 62

Building Reliable Distributed Systems

Craft Conference�June 2, 2022

Budapest

2 of 62

Loren Sands-Ramshaw

Author of The GraphQL Guide

Full-stack developer

Language Runtime Engineer

@ Temporal.io

3 of 62

  • Distributed systems
  • Reliable distributed systems
  • Why use Temporal
  • How to use Temporal

4 of 62

5 of 62

6 of 62

The TL;DR Intro

7 of 62

Temporal is the open source runtime for managing distributed application state at scale.

What is Temporal?

8 of 62

Why?

The most valuable, mission-critical workloads in any software company are long-running and tie together multiple services.

9 of 62

System Requirements

  • Because this work is complex:
    • You want to easily model dynamic asynchronous logic...
    • ...and reuse, test, version and migrate it.
  • Because this work relies on unreliable systems:
    • You want to standardize timeouts and retries.
    • You want offer "reliability on rails" to every team.
  • Because this work is so important:
    • You must never drop any work.
    • You must log all progress.
    • You must be able to scale it up without replatforming.

Orchestration

Event Sourcing

Workflows as Code

10 of 62

Status Quo

choreography

Temporal

orchestration

Commands

Queries

Cloud/Platform

App Devs

11 of 62

12 of 62

Microservices Death Star

13 of 62

14 of 62

15 of 62

16 of 62

17 of 62

Choreography vs Orchestration

18 of 62

Programming Model

19 of 62

Programming Model

20 of 62

Workflow

21 of 62

Activity

22 of 62

Timeouts and Retries

23 of 62

24 of 62

25 of 62

26 of 62

Workflow APIs

27 of 62

Monthly Billing

28 of 62

Signals and Queries

29 of 62

30 of 62

  • Core APIs
  • Advanced APIs
    • Workflow APIs (Timers, Signals, Queries, Child/External WFs, continueAsNew, SideEffects)
    • Activity APIs (Retries, Timeouts, Heartbeating, Cancellation)
    • Visibility APIs
    • Performance APIs (Local Activities)
  • Security
    • mTLS (AuthN)
    • Authorizer (AuthZ)
    • DataConverter
    • Namespaces
  • Maintenance
    • Testing
    • Versioning & Replay
  • Production
    • Logging
    • Monitoring/Metrics
  • Experimental
    • Archival
    • Multi-Cluster

31 of 62

Server

DevTools

SDKs

tctl CLI

32 of 62

33 of 62

Status Quo

choreography

Temporal

orchestration

Commands

Queries

Cloud/Platform

App Devs

34 of 62

Outcomes

  • More reliable
    • Fail to execute/drop data less often: from 1 production incident a week to ~0
    • When parts of application do fail, always recover to consistent state
  • More productive
    • 40-60% fewer lines of code and infra when writing features
    • DistSys/Orchestration concerns outsourced to Temporal
  • Easier to operate
    • Temporal consolidates errors, lets you make fixes without downtime
    • Event sourced system is highly observable by default

35 of 62

Business Transactions

Needs

  • Handling subscriptions, installment payments, communications reliably
  • Integrate multiple payment systems & ecommerce backends
  • Detecting/Triaging suspicious activity

Coinbase Quotes

“Temporal maintains the high level of reliability offered by the homegrown system while also providing tremendous amounts of visibility into running processes.”

36 of 62

Long Running Processes

Needs

  • Human-in-the-loop approval/triage
  • Expert labeling of ML metadata
  • Customer loyalty program - reward points over indefinite time
  • Customer engagement and threat detection

Checkr Quotes

  • All new data sources incorporated into our background checks are now done via Temporal, and more specifically by choice of the engineering team working on it.”
  • “[Temporal] allows us to share workflow components with different teams.”

37 of 62

Data Pipelines

Needs

  • Machine learning training
  • Data aggregation & analytics
  • ETL between databases & warehouses

Descript Quotes

"We had one incident every week just on the transcription workflow because it was too complicated to maintain... we were afraid of doing any changes in that code path."

38 of 62

Infrastructure Provisioning

Needs

  • Likely intermittent failures
  • Polling for quick response (not cronjob)
  • Complex dynamic logic
  • Guarantee strong lock on specific resource

Examples

  • CI/CD services (eg Uber, Vercel)
  • Managed Deployments (eg automated management, migration, recovery of MySQL, ElasticSearch, Apache Cassandra, HashiCorp Consul)
  • Kubernetes provisioning (eg Banzaicloud)

39 of 62

40 of 62

Temporal’s tenth rule

Any sufficiently complex distributed system contains an ad-hoc, undocumented, unscalable, and unreliable implementation of half of Temporal.

41 of 62

Use Temporal

Don’t use Temporal

  • Important work (reliability is important to you)
  • Not important (it’s fine if it fails occasionally)
  • Need very low latency

42 of 62

Team

43 of 62

Community

44 of 62

Cloud

Temporal Cloud is a fully managed cloud offering of Temporal Server. Why Cloud?

  • Updates: Automatic updates with latest releases
  • Experience: We have the most years of experience operating Temporal in production.
  • Scale: Our design partners are multi-billion dollar publicly listed companies.
  • Dependencies: No more managing dependencies like Elasticsearch or Cassandra.
  • Support: Dedicated channels and SLA for support and product feedback.

“The frequency of production incidents has declined from once-a-week to virtually zero.”

“Temporal Cloud provides Snap a highly reliable and scalable foundation for Snap Stories, enabling us to deliver the amazing global experience our users have come to expect.”

45 of 62

  • Engine
  • Task Queue
  • Timers
  • Transactions

46 of 62

  • Engine
  • Task Queue
  • Timers
  • Transactions

47 of 62

  • Engine
  • Task Queue
  • Timers
  • Transactions

48 of 62

  • Engine
  • Task Queue
  • Timers
  • Transactions

49 of 62

Workflow Engine

  • Engine
  • Task Queue
  • Timers
  • Transactions

50 of 62

Now… Scale This!

  • Multiple Hosts
  • Multiple Stores
  • Multiple Clusters
  • Batch Operations

51 of 62

CAP Theorem

A Temporal cluster is eventually available and highly consistent

  • Availability loss doesn't result in data loss, but in increased latency.
  • If persistence nodes are down, your Workflows will not progress, but the data will still be highly consistent.

Network failures are prevented from reaching the application level.

The optional multi-cluster replication feature greatly increases system availability.

Temporal as a Distributed System

+ Replication

52 of 62

Thank You

temporal.io/ts

temporal.io/subscribe

temporal.io/youtube

temporal.io/slack

lorensr.me

@lorendsr

loren@temporal.io

53 of 62

How would you write Uber?

    • Search
    • Pricing
    • Matching
    • Pickup
    • Dropoff
    • Rating
    • Tipping
    • Payment
    • Email

54 of 62

How would you write Uber?

    • Search
    • Pricing
    • Matching
    • Pickup
    • Dropoff
    • Rating
    • Tipping
    • Payment
    • Email

Cancellation

Change of Route

Driver Lost

Uber Pool

Refunds

55 of 62

56 of 62

React Code Organization

=

Components

57 of 62

58 of 62

59 of 62

60 of 62

61 of 62

What React did for Frontend Programming

  • Deterministic Renders
  • Local State (useState)
  • Reducing Boilerplate
  • Composition (Child Components)
  • Side Effects (useEffect)
  • Memoization (useMemo)
  • Normalization (Synthetic Events)
  • Devtools
  • Central Scheduler

62 of 62