Microservices Gone Wrong

A journey of scale, pain, and misery

Anthony Ferrara - PHP Barcelona 2019

Microservices Gone Wrong

  • Background
  • Architecture
  • Infrastructure
  • Local Dev Experience
  • Getting It Running
  • Dealing With Change
  • Lessons Learned

Background:

The Team

A Growing Engineering Team

  • Grew from 5 to 23 in the prior year
  • 10 PHP Engineers, a few with Go experience
  • 4 Front End Engineers
  • 3 Data Engineers
  • 3 Dedicated DevOps Engineers (1 SRE, 2 Ops Engineers)
  • 2 Test Engineers

Background:

The Legacy System

A legacy system existed

  • Laravel on top of Zend Framework 1 on top of CodeIgniter on top of
    PHP-4 era code, with some Symfony and Cake mixed in
  • A Laravel microservice for permissions
  • 600+ Database Tables
  • 500,000+ lines of code
  • 1000+ cron jobs
    • Many that “corrected” the database every 30-60 minutes...

Background:

The Business Need

Strong business requirements existed

  • Demanded faster iterations of new features
  • Demanded significant refactoring of existing features
  • Demanded more scale
  • Demanded higher quality product, with fewer bugs
  • Just ended a 9 month “product freeze” just focusing on technical debt

To Refactor?

Or To Rebuild?

The Refactor/Rebuild Test:

  • List every “assumption” the application makes (business rules)
    • Example: “An email address uniquely identifies a user”
  • Cross out each assumption that is local to only one part of code
    • Example: “If nothing outside the login system cares that email address is unique…”
  • Ask Product how many of the remaining assumptions it wants to break/change
    • Example: “Need to enable multiple users with the same (or no) email address”
  • If few: Refactor
  • If many: You’re not rebuilding...

We Decided

To Build A Second Product

And Throw Away The Existing Application After Migration

Main Drivers Of Decision:

  • Existing core assumptions the application was built on are no longer valid
  • Sheer un-maintainability of existing code
  • Current Product Functionality Was Struggling To Scale

Microservices Gone Wrong

  • Background
  • Architecture
  • Infrastructure
  • Local Dev Experience
  • Getting It Running
  • Dealing With Change
  • Lessons Learned

The Architecture For V2

Frontend Server

Domain Service

DB

Rabbit MQ

Domain Service

DB

Async Service

Async Service

Meta Service

Meta Service

API Gateway

API Gateway

  • Using Tyk.io
  • Configurable API Management
    • Configurable Backends
    • Handles Versioning
    • Transforms Inputs
  • OAuth2 Termination
  • Rate Limiting & Quotas
  • Can Create Mock APIs
  • Supports Custom Middleware
  • Generates Request IDs

Public

https://api.example.com

  • OAuth 2 Bearer Tokens
  • Slowly Evolving REST API
  • Rate Limits
  • Unaware Of Individual Services

Private

https://service.internal

  • Custom Headers Indicating User And Role
  • Fast Evolving RPC APIs
  • Direct Service To Service

The Architecture For V2

API Gateway

Domain Service

DB

Domain Service

DB

Async Service

Async Service

Meta Service

Meta Service

Frontend Server

Rabbit MQ

Archive

Rabbit MQ

  • Used as pseudo Event-Sourcing
  • All events are archived to S3
    • Not queryable by services
    • Tooling to “replay” events for ETL and Recovery
  • Each service keeps a local state
    • They listen to events to update it

“Event”

  • Sent on RabbitMQ
  • Used to indicate system state change
  • JSON

“RPC”

  • Sent on RabbitMQ
  • Used to request a big process to happen async
  • JSON
  • Replies are sent

The Design For V2

API Gateway

Rabbit MQ

Frontend Server

Domain Service

  • Model business entities and domains
  • Have their own persistence
  • Communicate over HTTP

Domain Service

Domain Service

Async Service

Async Service

Meta Service

Meta Service

DB

DB

Async Service

  • Batch jobs / long-running jobs
  • Usually transformations
  • Communicate over RPC calls

Meta Service

  • Fills the gap between the domain and what a frontend / user needs
  • Aggregates other services to produce a singular result
  • Communicate over HTTP

Our Domain

Asset Service

Content Service

Lesson Service

Lesson History Service

Assignment Service

User Service

Users

Lesson

Content

Asset

Assignment

Lesson History

Question: How would you get everything needed to show an assignment to the frontend?

Assignment Meta Service

Question 2: How would you get a list of Lessons ordered by the Content Author’s name?

Microservices Gone Wrong

  • Background
  • Architecture
  • Infrastructure
  • Local Dev Experience
  • Getting It Running
  • Dealing With Change
  • Lessons Learned

Application Infrastructure

Hardware

Server

Server

Server

Server

Server

AWS, DO, GCE

Kernel

Mesos

API

Mesos SDK

Java, Python, C++, Go

Services REST API

Marathon

Batch REST API

Chronos

Spark, Hadoop, Storm

App

App

Docker

App

App

Recurring Jobs

(ETL, Backups, Scheduled Jobs)

Network Infrastructure

services.example.com

identity.example.com

External ELB

Internal ELB

content.internal

auth.internal

lessons.internal

search.internal

HAProxy

Marathon-LB

HAProxy

Marathon-LB

HAProxy

Marathon-LB

HAProxy

Marathon-LB

HAProxy

Marathon-LB

HAProxy

Marathon-LB

Marathon

GROUP=external

GROUP=internal

Mesos Cluster

172.31.1.30

172.31.1.31

172.31.1.33

172.31.1.32

172.31.1.34

172.31.1.35

172.31.1.37

172.31.1.36

172.31.1.38

172.31.1.39

172.31.1.41

172.31.1.40

172.31.1.42

172.31.1.43

172.31.1.45

172.31.1.44

Tyk

Tyk

Tyk

A

A

A

A

B

B

B

users.internal

Logging Infrastructure

Mesos Host

Container

Container

Container

Docker Daemon

LogSpout

Kafka

Zipkin Collector

StatsD Collector

Scribe

Zipkin

(Distributed Tracing)

DataDog

(Metrics & Log Anaytics)

S3

(Log Archive)

Automation & Configuration (Planned & Partially Implemented): service.json

{

"name": "users",

"service_dependencies": [],

"datastores": [

{"type": "postgresql",

"name": "DB"}

],

"health": {

"type": "http",

"port": 80,

"path": "/healthz"},

"service": {

"...":"..."},

"deployment": {

"min_instances": 3,

"memory_limit": "512m",

"cpu_limit": 2}

}

Creates domain name, as well as log prefix

Controls spin-up order

Creates Databases, and adds ENV var configs

Configures Health Checks For Service

Configures public service in TYK

Tells Marathon About Resources

Does this look familiar?

It’s basically what Kubernetes is today...

Microservices Gone Wrong

  • Background
  • Architecture
  • Infrastructure
  • Local Dev Experience
  • Getting It Running
  • Dealing With Change
  • Lessons Learned

What Was Built:

(What We Intended To Happen)

  • Command Line Tool would parse service.json and create dynamic docker-compose file
  • Docker compose would build and launch all service dependencies
  • Service being worked on would mount to the local filesystem to allow for fast edits without redeploys
  • All work would happen in a product-esque configured environment
  • Migrations happened automatically

What Was Used:

(What Actually Happened)

  • Each engineer ran their own service natively on their machine
  • When they needed another service as a dependency, they would ask another engineer for help getting it running on their machine
  • Dev environments were nothing like production environments
  • Migrations were usually forgotten about

Why?

The Automated local environment tool was unreliable, difficult to use,

and excruciatingly SLOW

But most of all, it was “someone else’s problem”. Devs weren’t responsible for the tool, and so few committed to it

Microservices Gone Wrong

  • Background
  • Architecture
  • Infrastructure
  • Local Dev Experience
  • Getting It Running
  • Dealing With Change
  • Lessons Learned

It took a MONTH

from the first few services being “good to go” to having them functioning in production

What Failed?

Why was it so hard to get running?

  • ENV differences took a few days to sort out
  • Infra was built to an idealized service behavior
  • Services were built to idealized infra behavior
  • Services were built in isolation and not explicitly to interact with each other
  • Engineers had moved on to other services by the time infra challenges were sorted
  • APIs were evolving rapidly, services not as rapidly

Learning How To Run It

  • Most deployments required high levels of coordination
    • Each service had a separate GIT repository, using SEMVER and GitFlow
    • Since each service is deployed separately, coordinating releases and “working configurations” became tedious
  • RabbitMQ Events were missing data or structure
    • As services evolved, events changed. This meant that replaying events had to be backwards compatible with old events
    • There was no mechanism to test this or enforce compatibility
    • We investigated switching to AVRO
  • Getting a local/staging environment to a known state was exceedingly challenging
    • Each service had their own DB, so there was no easy way to get into a known state
    • Worked on a tool to take a YAML state file and configure services based on that.
  • Lack of circuit breakers led to difficult to detect and debug cascading failures
    • Any service failure would cascade and take down other services that depended on them, and not gracefully

Microservices Gone Wrong

  • Background
  • Architecture
  • Infrastructure
  • Local Dev Experience
  • Getting It Running
  • Dealing With Change
  • Lessons Learned

The Product Was Changing

As It Was Being Built...

Partially due to feedback from clients as we got it in front of them

Partially due to involvement of Executive stakeholders

Partially due to us simply not getting it right the first time

Our Original Domain

Program

Topic

Lesson

Card

Asset

Topic

Topic

Lesson

Lesson

Card

Card

Asset

Asset

Our Original Domain

Program

Topic

Lesson

Card

Asset

Topic

Topic

Lesson

Lesson

Card

Card

Asset

Asset

Milestone

Milestone

Milestone

Our Domain

Asset Service

Content Service

Lesson Service

Lesson History Service

Assignment Service

User Service

Assignment Meta Service

How would you refactor this to change a hierarchy?

Simple refactoring or ETL operations became major coordinated surgery.

Microservices Gone Wrong

  • Background
  • Architecture
  • Infrastructure
  • Local Dev Experience
  • Getting It Running
  • Dealing With Change
  • Lessons Learned

Service Calls Are Unreliable

Many Orders Of Magnitude Less Reliable Than A Method Call...

How often do you expect a method call to fail randomly?

1 in 1.000.000.000?

1 in 1.000.000.000.000.000.000?

1 in ∞? (practically speaking)

How about network calls?

99.999% uptime = 1 in 100.000

Therefore: How small should you build your services?

What We Should Have Done

Asset Service

User Service

Lesson Service

Users

Lesson

Content

Asset

Assignment

Lesson History

Lessons Learned

  • Don’t Do Microservices
    • Unless you have a dedicated tooling and automation team (SRE, Ops, etc)

  • Start With Big Services
    • It’s easier to split a large service, than stitch together two small ones

  • Automate Everything
    • Spin-Up, Deployment, Migration, Backup, State Restoration, etc

  • Don’t Plan For Failure, Live It
    • Failure modes should be built first, tested first, and relied upon.

  • Define SLO’s Early
    • Explicitly define the Business objectives of each service and system before building, and codify those as metrics and tested expectations.

Managing complexity is vital to long-term success.

It Is Too Easy To Create A System So Complicated That You Cannot Understand It...

And if you can’t understand it, how can you hope to run and maintain it?

Fin

Anthony Ferrara

@ircmaxell

me@ircmaxell.com
https://blog.ircmaxell.com

Microservices Gone Wrong - PHP Barcelona 2019 - Google Slides