1 of 26

Deploying resilient microservices on Google Cloud

Michael Mekuleyi�CTO, Sendme.ng

[Ogbomoso]

2 of 26

Who am I?

  • Mobile Software Engineer
  • CTO, Sendme.ng
  • ex-Engineering Manager, Fyyne Inc
  • Lover of food and everything good
  • FinSec Enthusiast
  • Linkedin – linkedin.com/in/monarene
  • Twitter - @monnarene

3 of 26

Table of Content

  1. What is resilience?
  2. Understanding Docker and Kubernetes
  3. Understanding resilience
  4. Maintaining resilience
  5. Chaos Engineering

4 of 26

What is resilience?

5 of 26

What is resilience?

To understand resilience, you must understand scale and availability.

  1. How do we scale web applications?
  2. What are the instruments of scale?
  3. How does scale affect availability?

6 of 26

What is resilience?

The basic of resilience is adaptability.

  • The ability to recover quickly from difficulty
  • The ability of a substance to spring back into shape
  • Sustaining conditions that help adapt to unpredictable events

7 of 26

Understanding docker and kubernetes

8 of 26

“It is not working on my laptop”

“It is working on my laptop ”

“We will not give the customer your laptop”

9 of 26

While Docker is a container runtime, Kubernetes is a platform for running and managing containers from many container runtimes.

10 of 26

Google Kubernetes Engine (GKE) is used to implement kubernetes orchestration on Google Cloud

11 of 26

Understanding resilience

12 of 26

Understanding resilience

To manage resilience, you must understand mature traffic management with traffic control and traffic splitting with Kubernetes API .

We will go through a couple of thorough use cases and their solutions, both in traffic control and traffic splitting

13 of 26

Traffic control (sometimes called traffic routing or traffic shaping) refers to the act of controlling where traffic goes and how it gets there

14 of 26

Understanding resilience (traffic control)

Use case: I want to protect services from getting too many requests,

Solution: Rate limiting.

Rate limiting restricts the number of requests a user can make in a given time period. Requests can include something as simple as a GET request for the homepage of a website or a POST request on a login form. When under DDoS attack, for example, you can use rate limiting to limit the incoming request rate to a value typical for real users

15 of 26

Understanding resilience (traffic control)

Use case: I want to avoid cascading failures

Solution: Circuit breaking

Circuit breakers prevent cascading failure by monitoring for service failures. When the number of failed requests to a service exceeds a preset threshold, the circuit breaker trips and starts returning an error response to clients as soon as the requests arrive, effectively throttling traffic away from the service

16 of 26

Traffic splitting (sometimes called traffic testing) is a subcategory of traffic control and refers to the act of controlling the proportion of incoming traffic directed to different versions of a backend app running simultaneously in an environment (usually the current production version and an updated version)

17 of 26

Understanding resilience (traffic splitting)

Use case: I’m ready to test a new version in production

Solution: Debug routing

Debug routing lets you deploy it publicly yet “hide” it from actual users by allowing only certain users to access it, based on Layer 7 attributes such as a session cookie, session ID, or group ID. For example, you can allow access only to users who have an admin session cookie – their requests are routed to the new version with the credit score feature while everyone else continues on the stable version.

18 of 26

19 of 26

Understanding resilience (traffic splitting)

Use case:I need to make sure my new version is stable,

Solution: Canary deployment

A typical canary deployment starts with a high share (say, 99%) of your users on the stable version and moves a tiny group (the other 1%) to the new version. If the new version fails, for example crashing or returning errors to clients, you can immediately move the test group back to the stable version. If it succeeds, you can switch users from the stable version to the new one, either all at once or (as is more common) in a gradual, controlled migration

20 of 26

21 of 26

Understanding resilience (traffic splitting)

Use case:I want to move my users to a new version without downtime

Solution: Blue‑green deployment

Blue green deployments greatly reduce, or even eliminate, downtime for upgrades. Simply keep the old version (blue) in production while simultaneously deploying the new version (green) alongside in the same production environment

22 of 26

23 of 26

Maintaining resilence

24 of 26

Maintaining Resilience

  • Define a Steady state
  • Run outages as tests: Inject chaos pertubations (say cpu hog at POD levels), using tool like litmus chaos
  • Observe results and interpret: Note down how the application performs during the chaos period.
  • More vulnerable microservices can also be identified during this exercise as also infra cpu, memory, network, IO metrics can be recorded.
  • Improving deployment parameters – number of pods for some services, cpu / memory min max values, etc.

25 of 26

Maintaining Resilience

  • Improving app config parameters – db connection pool sizes, http retry counts, timeouts, etc.
  • Improving application code, design or architecture – async with sync, exception handling, etc.
  • Implement chaos engineering with a tool called Chaos Monkey => the main environment variables that are tested in chaos engineering are resources such as cpu, memory, network and IO. Inducing failures related to cpu and memory starvation, network outages and storage failures are some primary usecase for chaos engineering. Good examples are Litmus Chaos (https://litmuschaos.io/) , Chaos Monkey (https://netflix.github.io/chaosmonkey/)

26 of 26

The End

Connect with me on Twitter: monnarene