1 of 32

Study on

High Availability & Fault Tolerance Application

Feb 20, 2023

Norman Kong Koon Kit and Michal Aibin

Edge Computing, Cloud Computing, and Big Data

International Conference on Computing, Networking and Communications (ICNC 2023)

2 of 32

Contents

2

Problem Overview

Approach & Result

Practical Implementation

Conclusion

Motivation

3 of 32

Motivation

4 of 32

Motivation

  • E-commerce and Remote Working have been the new normal since the pandemic.

  • e-Commerce trend growed 4 times from 1336 billions in 2014 and expected to be 7391 billions over 10 years

  • Remote jobs have kept increasing from 30,000 to 60,000 from 2018 to 2021

4

5 of 32

Motivation

As more or more critical system like Online Payment, Air Traffic Control are put online.�

It is crucial to design application with High Availability and Fault Tolerance, otherwise the consequence will be disastrous…

In this research, we would like to perform deep dive into application design to improve the overall Availability and Fault tolerance

5

6 of 32

Problem Overview

7 of 32

Problem Overview

7

The term “Availability” was defined as “The degree to which a system is functioning and is accessible to deliver its services during a given time interval”

Maximum downtime percentage over a given period

Source : Service Availability: Principles and Practice (Maria Toeroe · Francis Tam 2012)

High Availability

8 of 32

Problem Overview

  • Availability is a measure of the percentage of time that an application is running properly, i.e.

8

To further breakdown the formula, we define uptime & downtime as :

9 of 32

Problem Overview

9

MTBF = Mean Time Between Failure�MTTR = Mean Time To Restore/Recovery

To improve, we can either to ↑ MTBF or ↓ MTTR

In this research, we will focus on ↓MTTR as much as possible

10 of 32

Problem Overview

10

MTTR(Mean Time to Restore) can be further sub-classified into :

  • Mean Time to Detect the failure
  • Mean Time to Fix the failure

This research will focus on Minimize the MTD (Mean Time Detection)

11 of 32

Problem Overview

11

Minimize this as much as possible

12 of 32

Industrial Practice

12

To ensure the underlying application is up and running, typical Load Balancer will perform a “Pull-based” health check (or probe) periodically, which also include a timeout and Retry before declaring the node is unhealthy, worst case up to 14 seconds in AWS

13 of 32

How can we improve it ?

14 of 32

The Approach

15 of 32

The Approach

15

Instead of “Pull Based” health check, we propose a pluggable "Resource Agent"

  • Push Based health check
  • Persistent Connection

1) Agent keep a persistent connection to the “Load Agent

2) Node information (CPU / Memory / Network IO) will be sent periodically

3) "Load Balancer" perform a “Resource Awareness Routing Algorithm” to dispatch client request.

Note : When worker node crashes, this “Persistent Connection” will be lost such that “Load Agent” update the Load Balancer with minimal delay.

16 of 32

Minimize "Time to Detect"

16

Instead of Pull-based health check, we will use Push Based + Persistent Connection to improve the Time for Detection

Traditional Approach

New Approach

17 of 32

Components in this Proposal

There are 4 major components involves :

  1. Load Agent

Wait for worker node connect and update node information to cache�

  1. Resource Agent (Embedded in Worker Node)

Establish a persistent connection to Load Agent and push status periodically�

  1. Load Balancer

Run “Resource Awareness Routing Algorithm” to dispatch to Worker Node�

  1. Recovery Agent

Perform recovery action upon Load Agent detect outage

17

Note : Recovery Agent is to minimize the MTF(Mean Time to Fix) instead of MTD

18 of 32

Resource Agent

Load Agent

19 of 32

19

Recovery Agent

Load Balancer

20 of 32

Resource Awareness Routing Algorithm

20

21 of 32

22 of 32

Our Sample Implementation

22

  • Applications are implemented in NodeJS with WebSocket as the persistent connection
  • Redis is used as the Memory Datastore and Message Queue Broker
  • Components are deployed as Docker, except Recovery Agent
  • Recovery Agent executes docker command to perform recovery action

23 of 32

The Result

24 of 32

Test Result 1 - Log Analysis

In our proposed scenario, we run 1,000 system crash simulation and the average “Time to detect” is

24

24.5 milliseconds

Recall AWS worst case scenario is 14 seconds, which is 583 times faster !

25 of 32

Test Result 1 - Log Analysis

25

MTD ~ 37ms

MTF ~ 2.8s

26 of 32

Test Result 2 - Compare overall SLA

  1. Deployed similar application to AWS EC2
  2. Compare with the proposed framework

In order to simulate the system crash, a “Kill Switch” is implemented to "crash" after pre-defined duration :

26

#

Kill Switch Frequency

1

5 minutes of execution

2

10 minutes of execution

3

20 minutes of execution

27 of 32

Test Result 2 - Compare overall SLA

27

Duration(s)

Up Time(s)

Downtime(s)

SLA(%)

Fail Count

AWS - 5 min

2105.901

2020.234

85.667

95.93204%

2,080

AWS - 10 min

2203.67

2131.404

72.266

96.72067%

1,518

AWS - 20 min

2089.374

2048.838

40.536

98.05992%

1,021

Propose - 5 min

2943.944

2936.602

7.343

99.75058%

3,990

Propose - 10 min

2970.906

2970.883

0.023

99.99923%

600

Propose - 20 min

3005.539

3004.165

1.374

99.95429%

308

Disclaimer : AWS Load Balancer is on Virtual Machine while the simulation is riding on Docker. The recovery time for VM is much higher than docker

28 of 32

Test Result 2 - Compare overall SLA

28

Meet High Availability SLA !

29 of 32

Practical Implementation

30 of 32

Practical Implementation

Since the push-based + persistent mechanism is CPU resource consuming, this framework is suitable for Mission Critical applications like :

  1. Security trading system
  2. Banking system
  3. Air traffic control system

30

31 of 32

Conclusion

32 of 32

Conclusion

  • High availability has been one of the biggest challenges in application design

  • Depends on the use cases, there are various techniques can improve the service availability

  • This research paper proposes a “Push-based mechanism with persistent connection” to reduce the “Time to Detect” such that the overall Service Level Agreement can be improved

32