JavaScript isn't enabled in your browser, so this file can't be opened. Enable and reload.

1 of 32

Study on

High Availability & Fault Tolerance Application

Feb 20, 2023

Norman Kong Koon Kit and Michal Aibin

Edge Computing, Cloud Computing, and Big Data

International Conference on Computing, Networking and Communications (ICNC 2023)

2 of 32

Contents

Problem Overview

Approach & Result

Practical Implementation

Conclusion

Motivation

3 of 32

Motivation

4 of 32

Motivation

E-commerce and Remote Working have been the new normal since the pandemic.

e-Commerce trend growed 4 times from 1336 billions in 2014 and expected to be 7391 billions over 10 years

Remote jobs have kept increasing from 30,000 to 60,000 from 2018 to 2021

5 of 32

Motivation

As more or more critical system like Online Payment, Air Traffic Control are put online.�

It is crucial to design application with High Availability and Fault Tolerance, otherwise the consequence will be disastrous…

In this research, we would like to perform deep dive into application design to improve the overall Availability and Fault tolerance

6 of 32

Problem Overview

7 of 32

Problem Overview

The term “Availability” was defined as “The degree to which a system is functioning and is accessible to deliver its services during a given time interval”

⇒ Maximum downtime percentage over a given period

Source : Service Availability: Principles and Practice (Maria Toeroe · Francis Tam 2012)

High Availability

8 of 32

Problem Overview

Availability is a measure of the percentage of time that an application is running properly, i.e.

To further breakdown the formula, we define uptime & downtime as :

9 of 32

Problem Overview

MTBF = Mean Time Between Failure�MTTR = Mean Time To Restore/Recovery

To improve, we can either to ↑ MTBF or ↓ MTTR

In this research, we will focus on ↓MTTR as much as possible

10 of 32

Problem Overview

MTTR(Mean Time to Restore) can be further sub-classified into :

Mean Time to Detect the failure
Mean Time to Fix the failure

This research will focus on Minimize the MTD (Mean Time Detection)

11 of 32

Problem Overview

Minimize this as much as possible

12 of 32

Industrial Practice

To ensure the underlying application is up and running, typical Load Balancer will perform a “Pull-based” health check (or probe) periodically, which also include a timeout and Retry before declaring the node is unhealthy, worst case up to 14 seconds in AWS

13 of 32

How can we improve it ?

14 of 32

The Approach

15 of 32

The Approach

Instead of “Pull Based” health check, we propose a pluggable "Resource Agent"

Push Based health check
Persistent Connection

1) Agent keep a persistent connection to the “Load Agent”

2) Node information (CPU / Memory / Network IO) will be sent periodically

3) "Load Balancer" perform a “Resource Awareness Routing Algorithm” to dispatch client request.

Note : When worker node crashes, this “Persistent Connection” will be lost such that “Load Agent” update the Load Balancer with minimal delay.

16 of 32

Minimize "Time to Detect"

Instead of Pull-based health check, we will use Push Based + Persistent Connection to improve the Time for Detection

Traditional Approach

New Approach

17 of 32

Components in this Proposal

There are 4 major components involves :

Load Agent

Wait for worker node connect and update node information to cache�

Resource Agent (Embedded in Worker Node)

Establish a persistent connection to Load Agent and push status periodically�

Load Balancer

Run “Resource Awareness Routing Algorithm” to dispatch to Worker Node�

Recovery Agent

Perform recovery action upon Load Agent detect outage

Note : Recovery Agent is to minimize the MTF(Mean Time to Fix) instead of MTD

18 of 32

Resource Agent

Load Agent

19 of 32

Recovery Agent

Load Balancer

20 of 32

Resource Awareness Routing Algorithm

21 of 32

22 of 32

Our Sample Implementation

Applications are implemented in NodeJS with WebSocket as the persistent connection
Redis is used as the Memory Datastore and Message Queue Broker
Components are deployed as Docker, except Recovery Agent
Recovery Agent executes docker command to perform recovery action

23 of 32

The Result

24 of 32

Test Result 1 - Log Analysis

In our proposed scenario, we run 1,000 system crash simulation and the average “Time to detect” is

24.5 milliseconds

Recall AWS worst case scenario is 14 seconds, which is 583 times faster !

25 of 32

Test Result 1 - Log Analysis

MTD ~ 37ms

MTF ~ 2.8s

26 of 32

Test Result 2 - Compare overall SLA

Deployed similar application to AWS EC2
Compare with the proposed framework

In order to simulate the system crash, a “Kill Switch” is implemented to "crash" after pre-defined duration :

#	Kill Switch Frequency
1	5 minutes of execution
2	10 minutes of execution
3	20 minutes of execution

27 of 32

Test Result 2 - Compare overall SLA

	Duration(s)	Up Time(s)	Downtime(s)	SLA(%)	Fail Count
AWS - 5 min	2105.901	2020.234	85.667	95.93204%	2,080
AWS - 10 min	2203.67	2131.404	72.266	96.72067%	1,518
AWS - 20 min	2089.374	2048.838	40.536	98.05992%	1,021
Propose - 5 min	2943.944	2936.602	7.343	99.75058%	3,990
Propose - 10 min	2970.906	2970.883	0.023	99.99923%	600
Propose - 20 min	3005.539	3004.165	1.374	99.95429%	308

Disclaimer : AWS Load Balancer is on Virtual Machine while the simulation is riding on Docker. The recovery time for VM is much higher than docker

28 of 32

Test Result 2 - Compare overall SLA

Meet High Availability SLA !

29 of 32

Practical Implementation

30 of 32

Practical Implementation

Since the push-based + persistent mechanism is CPU resource consuming, this framework is suitable for Mission Critical applications like :

Security trading system
Banking system
Air traffic control system

31 of 32

Conclusion

32 of 32

Conclusion

High availability has been one of the biggest challenges in application design

Depends on the use cases, there are various techniques can improve the service availability

This research paper proposes a “Push-based mechanism with persistent connection” to reduce the “Time to Detect” such that the overall Service Level Agreement can be improved