Study on
High Availability & Fault Tolerance Application
Feb 20, 2023
Norman Kong Koon Kit and Michal Aibin
Edge Computing, Cloud Computing, and Big Data
International Conference on Computing, Networking and Communications (ICNC 2023)
Contents
2
Problem Overview
Approach & Result
Practical Implementation
Conclusion
Motivation
Motivation
Motivation
4
Motivation
As more or more critical system like Online Payment, Air Traffic Control are put online.�
It is crucial to design application with High Availability and Fault Tolerance, otherwise the consequence will be disastrous…
In this research, we would like to perform deep dive into application design to improve the overall Availability and Fault tolerance
5
Problem Overview
Problem Overview
7
The term “Availability” was defined as “The degree to which a system is functioning and is accessible to deliver its services during a given time interval”
⇒ Maximum downtime percentage over a given period
Source : Service Availability: Principles and Practice (Maria Toeroe · Francis Tam 2012)
High Availability
Problem Overview
8
To further breakdown the formula, we define uptime & downtime as :
Problem Overview
9
MTBF = Mean Time Between Failure�MTTR = Mean Time To Restore/Recovery
To improve, we can either to ↑ MTBF or ↓ MTTR
In this research, we will focus on ↓MTTR as much as possible
Problem Overview
10
MTTR(Mean Time to Restore) can be further sub-classified into :
This research will focus on Minimize the MTD (Mean Time Detection)
Problem Overview
11
Minimize this as much as possible
Industrial Practice
12
To ensure the underlying application is up and running, typical Load Balancer will perform a “Pull-based” health check (or probe) periodically, which also include a timeout and Retry before declaring the node is unhealthy, worst case up to 14 seconds in AWS
How can we improve it ?
The Approach
The Approach
15
Instead of “Pull Based” health check, we propose a pluggable "Resource Agent"
1) Agent keep a persistent connection to the “Load Agent”
2) Node information (CPU / Memory / Network IO) will be sent periodically
3) "Load Balancer" perform a “Resource Awareness Routing Algorithm” to dispatch client request.
Note : When worker node crashes, this “Persistent Connection” will be lost such that “Load Agent” update the Load Balancer with minimal delay.
Minimize "Time to Detect"
16
Instead of Pull-based health check, we will use Push Based + Persistent Connection to improve the Time for Detection
Traditional Approach
New Approach
Components in this Proposal
There are 4 major components involves :
Wait for worker node connect and update node information to cache�
Establish a persistent connection to Load Agent and push status periodically�
Run “Resource Awareness Routing Algorithm” to dispatch to Worker Node�
Perform recovery action upon Load Agent detect outage
17
Note : Recovery Agent is to minimize the MTF(Mean Time to Fix) instead of MTD
Resource Agent
Load Agent
19
Recovery Agent
Load Balancer
Resource Awareness Routing Algorithm
20
Our Sample Implementation
22
The Result
Test Result 1 - Log Analysis
In our proposed scenario, we run 1,000 system crash simulation and the average “Time to detect” is
24
24.5 milliseconds
Recall AWS worst case scenario is 14 seconds, which is 583 times faster !
Test Result 1 - Log Analysis
25
MTD ~ 37ms
MTF ~ 2.8s
Test Result 2 - Compare overall SLA
In order to simulate the system crash, a “Kill Switch” is implemented to "crash" after pre-defined duration :
26
# | Kill Switch Frequency |
1 | 5 minutes of execution |
2 | 10 minutes of execution |
3 | 20 minutes of execution |
Test Result 2 - Compare overall SLA
27
| Duration(s) | Up Time(s) | Downtime(s) | SLA(%) | Fail Count |
AWS - 5 min | 2105.901 | 2020.234 | 85.667 | 95.93204% | 2,080 |
AWS - 10 min | 2203.67 | 2131.404 | 72.266 | 96.72067% | 1,518 |
AWS - 20 min | 2089.374 | 2048.838 | 40.536 | 98.05992% | 1,021 |
Propose - 5 min | 2943.944 | 2936.602 | 7.343 | 99.75058% | 3,990 |
Propose - 10 min | 2970.906 | 2970.883 | 0.023 | 99.99923% | 600 |
Propose - 20 min | 3005.539 | 3004.165 | 1.374 | 99.95429% | 308 |
Disclaimer : AWS Load Balancer is on Virtual Machine while the simulation is riding on Docker. The recovery time for VM is much higher than docker
Test Result 2 - Compare overall SLA
28
Meet High Availability SLA !
Practical Implementation
Practical Implementation
Since the push-based + persistent mechanism is CPU resource consuming, this framework is suitable for Mission Critical applications like :
30
Conclusion
Conclusion
32