Troubleshooting
And
Workaround
In Kubernetes
瑞嘉軟體 Jack Kuo
1
Jack Kuo
瑞嘉軟體 CTO Department SRE Team
目前專注領域
Speaker
Agenda
4
About Us
SRE Team 防守範圍
5
Infra Team 防守範圍
SRE Team 防守範圍
6
Not Just Only Troubleshooting
7
But Also Workaround
8
10
Case 1
Kubernetes Pod Restart
Kubernetes Pod Restart - CrashLoopBackOff
11
Kubernetes Pod Restart - CrashLoopBackOff
12
Kubernetes Pod Restart - CrashLoopBackOff
13
14
所以我說...那個...未知狀態呢?
Kubernetes Pod Restart - ThreadPool
15
Kubernetes Pod Restart - ThreadPool
16
.Net ThreadPool
Kubernetes Pod Restart - Warmup Setting
17
Latency, High CPU Consumption And HPA Scaling 的惡性循環
Kubernetes Pod Restart - Warmup Setting
18
minReadySeconds ≠ min Ready Seconds
19
Please Tell Me When You Are Ready
Kubernetes Pod Restart - Warmup Setting
20
return non-200 to Kubernetes
return 200 to kubernetes
21
Case 2
Client Get 504 Timeout
Client Get 504 Timeout
22
Bottom-Up
23
A Chain Is Only As Strong As Its Weakeast Link
Client Get 504 Timeout - Network Troubleshooting
24
Pod to Pod
Node 2
Node 1
Node 3
Client Get 504 Timeout - Network Troubleshooting
25
Service to Pod
Node 2
Node 1
Node 3
Client Get 504 Timeout - Network Troubleshooting
26
Ingress to Service
Node 1
Node 2
Node 3
curl -v -H “Host: <Host>” http://<Ingress-Nginx-Controller IP>:<Ingress-Nginx-Controller Port>/<Path>
Client Get 504 Timeout - Network Troubleshooting
27
HA Proxy to Ingress-Nginx-Controller
Node 1
Node 2
Node 3
curl -v -H “Host: <Host>” http://<HAProxy IP>:<HAProxy Port>/<Path>
28
好像有點冗長且麻煩
Client Get 504 Timeout - Prometheus Metrics
29
Ingress Request Volume By Status
Client Get 504 Timeout - Prometheus Metrics
30
Ingress Percentile Response Time
Client Get 504 Timeout - Prometheus Metrics
31
Endpoint RPS
Client Get 504 Timeout - Prometheus Metrics
32
Requests Currently In Progress By Endpint
Workaround
33
Separate Kubernetes Deployment By Ingress Host And Path
Workaround
34
Separate Kubernetes Deployment By Ingress Host And Path
35
Case 3
P99 Latency Is High
36
P99 Latency Is A Leading Indicator Of Problems
P99 Latency Is High
37
P99 Latency Is High - Resource Issue
38
Resource Competition
worker node resource
Limits
Requests
guaranteed resource for container
available resource for container
Limits
Requests
guaranteed resource for container
max resource container can use
max resource container can use
available resource for container
Resource Competition !!!
P99 Latency Is High - Resource Issue
39
Pod Anti-Affinity
P99 Latency Is High - Resource Issue
40
Memory Leak
Workaround
41
A Cronjob To Detect Memory Leak
*/2 * * * *
Calculate Memory Usage
Prometheus API
If
Restart Deployment
Notify To Slack
Memory Usage > Memory Target
Memory Usage < Memory Target
Workaround
42
A Cronjob To Detect Memory Leak
P99 Latency Is High - Inconsistent Performance
43
Pod RPS
P99 Latency Is High - Inconsistent Performance
44
Sticky Session Setting
nginx.ingress.kubernetes.io/upstream-hash-by: "$http_x_actual_ip"
User-Agent
X-Forwarded-For
X-Actual-IP
P99 Latency Is High - Inconsistent Performance
45
Inconsistent Pod Memory Resource Usage
P99 Latency Is High - Inconsistent Performance
46
.Net Runtime Bug In AMD Machine
P99 Latency Is High - Inconsistent Performance
47
Inconsistent Pod CPU Resource Usage
P99 Latency Is High - Inconsistent Performance
48
VM Host Issue
49
CPU Exceeds Tipping Point, Performance Reduction
Workaround
50
Dummy Pod
Workaround
51
Dummy Pod
worker node resource
Dummy Pod
52
Workaround Doesn’t Mean That The Problem Is Solved
Takeaways
53