1 of 53

Troubleshooting

And

Workaround

In Kubernetes

瑞嘉軟體 Jack Kuo

3 of 53

Jack Kuo

瑞嘉軟體 CTO Department SRE Team

Senior SRE
Speaker

目前專注領域

Agile
DevOps
Cloud Native

Speaker

4 of 53

Agenda

About Us
Case 1: Kubernetes Pod Restart

CrashLoopBackOff
ThreadPool
Warmup Setting

Case 2: Client Get 504 Timeout

Network Troubleshooting
Prometheus Metrics

Case 3: P99 Latency Is High

Resource Issue
Inconsistent Performance

Takeaways

5 of 53

About Us

SRE Team 防守範圍

Infra Team 防守範圍

SRE Team 防守範圍

6 of 53

Not Just Only Troubleshooting

7 of 53

But Also Workaround

9 of 53

https://static.learnk8s.io/cbe079ed7f6e7764445aa746b2c9f295.png

10 of 53

Case 1

Kubernetes Pod Restart

11 of 53

Kubernetes Pod Restart - CrashLoopBackOff

12 of 53

Kubernetes Pod Restart - CrashLoopBackOff

13 of 53

Kubernetes Pod Restart - CrashLoopBackOff

Application Issue

Code Error
Environment Config Error
Dependent Applications Not Ready

Resource Issue

CPU
Memory

Health Check

Startup Probe
Liveness Probe
Readiness Probe

14 of 53

所以我說...那個...未知狀態呢？

15 of 53

Kubernetes Pod Restart - ThreadPool

16 of 53

Kubernetes Pod Restart - ThreadPool

.Net ThreadPool

ThreadPool.SetMaxThreads

Bigger is not necessarily better
Context Switching is expensive

ThreadPool.SetMinThreads

The number of Threads is not always above MinThreads
MinThreads means how many threads will be generated without delay

17 of 53

Kubernetes Pod Restart - Warmup Setting

Latency, High CPU Consumption And HPA Scaling 的惡性循環

18 of 53

Kubernetes Pod Restart - Warmup Setting

minReadySeconds ≠ min Ready Seconds

19 of 53

Please Tell Me When You Are Ready

20 of 53

Kubernetes Pod Restart - Warmup Setting

return non-200 to Kubernetes

return 200 to kubernetes

21 of 53

Case 2

Client Get 504 Timeout

22 of 53

Client Get 504 Timeout

Application Issue

Application is too busy to handle requests from Client
There is a problem with Dependent Applications

Network Issue

CDN
Load Balancer
Kubernetes Ingress
Kubernetes Service
Kubernetes Application Pod

Bottom-Up

23 of 53

A Chain Is Only As Strong As Its Weakeast Link

24 of 53

Client Get 504 Timeout - Network Troubleshooting

Pod to Pod

Node 2

Node 1

Node 3

25 of 53

Client Get 504 Timeout - Network Troubleshooting

Service to Pod

Node 2

Node 1

Node 3

26 of 53

Client Get 504 Timeout - Network Troubleshooting

Ingress to Service

Node 1

Node 2

Node 3

curl -v -H “Host: <Host>” http://<Ingress-Nginx-Controller IP>:<Ingress-Nginx-Controller Port>/<Path>

27 of 53

Client Get 504 Timeout - Network Troubleshooting

HA Proxy to Ingress-Nginx-Controller

Node 1

Node 2

Node 3

curl -v -H “Host: <Host>” http://<HAProxy IP>:<HAProxy Port>/<Path>

28 of 53

好像有點冗長且麻煩

29 of 53

Client Get 504 Timeout - Prometheus Metrics

Ingress Request Volume By Status

30 of 53

Client Get 504 Timeout - Prometheus Metrics

Ingress Percentile Response Time

31 of 53

Client Get 504 Timeout - Prometheus Metrics

Endpoint RPS

32 of 53

Client Get 504 Timeout - Prometheus Metrics

Requests Currently In Progress By Endpint

33 of 53

Workaround

Separate Kubernetes Deployment By Ingress Host And Path

34 of 53

Workaround

Separate Kubernetes Deployment By Ingress Host And Path

35 of 53

Case 3

P99 Latency Is High

36 of 53

P99 Latency Is A Leading Indicator Of Problems

37 of 53

P99 Latency Is High

Application Issue
Resource Issue

Resource Competition
Memory Leak

Performance Inconsistency

Sticky Session Setting
VM Host Issue

38 of 53

P99 Latency Is High - Resource Issue

Resource Competition

worker node resource

Limits

Requests

guaranteed resource for container

available resource for container

Limits

Requests

guaranteed resource for container

max resource container can use

available resource for container

Resource Competition !!!

39 of 53

P99 Latency Is High - Resource Issue

Pod Anti-Affinity

40 of 53

P99 Latency Is High - Resource Issue

Memory Leak

41 of 53

Workaround

A Cronjob To Detect Memory Leak

*/2 * * * *

Calculate Memory Usage

Prometheus API

Restart Deployment

Notify To Slack

Memory Usage > Memory Target

Memory Usage < Memory Target

42 of 53

Workaround

A Cronjob To Detect Memory Leak

43 of 53

P99 Latency Is High - Inconsistent Performance

Pod RPS

44 of 53

P99 Latency Is High - Inconsistent Performance

Sticky Session Setting

nginx.ingress.kubernetes.io/upstream-hash-by: "$http_x_actual_ip"

User-Agent

X-Forwarded-For

X-Actual-IP

45 of 53

P99 Latency Is High - Inconsistent Performance

Inconsistent Pod Memory Resource Usage

46 of 53

P99 Latency Is High - Inconsistent Performance

.Net Runtime Bug In AMD Machine

47 of 53

P99 Latency Is High - Inconsistent Performance

Inconsistent Pod CPU Resource Usage

48 of 53

P99 Latency Is High - Inconsistent Performance

VM Host Issue

49 of 53

CPU Exceeds Tipping Point, Performance Reduction

50 of 53

Workaround

Dummy Pod

51 of 53

Workaround

Dummy Pod

worker node resource

Dummy Pod

52 of 53

Workaround Doesn’t Mean That The Problem Is Solved

53 of 53

Takeaways

Warmup Setting With Kubernetes Health Check

Prometheus Metrics Is Helpful For Network Troubleshooting

Same Pod But Inconsistent Performance In Kubernetes

Workaround Doesn’t Mean That The Problem Is Solved