1 of 53

Troubleshooting

And

Workaround

In Kubernetes

瑞嘉軟體 Jack Kuo

1

2 of 53

3 of 53

Jack Kuo

瑞嘉軟體 CTO Department SRE Team

    • Senior SRE
    • Speaker

目前專注領域

    • Agile
    • DevOps
    • Cloud Native

Speaker

4 of 53

Agenda

4

  • About Us
  • Case 1: Kubernetes Pod Restart
    • CrashLoopBackOff
    • ThreadPool
    • Warmup Setting
  • Case 2: Client Get 504 Timeout
    • Network Troubleshooting
    • Prometheus Metrics
  • Case 3: P99 Latency Is High
    • Resource Issue
    • Inconsistent Performance
  • Takeaways

5 of 53

About Us

SRE Team 防守範圍

5

Infra Team 防守範圍

SRE Team 防守範圍

6 of 53

6

Not Just Only Troubleshooting

7 of 53

7

But Also Workaround

8 of 53

8

9 of 53

9

10 of 53

10

Case 1

Kubernetes Pod Restart

11 of 53

Kubernetes Pod Restart - CrashLoopBackOff

11

12 of 53

Kubernetes Pod Restart - CrashLoopBackOff

12

13 of 53

Kubernetes Pod Restart - CrashLoopBackOff

13

  • Application Issue
    • Code Error
    • Environment Config Error
    • Dependent Applications Not Ready
  • Resource Issue
    • CPU
    • Memory
  • Health Check
    • Startup Probe
    • Liveness Probe
    • Readiness Probe

14 of 53

14

所以我說...那個...未知狀態呢?

15 of 53

Kubernetes Pod Restart - ThreadPool

15

16 of 53

Kubernetes Pod Restart - ThreadPool

16

.Net ThreadPool

  • ThreadPool.SetMaxThreads
    • Bigger is not necessarily better
    • Context Switching is expensive
  • ThreadPool.SetMinThreads
    • The number of Threads is not always above MinThreads
    • MinThreads means how many threads will be generated without delay

17 of 53

Kubernetes Pod Restart - Warmup Setting

17

Latency, High CPU Consumption And HPA Scaling 的惡性循環

18 of 53

Kubernetes Pod Restart - Warmup Setting

18

minReadySeconds ≠ min Ready Seconds

19 of 53

19

Please Tell Me When You Are Ready

20 of 53

Kubernetes Pod Restart - Warmup Setting

20

return non-200 to Kubernetes

return 200 to kubernetes

21 of 53

21

Case 2

Client Get 504 Timeout

22 of 53

Client Get 504 Timeout

22

  • Application Issue
    • Application is too busy to handle requests from Client
    • There is a problem with Dependent Applications
  • Network Issue
    • CDN
    • Load Balancer
    • Kubernetes Ingress
    • Kubernetes Service
    • Kubernetes Application Pod

Bottom-Up

23 of 53

23

A Chain Is Only As Strong As Its Weakeast Link

24 of 53

Client Get 504 Timeout - Network Troubleshooting

24

Pod to Pod

Node 2

Node 1

Node 3

25 of 53

Client Get 504 Timeout - Network Troubleshooting

25

Service to Pod

Node 2

Node 1

Node 3

26 of 53

Client Get 504 Timeout - Network Troubleshooting

26

Ingress to Service

Node 1

Node 2

Node 3

curl -v -H “Host: <Host>” http://<Ingress-Nginx-Controller IP>:<Ingress-Nginx-Controller Port>/<Path>

27 of 53

Client Get 504 Timeout - Network Troubleshooting

27

HA Proxy to Ingress-Nginx-Controller

Node 1

Node 2

Node 3

curl -v -H “Host: <Host>” http://<HAProxy IP>:<HAProxy Port>/<Path>

28 of 53

28

好像有點冗長且麻煩

29 of 53

Client Get 504 Timeout - Prometheus Metrics

29

Ingress Request Volume By Status

30 of 53

Client Get 504 Timeout - Prometheus Metrics

30

Ingress Percentile Response Time

31 of 53

Client Get 504 Timeout - Prometheus Metrics

31

Endpoint RPS

32 of 53

Client Get 504 Timeout - Prometheus Metrics

32

Requests Currently In Progress By Endpint

33 of 53

Workaround

33

Separate Kubernetes Deployment By Ingress Host And Path

34 of 53

Workaround

34

Separate Kubernetes Deployment By Ingress Host And Path

35 of 53

35

Case 3

P99 Latency Is High

36 of 53

36

P99 Latency Is A Leading Indicator Of Problems

37 of 53

P99 Latency Is High

37

  • Application Issue
  • Resource Issue
    • Resource Competition
    • Memory Leak
  • Performance Inconsistency
    • Sticky Session Setting
    • VM Host Issue

38 of 53

P99 Latency Is High - Resource Issue

38

Resource Competition

worker node resource

Limits

Requests

guaranteed resource for container

available resource for container

Limits

Requests

guaranteed resource for container

max resource container can use

max resource container can use

available resource for container

Resource Competition !!!

39 of 53

P99 Latency Is High - Resource Issue

39

Pod Anti-Affinity

40 of 53

P99 Latency Is High - Resource Issue

40

Memory Leak

41 of 53

Workaround

41

A Cronjob To Detect Memory Leak

*/2 * * * *

Calculate Memory Usage

Prometheus API

If

Restart Deployment

Notify To Slack

Memory Usage > Memory Target

Memory Usage < Memory Target

42 of 53

Workaround

42

A Cronjob To Detect Memory Leak

43 of 53

P99 Latency Is High - Inconsistent Performance

43

Pod RPS

44 of 53

P99 Latency Is High - Inconsistent Performance

44

Sticky Session Setting

nginx.ingress.kubernetes.io/upstream-hash-by: "$http_x_actual_ip"

User-Agent

X-Forwarded-For

X-Actual-IP

45 of 53

P99 Latency Is High - Inconsistent Performance

45

Inconsistent Pod Memory Resource Usage

46 of 53

P99 Latency Is High - Inconsistent Performance

46

.Net Runtime Bug In AMD Machine

47 of 53

P99 Latency Is High - Inconsistent Performance

47

Inconsistent Pod CPU Resource Usage

48 of 53

P99 Latency Is High - Inconsistent Performance

48

VM Host Issue

49 of 53

49

CPU Exceeds Tipping Point, Performance Reduction

50 of 53

Workaround

50

Dummy Pod

51 of 53

Workaround

51

Dummy Pod

worker node resource

Dummy Pod

52 of 53

52

Workaround Doesn’t Mean That The Problem Is Solved

53 of 53

Takeaways

53

  • Warmup Setting With Kubernetes Health Check

  • Prometheus Metrics Is Helpful For Network Troubleshooting

  • Same Pod But Inconsistent Performance In Kubernetes

  • Workaround Doesn’t Mean That The Problem Is Solved