1 of 29

Quark Hibernate Container

Yulin Sun, Shaobao Feng

2 of 29

Serverless Container Requirement/Challenge

  1. Low latency On-demand user request handling:
    1. Cold startup
    2. Warm startup
  2. High density deployment
    • Low memory
    • CPU usage
  3. Challenge:
    • Memory: Warm-up container consume same memory as running service
    • CPU: Warm-up container might consume CPU when there is no user request

3 of 29

Hibernate Container

  • Start hibernate after application finished initialization
  • Application memory is swapped out to disk
  • Startup
    • Triggered by user request (http) or control plan request
    • Load memory from disk on-demand by page fault
  • Resource consumption
    • Memory: only sandbox memory
    • CPU: No CPU
    • Disk: Application memory swap file
  • Startup latency: Low

Memory Consumption

CPU Usage

Startup latency

Deployment Density

Warm up

Sandbox + User Application

Unknown

Zero

Low

Running

Sandbox + User Application

Yes

N/A

N/A

Hibernate

Sandbox only (~ 5 MB)

Zero

Low

High

Init

Zero

Zero

High

N/A

4 of 29

Demo/Performance

Memory Usage (RSS)

NodeJs

Nginx

Warm-up

77.779 MB

33.872MB

Hibernate

13.664 MB

14.396 MB

Wakeup from Hibernate

37.328 MB

15.512 MB

Private Mem

4.9 MB

5.3 MB

Nodejs

Nginx

qkernel.bin

2456

2340

quark

412

412

3652

4048

324

268

52

52

288

252

24

24

68

68

24

24

libgcc_s.so.1

12

12

0

64

136

136

976

1028

184

172

140

140

32

32

total

8780

9072

Startup latency

NodeJs

Nginx

Warm Up Container

1st Req

9 ~ 13 ms

2 ~ 3 ms

other Req

1 ~ 5 ms

0.4 ~ 4 ms

Hibernate Container

(No PageCache)

1st Req

12~65 ms

10 ~ 13 ms

other Req

1 ~ 5 ms

0.4 ~ 4 ms

Hibernate Container

(with PageCache)

1st Req

4 ~ 20 ms

1 ~ 2 ms

other Req

1 ~ 5 ms

0.4 ~ 4 ms

WorkSet

2256 Pages

154 Pages

  • Resident Set Size (RSS)
    • Private Mem
    • Shared Mem
  • Memory usage with N instances

N x Private Mem + 1 x Shared Mem

5 of 29

Node.js deployment density limit

Machine Spec

  • CPU: 100 CPU core
  • Memory: 300 GB

Application

  • Memory: 128 MB
  • CPU: 0.5 core

Running Container

  • Count : 100 / 0.5 = 200
  • Memory: 200 * 0.2 GB = 40 GB

Maximum Hibernate Container

  • Count: (300 - 40) * 1000 / 5 ~= 52K containers

6 of 29

Next step work

  1. IO improvement
    1. Batch pages loading
    2. NVMe Direct IO
  2. Checkpoint / Delta pages set hibernate
  3. WASM Node.js

Goal: Not replacement of warm startup but another option

M x Warm startup + N Hibernate

7 of 29

A New Serverless Application Deploy Mode

kubectl create -f app.yaml

8 of 29

A Knative AutoScale Example

apiVersion: serving.knative.dev/v1

kind: Service

metadata:

name: autoscale-go

namespace: default

spec:

template:

metadata:

annotations:

autoscaling.knative.dev/class: kpa.autoscaling.knative.dev

autoscaling.knative.dev/metric: concurrency

autoscaling.knative.dev/target: "10"

autoscaling.knative.dev/min-scale: "0"

autoscaling.knative.dev/max-scale: "100"

autoscaling.knative.dev/max-hibernete: "18"

autoscaling.knative.dev/min-hibernete: "3"

spec:

containers:

- image: gcr.io/knative-samples/autoscale-go:0.1

9 of 29

Manual Hibernating of pods

  • kubectl hibernate pod-xxxx
  • kubectl wakeup pod-xxxx

10 of 29

The Challenge of kubernetes

  • Thousands of pods on one node, when all nodes restart simultaneously , the list API calls to kube-apiserver is too heavy.
  • The hibernated pod will release the CPU/Mem resource, how do the scheduler schedule pods based on the real time resource usage?
  • The hibernated pod may be awaken any time, how to make sure of no OOM on nodes? (based on water mark?)

11 of 29

Init

Warm

up

Running

Hibernate

Cold Start

User Request

Request Finish

SIGSTOP

SIGCONT

User Request

12 of 29

Front End

Controller

Worker Node

Node Agent

Container

Container

Request

Control Msg

Control data

Msg

Bus

13 of 29

Guest Applications

Page Tables

Bitmap Allocator

QKernel

Swapping Mgr

Mem Reclaim Mgr

QVisor

Host Linux Kernel

Swap File

REAP File

Full

Swap Out

Page Fault

Swap-in

REAP

Batch Swap-In

14 of 29

Init

Warm

Running

Hibernate

① Cold Start

User Request

Request Finish

SIGSTOP

SIGCONT

User Request

Wake-up

SIGSTOP

Hibernate Running

Request Finish

User Request

①②③④⑤⑥⑦⑧⑨

15 of 29

4MB Mem Block

Control

Page

4KB

Page#1

4KB Page#1023

Control Page

Next

Free Page Bitmap

Refcnt Array

AtomicU16#0

AtomicU16#1

AtomicU16#2

AtomicU16#1023

4KB

Page#2

L2 Bitmap Array

u64#0

u64#1

Bit 0 ~ 63

Bit 64 ~ 127

u64#15

Bit 960 ~ 1023

Bit#0

Bit#1

Bit#15

Bit#63

L1 Bitmap (u64)

Free Page Bitmap

Bitmap Page Allocator

Head

Tail

4MB Mem Block

Next

4MB Mem Block

Next

4MB Mem Block

Next

16 of 29

Linux Process

Linux Container Applications

Linux Guest Kernel

QEMU / Firecracker …

Linux Container (cgroup, namespaces)

Linux Container Applications

QKernel

QVisor

Memory Management

Process Management

Network Stack

Virtual File System

Linux Host Kernel

KVM

Guest Applications

Guest Kernel

Virtual Machine Monitor (VMM)

Quark

Linux Virtual Machine

Host Kernel

System Call Virtualization Layer

Linux System Call

Quark System Call

17 of 29

TSoR Cluster

External

RDMA Conn

Node#2

Node#4

Node#3

Node#1

Cluster Node

RDMA Service

TSoR Gateway

Quark Pod#1

RDMA NIC

TCP Egress

TCP Ingress

TCP Ingress

TCP Egress

Orchestration Control Plane

Cluster Orchestration System

TSoR Client

Quark Pod#N

TSoR Client

18 of 29

Kubernetes Cluster

Node1

Pod#2

Pod#1

Node 2

Pod#3

External

①②③④⑤⑥

19 of 29

RDMA Service

Pod#1

SHM Region

TSoR Client

System Call Virtual Layer

Cloud Native Applications

RDMA Srv Client Mgr

RDMA Connection Mgr

RNIC

SQ

CQ

Shared

Data Buffs

Pod#1

SHM Region

TSoR Client

System Call Virtual Layer

Cloud Native Applications

TSoR Control Plane Agent

Orchestration Control Plane

TSoR Gateway

TSOR Client

Shared

Data Buffs

Guest User Space

Guest Kernel Space

20 of 29

RDMA Channel

RDMA Service

RDMA Connection Mgr

RDMA Service

RDMA Connection Mgr

RDMA Connection

RDMA Channel (Control)

RDMA Channel (Data) #2

RDMA Channel (Data) #1

Write Ring Buffer

Write Ring Buffer

read Ring Buffer

read Ring Buffer

Node#1

Node#2

21 of 29

TSoR Gateway

Gateway Control Plane Agent

TSoR Client

Ingress TCP Layer

Egress TCP Layer

External

TCP Ingress Traffic

TCP Egress Traffic

RDMA Service

Orchestration Control Plane

Quark Pod

TSoR Ingress Traffic

TSoR Egress Traffic

RDMA NIC

22 of 29

Cluster Node#1 (192.168.0.0/24)

RDMA Service

Pod#1

192.168.0.1

Client1

Pod#2

192.168.0.2

Client2

TSoR (Egress) Gateway

Client3

TSoR (Ingress) Gateway

Client4

Cluster Node#2

(192.168.1.0/24)

RDMA Conn#1

RDMA Conn#2

Dst

Interface

192.168.0.1

Client#1 (Local)

192.168.0.2

Client#2 (Local)

192.168.1.0/24

RDMA Conn#1

192.168.2.0/24

RDMA Conn#2

192.168.3.0/24

RDMA Conn#3

10.5.0.0/16

Cluster IP handler

*

Client#4 (Egress)

TSoR Route Table

RDMA Conn#3

Pod#3 192.168.1.5

Cluster Node#3

(192.168.2.0/24)

Pod#4 192.168.2.8

Cluster Node#4

(192.168.3.0/24)

Pod#5 192.168.3.6

Cluster IP

Pod IP

Service svc1

10.5.6.8:546

192.168.1.5:80

192.168.2.8:80

192.168.3.6:80

TSoR Cluster IP Table

External

External EP

Internal EP

202.21.11.5:80

10.5.6.8:546

TSoR Ingress Gateway Table

External IP

202.21.11.5

23 of 29

24 of 29

25 of 29

26 of 29

26

TCP Socket

Container

write(int socket, void *buf, ssize_t N))

TCP Socket

Container

read(int socket, void *buf, ssize_t N))

Write Buffer

SysWrite

Read Buffer

SysRead

Submit Queue

Push Request

Pop Request

RDMA QP

SQ

CQ

Handle Request

Complete Queue

RDMA QP

Handle Request

SQ

CQ

RDMA Write Immediate

TSoR_cli

share region

rdma_svc

rdma_svc

share region

TSoR_cli

27 of 29

Node

Sandbox

Process

TCP Conn

Node

Sandbox

Process

TCP Conn

28 of 29

  • Scalability
  • Multi-tenant

http://projectid.xxx/

API Gateway

API Gateway

Function Dispatcher

Function Dispatcher

Container#1

Container#2

Resource Manager

Node Agent

Event Queue

Ingress

Dispatcher

29 of 29

Region 1

Region 2

Pod A

Pod B

Pod B

Egress

Ingress

192.168.0.1:80

192.168.2.3:80

192.168.5.2:128

Pod B

Pod B

EIP 202.1.2.3:90

Endpoint B (Service B)

Cluster IP: 10.5.8.5:80

192.168.0.1.80

192.168.2.3.80

192.168.5.2.128

Endpoint B (Service B)

Cluster IP 10.5.6.7:8080

192.168.0.1.8080

192.168.82.3.8080

192.168.0.1:8080

192.168.82.3:8080

Port

External EP

128

202.1.2.3:90

68.5.4.3:507

129

..

Region 3

Ingress

Pod B

Pod B

EIP 68.5.4.3:507

192.168.0.1:8080

192.168.82.3:8080

Egress Mapping

Port

Internal EP

90

10.5.6.7:8080

Ingress Mapping

Service Definition

Service Definition

Global K8S Mgr

Adaptor C

S3/Redis

Egress

Ingress

DNS:

PodB.local → 10.5.6.7

DNS

PodB.local → 10.5.8.5

Object Store

VPC

Pod A