3 of 29

Hibernate Container

Start hibernate after application finished initialization
Application memory is swapped out to disk
Startup

Triggered by user request (http) or control plan request
Load memory from disk on-demand by page fault

Resource consumption

Memory: only sandbox memory
CPU: No CPU
Disk: Application memory swap file

Startup latency: Low

	Memory Consumption	CPU Usage	Startup latency	Deployment Density
Warm up	Sandbox + User Application	Unknown	Zero	Low
Running	Sandbox + User Application	Yes	N/A	N/A
Hibernate	Sandbox only (~ 5 MB)	Zero	Low	High
Init	Zero	Zero	High	N/A

4 of 29

Demo/Performance

Memory Usage (RSS)	NodeJs	Nginx
Warm-up	77.779 MB	33.872MB
Hibernate	13.664 MB	14.396 MB
Wakeup from Hibernate	37.328 MB	15.512 MB
Private Mem	4.9 MB	5.3 MB

	Nodejs	Nginx
qkernel.bin	2456	2340
quark	412	412
	3652	4048
	324	268
libm-2.31.so	52	52
	288	252
libpthread-2.31.so	24	24
	68	68
	24	24
libgcc_s.so.1	12	12
	0	64
libc-2.31.so	136	136
	976	1028
	184	172
ld-2.31.so	140	140
	32	32
total	8780	9072

Startup latency		NodeJs	Nginx
Warm Up Container	1st Req	9 ~ 13 ms	2 ~ 3 ms
Warm Up Container	other Req	1 ~ 5 ms	0.4 ~ 4 ms
Hibernate Container (No PageCache)	1st Req	12~65 ms	10 ~ 13 ms
Hibernate Container (No PageCache)	other Req	1 ~ 5 ms	0.4 ~ 4 ms
Hibernate Container (with PageCache)	1st Req	4 ~ 20 ms	1 ~ 2 ms
Hibernate Container (with PageCache)	other Req	1 ~ 5 ms	0.4 ~ 4 ms
WorkSet		2256 Pages	154 Pages

Resident Set Size (RSS)

Private Mem
Shared Mem

Memory usage with N instances

N x Private Mem + 1 x Shared Mem

5 of 29

Node.js deployment density limit

Machine Spec

CPU: 100 CPU core
Memory: 300 GB

Application

Memory: 128 MB
CPU: 0.5 core

Running Container

Count : 100 / 0.5 = 200
Memory: 200 * 0.2 GB = 40 GB

Maximum Hibernate Container

Count: (300 - 40) * 1000 / 5 ~= 52K containers

6 of 29

Next step work

IO improvement

Batch pages loading
NVMe Direct IO

Checkpoint / Delta pages set hibernate
WASM Node.js

Goal: Not replacement of warm startup but another option

M x Warm startup + N Hibernate

7 of 29

A New Serverless Application Deploy Mode

kubectl create -f app.yaml

8 of 29

A Knative AutoScale Example

apiVersion: serving.knative.dev/v1

kind: Service

metadata:

name: autoscale-go

namespace: default

spec:

template:

metadata:

annotations:

autoscaling.knative.dev/class: kpa.autoscaling.knative.dev

autoscaling.knative.dev/metric: concurrency

autoscaling.knative.dev/target: "10"

autoscaling.knative.dev/min-scale: "0"

autoscaling.knative.dev/max-scale: "100"

autoscaling.knative.dev/max-hibernete: "18"

autoscaling.knative.dev/min-hibernete: "3"

spec:

containers:

- image: gcr.io/knative-samples/autoscale-go:0.1

9 of 29

Manual Hibernating of pods

kubectl hibernate pod-xxxx
kubectl wakeup pod-xxxx

10 of 29

The Challenge of kubernetes

Thousands of pods on one node, when all nodes restart simultaneously , the list API calls to kube-apiserver is too heavy.
The hibernated pod will release the CPU/Mem resource, how do the scheduler schedule pods based on the real time resource usage?
The hibernated pod may be awaken any time, how to make sure of no OOM on nodes? (based on water mark?)

11 of 29

Init

Warm

Running

Hibernate

Cold Start

User Request

Request Finish

SIGSTOP

SIGCONT

User Request

12 of 29

Front End

Controller

Worker Node

Node Agent

Container

…

Request

Control Msg

Control data

Msg

Bus

13 of 29

Guest Applications

Page Tables

Bitmap Allocator

QKernel

Swapping Mgr

Mem Reclaim Mgr

QVisor

Host Linux Kernel

Swap File

REAP File

Full

Swap Out

Page Fault

Swap-in

REAP

Batch Swap-In

14 of 29

Init

Warm

Running

Hibernate

① Cold Start

②User Request

③Request Finish

④ SIGSTOP

⑤SIGCONT

⑦ User Request

Wake-up

⑨ SIGSTOP

Hibernate Running

⑧ Request Finish

⑥ User Request

①②③④⑤⑥⑦⑧⑨

15 of 29

4MB Mem Block

Control

Page

4KB

Page#1

4KB Page#1023

Control Page

Free Page Bitmap

Refcnt Array

AtomicU16#0

AtomicU16#1

AtomicU16#2

AtomicU16#1023

4KB

Page#2

…

L2 Bitmap Array

u64#0

u64#1

Bit 0 ~ 63

Bit 64 ~ 127

u64#15

…

Bit 960 ~ 1023

Bit#0

Bit#1

Bit#15

Bit#63

…

L1 Bitmap (u64)

Free Page Bitmap

…

Bitmap Page Allocator

Head

Tail

4MB Mem Block

…

16 of 29

Linux Process

Linux Container Applications

Linux Guest Kernel

QEMU / Firecracker …

Linux Container (cgroup, namespaces)

Linux Container Applications

QKernel

QVisor

Memory Management

Process Management

Network Stack

Virtual File System

Linux Host Kernel

KVM

Guest Applications

Guest Kernel

Virtual Machine Monitor (VMM)

Quark

Linux Virtual Machine

Host Kernel

System Call Virtualization Layer

Linux System Call

Quark System Call

17 of 29

TSoR Cluster

External

RDMA Conn

Node#2

Node#4

Node#3

Node#1

…

Cluster Node

RDMA Service

TSoR Gateway

Quark Pod#1

…

RDMA NIC

TCP Egress

TCP Ingress

TCP Egress

Orchestration Control Plane

Cluster Orchestration System

TSoR Client

Quark Pod#N

TSoR Client

…

18 of 29

Kubernetes Cluster

Node1

Pod#2

Pod#1

Node 2

Pod#3

External

…

①②③④⑤⑥

①

②

③

④

19 of 29

RDMA Service

Pod#1

SHM Region

TSoR Client

System Call Virtual Layer

Cloud Native Applications

RDMA Srv Client Mgr

RDMA Connection Mgr

RNIC

Shared

Data Buffs

Pod#1

SHM Region

TSoR Client

System Call Virtual Layer

Cloud Native Applications

…

TSoR Control Plane Agent

Orchestration Control Plane

TSoR Gateway

TSOR Client

Shared

Data Buffs

Guest User Space

Guest Kernel Space

20 of 29

RDMA Channel

RDMA Service

RDMA Connection Mgr

RDMA Service

RDMA Connection Mgr

RDMA Connection

RDMA Channel (Control)

RDMA Channel (Data) #2

RDMA Channel (Data) #1

…

Write Ring Buffer

read Ring Buffer

Node#1

Node#2

21 of 29

TSoR Gateway

Gateway Control Plane Agent

TSoR Client

Ingress TCP Layer

Egress TCP Layer

External

TCP Ingress Traffic

TCP Egress Traffic

RDMA Service

Orchestration Control Plane

Quark Pod

TSoR Ingress Traffic

TSoR Egress Traffic

RDMA NIC

22 of 29

Cluster Node#1 (192.168.0.0/24)

RDMA Service

…

Pod#1

192.168.0.1

Client1

Pod#2

192.168.0.2

Client2

TSoR (Egress) Gateway

Client3

TSoR (Ingress) Gateway

Client4

Cluster Node#2

(192.168.1.0/24)

RDMA Conn#1

RDMA Conn#2

Dst	Interface
192.168.0.1	Client#1 (Local)
192.168.0.2	Client#2 (Local)
192.168.1.0/24	RDMA Conn#1
192.168.2.0/24	RDMA Conn#2
192.168.3.0/24	RDMA Conn#3
10.5.0.0/16	Cluster IP handler
*	Client#4 (Egress)

TSoR Route Table

RDMA Conn#3

Pod#3 192.168.1.5

Cluster Node#3

(192.168.2.0/24)

Pod#4 192.168.2.8

Cluster Node#4

(192.168.3.0/24)

Pod#5 192.168.3.6

Cluster IP	Pod IP
Service svc1 10.5.6.8:546	192.168.1.5:80
	192.168.2.8:80
	192.168.3.6:80

TSoR Cluster IP Table

External

External EP	Internal EP
202.21.11.5:80	10.5.6.8:546

TSoR Ingress Gateway Table

External IP

202.21.11.5

…

26 of 29

TCP Socket

Container

write(int socket, void *buf, ssize_t N))

TCP Socket

Container

read(int socket, void *buf, ssize_t N))

Write Buffer

SysWrite

Read Buffer

SysRead

Submit Queue

Push Request

Pop Request

RDMA QP

Handle Request

Complete Queue

RDMA QP

Handle Request

RDMA Write Immediate

TSoR_cli

share region

rdma_svc

share region

TSoR_cli

27 of 29

Node

Sandbox

Process

TCP Conn

Node

Sandbox

Process

TCP Conn

28 of 29

Scalability
Multi-tenant

http://projectid.xxx/

API Gateway

…

Function Dispatcher

…

Container#1

Container#2

Resource Manager

Node Agent

Event Queue

Ingress

Dispatcher

29 of 29

Region 1

Region 2

Pod A

Pod B

Egress

Ingress

192.168.0.1:80

192.168.2.3:80

192.168.5.2:128

Pod B

EIP 202.1.2.3:90

Endpoint B (Service B) Cluster IP: 10.5.8.5:80
192.168.0.1.80
192.168.2.3.80
192.168.5.2.128

Endpoint B (Service B) Cluster IP 10.5.6.7:8080
192.168.0.1.8080
192.168.82.3.8080

192.168.0.1:8080

192.168.82.3:8080

Port	External EP
128	202.1.2.3:90
128	68.5.4.3:507
129	..

Region 3

Ingress

Pod B

EIP 68.5.4.3:507

192.168.0.1:8080

192.168.82.3:8080

Egress Mapping

Port	Internal EP
90	10.5.6.7:8080
	…

Ingress Mapping

Service Definition

Global K8S Mgr

Adaptor C

S3/Redis

Egress

Ingress

DNS:

PodB.local → 10.5.6.7

DNS

PodB.local → 10.5.8.5

Object Store

VPC

Pod A

1 of 29

2 of 29

3 of 29

4 of 29

5 of 29

6 of 29

7 of 29

8 of 29

9 of 29

10 of 29

11 of 29

12 of 29

13 of 29

14 of 29

15 of 29

16 of 29

17 of 29

18 of 29

19 of 29

20 of 29

21 of 29

22 of 29

23 of 29

24 of 29

25 of 29

26 of 29

27 of 29

28 of 29

29 of 29