Proposal for alpha version IPVS load balancing mode.docx

Alpha Version IPVS Load Balancing Mode in Kubernetes

Summary

There’s been lots of discussion and voice of enabling IPVS as in-cluster service load balancing mode.  IPVS performs better than iptables, meanwhile supports more sophisticated load balancing algorithms than iptables (least load, least connections, locality, weighted) as well as other useful features (e.g. health checking, retries etc[a][b][c]).

 

This page summarizes what’s expected in alpha version IPVS load balancing support, includes Kubernetes user behavior changes, build and deployment changes, design considerations and the test validation planned.

 ipvs

Kubernetes behavior change

Changes to kube-proxy startup parameters

Parameter: --proxy-mode

In addition to existing userspace and iptables mode, ipvs mode is configured via --proxy-mode=ipvs.  In alpha version, it implicitly uses IPVS NAT mode.

 

Parameter: --ipvs-scheduler

A new kube-proxy parameter will be added to specify IPVS load balancing algorithm, the parameter is --ipvs-scheduler.  Below is list of supported values, if it’s not configured rr is default value, if it’s incorrectly configured kube-proxy will exit with error message.

-          rr: round-robin

-          lc: least connection

-          dh: destination hashing

-          sh: source hashing

-          sed: shortest expected delay

-          nq: never queue

For more details about it, refer to http://kb.linuxvirtualserver.org/wiki/Ipvsadm

In future, we can implement service specific scheduler (potentially via annotation), which has higher priority and overwrites the value.

Parameter: --cleanup-ipvs

Similar to --cleanup-iptables parameter, if true cleanup IPVS configuration and IPTables rules that are created in IPVS mode.

Change to build

The IPVS implementation relies on seesaw library, seesaw library uses libnl which is a library implemented by c language.  We will vendor in seesaw source code in kubernetes.  Build machine needs to install libnl as prerequisite.

Some existing build scripts needs to be updated to install libnl, which is beyond the scope of alpha version.

To install libnl on build machine:

apt-get install libnl-dev

apt-get install libnl-genl-3-dev

There is an alternative to use libnetwork library which is implemented by golang, however new APIs need to be added to libnetwork. This is backup plan which requires more efforts and time, likely will not get it in kubernetes 1.7.

https://godoc.org/github.com/docker/libnetwork/ipvs

Change to deployment

IPVS kernel module installation is beyond kubernetes, it’s assumed IPVS is installed before running kubernetes.  When kube-proxy starts, if proxy mode is IPVS kube-proxy would validate if IPVS is installed on the node, if it’s not installed kube-proxy will exit with an error message. 

Some of the existing deployment script might need update to install ipvsadm, which is beyond the scope of alpha version.

        apt-get install ipvsadm

 

Design considerations

IPVS setup and network topology

IPVS is a replacement of IPTables as load balancer, it’s assumed readers of this proposal are familiar with IPTables load balancer mode. In IPVS mode, we will create a dummy interface and assign all VIPs to the dummy interface. In alpha version, we will implicitly use IPVS NAT mode.

 

Reverse traffic (SNAT)

In IPTables mode, if --masquerade-all is specified, SNAT is done on all traffic to ensure reverse traffic goes through. In IPVS mode, same SNAT is required for reverse traffic.

Here are current options for SNAT in IPVS mode:

  • If --cluster-cidr and --masquerade-all are both specified, kube-proxy will add a SNAT iptable rule.
  • If using flannel as network overlay, flannel automatically adds SNAT iptable rule.
  • Apply an IPVS patch to make IPVS do full SNAT. (I don’t have public patch information for this yet, if anyone knows please add details here. We had a private patch)

Session affinity

If session affinity is specified in service spec, IPVS will be configured with persistence. Same to IPTables mode, persistence timeout is 180 seconds.

Port range support

Since IPVS/IPTables load balancer are not known on API server, we will let user deploy services with any port range.

In kube-proxy, we will split port range into individual port to configure IPVS.  

Node Port support

We will scan all node IPs, then for each node IP configure IPVS to use NodeIP+NodePort as VIP and pod IPs as real server IPs.

 

External IP support[d][e][f][g]

We will add DNAT iptable rule to redirect traffic to service IP.  Service IP then gets load balanced through IPVS.  For services with annotation service.beta.kubernetes.io/external-traffic=OnlyLocal, IPVS will only use local pods as real-servers.

Test validation

-          Functionality tests, all below traffic should be reachable

o    Traffic accessing service IP

  container -> serviceIP -> container (same host)

  container -> serviceIP -> container (cross host)

  container -> serviceIP -> container (same container)

  host -> serviceIP -> container (same host)

  host -> serviceIP -> container (cross host)

o    Access service via NodePort

o    Access service via external IP

o    Traffic between container and host (not via service IP)

  container -> container (same host)

  container -> container (cross host)

  container -> container (same container)

  host -> container (same host)

  host -> container (cross host)

  container -> host (same host)

  container -> host (cross host)

-          Test service with ServiceAffinity=ClientIP. Validate IPVS has persistence for the service.

-          Pass existing E2E test

-          Test flannel, will not test plugins other than flannel.

 

[a]can you more add details about doing health check and retries?

every worker node will do healthcheck on app pods to avoid n/w split brain?

[b]scalability concern raised if we enable for all services by default, alternatively we can enable for specific service on demand (possibly via service annotation). However it would not be in the scope of alpha version.

[c]one another implementation https://github.com/cloudnativelabs/kube-router

[d]Perhaps a section on strategy to ensure reverse path traffic hits back the IPVS director? In other words for what traffic IPVS proxier will do SNAT/Masqurade. How will existing '--masquerade-all' and '--cluster-cidr' flags will be honoured.

[e]Nice question, added details in the reverse traffic section.

[f]_Marked as resolved_

[g]_Re-opened_