RFC: Loadbalancer v2

https://github.com/cilium/cilium/issues/7106

Note

In this document we consider IPv4-only for simplicity reasons. However, all examples and solutions are trivially applicable to IPv6.

Background

Currently, the loadbalancing mechanism of Cilium is based on the following simplified BPF maps:

(address, dport, slave) -> (target, port, count, rev_nat_index)

(rev_nat_index) -> (address, port)

(daddr, saddr, dport, sport) -> (rev_nat_index, slave)

In LB4_SERVICES_MAP, an entry with slave=0 denotes a placeholder value which count contains a number of backends for a given service. E.g.:

The backend selection for the first packet is based on a packet hash, while for a subsequent packet the selected backend is determined from a lookup in CT_MAP_ANY4 or CT_MAP_TCP4 (depending on a protocol). After the lookup, an additional lookup is required in the LB4_SERVICES_MAP table in order to get the address of the backend. So, as we can see, a service address with port number and a slave id can uniquely identify a backend, and slave numbers have to be consecutive (i.e. no gaps) for a given service.

Problem

Duplicate backend entries

Due to the way we uniquely identify a backend, we cannot completely remove a backend entry from LB4_SERVICES_MAP after the backend has been removed. Otherwise, the backend counting mechanism would not work.

From the example above, consider the removal of the 10.0.0.2:8080 backend. If we remove the (1.1.1.1,80,2) -> (10.0.0.2, 8080, 0, 12) entry, then the selection of a backend for a first packet is invalid in the following table due to the count value:

If we update the count to 2, then a) no backend with slave=3 will be selected, b) invalid backend with slave=2 can be selected.

Current solution to the removal problem is to duplicate any entry of the existing backend. E.g:

This makes the LB4_SERVICES_MAP table append-only (excluding a case when we remove a backend with the highest slave number), which in dynamic deployments might lead to a high memory consumption. Also, it introduces unfairness.

Support for different protocols

The current implementation can differentiate between TCP and non-TCP services only. In the case of Kubernetes, a non-TCP service can be either UDP or SCTP. So, two services UDP and SCTP with the same destination port will clash and thus, they won’t work.

Solution

A proposed solution by @tgraf is to introduce the following LB4_BACKEND_MAP:

(backend_id) -> (target, port, proto)

And to change the LB4_SERVICES_MAP, CT_MAP_ANY4 and CT_MAP_TCP4 maps accordingly:

(address, dport, slave) -> (count | backend_id, rev_nat_index)

(daddr, saddr, dport, sport, proto) -> (rev_nat_index | backend_id)

The backend removal operation would consist of three steps:

  1. Decrement the count value in the placeholder entry in LB4_SERVICES_MAP.
  2. Swap the to-be removed backend entry with the last entry of the same service.
  3. Remove the backend from LB4_BACKEND_MAP.

E.g. After the removal of the 10.0.0.2:8080, LB4_SERVICES_MAP would contain the following entries:

To generate a backend ID which has to be globally unique, we can rely on KVStore or to hash the target+port+proto combination. If we choose the former, then an additional map will be required for a reverse backend ID lookup (to avoid a full table scan when removing a backend).

As we going to change the related maps, we can introduce the protocol field which would require further changes to the maps:

(address, dport, proto, slave) -> (count | backend_id, rev_nat_index)

Upgrading

Upgrading any existing deployment of Cilium to the loadbalancer v2 would break the existing flows. To avoid this, we can fully enable the feature after two releases:

For users who do not care about the disruption, we can add a flag to 1.5.0 which would enable the loadbalancer v2 immediately.