In this document we consider IPv4-only for simplicity reasons. However, all examples and solutions are trivially applicable to IPv6.
Currently, the loadbalancing mechanism of Cilium is based on the following simplified BPF maps:
(address, dport, slave) -> (target, port, count, rev_nat_index)
(rev_nat_index) -> (address, port)
(daddr, saddr, dport, sport) -> (rev_nat_index, slave)
In LB4_SERVICES_MAP, an entry with slave=0 denotes a placeholder value which count contains a number of backends for a given service. E.g.:
The backend selection for the first packet is based on a packet hash, while for a subsequent packet the selected backend is determined from a lookup in CT_MAP_ANY4 or CT_MAP_TCP4 (depending on a protocol). After the lookup, an additional lookup is required in the LB4_SERVICES_MAP table in order to get the address of the backend. So, as we can see, a service address with port number and a slave id can uniquely identify a backend, and slave numbers have to be consecutive (i.e. no gaps) for a given service.
Due to the way we uniquely identify a backend, we cannot completely remove a backend entry from LB4_SERVICES_MAP after the backend has been removed. Otherwise, the backend counting mechanism would not work.
From the example above, consider the removal of the 10.0.0.2:8080 backend. If we remove the (220.127.116.11,80,2) -> (10.0.0.2, 8080, 0, 12) entry, then the selection of a backend for a first packet is invalid in the following table due to the count value:
If we update the count to 2, then a) no backend with slave=3 will be selected, b) invalid backend with slave=2 can be selected.
Current solution to the removal problem is to duplicate any entry of the existing backend. E.g:
This makes the LB4_SERVICES_MAP table append-only (excluding a case when we remove a backend with the highest slave number), which in dynamic deployments might lead to a high memory consumption. Also, it introduces unfairness.
The current implementation can differentiate between TCP and non-TCP services only. In the case of Kubernetes, a non-TCP service can be either UDP or SCTP. So, two services UDP and SCTP with the same destination port will clash and thus, they won’t work.
A proposed solution by @tgraf is to introduce the following LB4_BACKEND_MAP:
(backend_id) -> (target, port, proto)
And to change the LB4_SERVICES_MAP, CT_MAP_ANY4 and CT_MAP_TCP4 maps accordingly:
(address, dport, slave) -> (count | backend_id, rev_nat_index)
(daddr, saddr, dport, sport, proto) -> (rev_nat_index | backend_id)
The backend removal operation would consist of three steps:
E.g. After the removal of the 10.0.0.2:8080, LB4_SERVICES_MAP would contain the following entries:
To generate a backend ID which has to be globally unique, we can rely on KVStore or to hash the target+port+proto combination. If we choose the former, then an additional map will be required for a reverse backend ID lookup (to avoid a full table scan when removing a backend).
As we going to change the related maps, we can introduce the protocol field which would require further changes to the maps:
(address, dport, proto, slave) -> (count | backend_id, rev_nat_index)
Upgrading any existing deployment of Cilium to the loadbalancer v2 would break the existing flows. To avoid this, we can fully enable the feature after two releases:
For users who do not care about the disruption, we can add a flag to 1.5.0 which would enable the loadbalancer v2 immediately.