Kubernetes @ CERN
Kubernetes Service
On Demand Cluster as a Service
Multiple versions, custom features
Integration with CERN networking,
identity, security, storage, …
Used by multiple applications with resources in 513 across most sectors:
IT (Registry, GitLab + CI, MONIT, SWAN, Kubeflow/ML, SSO, CS/IP1, …)
IT / FHR (EDH, EDMS, Phonebook, AIS, egroups, Learning Hub, …)
RCS (ATLAS Rucio, CMSWeb, InspireHEP, HEPData, …)
2
2016
2017
2018
2019
2020
2021
2022
Pilot Service
Kubernetes, Swarm
No obvious around this time which orchestrator would win. Scalability and performance tests.
Investigations
“Production” Service
Kubernetes, Swarm, Mesos
Integration with the CERN and WLCG environments. Certificates, Storage.
20
122
276
342
330
334
411
1098
1093
1817
2560
2780
700
CephFS, Manila, CVMFS
CSI everywhere. Developed initial drivers for CephFS and Manila. Improvements for CVMFS.
Work done fully upstream, taken later by multiple other companies.
GitOps and Secret Management, Dissemination. Helm and Flux/ArgoCD.
HA improvements with Node Groups, Cluster Auto Scaling, Auto Healing, LBaaS for serviceType: LB.
Dissemination, Security
Container Webinars on infrastructure and use cases, popular elsewhere as well.
Re-thinking with Security Team security aspects of containerized deployments, policies, best practices.
Clusters
Nodes
Disaster Recovery
Volume Snapshots, Automated Backups and Restore.
Round one and multiple items to work together with the community.
High Availability, GitOps, Public Cloud
Integration: Storage
Physics Data - EOS, initially eosxd with hostPath mount, then eosxd-csi
General Purpose backed by CEPH - CSI plugins: manila, cinder, cephfs
Software distribution - csi-cvmfs POSIX read-only via FUSE
Other, TN - csi-driver-nfs for NetAPP or custom NFS servers
4
Integration: Networking & Load Balancing
Networking
Calico is the default CNI
Cilium is opt-in, attractive for cluster-mesh, hybrid deployments
Load Balancing
Ingresses with DNS alias
LoadBalancer via OpenStack/Octavia
5
Integration: Certificates
Let’s Encrypt, self-service with cert-manager.io
HTTP-01 challenge, LE allowed in perimeter firewall (upon request)
DNS-01 challenge, cert-manager and CERN DNS integration
CERN CA, custom daemonset
Host Certificate per node
6
Integration: Monitoring
Metrics
Upstream kube-prometheus-stack
WIP aggregation to central infrastructure
Logging
Fluentd daemonset with http plugin, pushing to central gateway
Transition to fluentbit
7
Operations: Registry
8
Operations: ArgoCD
Central repo (private) for all applications
Each application set manage in public�repo eg registry
Integration with private vault for secrets
Alerts via grafana to mail mattermost, telegram
9
Work in progress
BC/DR
Multi-cluster / cluster mesh
Improved audit logging with Falco
Adoption of CAPI for cluster lifecycle
SBOM integration for containers
10
Links
https://gitlab.cern.ch/kubernetes/magnum
https://gitlab.cern.ch/kubernetes/automation/releases/cern-magnum
https://gitlab.cern.ch/kubernetes/networking/
https://gitlab.cern.ch/kubernetes/security/
11
Q & A
12