1 of 12

Kubernetes @ CERN

Spyros Trigazis, IT-PW

Meeting with Université de Lausanne

https://indico.cern.ch/event/1417842/

2 of 12

Kubernetes Service

On Demand Cluster as a Service

Multiple versions, custom features

Integration with CERN networking,

identity, security, storage, …

Used by multiple applications with resources in 513 across most sectors:

IT (Registry, GitLab + CI, MONIT, SWAN, Kubeflow/ML, SSO, CS/IP1, …)

IT / FHR (EDH, EDMS, Phonebook, AIS, egroups, Learning Hub, …)

RCS (ATLAS Rucio, CMSWeb, InspireHEP, HEPData, …)

2

3 of 12

2016

2017

2018

2019

2020

2021

2022

Pilot Service

Kubernetes, Swarm

No obvious around this time which orchestrator would win. Scalability and performance tests.

Investigations

“Production” Service

Kubernetes, Swarm, Mesos

Integration with the CERN and WLCG environments. Certificates, Storage.

20

122

276

342

330

334

411

1098

1093

1817

2560

2780

700

CephFS, Manila, CVMFS

CSI everywhere. Developed initial drivers for CephFS and Manila. Improvements for CVMFS.

Work done fully upstream, taken later by multiple other companies.

GitOps and Secret Management, Dissemination. Helm and Flux/ArgoCD.

HA improvements with Node Groups, Cluster Auto Scaling, Auto Healing, LBaaS for serviceType: LB.

Dissemination, Security

Container Webinars on infrastructure and use cases, popular elsewhere as well.

Re-thinking with Security Team security aspects of containerized deployments, policies, best practices.

Clusters

Nodes

Disaster Recovery

Volume Snapshots, Automated Backups and Restore.

Round one and multiple items to work together with the community.

High Availability, GitOps, Public Cloud

4 of 12

Integration: Storage

Physics Data - EOS, initially eosxd with hostPath mount, then eosxd-csi

General Purpose backed by CEPH - CSI plugins: manila, cinder, cephfs

Software distribution - csi-cvmfs POSIX read-only via FUSE

Other, TN - csi-driver-nfs for NetAPP or custom NFS servers

4

5 of 12

Integration: Networking & Load Balancing

Networking

Calico is the default CNI

Cilium is opt-in, attractive for cluster-mesh, hybrid deployments

Load Balancing

Ingresses with DNS alias

LoadBalancer via OpenStack/Octavia

5

6 of 12

Integration: Certificates

Let’s Encrypt, self-service with cert-manager.io

HTTP-01 challenge, LE allowed in perimeter firewall (upon request)

DNS-01 challenge, cert-manager and CERN DNS integration

CERN CA, custom daemonset

Host Certificate per node

6

7 of 12

Integration: Monitoring

Metrics

Upstream kube-prometheus-stack

WIP aggregation to central infrastructure

Logging

Fluentd daemonset with http plugin, pushing to central gateway

Transition to fluentbit

7

8 of 12

Operations: Registry

Based on goharbor

Integration with gitlab-ci

Data in s3 by ceph

Vulnerability scanning via trivy.dev

8

9 of 12

Operations: ArgoCD

Central repo (private) for all applications

Each application set manage in public�repo eg registry

Integration with private vault for secrets

Alerts via grafana to mail mattermost, telegram

9

10 of 12

Work in progress

BC/DR

Multi-cluster / cluster mesh

Improved audit logging with Falco

Adoption of CAPI for cluster lifecycle

SBOM integration for containers

10

11 of 12

Links

11

12 of 12

Q & A

12