Critical Service Matrix — ATLAS Input

	A	B	C	D	E	F	G	H	I	J
1		WLCG and Operations			Other Collaboration Needs
2	Service	Urgency	Impact	Criticality	Urgency	Impact	Criticality	Reason	Comments	Key

3	*.docs.cern.ch	4	1	4	4	4	16	CORAL/COOL, Recast, ITk, various other documentation is kept here. No misssion-critical documentation. Service aspects are connected to web infrastructure; mkdocs and other software cannot have "outages".		Covered elsewhere
4	Acron service	7	10	70	7	10	70	Used in GPN related services for DCS, low priority there. Used in physics analysis, information protection, documentation services internally. At Tier 0 part of the "data collection" for monitoring, renewal of X509 proxies, etc. is done through cron jobs (acrontab service).		Not a (CERN-IT) service
5	ActiveMQ	1	7	7	1	7	7	Required by Rucio and the monitoring infrastructure		Covered elsewhere AND not a service
6	AFS	7	10	70	10	10	100	Tier 0 control processes rely on AFS/Kerberos; assuming kerberos authentication lives here and not under some other authentication item. Offline all home directories on lxplus on afs. All physics results hosted on afs. Many major websites hosted on afs.		Not used by ATLAS
7	AFS Web Hosting	1	1	1	7	10	70	All physics results hosted on afs-based webpages. Many other key collaboration websites hosted on afs. Most of these websites moving to webeos in the next 6-10 months, so criticality will decay over that time
8	Agile Infrastructure cloud services	10	10	100	4	7	28	OpenStack, GitOps, Puppet, containers (Kubernetes), load-balancing (Loadbalancer-as-a-service, LBAAS). Tier0 runs on OpenStack VMS. Cloud resources also needed by users for some software development. Proper functioning of Rucio for ATLAS depends on all these services.
9	ATCN network at Point 1	10	10	100	1	1	1	Not all elements of the network are equally important, of course. Covered by an SLA.
10	ATLAS Windows Terminal Cluster	10	7	70	1	1	1	cerntsatldcs.cern.ch Dedicated instance of WTS; remote operation of DSS and log-in to Point 1
11	ActiveDirectory Authentication	10	10	100	10	10	100	Required for egroups-based authentication, blocks SSO login when down. Kerberos covered under afs; SSO covered separately
12	Batch service	7	7	49	7	10	70	HTCondor. Required for physics analysis activities.
13	BDII	0	0	0	0	0	0	Grid information service: no longer used by ATLAS as far as we know
14	BOINC Service	4	1	4	1	1	1	Used for ATLAS@Home, as well as backfilling some grid sites and/or HPCs, provides a modest amount of computing to the collaboration. Not relied on for critical samples / processes.
15	Campus Network							Covered by GPN and External Connection
16	CE	7	7	49	0	0	0	If this is "Compute Element", it impacts Tier0 data-taking, but this is not really a CERN-IT service, or it's availabilty impact is covered elsewhere
17	Ceph	10	10	100	4	7	28	Network mounted storage. Cinder volumes used in Rucio for accounting and reporting; service could be replaced with other network-mounted storage accessible in Kube. Used also in PanDA, AMI, ARC control, and BOINC. Also used for shared storage for some (limited) software installations
18	CERNbox	1	1	1	1	4	4	Has become an important service for file sharing, etc., both analysis and documents. Alterrnatives are available, including DropBox and Google Drive (both of which are used extensively). Connects to storage infrastructure in complex ways (what is "EOS" and what is "CERNbox" is not trivial to delineate)
19	Certificate Authority Service	1	1	1	1	4	4	For issuing of certificates only, this service is used heavily during business hours, but outages are not a major operational issue. The use of these certificates is covered in other (more critical) items.
20	CodiMD	4	4	16	4	4	16	Used by some groups (particularly in ADC/DDM/CSops) for meeting notes, presentations, rolling minutes, etc. No critical documentation in CodiMD.
21	Configuration Management							Covered in BDII, CRIC, AI Cloud Services, and LANDB
22	CRIC	7	7	49	1	4	4	Used extensively by Grid infrastructure and configuration. For the general collaboration outages are a more minor problem (used more for cross-checking, accounting, etc)
23	CTA	7	7	49	4	4	16	Required for T/DAQ, Tier-0 and Rucio operations. Required for local CAT physics analysis activities. CTA backup of RAW files required for SFOs to clear the original copy on SFO disk. But this can be overruled in case of CTA problems (replica on EOS sufficient)
24	CVMFS Stratum-0	7	10	70	7	10	70	Tier 0 infrastructure relies on cvmfs for software distributions. Grid infrastructure relies on cvmfs for software distributions. For the Grid, failing stratum 0 blocks deployment but should allow general activities to continue by relying on stratum 1. User analysis relies heavily on cvmfs as well.
25	CVMFS Stratum-1	7	4	28	7	4	28	Tier 0 infrastructure relies on cvmfs for software distributions. Grid infrastructure relies on cvmfs for software distributions. For the Grid, failing stratum 0 blocks deployment but should allow general activities to continue by relying on stratum 1. User analysis relies heavily on cvmfs as well. Fail-over to Stratum-0 should be possible for CERN-based operations.
26	Data management clients (GFAL, Davix)	1	7	7	1	7	7	Required for all ATLAS storage interactions. These are software, not services, so it isn't clear to us how they might have an "outage"
27	DB-on-Demand	7	10	70	7	7	49	DBoD services are used extensively, both for operations and user owned tools
28	DCS Data visualization service	4	4	16	1	4	4	Used for diagnotistics (CSOPS service machines). Not extensively used by the broader collaboration except for offline checks.
29	Dedicated batch	7	10	70	1	1	1	Required for Tier-0 operations. Broader collaboration uses standard lxbatch resources.
30	Development, Deployment, Distribution							Not clear what "service" this refers to; also covered in other areas
31	DFS	1	1	1	1	1	1	Use limited to mostly administration, and even then limited
32	DIM							Not currently used within ATLAS, and also not a service but rather a software library
33	Discourse	0	0	0	1	4	4	Limited use offline within ATLAS; no usage online
34	Documentation browsing							Covered in *.docs.cern.ch
35	Drupal							Service outages all covered in other areas; Drupal itself is software, not a service
36	Eduroam	1	1	1	1	4	4	Alternative network infrastructure is available (CERN network). Useful for visitors physically at CERN.
37	ElasticSearch / OpenSearch							Covered under Monit
38	Email Service	10	10	100	10	10	100	Critical for all aspects of the collaboration.
39	EOS	7	10	70	7	10	70	Required for TDAQ, Tier-0 and Rucio operations, physics analysis activities (including local group), database storage, etc. Backs some websites as well.
40	EOS Web	1	1	1	4	4	16	Destination for many webafs-based sites, so urgency for the rest of the collaboration will steadily increase through 2024 until this reaches the same level as webafs currently (and webafs goes to zero)
41	External Network Access to CERN	10	10	100	10	10	100	Required for many activities, including Tier 0 interventions from remote, monitoring of the system, many physics analysis activities, etc
42	FILER	0	0	0	0	0	0	Not currently used within ATLAS
43	Frontier and Squid	7	10	70	4	4	16	Used for conditions access at the Tier 0 and across the Grid. Use outside of Tier 0 and Grid more limited (testing, development, etc)
44	FTS	10	10	100	4	4	16	Required by Rucio for all ATLAS data transfers; use outside of Grid operations more limited
45	GGUS	10	10	100	1	1	1	Critical ticketing service for WLCG operations (including for critical tickets and outages). Usage outside of WLCG / Grid Ops more limited.
46	GitLab	7	4	28	10	10	100	Used as repository and bug/feature tracking tool. Required by a broad range of groups and activities, including Rucio for Kubernetes/GitOps. Required for fixing issues that might arise during data-taking and in operations (Athena and DAQ software). Urgent patches can be deployed without GitLab in case of major outages. Outage essentially blocks worldwide development, including paper writing.
47	GitLab CI/CD	4	4	16	4	4	16	Important for supporting testing and deployment in many different repositories. Not as critical as gitlab itself, because outages cause delays but don't completely stop work.
48	Global xrootd redirector	0	0	0	0	0	0	Not currently used within ATLAS
49	Hadoop / Spark	1	7	7	1	4	4	Required for DDM operational reporting and overview reports for management. Usage for other activities currrently more limited and not time-critical.
50	HammerCloud	10	10	100	0	0	0	Criticality depends on outage. The wrong outage can set all sites offline worldwide and halt grid production. A simple outage where the service is unavailable but does not take any action would have quite low criticality. No usage outside of Grid production.
51	HPC at CERN	0	0	0	0	0	0	Not currently used within ATLAS for any significant activities that we are aware of
52	HTCondor							Covered in lxbatch infastructure; Also HTCondor itself is software, not a service that cannot have an "outage"
53	IAM	7	10	70	4	4	16	Criticality will increase as IAM takes over fully from VOMS. Users are currrently mostly unaware of what infrastructure has been replaced and what still relies on VOMS, except that the web interfaces are largely still VOMS-based.
54	Indico	1	1	1	10	10	100	Meeting organisation, both for data taking and offline activities. Critical to day to day life of the collaboration and any interruption is fatal to most ongoing activities.
55	Inspire	1	1	1	1	4	4	Mostly paper hosting, jobs, etc; not important for online / Grid work at all, but outages are inconvenient for collaboration life
56	JIRA	7	4	28	7	10	70	Required for certain Trigger workflows and documentation. Primary source of interaction for software development offline; used by many physics analysis groups for organization of work.
57	Kubernetes / K8s	10	10	100	10	10	100	Required for Rucio, also parts of PanDA. Critical for both data taking and all parts of production and users daily activity
58	LANDB	4	4	16	1	4	4	In case of downtime, we would not be able to implement changes in the computing infrastructure in P1 nor to follow changes in the EOS/CTA side. In case of severe outage, would tell CERN IT to not touch any services that require our re-configuration of P1 infrastructure until the outage is resolved. For non-operational issues, as long as nothing changes and authentication is not affected, this is not a major concern (changes can be postponed until the service has returned)
59	LHC-OPN / LHC-ONE / GPN	7	10	70	10	10	100	Campus network complete outage would be a major problem; limited network outages can be worked around; wifi infrastructure is included in this item. Point 1 is partly insulated during data taking thanks to the technical network, but that only preserves the data taking itself, not expert interaction, outside monitoring, data transfers, Tier 0 operations, etc. Network outage would take out services at CERN, and so would also be critical to grid operations.
60	Linux support							Not a CERN-IT Service that can have an "outage"
61	Lxplus	7	7	49	7	7	49	Used heavily for analysis, coding, debugging, paper writing, sharing of files, backing some acrontab jobs, submitting to lxbatch, etc.
62	MatterMost	4	7	28	7	7	49	Used heavily for daily communication. Not critical for operations (other channels of communication exist), but MatterMost is used extensively enough that outages are very inconvenient. Backup MatterMost instance is critical in case of catastrophic outages.
63	Mobile Phone (CERNphone)	10	10	100	10	10	100	Required to reach ACR and experts. Other mechanisms for communication exist, but this could present a major issue in case of an emergency (e.g. a fire, if the fire brigade cannot be reached). In case of outages that have more limited impact, criticality might be lower.
64	Monit	7	7	49	1	4	4	Required for DDM daily operations - Including dedicated ATLAS instances of Opensearch, Grafana, Elasticsearch. Used less extensively and with less criticality through other areas of the collaboration.
65	MyProxy	7	4	28	1	4	4	For issuing of certificates only, this service is used heavily during business hours, but outages are not a major operational issue. The use of these certificates is critical to WLCG operations, but running jobs would be able to continue
66	OpenShift							Covered by "Agile Infrastructure cloud services"
67	OpenStack							Covered by "Agile Infrastructure cloud services"
68	Oracle offline (inc. streaming)	10	10	100	10	10	100	Required for T/DAQ, Tier-0 (both control processes and monitoring) and Rucio operations. Oracle online->offline replication essential basis for offline processing. Also hosts authorship information, paper workflow information, membership information, shift information — many other services critical to the collaboration. LANDB also required (but listed separately)
69	Oracle online	10	10	100	1	4	4	Essential for data-taking; data loss may be unavoidable if not functional (e.g. used for SFO/Tier0 handshake). By definition not as important away from data-taking, except when copying data to the offline databases.
70	Px-CC network	7	10	70	1	1	1	Required for T/DAQ and Tier-0 Operations. SFO local disk buffer should allow ~3 days operation in case network fails. Not clear if this also includes ATCN and Spectrum (listed separately below). By definition not as important away from data-taking.
71	Remote Access (aka ssh)							Covered by external network access and GPN availability
72	ROOT / Geant4							Neither of these are a CERN-IT Service
73	Rucio							Not a CERN-IT Service
74	SiteMon							Covered by Monit
75	S3 Storage	1	1	1	1	1	1	No production use. Rucio and PanDA use S3 buckets for other purposes, but these do not hold critical data that would be problematic during an outage.
76	ServiceNow / Ticketing	4	4	16	7	7	49	Critical WLCG / Operations tickets can be covered by GGUS tickets and Alarm tickets. ServiceNow still is key for reporting major issues to CERN IT, but other communication channels exist in case of a specific outage (e.g. MatterMost or phone). General users are not as aware of these other mechanisms, and rely on ServiceNow for both ticketing and KnowledgeBase functionality for some services.
77	Software License Servers	1	1	1	1	1	1	Only identified use within ATLAS is Coverity, which is not a critical service and only occasionally used within the experiment. If RHEL9 is validated on start-up against a software license server in the future, then this will become high criticality in case significant resources move to RHEL9.
78	Spectrum	10	4	40	1	1	1	Provides the service of monitoring the ATLAS Technical and Control Network (ACTN) and sending notifications. By definition not as critical for the rest of the collaboration.
79	SSO	10	4	40	10	10	100	Whilst to first order data taking is not directly impacted by SSO, it is important for all web services associated to P1 when remotely accessed. In comparison, SSO is critical to day to day live of the collaboration and any interuption is fatal to most ongoing activites.
80	Twiki	7	4	28	10	10	100	Enormous amount of online and offline documentation. Critical service during normal working hours
81	Video conf (i.e Zoom)	7	7	49	10	10	100	Critical service during normal working hours, both for offline and online. Used in the control room for communication with remote experts (though other communication mechanisms generally exist, including phones)
82	VOMS	7	10	70	7	10	70	Critical service during normal working hours, both for offline and online. Certificate issuing service covered separately.
83	WAU / WSSA	1	4	4	1	1	1	ATLAS-specific monitoring is more heavily used, particularly Monit, for both compute and storage accounting. Not clear how this connects to the rest of the infrastructure (e.g. if this underlies some other accounting tools or is separate)
84	Windows terminal service	1	4	4	1	4	4	Not extensively used for any critical systems; used to some degree for administrative purposes
85
86
87
88