ABCDEFGHIJ
1
WLCG and OperationsOther Collaboration Needs
2
ServiceUrgencyImpactCriticalityUrgencyImpactCriticalityReasonCommentsKey
3
*.docs.cern.ch4144416CORAL/COOL, Recast, ITk, various other documentation is kept here. No misssion-critical documentation. Service aspects are connected to web infrastructure; mkdocs and other software cannot have "outages".Covered elsewhere
4
Acron service7107071070Used in GPN related services for DCS, low priority there. Used in physics analysis, information protection, documentation services internally. At Tier 0 part of the "data collection" for monitoring, renewal of X509 proxies, etc. is done through cron jobs (acrontab service).Not a (CERN-IT) service
5
ActiveMQ177177Required by Rucio and the monitoring infrastructureCovered elsewhere AND not a service
6
AFS710701010100Tier 0 control processes rely on AFS/Kerberos; assuming kerberos authentication lives here and not under some other authentication item. Offline all home directories on lxplus on afs. All physics results hosted on afs. Many major websites hosted on afs.Not used by ATLAS
7
AFS Web Hosting11171070All physics results hosted on afs-based webpages. Many other key collaboration websites hosted on afs. Most of these websites moving to webeos in the next 6-10 months, so criticality will decay over that time
8
Agile Infrastructure cloud services10101004728OpenStack, GitOps, Puppet, containers (Kubernetes), load-balancing (Loadbalancer-as-a-service, LBAAS). Tier0 runs on OpenStack VMS. Cloud resources also needed by users for some software development. Proper functioning of Rucio for ATLAS depends on all these services.
9
ATCN network at Point 11010100111Not all elements of the network are equally important, of course. Covered by an SLA.
10
ATLAS Windows Terminal Cluster10770111cerntsatldcs.cern.ch Dedicated instance of WTS; remote operation of DSS and log-in to Point 1
11
ActiveDirectory Authentication10101001010100Required for egroups-based authentication, blocks SSO login when down. Kerberos covered under afs; SSO covered separately
12
Batch service774971070HTCondor. Required for physics analysis activities.
13
BDII000000Grid information service: no longer used by ATLAS as far as we know
14
BOINC Service414111Used for ATLAS@Home, as well as backfilling some grid sites and/or HPCs, provides a modest amount of computing to the collaboration. Not relied on for critical samples / processes.
15
Campus NetworkCovered by GPN and External Connection
16
CE7749000If this is "Compute Element", it impacts Tier0 data-taking, but this is not really a CERN-IT service, or it's availabilty impact is covered elsewhere
17
Ceph10101004728Network mounted storage. Cinder volumes used in Rucio for accounting and reporting; service could be replaced with other network-mounted storage accessible in Kube. Used also in PanDA, AMI, ARC control, and BOINC. Also used for shared storage for some (limited) software installations
18
CERNbox111144Has become an important service for file sharing, etc., both analysis and documents. Alterrnatives are available, including DropBox and Google Drive (both of which are used extensively). Connects to storage infrastructure in complex ways (what is "EOS" and what is "CERNbox" is not trivial to delineate)
19
Certificate Authority Service111144For issuing of certificates only, this service is used heavily during business hours, but outages are not a major operational issue. The use of these certificates is covered in other (more critical) items.
20
CodiMD44164416Used by some groups (particularly in ADC/DDM/CSops) for meeting notes, presentations, rolling minutes, etc. No critical documentation in CodiMD.
21
Configuration ManagementCovered in BDII, CRIC, AI Cloud Services, and LANDB
22
CRIC7749144Used extensively by Grid infrastructure and configuration. For the general collaboration outages are a more minor problem (used more for cross-checking, accounting, etc)
23
CTA77494416Required for T/DAQ, Tier-0 and Rucio operations. Required for local CAT physics analysis activities. CTA backup of RAW files required for SFOs to clear the original copy on SFO disk. But this can be overruled in case of CTA problems (replica on EOS sufficient)
24
CVMFS Stratum-07107071070Tier 0 infrastructure relies on cvmfs for software distributions. Grid infrastructure relies on cvmfs for software distributions. For the Grid, failing stratum 0 blocks deployment but should allow general activities to continue by relying on stratum 1. User analysis relies heavily on cvmfs as well.
25
CVMFS Stratum-174287428Tier 0 infrastructure relies on cvmfs for software distributions. Grid infrastructure relies on cvmfs for software distributions. For the Grid, failing stratum 0 blocks deployment but should allow general activities to continue by relying on stratum 1. User analysis relies heavily on cvmfs as well. Fail-over to Stratum-0 should be possible for CERN-based operations.
26
Data management clients (GFAL, Davix)177177Required for all ATLAS storage interactions. These are software, not services, so it isn't clear to us how they might have an "outage"
27
DB-on-Demand710707749DBoD services are used extensively, both for operations and user owned tools
28
DCS Data visualization service4416144Used for diagnotistics (CSOPS service machines). Not extensively used by the broader collaboration except for offline checks.
29
Dedicated batch71070111Required for Tier-0 operations. Broader collaboration uses standard lxbatch resources.
30
Development, Deployment, DistributionNot clear what "service" this refers to; also covered in other areas
31
DFS111111Use limited to mostly administration, and even then limited
32
DIMNot currently used within ATLAS, and also not a service but rather a software library
33
Discourse000144Limited use offline within ATLAS; no usage online
34
Documentation browsingCovered in *.docs.cern.ch
35
DrupalService outages all covered in other areas; Drupal itself is software, not a service
36
Eduroam111144Alternative network infrastructure is available (CERN network). Useful for visitors physically at CERN.
37
ElasticSearch / OpenSearchCovered under Monit
38
Email Service10101001010100Critical for all aspects of the collaboration.
39
EOS7107071070Required for TDAQ, Tier-0 and Rucio operations, physics analysis activities (including local group), database storage, etc. Backs some websites as well.
40
EOS Web1114416Destination for many webafs-based sites, so urgency for the rest of the collaboration will steadily increase through 2024 until this reaches the same level as webafs currently (and webafs goes to zero)
41
External Network Access to CERN10101001010100Required for many activities, including Tier 0 interventions from remote, monitoring of the system, many physics analysis activities, etc
42
FILER000000Not currently used within ATLAS
43
Frontier and Squid710704416Used for conditions access at the Tier 0 and across the Grid. Use outside of Tier 0 and Grid more limited (testing, development, etc)
44
FTS10101004416Required by Rucio for all ATLAS data transfers; use outside of Grid operations more limited
45
GGUS1010100111Critical ticketing service for WLCG operations (including for critical tickets and outages). Usage outside of WLCG / Grid Ops more limited.
46
GitLab74281010100Used as repository and bug/feature tracking tool. Required by a broad range of groups and activities, including Rucio for Kubernetes/GitOps. Required for fixing issues that might arise during data-taking and in operations (Athena and DAQ software). Urgent patches can be deployed without GitLab in case of major outages. Outage essentially blocks worldwide development, including paper writing.
47
GitLab CI/CD44164416Important for supporting testing and deployment in many different repositories. Not as critical as gitlab itself, because outages cause delays but don't completely stop work.
48
Global xrootd redirector000000Not currently used within ATLAS
49
Hadoop / Spark177144Required for DDM operational reporting and overview reports for management. Usage for other activities currrently more limited and not time-critical.
50
HammerCloud1010100000Criticality depends on outage. The wrong outage can set all sites offline worldwide and halt grid production. A simple outage where the service is unavailable but does not take any action would have quite low criticality. No usage outside of Grid production.
51
HPC at CERN000000Not currently used within ATLAS for any significant activities that we are aware of
52
HTCondorCovered in lxbatch infastructure; Also HTCondor itself is software, not a service that cannot have an "outage"
53
IAM710704416Criticality will increase as IAM takes over fully from VOMS. Users are currrently mostly unaware of what infrastructure has been replaced and what still relies on VOMS, except that the web interfaces are largely still VOMS-based.
54
Indico1111010100Meeting organisation, both for data taking and offline activities. Critical to day to day life of the collaboration and any interruption is fatal to most ongoing activities.
55
Inspire111144Mostly paper hosting, jobs, etc; not important for online / Grid work at all, but outages are inconvenient for collaboration life
56
JIRA742871070Required for certain Trigger workflows and documentation. Primary source of interaction for software development offline; used by many physics analysis groups for organization of work.
57
Kubernetes / K8s10101001010100Required for Rucio, also parts of PanDA. Critical for both data taking and all parts of production and users daily activity
58
LANDB4416144In case of downtime, we would not be able to implement changes in the computing infrastructure in P1 nor to follow changes in the EOS/CTA side. In case of severe outage, would tell CERN IT to not touch any services that require our re-configuration of P1 infrastructure until the outage is resolved. For non-operational issues, as long as nothing changes and authentication is not affected, this is not a major concern (changes can be postponed until the service has returned)
59
LHC-OPN / LHC-ONE / GPN710701010100Campus network complete outage would be a major problem; limited network outages can be worked around; wifi infrastructure is included in this item. Point 1 is partly insulated during data taking thanks to the technical network, but that only preserves the data taking itself, not expert interaction, outside monitoring, data transfers, Tier 0 operations, etc. Network outage would take out services at CERN, and so would also be critical to grid operations.
60
Linux supportNot a CERN-IT Service that can have an "outage"
61
Lxplus77497749Used heavily for analysis, coding, debugging, paper writing, sharing of files, backing some acrontab jobs, submitting to lxbatch, etc.
62
MatterMost47287749Used heavily for daily communication. Not critical for operations (other channels of communication exist), but MatterMost is used extensively enough that outages are very inconvenient. Backup MatterMost instance is critical in case of catastrophic outages.
63
Mobile Phone (CERNphone)10101001010100Required to reach ACR and experts. Other mechanisms for communication exist, but this could present a major issue in case of an emergency (e.g. a fire, if the fire brigade cannot be reached). In case of outages that have more limited impact, criticality might be lower.
64
Monit7749144Required for DDM daily operations - Including dedicated ATLAS instances of Opensearch, Grafana, Elasticsearch. Used less extensively and with less criticality through other areas of the collaboration.
65
MyProxy7428144For issuing of certificates only, this service is used heavily during business hours, but outages are not a major operational issue. The use of these certificates is critical to WLCG operations, but running jobs would be able to continue
66
OpenShiftCovered by "Agile Infrastructure cloud services"
67
OpenStackCovered by "Agile Infrastructure cloud services"
68
Oracle offline (inc. streaming)10101001010100Required for T/DAQ, Tier-0 (both control processes and monitoring) and Rucio operations. Oracle online->offline replication essential basis for offline processing. Also hosts authorship information, paper workflow information, membership information, shift information — many other services critical to the collaboration. LANDB also required (but listed separately)
69
Oracle online1010100144Essential for data-taking; data loss may be unavoidable if not functional (e.g. used for SFO/Tier0 handshake). By definition not as important away from data-taking, except when copying data to the offline databases.
70
Px-CC network71070111Required for T/DAQ and Tier-0 Operations. SFO local disk buffer should allow ~3 days operation in case network fails. Not clear if this also includes ATCN and Spectrum (listed separately below). By definition not as important away from data-taking.
71
Remote Access (aka ssh)Covered by external network access and GPN availability
72
ROOT / Geant4Neither of these are a CERN-IT Service
73
RucioNot a CERN-IT Service
74
SiteMonCovered by Monit
75
S3 Storage111111No production use. Rucio and PanDA use S3 buckets for other purposes, but these do not hold critical data that would be problematic during an outage.
76
ServiceNow / Ticketing44167749Critical WLCG / Operations tickets can be covered by GGUS tickets and Alarm tickets. ServiceNow still is key for reporting major issues to CERN IT, but other communication channels exist in case of a specific outage (e.g. MatterMost or phone). General users are not as aware of these other mechanisms, and rely on ServiceNow for both ticketing and KnowledgeBase functionality for some services.
77
Software License Servers111111Only identified use within ATLAS is Coverity, which is not a critical service and only occasionally used within the experiment. If RHEL9 is validated on start-up against a software license server in the future, then this will become high criticality in case significant resources move to RHEL9.
78
Spectrum10440111Provides the service of monitoring the ATLAS Technical and Control Network (ACTN) and sending notifications. By definition not as critical for the rest of the collaboration.
79
SSO104401010100Whilst to first order data taking is not directly impacted by SSO, it is important for all web services associated to P1 when remotely accessed. In comparison, SSO is critical to day to day live of the collaboration and any interuption is fatal to most ongoing activites.
80
Twiki74281010100Enormous amount of online and offline documentation. Critical service during normal working hours
81
Video conf (i.e Zoom)77491010100Critical service during normal working hours, both for offline and online. Used in the control room for communication with remote experts (though other communication mechanisms generally exist, including phones)
82
VOMS7107071070Critical service during normal working hours, both for offline and online. Certificate issuing service covered separately.
83
WAU / WSSA144111ATLAS-specific monitoring is more heavily used, particularly Monit, for both compute and storage accounting. Not clear how this connects to the rest of the infrastructure (e.g. if this underlies some other accounting tools or is separate)
84
Windows terminal service144144Not extensively used for any critical systems; used to some degree for administrative purposes
85
86
87
88