1 of 20

WLCG Experiments Test Framework (ETF)

Marian Babik, CERN�HEPiX IPv6 WG meeting

2 of 20

Motivation

Current Service Availability Monitoring (SAM) structure

  • WLCG Experiments Test Framework (ETF)
      • WLCG testing middleware (running so called SAM tests)
      • Active testing of the sites services and reporting back to SAM3/MONIT
      • Common to all experiments
      • Main source for WLCG Availability/Reliability Reports (different from EGI/ARGO)
  • SAM3/MONIT
      • Aggregation (via custom algos), visualisation and reporting
      • Support for multiple sources of metrics (e.g. ALICE storage tests, ATLAS ASAP)�
  • A generic test framework remains fundamental for WLCG monitoring
    • Keeping track of sites availability/reliability
    • Running deployment campaigns (IPv6, HTTP, etc.)
    • Provides means of isolation when debugging site/experiments issues
      • Middleware bugs, site setup/configuration, latency sensitive issues/timeouts, etc.
    • Contributing to the operational toolchain of the experiments

2

HEPiX IPv6 Meeting

3 of 20

Overview

3

Generic test middleware based on open source

  • Checkmk, Nagios core and Messaging (ActiveMQ)�

Focuses on functional testing (atomic)

  • Direct job submissions, worker node env. testing
  • Core storage operations
  • Remote API testing and/or network testing (ping/icmp)�

~ 150 sites, 1200 hosts monitored

~ 10 metrics/host

~ 1M metrics/day

High-level functional testing

Plugins conforming to Nagios standard

Configurable schedule for test execution

Checkmk dashboard to show results

HEPiX IPv6 Meeting

4 of 20

Architecture

ETF Core Framework

  • Frontend API, configuration, scheduling, alerts

Plugins (probes/tests)

  • Range of available plugins to test broad �range of services
  • Contributed by experiments, PTs, TFs and�open source projects (Checkmk), etc.
  • Python library to help write plugins (python-nap)

MQ Stream for publishing results

Job Submission Framework (JESS)

  • Framework to write job submission plugins�(submit/manage jobs, retrieve worker node�results, etc.)

Worker Node Framework (WN-FM)

  • Micro-scheduler to run tests on the WNs�(configure and execute WN tests, �collect results)

4

HEPiX IPv6 Meeting

5 of 20

Deployment and Operations

Experiment instances @CERN (IPv4-only/IPv6-only in QA, IPv4-only in PROD) �perfSONAR infrastructure monitoring @OSG

ETF now runs in containers and is integrated with gitlab CI

  • Each experiment has its own container/image and gitlab repository
    • Full control over packages and versions to be deployed
  • ETF can be deployed in the experiment-specific environment if needed
  • Faster development cycle - changes propagated to QA upon each commit
    • Each commit triggers container rebuild and deployment to QA, one-click deploy to prod
  • Simplified deployment - auto-deployed directly from gitlab
    • Easy to rollback

5

HEPiX IPv6 Meeting

6 of 20

Plugins/Tests

6

Plugins

Users/Experiments

Maintained by

Job Submission

CREAM, ARC, HTCONDOR-CE

JESS**

LHCb, ALICE, ATLAS, CMS

ETF

Worker Nodes

ATLAS (3), CMS (11), LHCb (7)

ATLAS, CMS, LHCb

ATLAS, CMS, LHCb

Storage

GFAL2 (SRM, gsiftp, XRoot, HTTP)

ATLAS

ATLAS

GFAL2 (SRM)

CMS

CMS

XRoot**

CMS

CMS

HTTPs/WebDAV**

HTTP TF*

HTTP TF*

Network

perfSONAR infrastructure**

WLCG Network Throughput WG

OSG, WLCG

**Uses new library for writing plugins (python-nap) *Probe is still supported by GFAL2 team

HEPiX IPv6 Meeting

7 of 20

Summary

  • ETF is a container-based application combining open source software with a set of frameworks and APIs to provide flexible testing suite
  • Easy to extend, re-locate and support new experiments and technologies
  • Supported as part of the CERN IT Monitoring Stack
  • Currently deployed at CERN for all four experiments
    • Supporting IPv4-only and IPv6-only monitoring
    • Experiments contacts have access in case they need to debug and/or follow up on issues
    • Central instance provides a site-level view (one place to see results from all experiments)
  • Additional deployment at OSG for perfSONAR infrastructure monitoring
    • Strong interest from other communities to have this available as a generic tool
  • MONIT reporting ready for IPv6
    • We’re able to aggregate IPv6 results and compute either IPv6-only profiles or IPv6/IPv4
    • Experiments need to add IPv6 metrics to their A/R profile

7

HEPiX IPv6 Meeting

8 of 20

Questions ?

8

HEPiX IPv6 Meeting

9 of 20

WLCG Networks: Update on Monitoring and Analytics

perfSONAR Update

S. McKee1, B. Bockelman2, R. Gardner3, �I. Vukotic3, M. Babik4, D. Weitzel5, M. Zvada5, E. F. Hernandez6, �1 University of Michigan, 2 Morgridge Institute of Research, 3 University of Chicago, 4 CERN,�5 University of Nebraska, 6 UCSD

10 of 20

perfSONAR News

perfSONAR 4.2 was released (4.2.2 is the latest release)

  • New plugins
      • GridFTP plug-in - Significant interest from NRP community and others.
      • Test schedule pre-emption - Easier for manual tests to get a slot on busy hosts
      • Additional pSConfig utilities - Continuing to make meshes easier to build and manage through command-line and graphical interface
      • Lookup Service improvements - Bulk renewals and record signing
  • pScheduler adds preemptive scheduling support
    • Retires BWCTL - still installed but no longer configured
    • pScheduler requires port 443 to be open to all (potential) testing nodes
  • Docker support (for “testpoint” deployment)
  • SL6 no longer supported
    • Our recommendation: reinstall with CentOS7 ASAP; don’t worry about saving data

10

HEPiX IPv6 Meeting

11 of 20

perfSONAR deployment

11

261 Active perfSONAR instances

- 207 production endpoints

- 173 running 4.2; 138 on 4.2.1 (latest)

- T1/T2 coverage

- Continuously testing over 5000 links

- Testing coordinated and managed from central place

- Dedicated latency and bandwidth nodes at each site

- Open platform - tests can be scheduled by anyone who participates in our network and runs perfSONAR

HEPiX IPv6 Meeting

12 of 20

Platform Overview

  • Collects, stores, configures and transports all network metrics
    • Distributed deployment - operated in collaboration
  • All perfSONAR metrics are available via API, live stream or directly on the analytical platforms
    • Complementary network metrics such as ESNet, LHCOPN traffic also via same channels

12

Collector �(NEW)

Store (long-term)

Store (short-term)

pS Monitoring

pS Configuration�

Tape

Experiments

MONIT-GRAFANA

pS Dashboard

HEPiX IPv6 Meeting

13 of 20

MONIT perfSONAR IPv6 dashboard

13

HEPiX IPv6 Meeting

14 of 20

MONIT perfSONAR IPv6 dashboard

14

HEPiX IPv6 Meeting

15 of 20

Network Analytics Activities

During the spring of 2019 we engaged a group of students to work on analysis and visualization of our network metrics

  • Machine learning: At Chicago we have Sushant Bansal (Master’s student)
  • Path Analysis: At Michigan we have Manjari Trivedi (Undergraduate), Yuan Li (Graduate student; graduated)
  • R&E Network Analytics: In Bulgaria we have Petya Vasileva (PhD student)

15

Prototype path display using network metrics from ES

The students have worked independently over summer 2019 learning about the data we have and the analytics platform itself

For Fall 2019, the goal was to clean up and annotate the path information, filtering out bad or incomplete traceroute measurements and then work on analyzing, organizing and displaying path information with corresponding network metrics like packet-loss, throughput or delay

HEPiX IPv6 Meeting

16 of 20

Collaboration with MEPhI on Network Visualization

The SAND project is collaborating with MEPHI (Moscow Engineering Physics Institute) on network path visualization

16

Containerized Version running at UC https://perfsonar.uc.ssl-hep.org/graph/viewer

Application being Updated

  • Rebuilding Django server that serves as API for Elasticsearch
    • Due to Cross-Origin Resource Sharing (CORS) and security reasons
  • Moving front-end part of application from jQuery to ReactJS
    • allows to replace heavy modules with light React components
  • Resolving issues related to unique path Identification
    • makes possible to monitor paths but not single records about path

HEPiX IPv6 Meeting

17 of 20

Platform Use

  • WLCG and OSG operations
    • Baseline testing and interactive debugging for incidents reported via support unit
    • Regular reports at the WLCG operations coordination and WLCG weekly operations
    • Providing Grafana dashboards that help visualise the metrics
  • Analytical studies and diagnostics
    • Providing derived metrics that will make it easier to diagnose issues
    • Alerting and notifications
    • Advanced visualisation of network paths
  • Cloud testing - HNSciCloud - testing commercial cloud providers
    • perfSONAR part of the standard benchmarking tools, developed as part of EU OCRE
  • Collaboration with GridPP and CCSTAR communities
    • Common platform for configuration, collection and storage of network measurements
  • HEPiX IPv6 WG - now testing bandwidth and paths over IPv6
  • Collaboration with other science domains deploying perfSONAR

17

HEPiX IPv6 Meeting

18 of 20

Plans

  • Providing new features that will make it easier to integrate with other communities
  • EU projects ARCHIVER and ESCAPE plan to use our infrastructure
  • ALICE, StashCache and CC* communities will continue to evolve
  • 100G and experimental testing - planning 100G testbed and evaluations
    • If interested please join perfsonar-100g mailing list (subscription link is in the references)
  • Working closely with the SAND (https://sand-ci.org/ ) project on analytics
    • SAND plans to continue hardening the data pipeline while also continuing to work on extracting value from the various network metrics being gathered
    • The students associated with the project are working on cleaning, annotating and using the data to provide useful tools to better understand our networks and localize problems.
    • The SAND project goals in its final year are to provide near real-time network problem identification, path correlation and problem localization through the use of analytics and visualization and this is well-aligned with OSG and WLCG needs.

18

HEPiX IPv6 Meeting

19 of 20

Summary

  • OSG in collaboration with WLCG are operating a comprehensive network monitoring platform
  • Platform has been used in a wide range of activities from core OSG/WLCG operations to Cloud testing and IPv6 deployment
  • Providing feedback to LHCOPN/LHCONE, HEPiX, WLCG and OSG communities
  • Next version of perfSONAR will enable additional functionality as well as improve overall stability and performance
  • IRIS-HEP and SAND will contribute to the operations and R&D in the network area
  • Further analytical studies are planned to better understand our use of networks and how it could be improved

19

HEPiX IPv6 Meeting

20 of 20

References

20

HEPiX IPv6 Meeting