1 of 27

The SAND Project and a New Research Networking Technical Working Group

Shawn McKee / University of Michigan

on behalf of the SAND Collaboration

At the Internet2 Community Measurement, Metrics, and Telemetry meeting - May 12, 2020

2 of 27

Outline

  • Overview of WLCG/OSG Networking; The Motivation for SAND
    • Existing data pipeline and dashboards
    • Information on the “context” for SAND
  • Part 1): The SAND project and Goals
  • Part 2): A new Research Networking Technical Working Group

2

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

3 of 27

OSG/WLCG Networking Activities

  • OSG is in its 8th year of supporting WLCG/OSG networking focused on:
    • Assisting its users and affiliates in identifying and fixing network bottlenecks
    • Developing and operating a comprehensive Network Monitoring Platform
    • Improving our ability to manage and use network topology and network metrics for analytics
  • WLCG Network Throughput Working Group was established to ensure sites and experiments can better understand and fix networking issues:
    • Oversees the WLCG perfSONAR infrastructure
      • Core infrastructure for taking network measurements and performing low-level debugging activities
    • Coordinates WLCG network performance incidents - runs a dedicated support unit which involves sites, network experts, R&Es and perfSONAR developers
      • Many issues are potentially resolvable within the working group

3

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

4 of 27

perfSONAR deployment

288 Active perfSONAR instances

- 207 production endpoints

- WLCG T1/T2 coverage

- Continuously testing over 5000 links

- Testing coordinated and managed from central place

- Dedicated latency and bandwidth nodes at each site

- Open platform - tests can be scheduled by anyone who participates in our network and runs perfSONAR

4

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

5 of 27

Network Measurement Platform Overview

  • Collects, stores, configures and transports all network metrics
    • Distributed deployment - operated in collaboration
  • All perfSONAR metrics are available via API, live stream or directly on the analytical platforms
    • Complementary network metrics such as ESNet, LHCOPN traffic also via same channels

5

Collector

Store (long-term)

Store (short-term)

pS Monitoring

pS Configuration

Tape

Experiments

MONIT-GRAFANA

pS Dashboard

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

6 of 27

Grafana - perfSONAR dashboard

  • Now has all WLCG sites that run perfSONAR
  • Updated dashboards to support latest Grafana version

  • Now includes all WLCG sites that run perfSONAR
  • Added row that tracks RTT and number of hops as reported by traceroute/tracepath

6

6th SIG-PMV Meeting Dublin

7 of 27

Grafana - IPv6 dashboard

  • Added IPv6 dashboard
    • Side-by-side comparison btw. IPv4 and IPv6 performance
  • Due to performance limitations it was agreed that won’t configure IPv6 latency tests

7

See more Grafana dashboards at http://monit-grafana-open.cern.ch/

6th SIG-PMV Meeting Dublin

8 of 27

Current Platform Use

  • WLCG and OSG operations
    • Baseline testing and interactive debugging for incidents reported via support unit
    • Regular reports at the WLCG operations coordination and WLCG weekly operations
    • Providing Grafana dashboards that help visualise the metrics
  • Close collaboration with perfSONAR consortium
  • Enabling analytical studies - data stored in the ATLAS Analytics platform
    • Providing an important source for network metrics (bandwidth, latency, path)
  • Cloud testing - HNSciCloud - testing commercial cloud providers
    • Baselining and evaluating network performance: critical to evaluate effectiveness fo LHC
  • HEPiX IPv6 WG
    • Now testing bandwidth and paths over IPv6
  • Collaboration with other science domains deploying perfSONAR
    • E.g., US Universities, Pittsburgh Supercomputer Center, European Bioinformatics Institute
    • Also close collaboration with (N)RENs who provide LHCONE perfSONAR coverage
  • What is MISSING is the work to extract value from the data being gathered!

8

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

9 of 27

The NSF SAND Project

SAND: Service Analysis and Network Diagnosis

This is a NSF funded project (award #1827116) focusing on combining, visualizing, and analyzing disparate network monitoring and service logging data. (GOAL: capitalize on our rich network dataset!!)

9

Website https://sand-ci.org/ (Project started in September 2018 and will last 2 years)

PI: Brian Bockelman, Co-PIs: Shawn McKee, Rob Gardner

Brian Bockelmann

Associate Scientist

Morgridge Institute for Research, University of Wisconsin

bbockelman@morgridge.org

Shawn McKee

Research Scientist

University of Michigan Physics

smckee@umich.edu

Rob Gardner

Senior Scientist

University of Chicago Physics

rwg@hep.uchicago.edu

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

10 of 27

SAND Project Vision

It will extend and augment the OSG networking efforts with a primary goal of extracting useful insights and metrics from the wealth of network data being gathered from perfSONAR, FTS, R&E network flows and related network information from HTCondor and others.

Shown on the top diagram to the right is the logical SAND data flow from source to analytics.

The bottom diagram to the right shows the potential power of the extensive network tomography we have by continuously measuring thousands of R&E network paths. In this example, 3 host-pairs see differing packet loss on intersecting paths. We can infer a solution!

10

E-F 1%

D-C 2%

A-B 3%

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

11 of 27

SAND Planning and Work Areas

  • Main goal of SAND is to create new analytics, visualizations and user-interfaces to extract value from the perfSONAR (and related) network metrics
  • Initial architecture: Data-pipeline to ELK stack, visualizations via Kibana, Grafana and perhaps other tools, analytics via Jupyter notebooks and creation of “architecture plugins” to leverage this framework.
    • Examples:
      1. Alarming dashboards that show Top-N problem links (SRC-DEST with largest packet loss in last N hours, SRC-DEST with most routes in last N hours, SRC-DEST with largest change in measured throughput in last N hours, SRC with most average packet loss averaged over all DEST, DEST with most average packet loss averaged over all SRC)
      2. Route correlation: Identify SRC-DEST pairs with similar behavior changes at a point in time and analyze common hops in their routes
      3. Alerting system based upon alarming and route work. Users subscribe to various alerts using SRC, DEST, packet-loss, change in BW, etc

11

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

12 of 27

SAND Planning and Work Areas (2)

We have more items on our list:

  • Dashboards for navigating the metrics
  • Network topology - cleaning, re-organizing, visualizing.
  • Engaging the broader NSF research community (CC* grant recipients)
  • Improving end-users ability to find networking information
  • Transitioning from a “pull” data model to a secure “push” model

In the interest of time, I will only show a couple things.

12

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

13 of 27

Challenge: Network Topology

Whenever we identify a possible network problem, the first question is: what path is being measured?

  • Knowing the path in place when a problem is identified is critical

It should be noted that having many paths continuously monitored is a very powerful tool for both identify network issues and localizing them!

  • Gedanken experiment: at approximately the same time, 5 host-pairs show an increase in packet loss. What is the inference we can make by correlating their paths?

Fortunately, we are scheduling regular “traceroute” tests between our perfSONAR measurement end-points

We have students working on data cleaning and topology extraction.

New path visualisation tool being developed by MEPHi SAND collaborators

    • Still in Beta, but already provides very interesting views into perfSONAR traces
    • Video of next release available at https://yadi.sk/i/tyhiA-e3GGKqDQ

13

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

14 of 27

DEMO: Dashboards for Network Metrics

14

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

15 of 27

Finding Relevant Information

So far I have shown a few different links. Another area the SAND team would like to improve is to make it easier to find all the relevant tools, docs and data

We have setup a web server at: https://toolkitinfo.opensciencegrid.org/toolkitinfo/

15

The goal is to continue to maintain and add-to the various menus available to allow a broad range of users to easily find and access network data and analytics results.

We will be adding info on any future containerized perfSONAR, new topology capabilities and links to adding your site data to SAND.

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

16 of 27

SAND Summary

  • The SAND project is working to
    • Maintain an effective, efficient metrics pipeline
    • Provide an infrastructure to monitor our infrastructure and analyze various metrics
    • Extract new insights from measurements of our existing, complex global infrastructure.
  • The primary goal for SAND is to better extract “value” for our Scientists, Site and Network Administrators from the extensive network metrics OSG/WLCG is gathering.
  • We are looking for collaborators with an interest in any of the topics I covered. Contact us if you or your group are interested.

team@sand-ci.org

Part 2: RNTWG Slides

16

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

17 of 27

Acknowledgements

We would like to thank the WLCG, HEPiX, perfSONAR and OSG organizations for their work on the topics presented.

In addition we want to explicitly acknowledge the support of the National Science Foundation which supported this work via:

  • SAND: NSF OAC-1827116
  • IRIS-HEP: NSF OAC-1836650

17

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

18 of 27

SAND References

18

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

19 of 27

SAND Backup Slides

19

20 of 27

OSG/WLCG networking projects

20

There are 4 coupled projects around the core OSG Net Area

  1. SAND (NSF) project for analytics
  2. HEPiX NFV WG
  3. perfSONAR project
  4. WLCG Network Throughput WG

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

21 of 27

Available Data Overview

SAND and OSG/WLCG are gathering a number of potentially very useful metrics

  • perfSONAR data from over 260 instances all over the world
  • ESnet network traffic (snmp counters)
  • WLCG data transfers (FTS)
  • LHCOPN data (from CERN networking)

This data is being transferred using message bus technologies (RabbitMQ (OSG) and ActiveMQ (CERN)) and ends up in two different Elasticsearch instances (University of Chicago analytics platform and University of Nebraska)

This data could provide powerful insights into our R&E network infrastructure by using the temporal and spatial information we have available.

21

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

22 of 27

Some Context: IRIS-HEP

The Institute for Research and Innovation in Software in High Energy Physics (IRIS-HEP) project has been funded by National Science Foundation in the US as grant OAC-1836650 starting 1 September, 2018.

The institute focuses on preparing for High Luminosity (HL) LHC and is funded at $5M / year for 5 years. There are three primary development areas:

  • Innovative algorithms for data reconstruction and triggering;
  • Highly performant analysis systems that reduce `time-to-insight’ and maximize the HL-LHC physics potential;
  • Data organization, management and access systems for the community’s upcoming Exabyte era.

The institute also funds the LHC part of Open Science Grid, including the networking area and created a new integration path (the Scalable Systems Laboratory) to deliver its R&D activities into the distributed and scientific production infrastructures. Website for more info: http://iris-hep.org/

22

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

23 of 27

perfSONAR Data Details

We are collecting a number of different types of data from perfSONAR which are sent to different “topics” on the RabbitMQ bus and put into their own index in Elasticsearch:

  • ps_alarms : These are generated alarms based on other ps indices
  • ps_meta : Tracks toolkit version, host info, various metadata
  • ps_owd : One-way Delay measurements from perfSONAR (latency)
  • ps_packet_loss : The percentage of packets lost in latency testing (10 Hz)
  • ps_retransmits : During throughput testing, tracks retransmits
  • ps_status : Tracks status of measurements (coverage, efficiency)
  • ps_throughput : Measures throughput via iperf
  • ps_trace : Measures the layer-3 network path via traceroute

You can explore the details via Kibana: https://atlas-kibana.mwt2.org/s/networking/app/kibana#/discover?_g=()

23

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

24 of 27

SAND Collaboration Meeting Details

Our first in person collaboration meeting was June 17-18, 2019 at U Chicago

24

Main topic areas discussed day 1

  • Network pipeline
  • Monitoring tools
  • Containerizing perfSONAR
  • Engaging with and enabling a broader community
  • Topology and data cleaning

The second day was a “hackathon” were we worked on items from day 1.

The “Team

Picture credit: Rob Gardner (that’s why he’s missing)

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

25 of 27

Visualizing NSF CC* Institutions

The NSF has had a very successful series of Campus Cyberinfrastructre (CC) solications, and all require recipients to deploy perfSONAR

SAND wants to make it easy for these sites to be seen by simply adding a ‘CCSTAR’ community to their perfSONAR toolkits https://display.sand-ci.org/

25

Of course showing them on the map is just a first step

We want to then provide a very easy way for sites to “opt-in” to SAND so the we can begin to gather their perfSONAR data and provide our analytics, alerting and monitoring for them.

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

26 of 27

SAND Activities to Date

Initial efforts targeted improving the network data pipeline from OSG

  • OSG was using an infrastructure called RSV to gather perfSONAR data
    • There were issues with reliability and latency
    • With help from the SAND project, a new collector was created that has much lower latency, more complete monitoring and is significantly more robust.
  • The network metrics being collected were only going to an ELK-based analytics platform in Chicago
    • We added a new “long-term” ELK destination in Nebraska
    • We also added tape backup of the data at FNAL (tested and successfully used this year!)
  • Initial planning for a new push-based (from each toolkit) model is ready
    • Planning to have a few instances running “push” based in August (after pS 4.3 is out)

We have also been working with the collected data and have identified challenges that we need to address to make it more useful

  • As part of the we augmented the traceroute with ASN info (See later details)

26

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry

27 of 27

Issues with Traceroute and Network Paths

While we regularly try to measure the network paths between our hosts (and by proxy, between our sites), the traceroute tools has some limitations

  • It sometimes doesn’t reach the destination
  • Hops along the way can fail to respond in time, leaving “holes” in the path
  • The trivial variations in traceroutes can lead to 10’s of thousands of routes
  • The “route” it delivers can be false https://www.cellstream.com/reference-reading/tipsandtricks/403-ecmp-linux-paristr

27

For all these reasons, we have challenges in trying to use our traceroute results to understand the network topology

The SAND project is planning to work on cleaning things up

  • We are trying to identify logical paths to contain trivially varying physical paths to simplify things
  • We need to identify when multiple links might exist at L2
  • We have added “AS” number to the traceroute data to simplify understand when a major route change happens.
  • We are working on ways to visualize, compare and understand our network paths

I2 PWG-CMMT

 Internet2 Performance Working Group Community Measurement, Metrics, and Telemetry