1 of 86

Engagement and Performance Operations Center (EPOC)

Support to Share and Collaborate

Dr. Jennifer M. Schopf, jmschopf@tacc.utexas.edu

Jason Zurawski, zurawski@es.net

Forough Ghahramani, forough.ghahramani@njedge.net

Douglas M. Jennewein, Douglas.Jennewein@asu.edu

Nathaniel Mendoza, nmendoza@tacc.utexas.edu

National Science Foundation Award #2328479

2 of 86

Outline

  • 5-7 mins - Jen or Jason – Introduce panelists, very high level overview of EPOC as a whole, and what it means to be an EPOC partner
  • 10 mins - Forough on NJ Edge and training, and what she thinks NJ Edge gets from being an EPOC Partner
  • 10 mins - Jen – NetSage now and next, the move to netsage.io, work with ACCESS
  • 10 mins – Doug – Deep Dives and Cryo EM Work, and what Sun Corridor gets from being a partner
  • 10 mins – maybe Nathaniel – Arecibo stuff
  • 10 mins – JZ – Baseline performance testing and DME
  • 15 mins questions – but lets think of some we might want to answer ahead of time.

2

3 of 86

Our Panelists

  • Dr. Jennifer M. Schopf, Director Networking Partnerships, Texas Advanced Computing Center (TACC)
  • Jason Zurawski, Science Engagement Engineer, ESnet
  • Forough Ghahramani, Associate Vice President for Research and Innovation, NJ Edge
  • Douglas M. Jennewein, Sr. Director, Research Technology Office, Arizona State University

3

4 of 86

What we’re going to talk about

  • High level EPOC Overview - Jen
  • Training and work with NJ Edge - Forough
  • Fasterdata DTN Framework - Jason
  • Deep Dives and Cryo EM Work - Doug
  • NetSage now and next - Jen
  • Arecibo - Jason
  • Questions?

4

5 of 86

Engagement and Performance

Operations Center (EPOC)

  • Joint project between TACC and ESnet (LBNL/DOE)
    • $3.5M 5 year project, current end date fall 2024
  • Part of NSF CC* program for domestic science support
  • Partnerships with regional, infrastructure, and science communities that span the NSF and DOE continuum of funding
  • Promoting better data sharing in all ways

5

6 of 86

EPOC Core Mission

  • Understanding and supporting science use cases
  • “Smallest difference for the biggest change”
  • Campus, regional, national, and int’l support
  • Debugging any and all network complications via established measurement and monitoring infrastructure
  • Data movement at all layers of the ecosystem
    • Software, hardware, network, people

6

7 of 86

Current Regional Partners (13)

  • Front Range GigaPop (FRGP)
  • Great Plains Network (GPN)
  • iLight
  • KINBER
  • Lonestar Education and Research Network (LEARN)
  • NJEdge
  • NOAA N-wave
  • NYSERNet
  • Ohio Academic Resources Network (OARnet)
  • Pacific Northwest GigaPop (PNWGP)
  • Southern Crossroads (SoX)
  • Sun Corridor Network (SCN)
  • Texas Advanced Computing Center (TACC)

7

8 of 86

EPOC Five Focus Areas

  1. Roadside Assistance and Consulting
  2. Application Deep Dives
  3. Network Analysis (NetSage)
  4. Data Mobility Exhibition/Baseline Testing
  5. Training

8

9 of 86

Roadside Assistance

  • “This file transfer worked last week, but it doesn’t anymore?”
    • Think of this like a flat tire, crash repair
    • Anyone can submit
  • Contact epoc@tacc.utexas.edu
    • Within 24 hours, gets triaged
    • Some initial investigation to verify issues
    • A Case Manager and Lead Engineer are assigned
    • Shareable infrastructure set up
  • Centralization of Researcher Assistance

9

10 of 86

Roadside Assistance - Consulting

  • EPOC is an “Ask Me Anything” help desk
  • Often simpler questions:
    • Suggestions for data architecture choices
      • DTNs, DMZs, firewalls
    • Data projections for science fields
    • Expected (real) performance between two sites
    • Advice on how to conduct a performance assessment
    • Or others!
  • Same operations center approach, aim for 1 business day turnaround for first response

Over 300 RA/C cases to date

10

11 of 86

Roadside Assistance is not “normal network engineering problem solving”

  • We don’t own any of the resources having problems
    • We coordinate with the resource owners and the other networks/systems people
  • We can’t always run the tests ourselves
    • Must be a collaborative effort

Best technical choice often isn’t an option

We try to make the smallest change for the biggest difference within the constraints we’re handed

11

12 of 86

Deep Dives

  • Think of this as regular maintenance,

oil change, or planning to buy a car

  • Based on ESnet facility req’ts reviews
    • Walk through science workflow with the

actual scientists

    • Way to understand needs and planning
  • Often identifies issues that have nothing to do with networks, and everything to do with sociology

12

13 of 86

Deep Dive Overview

  • Formal mechanism via structured conversations to determine shared understanding of CI needs
  • Bring together a cross section of campus
    • Network users (researchers)
    • Administrators
    • Technology providers
  • Try to find common problems and paths forward
  • In-person component a significant value add

13

14 of 86

Deep Dive: Face-to-Face Discussion

  • Bring together researchers, IT staff, research admin
  • Create a shared vision to go forward
  • Share information for strategic programs, initiatives
  • Guide organizational strategy
  • Build relationships with constituents
  • Identify and resolve network-related issues, existing or anticipated

14

15 of 86

15

16 of 86

We Walk Through Scientific Components…

  1. Background information
    • Brief overview of the facility, nature of the science being performed
  2. Collaborators
    • Identify people and institutions that a science group interacts with
  3. Instrumentation
    • Local and remote scientific instruments and facilities.
  4. Process of Science
    • Explain ‘a day in the life’ of the science group
    • Should tie together the instruments, the people, and the resources

16

17 of 86

And Also More Technical Aspects…

5. Software Infrastructure

6. Network and Data Architecture

7. Cloud Services

8. Outstanding Issues and Pain Points

Local and regional IT staff are critical for this information, and help form valuable partnerships that may not exist or could use strengthening

17

18 of 86

Deep Dive: Outputs

  • Identify and analyze technical gaps/bottlenecks or opportunities
  • Forecast technology/network capacity needs, particularly in regions where a site is anticipating increases or decreases in data l
  • Help inform investments in network improvements, bandwidth needs, or other application services
  • Create long-term, relationships with researchers, IT staff and administration to provide ongoing consultation and support

18

19 of 86

Deep Dives So Far

  • Twenty Deep Dive reports available at https://epoc.global/materials
  • Several more in progress stages
  • Reports have 3-5 page executive summary with action items
    • Strategic report input
    • Funding request justification
    • Structural planning

19

20 of 86

Monitoring using NetSage

  • NetSage advanced measurement services

for R&E traffic

    • Better understanding of current traffic patterns across instrumented circuits
    • Better understanding of large flow sources/sinks
    • Performance information for data transfers
  • Started as collaboration between Indiana University, LBNL, and University Hawaii Manoa
  • Now development at TACC
    • Backend support/deployments at both TACC and IU
  • Last year : 3,000+ unique users

20

21 of 86

NetSage Data Sources

  • SNMP data
    • Basic bandwidth data
  • Flow data from routers
    • De-identified, only flows over 10M
  • CAIDA WhoIs data for org names, geo, etc
  • Science Registry to map IP ranges to projects, science discipline
    • Crowd sourced

21

22 of 86

Data Mobility Exhibition (DME)/

Baseline Performance Testing

  • One TeraByte of data in an hour
    • Equivalent to 2.22 Gb/s average
    • Achievable for institutions connected at 10G
  • How to find out?
    • DME has known good endpoints to test against
    • Variety of data sizes you can transfer
    • Standard Globus set up
  • And if you can’t?
    • Work with EPOC to find the bottlenecks!

22

23 of 86

EPOC Supports:

  • Known good end points you can test against
    • One will be put at TACC as well shortly
  • Help to get your baseline test numbers
  • Help to improve those numbers as requested

23

24 of 86

Training

  • Follow on to OIN (http://oinworkshop.com)
    • Reached over 750 people in 3 years
  • Hands on perfSONAR sessions
    • Especially for small nodes, includes file transfer tests
  • “How to talk to Scientists”
  • Science DMZ Set up
  • Data Transfer Node (DTN) Set Up

24

25 of 86

Joint work with University South Carolina

  • Two-day online hands-on workshops
    • Often joint with a regional network
    • Introduction to tools and techniques for the design, implementation, and monitoring
    • Lab exercises on “pods” emulating networks and tools
  • Topics
    • Network tools and architecture
    • Use of perfSONAR
    • BGP attributes and configuration

http://ce.sc.edu/cyberinfra/workshop_2022.html

25

26 of 86

Any questions on EPOC

  1. Roadside Assistance and Consulting
  2. Application Deep Dives
  3. Network Analysis (NetSage)
  4. Data Mobility Exhibition/Baseline Testing
  5. Training

26

27 of 86

What we’re going to talk about

  • High level EPOC Overview - Jen
  • Training and work with NJ Edge - Forough
  • Fasterdata DTN Framework - Jason
  • Deep Dives and Cryo EM Work - Doug
  • NetSage now and next - Jen
  • Arecibo - Jason
  • Questions?

27

28 of 86

New Jersey Edge - EPOC Partnership

  • Collaboration on NSF proposal
  • EPOC resources and training
  • Network Analysis enabled by the NetSage monitoring suit
  • Assistance from EPOC to gain a better understanding of researcher needs based on data movement

28

29 of 86

Cyberinfrastructure for Research Data Management Workshop

  • Hosted by Princeton University, Edge, Globus, and EPOC (TACC and ESnet co-PIs)
  • Date/Location: May 23-24 Princeton University campus
  • Workshop designed to help research computing professionals in deploying next-gen cyberinfrastructure that can effectively support data-intensive science
    • Topics include: the Science DMZ, Data Management using Globus, perfSONAR network measurement, Netsage, and other affiliated Research & Education common best practices
  • Target attendees: anyone supporting infrastructure and/or developing applications for research and education

29

30 of 86

What we’re going to talk about

  • High level EPOC Overview - Jen
  • Training and work with NJ Edge - Forough
  • Fasterdata DTN Framework - Jason
  • Deep Dives and Cryo EM Work - Doug
  • NetSage now and next - Jen
  • Arecibo - Jason
  • Questions?

30

31 of 86

Motivation/Background

  • Networks are an essential part of data-intensive science
      • Connect data sources to data analysis
      • Connect collaborators to each other
  • Performance is critical, but often overlooked
      • Exponential data growth and not always aware of new data sources
      • Constant human factors and
      • Data movement and data analysis must keep up
  • Effective use of wide area (long-haul) networks by scientists has historically been difficult
  • Different IT groups manage various components of research workflow

31 – ESnet Science Engagement (engage@es.net) - April 2023

32 of 86

Fasterdata DTN Framework

  • ESnet has worked with the R&E community for 15+ Years on specific aspects of data mobility:
    • Science DMZ Network Architecture
    • Network test and measurement: perfSONAR, NetSage, Stardust
    • Data Transfer Hardware Design: https://fasterdata.es.net/DTN/reference-implementation/
    • Championing Efficient Applications: Globus
    • The Data Mobility Exhibition: Ensuring that a 10G connected DTN is capable of 1TB/hour, 2.22Gb/s disk-to-disk performance
  • The next evolution of this work is the “Fasterdata DTN Framework”

32 – ESnet Science Engagement (engage@es.net) - April 2023

33 of 86

Science DMZ Design Pattern (Abstract)

33 – ESnet Science Engagement (engage@es.net) - April 2023

34 of 86

Data Transfer Rates by Audience

A benchmark table is provided to gauge data architecture performance, which can vary depending on number of files, folders, size of files, distance between sites, CI performance (network, server, disk/filesystem), as well as data transfer tool.

34 – ESnet Science Engagement (engage@es.net) - April 2023

Host Transfer Rates

PetaScale

(Minimum)

PetaScale

½

PetaScale

PetaScale: 

1 PB/wk 

PetaScale: 

1 PB/day

10G Capable DTN

10xG, 25G, 40G, 100G DTNs

Data Transfer Rate/Volume (Researcher)

1 TB/hr

2 TB/hr

3 TB/hr

5.95 TB/hr

41.67 TB/hr

Network Transfer Rate (Network Admin)

2.22 Gb/s

4.44 Gb/s

6.67 Gb/s

13.23 Gb/s

92.59 Gb/s

Storage Transfer Rate (Sys/Storage Admin)

277.78 MB/s

555.54 MB/s

833.33 MB/s

1.65 GB/s

11.57 GB/s

35 of 86

DTN Framework

  • Our new goals are:
    • Major facilities (e.g. HPC centers, instrumentation labs) should reach efficiencies of 2PB of data transfer a day (e.g. sustained performance of ~200Gbps)
    • User/Collaboration sites (e.g. Universities, Laboratories) should reach efficiency of 4PB of data transfer a week (e.g. sustained performance of ~50Gbps)

35 – ESnet Science Engagement (engage@es.net) - April 2023

36 of 86

Old Example: PetaScale DTN - before

36 – ESnet Science Engagement (engage@es.net) - April 2023

37 of 86

Old Example: PetaScale DTN - after

37 – ESnet Science Engagement (engage@es.net) - April 2023

38 of 86

Next Steps

  • Building a List of Participants for each phase:
    • DOE Laboratories and Facilities,
    • Universities
    • Regional Networks
  • Structured testing to evaluate current capabilities + improvements
  • Report out at major community events
  • For those that want to participate (or want 1:1 assistance), please engage with Engagement and Performance Operations Center (EPOC) is available: epoc@tacc.utexas.edu

38

39 of 86

What we’re going to talk about

  • High level EPOC Overview - Jen
  • Training and work with NJ Edge - Forough
  • Fasterdata DTN Framework - Jason
  • Deep Dives and Cryo EM Work - Doug
  • NetSage now and next - Jen
  • Arecibo - Jason
  • Questions?

39

40 of 86

ASU Campus Deep Dive

  • First meeting 11/11/2019
  • Science drivers include Cryo-EM and Social Media Observatory
  • Campus deep dive planned for Feb 2020
  • But…space/logistics challenges
  • “Let's just do it in… April 2020
  • What could possibly go wrong?

40

41 of 86

41

42 of 86

January 2022: Virtual Deep Dive

ASU Research Engagement, System Administrators, Research Data Management Staff, and leadership.

EPOC engineers and engagement staff

42

43 of 86

ASU Cryo EM Core Facility

43

44 of 86

Social Media Observatory

44

New forms of political participation emerging on social media platforms

The related challenges of collecting, analyzing, and preserving data from social media platforms

45 of 86

Findings

45

  • “A number of research use cases have noted that they will require access to more storage capacity (e.g. multiple TBs, approaching PB).”
  • “It is recommended that technology service requests be integrated with the pre-award support office, to better identify technology needs and use cases at the time of application.
  • “It is recommend that ASU increase the ASU support team size, services offered, and available documentation.”
  • “Data mobility infrastructure should be upgraded to integrate other Globus endpoints, as well as adopting approaches to portal applications for some users.“

46 of 86

What we’re going to talk about

  • High level EPOC Overview - Jen
  • Training and work with NJ Edge - Forough
  • Fasterdata DTN Framework - Jason
  • Deep Dives and Cryo EM Work - Doug
  • NetSage now and next - Jen
  • Arecibo - Jason
  • Questions?

46

47 of 86

Monitoring using NetSage

  • NetSage advanced measurement services for R&E traffic
    • Better understanding of current traffic patterns across instrumented circuits
    • Better understanding of large flow sources/sinks
    • Performance information for data transfers
  • Started as collaboration between Indiana University, LBNL, and University Hawaii Manoa
  • Now all development at TACC
    • Backend support/Deployments at both TACC and IU
  • 2021: 2,500+ unique users in 85+ different countries

47

48 of 86

NetSage Data Sources

  • SNMP data (Passive)
    • Basic bandwidth data
  • perfSONAR (Active)
    • Active tests between sites
  • Flow data from routers (Passive)
    • Only de-identified data collected by NetSage
  • Tstat-based traffic analysis for archives (Passive)
    • TCP flow statistics: congestion window size, number of packets retransmitted, etc
    • Also de-identified before stored

48

49 of 86

NetSage Ingest

49

Ingest Pipeline

Ingest Pipeline

50 of 86

Flow Data collection

  • Flow data is redirected to a collection point, de-identified, and then sent to NetSage archive
  • Collection point
    • IU collection point - BEING DISCONTINUED
    • Docker container on resources at site’s institution
      • https://netsage-project.github.io/netsage-pipeline/docs/deploy/docker_install_simple
  • Docker container run as a service on an existing server
    • Linux or MacOS
    • Can be anywhere your router has access to across regular IP routing
    • If you choose this option, you have to do updates, not us

50

51 of 86

NetSage Privacy

  • NetSage is committed to privacy, and preemptively addressing any security or data sharing concerns
    • No Personally Identifiable Information (PII) collected
    • Remove the last octet from IP address
    • Only keep data on flows over 10M for circuits
      • 1M for archives
  • Data Privacy Policy
    • https://drive.google.com/file/d/19ljTq4xztalXyz5DhTfyjUMO1q4s0mYR/view
  • Prototypes are behind a password until we’re told to make it public

51

52 of 86

NetSage - Built around answering questions

  • Answers questions asked by network engineers, network owners, and end-users
  • Human-readable summaries and patterns
  • Big picture overview helps highlight trends and events that can make in-depth analysis of local data more fruitful

52

53 of 86

Built around answering questions:

53

Interesting pattern. What does it mean?

Singapore to Taiwan via LA?

Why so slow?

54 of 86

NetSage Focus on Use Cases and Questions

  • Flow Data Dashboards
    • What are the top sites using my circuits?
    • What are the top sources/destinations for an organization?
    • Who’s using my archive?
  • Debugging dashboards
    • What are the flows like between these two orgs?
    • There was a performance spike on my circuit – what was it?
    • Who’s transferring a lot of data really slowly?
  • If SNMP data then Bandwidth Dashboard:
    • How much are the links used?
    • Where are congestion points?

54

55 of 86

Dashboard Supported by EPOC

  • ALL Dashboards are public

55

56 of 86

Sample NetSage - FRGP

56

57 of 86

FRGP Front Page: https://frgp.netsage.io

57

Pick the question to answer

Change the timeframe

58 of 86

58

59 of 86

Top Pairs as a table

59

60 of 86

60

61 of 86

61

62 of 86

62

63 of 86

Protocols and Ports

63

64 of 86

Everything we know about one of the flows

64

65 of 86

Sun Corridor - Geo traffic

65

66 of 86

66

67 of 86

67

68 of 86

Coming Soon for NetSage

  • We’ll be adding some deployments
    • ACCESS Resource Providers
      • PSC, SDSC, NCSA, etc
    • EPOC Partners
      • NJ Edge, NCAR
  • What use cases would you like to see?

68

69 of 86

What we’re going to talk about

  • High level EPOC Overview - Jen
  • Training and work with NJ Edge - Forough
  • Fasterdata DTN Framework - Jason
  • Deep Dives and Cryo EM Work - Doug
  • NetSage now and next - Jen
  • Arecibo - Jason
  • Questions?

69

70 of 86

Arecibo - Brief History of (Planned) Challenges

  • 1963 - Arecibo in Operation (https://www.naic.edu/ao/legacy-discoveries)
  • 2006 - NSF 15% Budget cut across Astronomy
  • 2007 - Arecibo budget cut from$10.5 to $8M
    • (NASA adds ~$2.6M to help operations budget)
  • 2015 - Facilities director Kerr quits due

to funding clashes

  • 2018 - University of Central Florida

takes on stewardship

70

71 of 86

Unplanned Challenges (1)

  • Hurricane Maria ( Sept 20, 2017 )
    • Category 4
    • Significant damage to facility
    • R&E (I2) network connection lost and not replaced
      • Too expensive to repair due to budget cuts
  • Core operations resume when the power came back...

71

72 of 86

Unplanned Challenges (2)

  • 5.0 - 6.4 Earthquakes ( January 7-11, 2020 )
  • Operations still going
    • shaken not stirred

72

73 of 86

Unplanned Challenges (3)

  • July 30, 2020
    • Tropical Storm Isaias
    • SCIENCE vs operations means that budget choices are made
  • Aug 10, 2020
    • First [auxiliary] cable snap
  • Nov 6, 2020
    • Second cable [primary] snaps

73

74 of 86

December 1, 2020 - Resulting Facility Profile

74

Data Center Building 1.

75 of 86

Need to Move 2+ PB of Golden Copy Data

  • Option 1: Move data to the cloud
    • Est time to transfer: 42 years
    • Est cost: Millions for downloads

  • Option 2: Move data to partner site at UCF
    • Partner site says they can’t support this option

  • Option 3: TACC steps up to save science
    • Workflow developed
    • 9 months later, all data from spinning disk at TACC
    • …but the tapes…

75

76 of 86

Workflow to move data on spinning media

76

77 of 86

MISSION: Get the data to a safe & stable location

  • 2 Petabytes of Golden Copy need to move
      • Fiber path from facility is operating at 1Gbps - upgrade not feasible
      • Transfers in MB/s with PB to go
      • Data movement at a scale of years
  • Proposed Solution: Network Attached Storage (NAS) “Appliances” being used with sneakernet
      • 100+ Terabytes at a time ( Full capacity of NAS device )
      • Onsite team hand carries NAS to the closest 100Gbps links (e.g., IRNC funded AMPATH)
        • University of Puerto Rico - Mayaguez
        • Engine-4 Commercial Collaboration Space.

Operations still on going

77

78 of 86

Call for assistance:

  • Team forms to see what can be done to get the data moving to safe stable state:
    • EPOC - Architect solutions to implement
    • Globus - application to handle data migration
    • UPR and Engine-4 - Conduits to 100Gbps link(s)
    • UCF - Facility management
    • TACC - long-term storage location

78

79 of 86

EPOC- Define the problem and approach

  • Understand Problem set:
    • Global PANDEMIC - travel was not an option
    • We can’t just ship hardware
    • Closest 100Gbps link an hour drive away
      • University of Puerto Rico and Engine-4
    • Network Attached Storage (movable storage appliance) devices small but, mighty solutions that can be tuned

79

80 of 86

TACC- Offered up resources to save the Science

  • Offered to catch and distribute the data
    • 2+ Petabyte landing zone
    • 100Gbps link analysis and monitoring
      • NetSage and dashboarding
    • Globus tools and DTNs
    • Team to help develop portals and analysis pipelines

80

81 of 86

Move that data! - Globus

81

82 of 86

But what about the tapes?

82

83 of 86

6,600+ Tapes

83

84 of 86

TACC on-site @ AO Next Week(!)

  • Working to catalog the entire tape room (manual recording of each tape’s and metadata
  • Physically packing them for shipment to Austin
  • Developing a manual workflow to:
    • read tapes to block storage
    • attach metadata
    • migrate to portal system

84

85 of 86

What we’re going to talk about

  • High level EPOC Overview - Jen or Jason
  • Training and work with NJ Edge - Forough
  • NetSage now and next - Jen
  • Deep Dives and Cryo EM Work - Doug
  • Arecibo - Nathaniel
  • Fasterdata DTN Framework - Jason
  • Questions?

85

86 of 86

Acknowledgements

  • EPOC is funded by
    • US NSF award #1826994 through 2023
    • US NSF award #2328479

86