The HORUS Project First Webinar
Shawn McKee for the HORUS Collaboration
Merit Network, University of Michigan, Wayne State University, Michigan State University
HORUS Webinar
December 13, 2022
HORUS: Helping Our Researchers Upgrade their Science
Introduction
Today we want to present and discuss the newly funded HORUS Project and get feedback from our user community.
We will provide an overview of the project as proposed, give some near term plans and seek feedback from our collaborators about what we present and intend to build.
The project page at NSF has an abstract describing HORUS, but the overview is that we are augmenting the existing storage from the previous OSiRIS project with a diverse set of computing resources to create new capabilities for a diverse range of science domains, especially for under-served institutions.
Please use/add-to the online notes during the webinar
2
HORUS
Acronym Definitions
OSiRIS: Open Storage Research Infrastructure, NSF 5 year project 2016-2021
HORUS: Helping Our Research Upgrade their Science, NSF 2 year project 2022-2024
CE: An OSG Compute Entrypoint (CE) is the door for remote organizations to submit requests to temporarily allocate local compute resources.
NSF: National Science Foundation which funds many types of research projects.
RHEL: Red Hat Enterprise Linux, the de facto enterprise linux operating system.
OSG: Used to mean Open Science Grid, now just “OSG”
Ceph: A popular storage infrastructure uniquely delivering object, block, and file storage in one unified system.
PATh: Partnerships to Advance Throughput Computing, NSF funded project
GPU: Graphics Processing Unit, dedicated hardware for rapid image processing, but now widely used for AI (artificial intelligence) / ML (machine learning).
3
HORUS
Agenda for Today
During this hour we want to cover the following topics:
4
HORUS
The HORUS Project in a Nutshell
The HORUS project was conceived as a means to add a heterogeneous set of computing resources to augment the existing storage provided by the previously funded NSF OSiRIS grant (more information on OSiRIS coming up).
We proposed three distinct computing server configurations to try to match the wide range of needs described by our science collaborators:
HORUS intends to integrate easy access to the existing OSiRIS storage to allow our collaborators to stage and process their datasets using HORUS computing systems.
HORUS is also collaborating with the OSG / PATh projects to provide access for 20% of our capacity to other national infrastructure users.
HORUS is funded for 2 years but we intend to operate the project for at least 4 years.
5
HORUS
OSiRIS - The Precursor to HORUS
The OSiRIS proposal targeted the creation of a distributed storage infrastructure, built with inexpensive commercial off-the-shelf (COTS) hardware, combining the Ceph storage system with software defined networking to deliver a scalable infrastructure to support multi-institutional science.
Current: Single Ceph cluster (Octopus 15.2.17 ) spanning U-M, WSU, MSU - 1500 OSD / 13.7 PiB
6
HORUS
Building Upon OSiRIS
The OSiRIS project was funded in 2016 (a 5 year award) to create a software defined storage infrastructure to support multiple diverse science domains for their scientific storage needs.
The project ended in 2021 but has continued to be maintained on a best effort basis.
The OSiRIS proposal targeted using data “in-place” from well connected locations across our member institutions but that turned out not to be the primary way our users leveraged OSiRIS.
One of the issues that some of our OSiRIS users brought up was having local computing resources able to process the stored and shared data in OSiRIS.
OSiRIS was able to create significant capabilities to allow scientific collaborations to register users using their institutional identity and self-manage access.
7
HORUS
The Project Team
The HORUS project team incorporates the previous core OSiRIS team and new involvement of Merit Network.
The PI and Co-PIs: Shawn McKee / University of Michigan (PI), Bob Stovall / Merit Network (Co-PI), Rob Thompson / Wayne State University (Co-PI)
Senior Personnel: Andy Keen, Brian O’Shea / Michigan State University, Michael Thompson / Wayne State University
Project Team Members: Nick Grundler, Muhammad Akhdhor, Wendy Dronen / University of Michigan, Aragorn Steiger, Matt Lessins, Patrick Gossman / Wayne State University, Pierrette Renée Dagg, Merit Network.
8
HORUS
HORUS Equipment Status
We have already ordered our first year equipment based upon the specs shown below and most of it is already racked at UM, MSU and Wayne State.
Users will not need to worry about the physical location of these servers since jobs will be scheduled across all possible servers based upon job requirements.
The HORUS team is just beginning to test provisioning, server configuration, OSiRIS integration and user interfaces.
HORUS Systems | Model | Mem (G) / Host | CPU | CPU Cnt | GPU | GPU Mem | GPU /Host | NICs | Yr1 Host Cnt | HT Job Slots (total) | Mem (G) / Slot |
GPU Node | Dell R750xa | 512 | Xeon Gold 6334 8C/16T | 2 | A100 | 80G | 4 | 2x25G | 2 | 64 | 16.0 |
Large Memory Node | Dell R6525 | 1024 | AMD Epyc 7F72 3.2GHz, 24C/48T | 2 | - | - | - | 2x25G | 6 | 576 | 10.7 |
Compute Node | Dell R6525 | 512 | AMD Epyc 7H12 2.60GHz, 64C/128T | 2 | - | - | - | 2x25G | 6 | 1536 | 2.0 |
9
HORUS
HORUS Networking
We will leverage the excellent networking put in place for OSiRIS / Research
The network infrastructure shown on the right was created by a combination of OSiRIS and a CC* Grant received by Michigan State University
HORUS
HORUS
HORUS
10
HORUS
Merit Network - Michigan’s Research and Education Provider
Resilient Optical and Layer3 network within Michigan
100 Gbps connections to Internet2 in Chicago and Toledo
100 Gbps connections to Internet service providers in Grand Rapids, Southfield and Detroit
Connected to Michigan’s public universities via optical transport and routing infrastructure
11
HORUS
HORUS Building Blocks / Open Source Components
The planned HORUS software architecture is built upon a number of open source tools and applications and we group them in the following list by the particular role they are used in:
12
HORUS
Expected HORUS User Interface
The HORUS user interface is one of the main areas the project team is working on.
We don’t yet have the final details but we can describe the general ideas
We will user CILogon to support users “logging into” HORUS using their own, existing institutional credentials.
Users will have their own assigned project space, potentially shared with other users from the same project and we intend to provide Globus Online access and scp for transferring data into and out of the project space.
13
HORUS
Project Timeline and Milestones
Shown on the right is the project timeline.
Given recent issues with product supply chains, we may have some challenges in getting components procured, delivered and installed as we originally planned.
Part of the plan for next quarter is identifying the early adopters from our science collaborations
Let us know if your want to be a guinea pig for the project !!
14
HORUS
Overview of the Proposed Science Collaborations
15
HORUS
HORUS and OSG: On-ramp to National Scale Resources
As part of the NSF award requirements, the HORUS project needs to make available at least 20% of its capacity for user outside of our region:
“Proposals are required to commit to a minimum of 20% shared time on the cluster and describe their approach to making the cluster available as a shared resource external to the state/region and the set of institutions being primarily served. Proposals are strongly encouraged to address this requirement by joining the Partnerships to Advance Throughput Computing (PATh) campus federation (https://opensciencegrid.org/campus-cyberinfrastructure.html) and adopting an appropriate subset of PATh services to make the cluster available to researchers on a national scale.”
We proposed to do this via the OSG/PATh project and have a close working relationship with them (PI leads OSG Networking, UM/WSU already contribute cycles to OSG).
Our plan is to use an HTCondor meta-scheduler and integrate with an OSG CE (Compute Element) to allow users outside our region to access HORUS.
A beneficial side effect of this integration is that the HORUS team has an opportunity to make OSG/PATh resources more easily accessible for HORUS users at the same time.
16
HORUS
What Will Be Available and When?
As shown in the timeline slide, we hope to have initial hardware in place and configured in about spring 2023.
Part of our task is to create an easy to use interface for our science collaborations and this may take some time to tune and debug. We will want some early adopters who are willing to help us work through some of the start up issues.
For most science users, it may take us a year to get a production infrastructure in place but we could be ready in late summer 2023, depending upon how things go.
One of the reasons we left ~1/3rd of the equipment to be purchased in the second year of the project was to allow us to evaluate what the real needs are and tune the purchased configuration to best match. This will require us to have some experience and feedback from the majority of our users.
If there is interest, we can discuss timing and details later during Q&A.
17
HORUS
Next Steps By Role
The HORUS Team
Science Collaborators
Interested Potential Users
Campus IT and Administrators
18
HORUS
Questions or Comments
Questions / Discussion?
Please use/add-to the online notes as we discuss:
Email us with questions: horus-help@umich.edu
Website: http://www.horus-ci.org/ (not yet ready!)
19
HORUS
Acknowledgements
We would like to thank our HORUS science partners and our host institutions for their contributions to work described.
In addition we want to explicitly acknowledge the support of the National Science Foundation which supported this work via:
20
HORUS
Further Information
OSiRIS
http://www.osris.org project website
Details in various presentations at http://www.osris.org/publications
OSG
https://osg-htc.org/ project website
HTCondor
https://htcondor.org/ project website
Open OnDemand
https://openondemand.org/ project website
NSF CC* Program
https://beta.nsf.gov/funding/opportunities/campus-cyberinfrastructure-cc
21
HORUS
Additional Slides Follow
Backup Slides
22
HORUS
OSiRIS Storage Summary
We have deployed 13.7 pebibytes (PiB) of raw Ceph storage across our three research institutions in the state of Michigan.
The OSiRIS hardware is monitored by Prometheus and configuration control is provided by Puppet
Institutional identities are used to authenticate users and authorize their access via CoManage and Grouper
Augmented perfSONAR is used to monitor and discover the networks interconnecting our main science users.
23
HORUS
OSiRIS Science Domains
The primary driver for OSiRIS was a set of science domains with either big data or multi-institutional challenges.
OSiRIS is supporting the following science domains:
24
HORUS
Recent Science Domains
Brainlife.io (Neuroimaging) - Brainlife organizes neuroimaging data and data derivatives using their registered data types. No single computing resources has enough storage capacity to store all datasets, nor reliable enough so that user can access the data when they need them. They will depend on OSiRIS to store datasets and transfer data between computing resources.
Oakland University - Already a user of MSU iCER compute resources, OU will leverage OSiRIS to bring their data closer for analysis and for collaboration with other institutions.
Evolution - Large-scale evolutionary analyses, primarily phylogenetic trees, molecular clocks, and pangenome analyses
Genomics - High volume of human, mammal, environmental, and intermediate analysis data
25
HORUS
New and Ongoing Collaborations
26
HORUS
Network Upgrades - 100Gb MiLR
27
HORUS
Michigan Innovation Network
28
HORUS
COmanage Credential Management
COmanage Ceph Provisioner plugin provides user interface to manage S3 credentials and default bucket placement
Work is underway to include a full GUI for managing buckets: Create, rename, download, set ACL from OSiRIS groups or specific user, etc.
29
HORUS
Globus and Gridmap
30
HORUS
Network Management
NMAL work is led by the Indiana University CREST team
31
HORUS
OSiRIS Topology Discovery and Monitoring
32
HORUS
Summary
33
HORUS
FABRIC Topology
FABRIC (https://fabric-testbed.net/ ) is a newly funded network testbed spanning the US
Michigan is an early
adopter (2021)
https://whatisfabric.net/events/fabric-community-workshop-2020
34
HORUS