1 of 23

Update on SLATE and FedOps

Rob Gardner for the SLATE Team

2021.01.06 �OSG Area Coordinators Meeting

1

2 of 23

SLATE update

2

3 of 23

Quick Recap

  • SLATE - Edge Service Deployment Platform
    • Can be operated completely locally to manage containerized services (& "applications")
      • We are using it for OSG services as part of PATh
      • Sites can use it to ease deployments of hosted CE, StashCache, Frontier-Squid, Globus and other services
    • Implements a privilege model to enable trusted federated operation
      • where the rubber meets the road in terms of security policy development
  • Tutorials given cover the essential concepts and implementation
    • In particular the tutorial we gave to the US CMS admin team is particularly good

3

4 of 23

SLATE numbers

  • Number of production applications: 13
  • Number of incubator applications: 44
  • Number of registered clusters: 33
  • Number of clusters in operation: 21
  • Number of groups: 51
  • Number of instances deployed:
    • snapshot 1/6/21: 128
  • Number of clusters at South Pole: 1

4

5 of 23

Aside: Hosted CEs operated by OSG Ops

$ slate instance list --group osg-ops

Name Cluster ID

osg-hosted-ce-amnh-ares uchicago-river-v2 instance_AOUkOiliGjg

osg-hosted-ce-amnh-hel uchicago-river-v2 instance_-5tbF_slj3k

osg-hosted-ce-amnh-mendel uchicago-river-v2 instance_KlZKGY-i5hM

osg-hosted-ce-asu-dell-m240 uchicago-river-v2 instance_-T9qcccY3e0

osg-hosted-ce-clarkson-acres uchicago-prod instance_omQQTLH2-XU

osg-hosted-ce-computecanada-cedar uchicago-river-v2 instance_EjL5pbnc594

osg-hosted-ce-fsu-hnpgrid uchicago-river-v2 instance_j1D_dZ3_jrI

osg-hosted-ce-gsu-acore uchicago-river-v2 instance_qUO78rDyrSw

osg-hosted-ce-my-cluster osgcc instance_6YYWSOQ2Ahk

osg-hosted-ce-nd-caml-gpu uchicago-river-v2 instance_5OP48zv5rek

osg-hosted-ce-psc-bridges uchicago-river-v2 instance_SW6qKz9cFVA

osg-hosted-ce-sdsc-triton-stratus uchicago-river-v2 instance_OwaaDcR5wiU

osg-hosted-ce-sut-ozstar chtc-tiger instance_5qgv0f6m9oQ

osg-hosted-ce-tcnj-elsa uchicago-river-v2 instance_hLptxaFKTjI

osg-hosted-ce-tufts-cluster chtc-tiger instance_DNA8O0VQAIs

osg-hosted-ce-uci-gpatlas uchicago-river-v2 instance_twHw8lU6_Zg

osg-hosted-ce-uconn-xanadu uchicago-river-v2 instance_fI2zEGUbdWg

osg-hosted-ce-ucsd-comet uchicago-river-v2 instance_o6038H1CDIg

osg-hosted-ce-usf-sc uchicago-river-v2 instance_4riG7c9yTFA

osg-hosted-ce-uwm-nemo uchicago-river-v2 instance_Zrk8YgF3yK8

osg-hosted-ce-wsu-grid uchicago-river-v2 instance_oVjOC0nHnN8

$ slate instance list --group osg-covid19

Name Cluster ID

open-science-ce-ce1 uchicago-river-v2 instance_qPo-bcR3TqU

5

6 of 23

Aside: Hosted CEs run by others

$ slate instance list | grep hosted-ce | grep -v osg-ops

osg-hosted-ce-aggiegrid nmsu nmsu instance_qeOWw6HnsiQ

osg-hosted-ce-blin-local-submit-attr chtc-osg osgcc instance_SB5CJyRBj8Y

osg-hosted-ce-discovery nmsu nmsu instance_qOZ2j80yaos

osg-hosted-ce-osg-gatech-dev gatech-dev osg-gatech-dev instance_V1DM5Z3YNyo

osg-hosted-ce-tacc-frontera osg-hepcloud-ops uchicago-river-v2 instance_ZKBN7nCW1hI

osg-hosted-ce-tacc-stampede2 osg-hepcloud-ops uchicago-river-v2 instance_MEsPx3_7fqQ

osg-hosted-ce-uiuc-htc mwt2 uchicago-prod instance_YEyAToBxjJA

slate-dev-osg-hosted-ce-uchicago-grid slate-dev uchicago-prod instance_Hzaq1kea1Ck

6

7 of 23

Aside: Various Squid Caches

$ slate instance list | grep squid

ndcms-osg-frontier-squid-global ndcms notredame instance_GwswqO_izOs

osg-frontier-squid uu-chpc-ops uutah-prod instance_rz4dkyGny_A

osg-frontier-squid ssl uchicago-river-dev instance_d2nV6MP2_4Y

osg-frontier-squid gpn-poc gpn-poc-onenet instance_y1J_s4VymXY

osg-frontier-squid gatech-dev osg-gatech-dev instance_OIZCS8jjKcc

osg-frontier-squid-global nmsu nmsu instance_ykKyfHfFapA

osg-frontier-squid-mwt2-iu mwt2 mwt2-iu instance_tVkVVIrXIIA

osg-frontier-squid-mwt2-uc mwt2 uchicago-prod instance_Mi9mzDKl3OI

osg-frontier-squid-mwt2-uiuc mwt2 mwt2-uiuc instance_9Uu0ma4pz7c

osg-frontier-squid-swt2-cpb swt2-cpb swt2-cpb instance_-9j5nN97ldc

slate-dev-osg-frontier-squid-cvmfs slate-dev koik8scluster instance_OxEZcuOm_H0

slate-dev-osg-frontier-squid-cvmfs slate-dev uchicago-prod instance_js3-usm2paY

slate-dev-osg-frontier-squid-global slate-dev uchicago-prod instance_vTb5dO1fuZA

slate-dev-osg-frontier-squid-global slate-dev umich-prod instance_QAcSmU3wq8o

spt-osg-frontier-squid-global spt spt-npx instance_5m107QSBV0U

ssl-osg-frontier-squid-cvmfs ssl uchicago-river-v2 instance_WkTVO4N_r8w

7

8 of 23

Various XCaches

All of the US sites:

AGLT2

BNL*

MWT2

NET2

SWT2_CPB

EU sites:

LRZ-LMU*

Prague��bold = FedOps

8

All of them are single node deployments.

Developing a chart that supports:

  • Multi-node deployments
  • Rebalancing
  • Heartbeats

*same image but not deployed using SLATE

9 of 23

Federated Ops Security

9

10 of 23

SLATE Security Personnel

Information Security Officer

Tom Barton

Email: tbarton@uchicago.edu

Mobile Phone: 773-213-1096

Many thanks to Chris Weaver who has now moved on to a cloud engineering position on IceCube at Michigan State University

SLATE Security Staff

Mitchell Steinman

Email: mitchell.steinman@utah.edu

Office Phone: 208-721-2945

Mobile Phone: 208-721-2945

Muhammad Akhdhor

Email: muali@umich.edu

Office Phone: 734-936-3249

Mobile Phone: 813-406-1982

10

11 of 23

High Level Picture on Security

  • We want to prepare the policy side of SLATE as well as the technical side so that sites and organizations will be able to trust it as infrastructure
  • We are using the WISE Community SCIv2 framework as a foundation
  • We have tried to combine the list of possible documents suggested by the TrustedCI guide with a set of general areas we want to cover to form a list of documents to write
  • For each requirement in SCIv2 we have tried to identify which of our documents would be responsible for covering it, and make sure that each is appropriately covered by at least one document
  • We are part way through drafting all of the documents planned in this way
    • Some are more or less finished, some are not yet started
  • We hope to have the WLCG Federated Security WG review compliance with SCIv2, eventually

11

12 of 23

Roles in the SLATE Federation

12

Role

Description

Who does this

Platform Administrator

Operates the central parts of the federation

SLATE Team

Edge Administrator

Runs a cluster which participates in the federation

OSG Sites

Application Administrator

Runs one or more services on one or more clusters

OSG operations

Application Developer

Maintains an application for use on the platform

OSG Software Team

Application Reviewer

Checks applications for consistency with policy

SLATE Team

13 of 23

Private cloud operation - DevOps basics

13

Host w/ containers

Host w/ containers

Host w/ containers

Orchestrator

Container registry

Developer

Developer

Developer

Test, accredit

RHEL, SEL, …

Docker, singularity

Kubernetes (+Helm)

Docker Swarm

AWS ECS

Github

...

Jenkins

Puppet

...

A single organization manages everything

this and following from Tom Barton, SIG-ISM/WISE Workshop, Oct 2020

14 of 23

SLATE: federated DevOps

14

Site N

Host w/ containers

Host w/ containers

Host w/ containers

Orchestrator

SLATE Platform

Container registry

Developer

Developer

Developer

SLATE API & user portal

Test, accredit

Site 2

Host w/ containers

Host w/ containers

Host w/ containers

Orchestrator

Site 1

Host w/ containers

Host w/ containers

Host w/ containers

Kubernetes + Helm

15 of 23

Roles in SLATE federated operations

15

Site N

Host w/ containers

Host w/ containers

Host w/ containers

Orchestrator

SLATE Platform

Container registry

Developer

Developer

Developer

SLATE API & user portal

Test, accredit

Project M

Project 2

Project 1

App Admin

App Dev

Site 2

Host w/ containers

Host w/ containers

Host w/ containers

Orchestrator

Site 1

Host w/ containers

Host w/ containers

Host w/ containers

Kubernetes + Helm

Reviewer

Platform Admin

Instruct SLATE to run containers on edge sites

Check conformance with criteria

Publish conformance report

Conform to criteria

Operate securely Support other roles

Edge Admin

Configure SLATE namespace & policies Permit/deny App Admin groups & containers

16 of 23

Trust issues for the site manager

How can I remain responsible for the security of my site if I permit others to run things in it?

  • SCI is designed to address this concern, at least enough to address cooperation in managing security incidents, by listing criteria for adequate security of collaborating organisations

A federated platform must further consider:

  • Prospect of platform itself giving unauthorised access
  • Prospect of containers installed creating security issues for the site

16

For the any FedOps implementation:

  • The first requires a through review of platform software & host infrastructure security
  • The second is a general issue for the community

17 of 23

Prospect of a platform gaining unauth privWhat we've done in the SLATE context to address this

  • Community-reviewed security documentation
    • TrustedCI early engagement
    • Address all criteria in SCI v2
    • WLCG Federated Operations WG
    • OSG security leads
    • All review is welcome!
  • Overview of SLATE Platform Internals and Security” doc
  • Clarity of role obligations and SLATE Platform Admins’ support of them

Extension of SCI v2 criteria to the federated operations context was accomplished through per-role Obligations documents

17

18 of 23

Container Security

Top container misconfiguration security risks*

RBAC; Secrets; Network policies; Privilege levels; Resource limits/requests; Read-only root file systems; Annotations, labels; Sensitive host mount and access; Image configuration, including provenance

We are currently determining additions to the per-role Obligations documents, application review criteria and procedures, and installation defaults, to address these concerns in the context of SLATE

An aspirational goal: to report each container’s adherence to application review criteria so that Edge Admins can better understand the risk

18

*State of Container and Kubernetes Security, Fall 2020, StackRox

19 of 23

What SCI v2 did and didn’t for SLATE security work

Did

Each of its specifications informed aspects of one or more of the various SLATE security documents

Extension to the federated operations context was pretty straightforward through use of per-role Obligations documents

Didn’t

Help address container security

Provide guidance on its use in a federated operations context

19

OS3 and OS4 (Operational Security directives) don’t really address the upstream DevOps technologies and processes that can have more impact on the resultant security of a running container than its host’s own security configuration

20 of 23

SLATE Policy Areas and Documents

20

Area

Planned Documents

Status

Overview

Master Information Security Policy and Procedures

In progress

Definition of Protected Environment

(Network Security)

"Overview of SLATE Platform Internals and Security"

Done

Risk Assessment

Asset Inventory

Done

Acceptable Use

Acceptable Use Policy

Done

User Data Handling

Privacy Policy

Done

Incident Response

Incident Response Policy

Done

Obligations for each Role

Edge Admin. Obligations, App. Admin. Obligations, App. Dev. Obligations, App. Reviewer Obligations

Done

Application Review Process

Application Review Procedures

In progress

Access Control

Access Control Policy

Pending

Traceability

Traceability Policy

Pending

Change Management

Change Management Policy

Done

21 of 23

SLATE Policy Areas and Documents

21

Area

Planned Documents

Status

Overview

Master Information Security Policy and Procedures

In progress

Definition of Protected Environment

(Network Security)

"Overview of SLATE Platform Internals and Security"

Done

Risk Assessment

Asset Inventory

Done

Acceptable Use

Acceptable Use Policy

Done

User Data Handling

Privacy Policy

Done

Incident Response

Incident Response Policy

Done

Obligations for each Role

Edge Admin. Obligations, App. Admin. Obligations, App. Dev. Obligations, App. Reviewer Obligations

Done

Application Review Process

Application Review Procedures

In progress

Access Control

Access Control Policy

Pending

Traceability

Traceability Policy

Pending

Change Management

Change Management Policy

Done

Some are SLATE platform specific; others can be used to guide other distributed platforms

22 of 23

Next Steps

Focus is on this set of documents:

  • Implement additional feedback into published documents,
    • Most feedback already included in current iteration
    • Specific request for Best Practices which we have gathered and are breaking out into the respective documents, especially Application Review Process
    • Complete Access Control and Traceability policies
  • Obtain another feedback pass
  • Completed documents are published here

Restart the Federated Operation Security Working group

  • Last year this effort was put on pause due to lack of time & effort
  • However, many of the original deliverables have been accomplished and we plan to wrap up this effort

22

23 of 23

security@slateci.io

23