Published using Google Docs
DataONE FedSec Workshop Report
Updated automatically every 5 minutes

DataONE Federated Security Workshop Report

September 2010

Table of Contents

Introduction

Workshop Objectives

DataONE Overview

Problem Statement

Member Node Perspectives

Authentication Technology Choices

Phased Implementation

Next Steps

Open Issues

Introduction

On September 8-9, 2010, DataONE participants and collaborators gathered in Chicago to address federated security standards, management, and implementation in relation to the DataONE project. This report summarizes the outcomes of that workshop.

The workshop participants were Jon Auman, Jim Basney, Ed Bishop, Randy Butler, John Cobb, Tim DiLauro, Dale, Hendrickson, Jeff Horsburgh, Matt Jones, David Kennedy, Ken Klingenstein, Kevin Murphy, Tom Sohre, and Dave Vieglais.

Workshop Objectives

DataONE Overview

DataONE and the Data Conservatory are the two DataNet program awards to-date, funded by the National Science Foundation (NSF) Office of CyberInfrastructure (OCI) and other NSF directorates for an initial five years with plans for ongoing funding of operations. There is no news yet about future DataNet program awards. There is strong motivation for interoperability between DataNets, with identity management being a strong potential focus area for interoperability.

DataONE’s goal is to enable synthesis in earth observation sciences, providing a reliable, stable, and adaptive cyberinfrastructure. DataONE consists of coordinating nodes (CNs) and member nodes (MNs). Coordinating nodes provide cataloging services and are responsible for moving information between member nodes. Member nodes each typically support a specific domain science. The target number of member nodes is unknown but may grow from 6 initially to hundreds, serving tens of thousands of users.

DataONE has relationships with other NSF cyberinfrastructures, such as ESG, NEON, and OOI. The goal is not to try to collate all the data into one system but instead of enable existing and emerging data repositories to interoperate with DataONE.

DataONE is concerned with science data, science metadata, and system metadata. The DataONE system metadata captures dates, types, size, sources, replication attributes, associations, owner(s), and access rules. It associates the science metadata with the science data. DataONE requires a simple process for creating system metadata at member nodes.

DataONE aims to support multi-master replication of science metadata and system metadata across the coordinating nodes. Metadata is replicated across the coordinating nodes. To replicate science data, a CN asks a MN to retrieve a copy of the data from the source MN. Motivations for replication include archive/preservation/survivability and improved access (performance, accessibility, load balancing).

DataONE is nearing the end of a prototyping stage and moving to an evaluation phase. The version 1 public infrastructure will be released and stood up in the next six to 12 months. The coordinating nodes for this stage are at Oak Ridge, New Mexico, and Santa Barbara. Member nodes use different technologies that will be integrated via rich DataONE APIs and web interfaces.

More information about the DataONE system architecture and use cases is available at http://mule1.dataone.org/ArchitectureDocs.

Problem Statement

The DataONE project needs both a short-term and long-term strategy for authentication and authorization. Security requirements include:

Member Node Perspectives

Workshop participants presented details on member node requirements and operations. The slides from these presentations are available at ????.

Data access requirements across the member nodes can be categorized as:

  1. public metadata, public data (no tracking)
  2. public metadata, tracked public data
  3. public metadata, restricted data
  4. restricted metadata, restricted data

Currently all data in CUASHI and ESDIS is public. Dryad has mostly public data, with one year data embargoes in some cases. All Dryad metadata is public. LTER data is public with tracking to comply with NSF usage reporting requirements.

Two commonly used data management software packages are Metacat and DSpace.

Authentication Technology Choices

From the wide range of authentication technology choices surveyed in the Security Landscape presentation at the workshop, attendees narrowed their focus to four options considered most promising:

  1. SAML as implemented by InCommon
  2. Certificates as implemented by CILogon
  3. OpenID as implemented by Google and Yahoo
  4. Passwords as implemented by existing LDAP deployments in the science community (for example: LTER)

The Authentication Technology Matrix documents the different attributes that were considered for each of these technology choices. The following requirements were particularly critical to reaching consensus on technology choices:

As a result of considering these and other factors detailed in the Authentication Technology Matrix, the consensus recommendation is for DataONE to adopt a combination of InCommon and CILogon authentication technologies. Note that CILogon itself depends on InCommon.

CILogon today issues certificates that can be used both inside and outside web browsers, but an initial browser-based authentication via InCommon to CILogon is required. InCommon is expected to support non-browser applications in the future via Project Moonshot and related work. CILogon certificates can also be used by services/agents, using long-lived (one year) certificates and/or RFC 3820 proxy certificates.

Phased Implementation

Four implementation phases are envisioned:

  1. Mostly public access (target date: January 2011): Only publicly readable content is replicated. Only publicly readable content is indexed for search and retrieval. Access to restricted content is through origin member node only. No authentication is required to search and retrieve public content. Authentication is required to upload (create) content.
  2. Roles supported for search and retrieval: ACLs respected by coordinating nodes. Authenticated users can discover content that is restricted to them or their groups. Restricted access content is not replicated.
  3. Roles supported for content replication: Restricted access content is replicated to member nodes with compatible ACLs and pre-arranged trust agreements.
  4. Consistent semantic and functional interoperability for identity and security: Restricted access content is replicated to any member node. Authentication by long-running workflows is supported.

Next Steps

Given the above authentication technology choices and phased implementation plan, the next steps are:

Open Issues

We conclude by identifying open issues raised during the workshop: