DataONE FedSec Workshop Report

DataONE Federated Security Workshop Report

September 2010

Table of Contents

Introduction

Workshop Objectives

DataONE Overview

Problem Statement

Member Node Perspectives

Authentication Technology Choices

Phased Implementation

Next Steps

Open Issues

Introduction

On September 8-9, 2010, DataONE participants and collaborators gathered in Chicago to address federated security standards, management, and implementation in relation to the DataONE project. This report summarizes the outcomes of that workshop.

The workshop participants were Jon Auman, Jim Basney, Ed Bishop, Randy Butler, John Cobb, Tim DiLauro, Dale, Hendrickson, Jeff Horsburgh, Matt Jones, David Kennedy, Ken Klingenstein, Kevin Murphy, Tom Sohre, and Dave Vieglais.

Workshop Objectives

Overview and understanding of security protocols and systems at existing DataONE Member Nodes and Coordinating Nodes, and other related federated security projects.
Identification of critical short-term (12-18 month) security technology recommendations that can be reasonably implemented by DataONE and other DataNets.
Identification of cross-Member and cross-Net requirements; establish consensus.
Identification of long-term (2-5 years) strategy.
Identification of out-of-scope or unmet needs that may be targeted by new research (requires external funding).

DataONE Overview

DataONE and the Data Conservatory are the two DataNet program awards to-date, funded by the National Science Foundation (NSF) Office of CyberInfrastructure (OCI) and other NSF directorates for an initial five years with plans for ongoing funding of operations. There is no news yet about future DataNet program awards. There is strong motivation for interoperability between DataNets, with identity management being a strong potential focus area for interoperability.

DataONE’s goal is to enable synthesis in earth observation sciences, providing a reliable, stable, and adaptive cyberinfrastructure. DataONE consists of coordinating nodes (CNs) and member nodes (MNs). Coordinating nodes provide cataloging services and are responsible for moving information between member nodes. Member nodes each typically support a specific domain science. The target number of member nodes is unknown but may grow from 6 initially to hundreds, serving tens of thousands of users.

DataONE has relationships with other NSF cyberinfrastructures, such as ESG, NEON, and OOI. The goal is not to try to collate all the data into one system but instead of enable existing and emerging data repositories to interoperate with DataONE.

DataONE is concerned with science data, science metadata, and system metadata. The DataONE system metadata captures dates, types, size, sources, replication attributes, associations, owner(s), and access rules. It associates the science metadata with the science data. DataONE requires a simple process for creating system metadata at member nodes.

DataONE aims to support multi-master replication of science metadata and system metadata across the coordinating nodes. Metadata is replicated across the coordinating nodes. To replicate science data, a CN asks a MN to retrieve a copy of the data from the source MN. Motivations for replication include archive/preservation/survivability and improved access (performance, accessibility, load balancing).

DataONE is nearing the end of a prototyping stage and moving to an evaluation phase. The version 1 public infrastructure will be released and stood up in the next six to 12 months. The coordinating nodes for this stage are at Oak Ridge, New Mexico, and Santa Barbara. Member nodes use different technologies that will be integrated via rich DataONE APIs and web interfaces.

More information about the DataONE system architecture and use cases is available at http://mule1.dataone.org/ArchitectureDocs.

Problem Statement

The DataONE project needs both a short-term and long-term strategy for authentication and authorization. Security requirements include:

Secure write access for publishing, modifying, and replicating data. Data owners control their data. Stewardship of the data resides at the sites (member nodes). Coordinating nodes manage data replication.
Secure read access to private data. While most (if not all) data in DataONE is expected to become public, access may be initially embargoed. Access control on private data is set by data owners and must be enforced, including (in some cases) hiding private data from search results. Different member nodes have different data access policies.
Tracking read access to public data. Investigators require information about who is accessing their data. They are concerned about misuse or misrepresentation of the data, and need to be able to contact users of the data regarding updates.

Member Node Perspectives

Workshop participants presented details on member node requirements and operations. The slides from these presentations are available at ????.

Data access requirements across the member nodes can be categorized as:

public metadata, public data (no tracking)
public metadata, tracked public data
public metadata, restricted data
restricted metadata, restricted data

Currently all data in CUASHI and ESDIS is public. Dryad has mostly public data, with one year data embargoes in some cases. All Dryad metadata is public. LTER data is public with tracking to comply with NSF usage reporting requirements.

Two commonly used data management software packages are Metacat and DSpace.

Authentication Technology Choices

From the wide range of authentication technology choices surveyed in the Security Landscape presentation at the workshop, attendees narrowed their focus to four options considered most promising:

SAML as implemented by InCommon
Certificates as implemented by CILogon
OpenID as implemented by Google and Yahoo
Passwords as implemented by existing LDAP deployments in the science community (for example: LTER)

The Authentication Technology Matrix documents the different attributes that were considered for each of these technology choices. The following requirements were particularly critical to reaching consensus on technology choices:

Support for authentication via web browsers and client / API applications
Support for authentication of services / agents that represent systems and people

As a result of considering these and other factors detailed in the Authentication Technology Matrix, the consensus recommendation is for DataONE to adopt a combination of InCommon and CILogon authentication technologies. Note that CILogon itself depends on InCommon.

CILogon today issues certificates that can be used both inside and outside web browsers, but an initial browser-based authentication via InCommon to CILogon is required. InCommon is expected to support non-browser applications in the future via Project Moonshot and related work. CILogon certificates can also be used by services/agents, using long-lived (one year) certificates and/or RFC 3820 proxy certificates.

Phased Implementation

Four implementation phases are envisioned:

Mostly public access (target date: January 2011): Only publicly readable content is replicated. Only publicly readable content is indexed for search and retrieval. Access to restricted content is through origin member node only. No authentication is required to search and retrieve public content. Authentication is required to upload (create) content.
Roles supported for search and retrieval: ACLs respected by coordinating nodes. Authenticated users can discover content that is restricted to them or their groups. Restricted access content is not replicated.
Roles supported for content replication: Restricted access content is replicated to member nodes with compatible ACLs and pre-arranged trust agreements.
Consistent semantic and functional interoperability for identity and security: Restricted access content is replicated to any member node. Authentication by long-running workflows is supported.

Next Steps

Given the above authentication technology choices and phased implementation plan, the next steps are:

Adapt the MN and CN software stack to the chosen authentication technologies.
Form a working group to address later phases of the work and remaining topics, including authorization.
DataONE will join InCommon.
Follow-up in person at the DataONE all-hands meeting the first week in November.
Prepare for the DataONE project review in early February.

Open Issues

We conclude by identifying open issues raised during the workshop:

How will authentication occur outside of the web environment (e.g., exclusively on the command-line without an initial browser step to download a certificate)?
What is the DataONE privacy policy? Can users access DataONE anonymously/pseudonymously?
Will users have a “DataONE identity” (or profile)? How is it mapped to external identities?
What level of assurance (LOA) is required across DataONE? Can DataONE support multiple LOAs?
How will DataONE resolve users with multiple identities? Does it require an administrator or can the user make the mapping? What are the implications on intellectual property ownership?
Should DataONE form a relationship with VIVO (for managing profiles)?
How is trust established between member nodes? What role do the coordinating nodes play in trust establishment?
How will DataONE handle an EML document that has inline data?
How will DataONE support interoperable authorization across MNs?