1 of 35

Dataverse, Journals, and Sensitive Data

Gustavo Durand

Dataverse Technical Lead / Architect

Data-PASS Pre-APSA Workshop - August 30, 2017

2 of 35

Dataverse

3 of 35

Dataverse

  • Overview, Features, and Technology
  • Development Process
    • Transparency, Strategic Goals, Roadmap
  • Collaborations
  • Community

4 of 35

Overview

  • An open-source platform to publish, cite, and archive research data
  • Built to support multiple types of data, users, and workflows
  • Developed at Harvard’s Institute for Quantitative Social Science (IQSS) since 2006
  • Development funded by IQSS and with grants, in collaboration with institutions around the world
  • 15 on the core team - developers, designers, UI/UX, metadata specialists, curation manager

5 of 35

Dataverse Features - Data

  • Persistent IDs / URLs
    • DataCite
    • Handle
  • Automatically Generated Citations with attribution
  • Compliant with FAIR and data citation principles
  • Domain-specific Metadata
  • Versioning
  • File Storage
    • Local
    • Swift (OpenStack)
    • S3 (Amazon)

6 of 35

Dataverse Features - Users

  • Multiple Sign In options
    • Native
    • Shibboleth
    • OAuth (ORCID)
  • Dataverses within Dataverses
  • Branding
  • Widgets

7 of 35

Dataverse Features - Workflows

  • Permissions
  • Access Controls and Terms of Use
  • Publishing Workflows
  • Private URLs
  • Upload / Download Workflows
    • Browser
    • Dropbox
    • Rsync (for big data “packages”)

8 of 35

Dataverse Features - Interoperability

  • APIs
    • SWORD
    • Native
  • Harvesting (OAI-PMH)
    • Client
    • Server

9 of 35

Dataverse Technology

Glassfish Server 4.1

Java SE8

Java EE7

  • Presentation: JSF (PrimeFaces), RESTful API
  • Business: EJB, Transactions, Asynchronous, Timers
  • Storage: JPA (Entities), Bean Validation

Storage: Postgres, Solr, File System / Swift / S3

10 of 35

Dataverse Development Process

  • Inbox
  • Backlog
  • This Sprint
  • Development
  • Code Review
  • QA
  • Done

11 of 35

(some) Collaborations

  • SBGrid Data
    • Large Data and Support
  • Massachusetts Open Cloud
    • Big Data Storage and Compute Access (OpenStack)
  • DANS/CIMMYT
    • Handles Support
  • ResearchSpace
    • API Java Client Library
  • (soon) Provenance
    • W3C PROV

12 of 35

Dataverse Community

  • 26 installations around the world

13 of 35

Dataverse Community

  • 40+ code contributors outside of the Core Team
  • Hundreds of members of the Dataverse Community - developers, researchers, librarians, data scientists
    • Dataverse Google Group
    • Dataverse Community Calls
    • Dataverse Community Meeting

14 of 35

Community

15 of 35

Journals

16 of 35

Journals

  • Overview
  • Permissions
  • Demo
    • Review Workflow
    • Private URLs

17 of 35

Journals

https://dataverse.org/journals

  • We recommend four ways that journals can use Dataverse repositories to ensure that authors make data available and get credit for their research, with links to and from associated published articles
    • Set up a journal dataverse
    • Set up a journal dataverse with data curation & verification
    • Integrate your journal's manuscript submission system with Dataverse
    • Recommend Dataverse to authors

18 of 35

Permissions / Roles

Robust Permission System:

  • System Roles
  • Custom Roles (can be defined per installation)
  • Groups
    • Explicit
    • IP
    • Shibboleth
  • Inheritance
    • dataverse -> dataset
    • dataset -> file

19 of 35

Review Workflow

If you have a Contributor role in a Dataverse you can submit your dataset for review when you have finished uploading your files and filling in all of the relevant metadata fields.

  • To Submit for Review, go to your dataset and click on the “Submit for Review” button, which is located next to the “Edit” button on the upper-right.
  • Once Submitted for Review: the Admin or Curator for this dataset will be notified to review this dataset before they decide to either “Publish” the dataset or “Return to Author”.
    • If the dataset is published the contributor will be notified that it is now published.
    • If the dataset is returned to the author, the contributor of this dataset will be notified that they need to make modifications before it can be submitted for review again.

20 of 35

Private URLs

Creating a Private URL for your dataset allows you to share your dataset (for viewing and downloading of files) before it is published to a wide group of individuals who may not have a user account on Dataverse. Anyone you send the Private URL to will not have to log into Dataverse to view the dataset.

  • Go to your unpublished dataset
  • Select the “Edit” button
  • Select “Private URL” in the dropdown menu
  • In the pop-up select “Create Private URL”
  • Copy the Private URL which has been created for this dataset and it can now be shared with anyone you wish to have access to view or download files in your unpublished dataset.

21 of 35

Sensitive Data

22 of 35

Sensitive Data

  • Dataverse 5
    • Infrastructure
    • DataTags
    • PSI (Differential Privacy)

23 of 35

Infrastructure

  • Encrypted Transit (already supported)
  • Encrypted Storage (#4113)
  • Require verification of e-mail address (#3300)
  • Complex Passwords (#3150)
    • :PVMinLength, :PVMaxLength
    • :PVCharacterRules, :PVNumberOfCharacteristics
    • :PVDictionaries
    • :PVGoodStrength
  • Mitigate against password guessing (#3153)
  • Bulk Removal of Roles / Permissions (#4055)

24 of 35

DataTags

A datatag is a set of security features and access requirements for file handling.

A datatags repository is one that stores and shares data files in accordance with a standardized and ordered level of security and access requirements.

25 of 35

DataTags Levels

26 of 35

DataTags

27 of 35

PolicyModels

PolicyModels is a system for creating models of policies, and can be used to perform interactive interviews which yield a concrete treatment that is both human readable and machine actionable.

28 of 35

DataTags

29 of 35

Differential Privacy

What is Differential Privacy?

30 of 35

Differential Privacy

31 of 35

Differential Privacy

Differential Privacy is a formal, mathematical conception of privacy preservation.

It guarantees that any reported result does not reveal information about any one single individual, regardless of auxiliary information.

32 of 35

PSI (Differential Privacy)

Private data Sharing Interface

  • upload private data to a secured Dataverse archive,
  • decide / budget what statistics they would like to release about that data
  • release privacy preserving versions of those statistics to the repository
  • that can be explored through a curator interface without releasing the raw data
  • including interactive queries.

33 of 35

PSI - Budgeteer

The budgeteer allows users to select which statistics they would like to calculate and are given estimates of how accurately each statistic can be computed. They can also redistribute their privacy budget according to which statistics they think are most valuable in their dataset.

34 of 35

PSI (Differential Privacy)

35 of 35

Thank you!

Please get in touch with us!

Google Group, Github, IRC, Twitter - dataverse.org/contact

support@dataverse.org

Dataverse Community Meeting 2018

June 13, 14, 15 at Harvard University