1 of 26

Session 02: Using Terra for data discovery and access

Allie Hajian

Data Sciences Platform, Broad Institute

2 of 26

Learning Objectives

  • Understand how the Terra platform fits into the AnVIl ecosystem�
  • Understand how to use the components of cloud-native Terra for research
    • Data
    • Bulk Analysis
    • Interactive analysis
    • Security�
  • Understand the function of billing and permissions in Terra�
  • Understand how to protect controlled-access data with Authorization Domains

3 of 26

The Big Data Problem

Research relies on vast stores of data generated by individual labs and large community-wide projects

How can researchers take advantage of this wealth of information?

… find and download the right data?

…interpret the format correctly?

…perform computations quickly?

Data is getting too big to practically download and store!

4 of 26

Solution: use the cloud to share data

Traditional WayBring data to the researchers

Cloud WayBring researchers to the data

Problems

Data sharing = data copying

High storage costs

Individual security implementations

Solutions

True data sharing

Store once, access by many

Centralized security implementation

5 of 26

Leveraging the Cloud for Research

Oct 2017: https://medium.com/@benedictpaten/a-data-biosphere-for-biomedical-research-d212bbfae95d

“…we propose the idea of creating a vibrant ecosystem… containing modular and interoperable components that can be assembled into diverse data environments”

6 of 26

How should a data biosphere be structured?

Modular

Comprised of functional components with well-specified interface

Community focused

Created by many groups to foster a diversity of ideas

Open

Open-source licenses, software, architecture to enable extensibility

Standards Based

Consistent with standards developed by coalitions such as GA4GH

7 of 26

A cloud-based analysis platform to access and analyze data

8 of 26

NIH Interoperability Example

TOOLS

WORKSPACES

Scientific Discoveries

NHLBI DATA

(TOPMed)

Dockstore

NHGRI DATA

(CCDG, CMG)

BRING YOUR OWN DATA

9 of 26

Terra: An end-to-end, cloud-native platform

10 of 26

The Workspace - The fundamental unit in Terra

11 of 26

Organize data and tools in the Terra workspace

12 of 26

Data in dedicated CCDG and CMG workspaces

13 of 26

Two different modes of computation

  • Bulk analysis workflows �Some jobs take a while, so launch them and come back later! (WDL and Cromwell)
  • Interactive analysis
    • Run R or Python via Jupyter Notebooks, or run shell scripts on the terminal
    • Interrogate data in real time
    • Interactive Jupyter Notebooks
    • GWAS with Hail

14 of 26

Accessible, findable bulk analysis tools

  • Align & QC sequence data per sample
  • Call short variants per sample
  • Joint-call across population
  • Filter & QC variants

Broad Methods�Repository

15 of 26

Interactive analysis with Jupyter Notebooks

16 of 26

Containers for portability and reproducibility

GATK 2.8

Java 7

R 2.5.0

GATK 4.0

Java 8

R 3.0.1

BWA

Picard

A Docker container encapsulates

all the software dependencies associated with running a program

Takes the guesswork out of running workflows or notebooks on different platforms.

Standardize the analysis environment by specifying a Docker container that includes the exact libraries and packages used for your analysis.

Anyone using the same Docker image will get the same results

17 of 26

Curated analyses in pre-loaded Showcase Workspaces

18 of 26

How to collaborate and share in Terra

Workspaces are shareable and you’re in control

Invite people to work with you; control who has access to the research assets in your workspaces

Workspaces are private by default. Sharing permissions include:

    • Reader
    • Writer
    • Owner

Enforce privacy and security with Authorization Domains

Owners of controlled-access data can restrict the list of people with whom

any derived workspaces can be shared

19 of 26

Security is a priority that’s built in to the platform

  • Certified FISMA moderate from the NCI for hosting controlled-access data owned by NCI and the NIH
  • Leverages security features in Google Cloud Platform for Federal Risk and Authorization Management Program (FedRAMP) authorization
  • 15-minute timeout for clinical researchers

Terra configures workspace resources appropriately so you can work confidently in the cloud and spend less time and effort on management

20 of 26

Workspace permissions set data access, billing

To manage access to resources, resource owners will assign permissions (roles) to individuals or groups.

Examples of resources

  • Terra Billing Projects
  • Workspaces (including data stored in the workspace bucket)
  • Workflow collections

Examples of roles

  • Owner - may add/remove users (grant access), lock workspace, etc.
  • Writer - may write to the metadata, method configs, etc.
  • Reader - may read the metadata, method configs, etc.
  • Can-compute - able to launch batch compute and interactive analyses (notebooks)
  • Share-writer - able to grant others write access
  • Share-reader - able to grant others read access

21 of 26

How workspace permissions set data access

Shared data resources

Workspace #1

Generated data

Workspace #2 (clone)

User A is owner

  • Shares with User B (writer - can copy)
  • User B can access shared data
  • User A’s Billing Project covers all costs

User B is owner

  • User B’s Billing Project covers costs
  • Shares with User C �(writer - can compute)

Note: User C now has unauthorized access to derived data

This scenario illustrates how using permissions alone can lead to unauthorized access

22 of 26

How much it costs and how to pay

  • Terra itself won’t cost you anything - To enable scientists to get work done, the open-source platform is available and free for everyone

  • Pay only for what you use - Storage, compute, and egress

  • Terra Billing Accounts linked to Google Billing - Securely-configured billing project linked to Billing Account on the Google Cloud Platform can be shared with members of your lab or department for use in Terra.�
    • Institutional billing accounts �
    • STRIDES�
    • Free credits

23 of 26

Protecting Controlled-Access Data

Terra uses Authorization Domains to enforce limits on data access

  • Workspace access only to users in the Authorization Domain
  • AD stays with workspace copies
  • Protects all data in the workspace
    • Phenotype data
    • Derived data
    • Data from multiple sources

24 of 26

How Authorization Domains protect data

Authorization� Domain #1

User A and User B

Shared data resources

Workspace #1

User A is owner

Generated data

Workspace #2 (clone)

User B is owner

Authorization� Domain #1

User A and User B

User B cannot share with anyone outside the Authorization Domain.

Any new collaborator must be added to the Authorization Domain to access the shared or generated data resources

25 of 26

Authorization Domains Instructions and Caveats

  • Authorization Domains must be set up before creating a workspace. You cannot apply or change Authorization Domains for an existing workspace.
  • Include all collaborators that you want to share your data with. Note that this list can be updated!

  • ADs protect access to all data, including derived files
  1. Create a Group from the main menu (top left). Note this is also where you can AD members.

2. Select the group as an authorization Domain when creating a workspace

26 of 26

Spend time doing science… �Not wrangling software

For hands-on practice try