Session 02: Using Terra for data discovery and access
Allie Hajian
Data Sciences Platform, Broad Institute
Learning Objectives
The Big Data Problem
Research relies on vast stores of data generated by individual labs and large community-wide projects
How can researchers take advantage of this wealth of information?
… find and download the right data?
…interpret the format correctly?
…perform computations quickly?
Data is getting too big to practically download and store!
Solution: use the cloud to share data
Traditional Way�Bring data to the researchers
Cloud Way�Bring researchers to the data
Problems
Data sharing = data copying
High storage costs
Individual security implementations
Solutions
True data sharing
Store once, access by many
Centralized security implementation
Leveraging the Cloud for Research
Oct 2017: https://medium.com/@benedictpaten/a-data-biosphere-for-biomedical-research-d212bbfae95d
“…we propose the idea of creating a vibrant ecosystem… containing modular and interoperable components that can be assembled into diverse data environments”
How should a data biosphere be structured?
Modular | Comprised of functional components with well-specified interface |
Community focused | Created by many groups to foster a diversity of ideas |
Open | Open-source licenses, software, architecture to enable extensibility |
Standards Based | Consistent with standards developed by coalitions such as GA4GH |
A cloud-based analysis platform to access and analyze data
NIH Interoperability Example
TOOLS
WORKSPACES
Scientific Discoveries
NHLBI DATA
(TOPMed)
Dockstore
NHGRI DATA
(CCDG, CMG)
BRING YOUR OWN DATA
Terra: An end-to-end, cloud-native platform
The Workspace - The fundamental unit in Terra
Organize data and tools in the Terra workspace
Data in dedicated CCDG and CMG workspaces
Two different modes of computation
Accessible, findable bulk analysis tools
Broad Methods�Repository
Interactive analysis with Jupyter Notebooks
Containers for portability and reproducibility
GATK 2.8
Java 7
R 2.5.0
GATK 4.0
Java 8
R 3.0.1
BWA
Picard
A Docker container encapsulates
all the software dependencies associated with running a program
Takes the guesswork out of running workflows or notebooks on different platforms.
Standardize the analysis environment by specifying a Docker container that includes the exact libraries and packages used for your analysis.
Anyone using the same Docker image will get the same results
Curated analyses in pre-loaded Showcase Workspaces
How to collaborate and share in Terra
Workspaces are shareable and you’re in control
Invite people to work with you; control who has access to the research assets in your workspaces
Workspaces are private by default. Sharing permissions include:
Enforce privacy and security with Authorization Domains
Owners of controlled-access data can restrict the list of people with whom
any derived workspaces can be shared
Security is a priority that’s built in to the platform
Terra configures workspace resources appropriately so you can work confidently in the cloud and spend less time and effort on management
Workspace permissions set data access, billing
To manage access to resources, resource owners will assign permissions (roles) to individuals or groups.
Examples of resources
Examples of roles
How workspace permissions set data access
Shared data resources
Workspace #1
Generated data
Workspace #2 (clone)
User A is owner
User B is owner
Note: User C now has unauthorized access to derived data
This scenario illustrates how using permissions alone can lead to unauthorized access
How much it costs and how to pay
Protecting Controlled-Access Data
Terra uses Authorization Domains to enforce limits on data access
How Authorization Domains protect data
Authorization� Domain #1
User A and User B
Shared data resources
Workspace #1
User A is owner
Generated data
Workspace #2 (clone)
User B is owner
Authorization� Domain #1
User A and User B
User B cannot share with anyone outside the Authorization Domain.
Any new collaborator must be added to the Authorization Domain to access the shared or generated data resources
Authorization Domains Instructions and Caveats
2. Select the group as an authorization Domain when creating a workspace
Spend time doing science… �Not wrangling software
For hands-on practice try