Secure, Real-Time Sharing of Cancer Gene Information

Cancer is a disease of the genome. Thanks to new genome sequencing technology, we have seen breath-taking advances over the last ten years in our understanding of which genomic mutations accumulate in cancers. However, we also realized how big the sheer number of possible mutations is in the three billion letters of our genome. For the majority of the mutations we do not know if they drive uncontrolled cell growth or are rather mere passengers that have little to no effect. To understand these mutations we need data, but today researchers are limited to small curated datasets in the thousands from participants from the past. We need millions of datapoints as a reference and new data points in weeks for prospective participants. In this workshop we’ll cover the basics of cancer genomics as well as emerging approaches to real-time sharing of genomic data globally, bringing the compute to the data to retain privacy, optimizing large scale genomic computing for speed and managing both the data and compute.

Session I

Monday October 24 10:15-12:00

The Cancer Gene Trust

www.cancergenetrust.org

Rob Currie, UCSC Genomics Institute

The GA4GH Cancer Gene Trust (CGT) is a decentralized, distributed content addressable real-time database. Stewards working directly with patient-participants publish de-identified genomic and basic clinical data. Researchers who find rare variants or combinations of variants in this global resource that are associated with specific clinical features of interest may then contact the data stewards for those participants. A submission consists of a manifest containing fields and references to files by multihash. Initial submissions will likely include de-identified clinical data, a list of somatic mutations, and gene expression data although any type of data can be submitted and shared.

ADAM

http://bdgenomics.org/projects/adam/

Frank Austin Nothaf, UC Berkeley

ADAM provides both an application programming interface (API) and a command line interface (CLI) for manipulating genomic data on a computing cluster. ADAM operates on data stored inside of Parquet with the bdg-formats schemas, using Apache Spark, and provides scalable performance on clusters larger than 100 machines. ADAM is on Github. Quick start guides are available for running ADAM on EC2, and for building ADAM for specific CDH releases.

Session II

Monday October 24th 1:30-3:00

Toil

https://toil.readthedocs.io

Hannes Schmidt, UCSC Genomics Institute

Toil is a scalable, efficient and easy-to-learn workflow engine. Workflows are written in object-oriented or functional Python, but it can also run standard CWL workflows. It supports a range of target environments—be it a single machine, bare-metal clusters (GridEngine, SLURM, LSF and Parasol) or commercial cloud platforms (Amazon, Google and Microsoft). It automatically scales clusters of virtual machines in AWS, and will soon be able to do the same in other cloud environments. Toil is available as a Docker image on quay.io or from PyPI via `pip install`. The Toil ecosystem delivers Docker images for prominent bioinformatics tools as well as Toil workflows that connect those tools into state-of-the art genome analysis pipelines.

 

UCSC Genomics Analysis Core Architecture

Brian O’Connor, UCSC Genomics Institute

The analysis core at UCSC develops the infrastructure and pipelines to process large amounts of genomic data for research and clinical programs. The architecture of this system is inspired by past research efforts with the addition of new tools such as Docker, Toil and ADAM as well as data management solutions to coordinate with a growing list of outside medical institutions involved in translational genomics.

Smart Storage Devices in Genomics

Carlos Maltzahn, UCSC Center for Research in Open Source Software

Genomics is unique in that to fulfill its promise it requires extreme-scale data-intensive processing with strong privacy guarantees. While there are theoretical advances in the latter, practical solutions are still many orders of magnitude slower than data processing without privacy guarantees. This talk is exploring implications for storage systems and some of the opportunities that “smart” storage devices might provide by offering in-storage computing.