Genomics: ENCODE leads the way on big data

 

Mark Gerstein

Nature 489, 208 (13 September 2012) doi:10.1038/489208b

Published online 12 September 2012

 

The ENCODE project offers a fresh perspective on big data by providing an organized framework for genomics (www.nature.com/encode). Other big-data efforts tend to focus on rapidly locating needles in petabyte-sized haystacks (such as finding the Higgs boson, for instance), whereas ENCODE aims to supply a structured overview.

 

ENCODE's organization of information is hierarchical, with raw data at the bottom and layers of annotation above. The processed summaries become progressively broader — for example, starting at the level of signals representing the degree to which DNA is bound by transcription factors, moving on to the locations of sites where these factors bind, and then to overviews of regulatory networks. At the summit are the linked publications documenting the annotation.

 

The ENCODE data model could be useful in other fields: for example, astronomy and Earth science are in the process of organizing their reams of data (M. J. Raddick and A. S. Szalay Science 329, 1028–1029; 2010), but don't yet compare with ENCODE in terms of the level of integration.