Genomics: ENCODE leads the way on big data
Nature 489, 208 (13 September 2012) doi:10.1038/489208b
Published online 12 September 2012
The ENCODE project offers a fresh perspective on big data by providing an organized framework for genomics (www.nature.com/encode). Other big-data efforts tend to focus on rapidly locating needles in petabyte-sized haystacks (such as finding the Higgs boson, for instance), whereas ENCODE aims to supply a structured overview.
ENCODE's organization of information is hierarchical, with raw data at the bottom and layers of annotation above. The processed summaries become progressively broader — for example, starting at the level of signals representing the degree to which DNA is bound by transcription factors, moving on to the locations of sites where these factors bind, and then to overviews of regulatory networks. At the summit are the linked publications documenting the annotation.
The ENCODE data model could be useful in other fields: for example, astronomy and Earth science are in the process of organizing their reams of data (M. J. Raddick and A. S. Szalay Science 329, 1028–1029; 2010), but don't yet compare with ENCODE in terms of the level of integration.