American Gut goal of the American Gut project is to study the gut microbiome of people from all over the US. It's an ongoing project and more than 20,000 samples have been collected already. The project gives an an opportunity for anybody to participate and send the samples to be compared to thousands of others across the US.Cooper Devlin
Broad Bioimage Benchmark Collection Broad Bioimage Benchmark Collection is a collection of the microscopy image sets. In addition to the images themselves, each set includes a description of the biological application and some type of expected results.Runyu Hong
Broad Cancer Cell Line Encyclopedia (CCLE) goal of the CCLE project is to conduct a detailed genetic and pharmacologic characterization of a large panel of human cancer models, to develop integrated computational analyses that link distinct pharmacologic vulnerabilities to genomic patterns and to translate cell line integrative genomics into cancer patient stratification. The CCLE provides public access to genomic data, analysis and visualization for over 1400 cell lines. The five major dataset types are Copy Number, mRNA expression (Affy), RPPA, RRBS, and mRNA expression (RNAseq).
Cell Image Library goal of the Cell Image Library was to create a valuable research tool to promote analysis and new discoveries. The CIL has high-quality images from different organisms, cell types, and processes, normal and pathological. The project contains 10,000 unique datasets and 20 TB of data.
Electron Microscopy Pilot Image Archive (EMPIAR), the Electron Microscopy Public Image Archive, is a public resource for raw, 2D electron microscopy images with 174 datasets. EMPIAR provides a way to easily access state-of-the-art 2D image data that underpins 3D cryo-EM structures of biomacromolecules and molecular machines. It complements the Electron Microscopy Data Bank (EMDB), where corresponding 3D structures are stored, and PDB which stores atomic models of macromolecular structures.
ENCODE Project goal of ENCODE is to build a comprehensive list of functional elements in the human genome, including elements that act at the protein and RNA levels, and regulatory elements that act at the protein and RNA levels, and regulatory elements that control cells and events in which a gene is active. The project contains different types of data including 5C, ChIA-PET, Hi-C, DNase-seq, FAIRE-seq, ATAC-seq, RNA-seq, CLIP-seq, RIP-seq and some others.Anna Yeaton
Ensemble Genomes Ensemble genome annotation system includes the annotation, analisys and display of the vertebrate genomes, as well as bacteria, fungi, protists, plants and metazoa. In each domain, the aim is to bring the integrative power of Ensembl tools for comparative analysis, data mining and visualisation across genomes of interest to improve and deepen genome annotation and interpretation.
Gene Expression Omnibus is an international public repository that archives and freely distributes microarray, next-generation sequencing, and other forms of high-throughput functional genomics data submitted by the research community. The GEO DataSets database stores original submitter-supplied records (Series, Samples and Platforms) as well as curated DataSets. Curated DataSets form the basis of GEO's advanced data display and analysis features, including tools to identify differences in gene expression levels and cluster heatmaps.Sonali Narang
GTEx - Tissue Expression (GTEx) Project is to increase our understanding of how changes in our genes contribute to common human diseases, in order to improve health care for future generations. Resource that researchers can use to study how inhereted changes in genes lead to common disaeses.Cindy Wang
Human Microbiome Project project has two parts: HMP1 and iHMP. The first one was aimed to characterize the microbiome of the healthy individuals at 5 major body sites. The goal of iHMP was to characterize the micorbiome and human host from 3 conditions associated with micorbiome using multiple 'omics approach.Jonathan Abebe
International Genome Sample Resource (ISGR) 1000 genomes project was aimed to create the largest public catalogue of human variation and genotype data. It was the first project to sequence the genomes of a large number of people to provide a comprehensive resource on human genetic variation.
KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies.
LINCS project is aimed to collect and disseminate data and analytical tools needed to understand how human cells respond to perturbation by drugs, the environment, and mutatio in order to discover fundamental principals of cellular response to perturbation including the relationship between dose and response, the origin and significance of cell-to-cell variation, and the molecular basis of drug sensitivity and resistance. Data include biochemical, proteomic, and imaging assays.
miRBase miRBase provides published miRNA sequences and annotation. Each entry in the miRBase Sequence database represents a predicted hairpin portion of a miRNA transcript with information on the location and sequence of the mature miRNA sequence..
openSNP contains full genotyping raw-data in the file formats that are provided by 23andMe, deCODEme and FamilyTreeDNA. As the files can be grouped by their variations for specific phenotypes it is easy to get datasets that are already usable for association studies. The database contains information about SNPs, genotypes, phenotypes and phenotypes pictures.
Pathguide contains information about 702 biological pathway related resources and molecular interaction related resources.
Pfam Pfam database is a large collection of protein families, each represented by multiple sequence alignments and hidden Markov models. The database also provides higher-level groupings of related entries known as clans.
Sanger Catalogue of Somatic Mutations in Cancer Catalogue Of Somatic Mutations In Cancer (COSMIC) is the world's largest and most comprehensive resource for exploring the impact of somatic mutations in human cancer. It is divided into several projects: COSMIC, Cell Lines project, COSMIC-3D and Cancer Gene Census. The data includes around 32,000 genomes. COSMIC provides extensive coverage of the cancer genomic landscape from a somatic perspective.
Sanger Genomics of Drug Sensitivity in Cancer Project (GDSC) aim of this project is to improve cancer treatments by discovering therapeutic biomarkers that can be used to identify patients most likely to respond to anticancer drug. The data provided is a large-scale drug screen incorporating detailed genomic analyses to systematically identify drug response biomarkers. More than 1000 genetically characterized human cancer cell lines with a wide range of anti-cancer therapeutics were used for this purpose. The database contains drug sensitivity data and genetic correlations.
The Cancer Genome Atlas (TCGA) TCGA project has generated comprehensive, multi-dimensional maps of the key genomic changes in 33 types of cancer. The TCGA dataset, comprising more than two petabytes of genomic data, and this genomic information helps the cancer research community to improve the prevention, diagnosis, and treatment of cancer. The cancer tissues collected so far insclude breast, central nervous system, endocrine, gastrointestinal, gynecologic, head and neck, hematologic, skin, soft tissue, thoracic and urologic tissues.Laura McCulloch
The Global Proteome Machine Database Global Proteome Machine Database was constructed to utilize the information obtained by GPM servers to aid in the difficult process of validating peptide MS/MS spectra as well as protein coverage patterns. This database has been integrated into the GPM server pages, allowing users to quickly compare their experimental results with the best results that have been previously observed by users of the machine.Mark Grivainis
The Human Metabolome Database Human Metabolome Database (HMDB) is a freely available electronic database containing detailed information about small molecule metabolites found in the human body. It is intended to be used for applications in metabolomics, clinical chemistry, biomarker discovery and general education. The database is designed to contain or link three kinds of data: 1) chemical data, 2) clinical data, and 3) molecular biology/biochemistry data. The database contains 114,100 metabolite entries including both water-soluble and lipid soluble.. Additionally, 5,702 protein sequences are linked to these metabolite entries.
The Human Protein Atlas is to map all the human proteins in cells, tissues and organs using integration of various omics technologies, including antibody-based imaging, mass spectrometry-based proteomics, transcriptomics and systems biology.Dalia Barkley
The Personal Genome Project Personal Genome Project, initiated in 2005, is a coalition of projects across the world dedicated to creating public genome, health, and trait data.
The European Nucleotide Archive European Nucleotide Archive (ENA) is a publicly available database that was developed by the European Molecular Biology Laboratory (EMBL) at the European Bioinformatic Institute (EBI). This database serves as a resource for primary nucleotide sequence information, such as DNA and RNA sequences, including assembled data, annotation enriched data, and raw data as soon as it becomes available and regardless of the sequence technology used. ENA consists of three main databases: The EMBL-Bank which consists of the assembled sequence data and the annotation information, the Sequence Read Archive(SRA) which consists of the reads of the raw data generated from next generation sequencing technology, and the Trace Archive which consists of the reads of the raw data generated using capillary sequencing technology. It is very easy to use this database. I recently used it to download data from a URL using a GSE accession number and it turned out to be easier to use than GEO.
Protein Data Bank is the protein data bank which is the largest database for 3D structures of large biological molecules which includes nucleic acids and proteins. After being established in 1971 the PDB has since grown to contain over 770,000 unique structures as of 2011. All protein structures have been experimentally validated. While this information is primarily used in basic research to verify protein structure and ligand conformations, computational researchers have also used structural data from PDB to try and predict 3D conformation from DNA or protein sequences. For example, Hou et al., (Bioinformatics, 2017) developed DeepSF, a deep convolutional neural network for mapping protein sequences to folds. This database has a rich potential for machine learning applications and the immense amount of structural information for current proteins could help researchers predict and validate new proteins.
The 500 Cities Project
The 500 Cities Project was launched by the Robert Wood Johnson Foundation and CDC Foundation in partnership with the Center for Disease Control (CDC). The project involved collecting data regarding 27 measures at the census tract level of the 500 largest cities in the United States. The data was gathered by the Behavioral Risk Factor Surveillance System (BRFSS) which is a system of health related telephone surveys. The measures fall within three categories, health outcomes, preventative and unhealthy behaviours. The dataset was first published in December of 2016 using data gathered during 2014. In November of 2017, 20 of the 27 measures were updated to reflect data gathered during 2015 by the BRFSS. Data for the 7 measures that were not updated is only gathered on even years and therefore could not be updated.
Saccharomyces Genome Database (SGD)https://www.yeastgenome.orgSGD features sequences for a variety of yeast strains, which can be downloaded or viewed on the site’s interactive genome browser. The site also includes tools for working with yeast-related data, including BLAST, gene lists, gene ontology finders, primer design tools, restriction enzyme mappers, and more. The site’s YeastMine tool can be used to search through information on genes, proteins, interactions, phenotype, function, literature references, and more.
Mouse Genome Informatics is a database of laboratory mouse genome, strain information and PDX data. It includes several projects such as MGD(Mouse Genome Database), GXD(Gene Expression Database), MTB(Mouse Tumor Biology Database) and A GO(Gene Ontology) project for mouse. The data include mouse strain characterization, nomenclature, mouse gene expression data in various studies, genomic data of mouse tumor, gene annotations and so on. As of today, mice are still commonly used subjects for scientific studies, such as PDX models for cancer target validation. This database provided valuable information and tools for to understand or mimic human mechanism. For example, the MGI curated mouse–human homolog information is one of the most useful gene conversion tools for researchers using syngeneic mouse tumor models to extrapolate the mechanism of action or treatment effect for therapeutic compounds/molecules in human bodies.
The Progenitor Cell Biology Consortium!Synapse:syn1773109/wiki/54962The Progenitor Cell Biology Consortium was created in order to standardize these protocols and characterize the cells obtained. The results obtained by 10 laboratories are freely available through Synapse, which has convenient R, Python and Java packages in addition to its web interface. These sets were curated to include only true PSCs based on a number of criteria (self-renewal, expression of pluripotency markers, no expression of differentiation markers, normal karyotype). The project, completed in 2015, includes 64 cell lines, each representing 7 states including stem cell, ectoderm, endoderm and mesoderm. For each of these, the database provides metadata including the cell of origin, the reprogramming vector and gene, the laboratory, and a number of quality metrics. The characterization results include miRNA and mRNA sequencing data, as well as DNA methylation, and are available both as raw data and as summarized matrices. A user-friendly interface is also available to explore the data visually before download.
New York City Health and Nutrition Examination Survey New York City Health and Nutrition Examination Survey (NYC HANES) is a community-based health survey. The first city wide survey was conducted in 2004 and the most recent one was conducted in 2013-2014. The data was collected using physical examinations, clinical and laboratory tests, and face-to-face interviews. The interviewees were all 20 years or older, and random households throughout the city were used for a proper representation of the total population. Most of the papers using this database were published in 2004 and there have been over 20 publications using this dataset.
The Harvard Personal Genomes Project Personal Genomes Project is headed by Dr. George Church. It was started in 2005, and now has almost 10,000 people enrolled. People volunteer their genome and health data to the project. The database includes microbiome data from several body sites, participant profiles including electronic medical records, survey data, and genome data. This study has been used to create more ethnically accurate reference genomes, and has been used to find associations between splice sites and diabetes.
Acute Lymphoblastic Leukemia Image Database for ImageProcessing (ALL-IDB) is a public and free dataset of microscopic images of blood samples, initially focused on Acute Lymphoblastic Leukemia (ALL), a serious blood pathology most common in 3-5 year-old kids that can being fatal in as little as a few weeks if left untreated It is specifically designed for the evaluation and the comparison of algorithms for segmentation and image classification. For each image in the dataset, the classification/position of ALL lymphoblasts is provided by expert oncologists.
