ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
AWS provides 366 Open data sets to the world. On all scientific domains https://aws.amazon.com/opendata
@dropdown@dropdown@dropdown@
2
A selection of open datasets related to Life science are today available in https://aim-ahead.net/swb
3
Primary focus is around genomics datasets. Multiple datasets can be of interest for an AIM-AHEAD investigators and students
4
Availability of SoDH variables in those datasets needs to be explored
5
AcronymName and descriptionMore informationAccess Typelink to request access
6
EMBEDEMory BrEast Imaging Dataset https://registry.opendata.aws/emory-breast-imaging-dataset-embed/controlled accesshttps://forms.gle/HwGMM6vdv3w32TKF91
7
This dataset was created during an AIM-AHEAD Pilot project. Dataset description: EMBED is a racially diverse mammography dataset containing 3.4M screening and diagnostic images from 110,000 patients collected from 2013-2020, with an equal representation of black and white women. The dataset is comprised of 2D, synthetic 2D (C-view), and 3D (digital breast tomosynthesis, i.e. DBT) images. It contains 60,000 annotated lesions linked to structured imaging descriptors and ground truth pathologic outcomes grouped into six severity classes. This release represents 20% of the total 2D and C-view dataset and is available for research use. DBT, US, and MRI exams will be added at a later date. Acknowledgements - We would like to thank Glendor, Inc and MD.ai for assistance with image de-identification.
8
9
1000-genomes1000 Genomeshttps://registry.opendata.aws/1000-genomes/open (no authorization required)https://aws.amazon.com/cli/
10
The 1000 Genomes Project is an international collaboration which has established the most detailed catalogue of human genetic variation, including SNPs, structural variants, and their haplotype context. The final phase of the project sequenced more than 2500 individuals from 26 different populations around the world and produced an integrated set of phased haplotypes with more than 80 million variants for these individuals.
11
12
tcgaThe Cancer Genome Atlashttps://registry.opendata.aws/tcga/Both: open access and controled accesshttps://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000178.v1.p1
13
The Cancer Genome Atlas (TCGA), a collaboration between the National Cancer Institute (NCI) and National Human Genome Research Institute (NHGRI), aims to generate comprehensive, multi-dimensional maps of the key genomic changes in major types and subtypes of cancer. TCGA has analyzed matched tumor and normal tissues from 11,000 patients, allowing for the comprehensive characterization of 33 cancer types and subtypes, including 10 rare cancers. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA Expression Quantification, Genotyping Array Copy Number Segment, Genotyping Array Masked Copy Number Segment, Genotyping Array Gene Level Copy Number Scores, and WXS Masked Somatic Mutation data from Genomic Data Commons (GDC). This dataset also contains controlled Whole Exome Sequencing (WXS), RNA-Seq, miRNA-Seq, ATAC-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and WXS Aggregated Somatic Mutation data from GDC. TCGA is made available on AWS via the [NIH STRIDES Initiative](https://aws.amazon.com/blogs/publicsector/aws-and-national-institutes-of-health-collaborate-to-accelerate-discoveries-with-strides-initiative/).
14
15
broad-gnomadGenome Aggregation Database (gnomAD)https://registry.opendata.aws/broad-gnomad/open (no authorization required)https://aws.amazon.com/cli/
16
The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators that aggregates and harmonizes both exome and genome data from a wide range of large-scale human sequencing projects. The summary data provided here are released for the benefit of the wider scientific community without restriction on use. The v2 data set (GRCh37) spans 125,748 exome sequences and 15,708 whole-genome sequences from unrelated individuals. The v3 data set (GRCh38) spans 71,702 genomes, selected as in v2. Sign up for the gnomAD mailing list [here](http://broad.io/gnomad_list).
17
18
broad-pan-ukbUK Biobank Pan-Ancestry Summary Statisticshttps://registry.opendata.aws/broad-pan-ukb/open (no authorization required)https://aws.amazon.com/cli/
19
A multi-ancestry analysis of 7,221 phenotypes using a generalized mixed model association testing framework, spanning 16,119 genome-wide association studies. We provide standard meta-analysis across all populations and with a leave-one-population-out approach for each trait. The data are provided in tsv format (per phenotype) and Hail MatrixTable (all phenotypes and variants). Metadata is provided in phenotype and variant manifests.
20
21
kids-firstGabriella Miller Kids First Pediatric Research Program (Kids First)https://registry.opendata.aws/kids-first/controlled accesshttps://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001138.v3.p2
22
The NIH Common Fund's Gabriella Miller Kids First Pediatric Research Program’s (“Kids First”) vision is to “alleviate suffering from childhood cancer and structural birth defects by fostering collaborative research to uncover the etiology of these diseases and by supporting data sharing within the pediatric research community.” The program continues to generate and share whole genome sequence data from thousands of children affected by these conditions, ranging from rare pediatric cancers, such as osteosarcoma, to more prevalent diagnoses, such as congenital heart defects. In 2018, Kids First launched the Gabriella Miller Kids First Data Resource Center, charged with building a large-scale data platform supporting clinical and genetic data from these patients and their families in order to accelerate discovery and ultimately clinical impact. Researchers can search, access, aggregate, and analyze these data through the Kids First Data Resource Portal. Additionally, by using cloud-based individual workspaces in CAVATICA, a data analysis and sharing computation platform, researchers can cross-analyze Kids First data with data from other efforts, such as NCI’s TARGET program and consortia-based datasets like the Children’s Brain Tumor Tissue Consortium (CBTTC). Kids First is made available on AWS via the [NIH STRIDES Initiative](https://aws.amazon.com/blogs/publicsector/aws-and-national-institutes-of-health-collaborate-to-accelerate-discoveries-with-strides-initiative/).
23
24
targetTherapeutically Applicable Research to Generate Effective Treatments (TARGET)https://registry.opendata.aws/target/open (no authorization required)https://aws.amazon.com/cli/
25
Therapeutically Applicable Research to Generate Effective Treatments (TARGET) is the collaborative effort of a large, diverse consortium of extramural and NCI investigators. The goal of the effort is to accelerate molecular discoveries that drive the initiation and progression of hard-to-treat childhood cancers and facilitate rapid translation of those findings into the clinic. TARGET projects provide comprehensive molecular characterization to determine the genetic changes that drive the initiation and progression of childhood cancers.The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, miRNA-Seq miRNA Expression Quantification data from Genomic Data Commons (GDC), and open data from GDC Legacy Archive.
26
27
hcmi-cmdcHuman Cancer Models Initiative (HCMI) Cancer Model Development Centerhttps://registry.opendata.aws/hcmi-cmdc/open (no authorization required)https://aws.amazon.com/cli/
28
The Human Cancer Models Initiative (HCMI) is an international consortium that is generating novel, next-generation, tumor-derived culture models annotated with genomic and clinical data. HCMI-developed models and related data are available as a community resource. The NCI is contributing to the initiative by supporting four Cancer Model Development Centers (CMDCs). CMDCs are tasked with producing next-generation cancer models from clinical samples. The cancer models include tumor types that are rare, originate from patients from underrepresented populations, lack precision therapy, or lack cancer model tools. Throughout the development process, the CMDCs utilize stringent internal QC measures to ensure both clinical and molecular integrity. These models are then annotated with clinical and genomic data and are available as a community resource.
29
30
cgciCancer Genome Characterization Initiatives - Burkitt Lymphoma, HIV+ Cervical Cancerhttps://registry.opendata.aws/cgci/Both: open access and controled accesshttps://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000235.v14.p2
31
The Cancer Genome Characterization Initiatives (CGCI) program supports cutting-edge genomics research of adult and pediatric cancers. CGCI investigators develop and apply advanced sequencing methods that examine genomes, exomes, and transcriptomes within various types of tumors. The program includes Burkitt Lymphoma Genome Sequencing Project (BLGSP) project and HIV+ Tumor Molecular Characterization Project - Cervical Cancer (HTMCP-CC) project. The dataset contains open Clinical Supplement, Biospecimen Supplement, RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, and miRNA Expression Quantification data. This dataset also contains controlled WGS/Targeted Sequencing/RNA-Seq/miRNA-Seq Aligned Reads, and RNA-Seq Splice Junction Quantification
32
33
34
organoid-pancreaticPancreatic Cancer Organoid Profilinghttps://registry.opendata.aws/organoid-pancreatic/Both: open access and controled accesshttps://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs001611.v1.p1
35
This study generated a collection of patient-derived pancreatic normal and cancer organoids and it was sequenced using Whole Genome Sequencing (WGS), Whole Exome Sequencing (WXS) and RNA-Seq as well as matched tumor and normal tissue if available. The study provides a valuable resource for pancreatic cancer researchers. The dataset contains open RNA-Seq Gene Expression Quantification data and controlled WGS/WXS/RNA-Seq Aligned Reads, WXS Annotated Somatic Mutation, WXS Raw Somatic Mutation, and RNA-Seq Splice Junction Quantification.
36
37
nciccr-dlbclNational Cancer Institute Center for Cancer Research - Diffuse Large B Cell Lymphoma (DLBCL) Genomics and Expressionhttps://registry.opendata.aws/nciccr-dlbcl/open (no authorization required)https://aws.amazon.com/cli/
38
The study describes integrative analysis of genetic lesions in 574 diffuse large B cell lymphomas (DLBCL) involving exome and transcriptome sequencing, array-based DNA copy number analysis and targeted amplicon resequencing. The dataset contains open RNA-Seq Gene Expression Quantification data.
39
40
41
mmrf-commpassCoMMpass from the Multiple Myeloma Research Foundationhttps://registry.opendata.aws/mmrf-commpass/open (no authorization required)https://aws.amazon.com/cli/
42
43
The Relating Clinical Outcomes in Multiple Myeloma to Personal Assessment of Genetic Profile study is the Multiple Myeloma Research Foundation (MMRF)’s landmark personalized medicine initiative. CoMMpass is a longitudinal observation study of around 1000 newly diagnosed myeloma patients receiving various standard approved treatments. The MMRF’s vision is to track the treatment and results for each CoMMpass patient so that someday the information can be used to guide decisions for newly diagnosed patients. CoMMpass checked on patients every 6 months for 8 years, collecting tissue samples, genetic, information, quality of life and various disease and clinical outcomes. The study has produced one of the largest genomic and clinical datasets of a single disease.
44
45
hcp-openaccessThe Human Connectome Projecthttps://registry.opendata.aws/hcp-openaccess/controlled accesshttps://wiki.humanconnectome.org/display/PublicData/How+To+Connect+to+Connectome+Data+via+AWS
46
The Human Connectome Project (HCP Young Adult, HCP-YA) is mapping the healthy human connectome by collecting and freely distributing neuroimaging and behavioral data on 1,200 normal young adults, aged 22-35.
47
48
cptac-2Clinical Proteomic Tumor Analysis Consortium 2 (CPTAC-2)https://registry.opendata.aws/cptac-2/open (no authorization required)https://aws.amazon.com/cli/
49
The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. CPTAC-2 is the Phase II of the CPTAC Initiative (2011-2016). Datasets contain open RNA-Seq Gene Expression Quantification, miRNA-Seq Isoform Expression Quantification, and miRNA Expression Quantification data.
50
51
cptac-3Clinical Proteomic Tumor Analysis Consortium 3 (CPTAC-3)https://registry.opendata.aws/cptac-3/open (no authorization required)https://registry.opendata.aws/cptac-3/
52
The Clinical Proteomic Tumor Analysis Consortium (CPTAC) is a national effort to accelerate the understanding of the molecular basis of cancer through the application of large-scale proteome and genome analysis, or proteogenomics. CPTAC-3 is the Phase III of the CPTAC Initiative. The dataset contains open RNA-Seq Gene Expression Quantification data.
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100