1) What is your name / professional affiliation?
2) What is the name of your dataset or consortium (e.g. eMERGE, 1000 Genomes)? Include version number if available.
3) How was this dataset or consortium funded?
4) Please name the institutions that are a part of this dataset? Please list.
5) We require a contact who will coordinate onboarding of this dataset into the AnVIL, which includes activities like depositing files into cloud buckets, determining data use limitations, and ensuring phenotypic data adheres to an agreed upon model. Please list the institutions that are contributing to this dataset (along with contact information).
6) If this dataset is NIH funded, do program officers affiliated with this dataset organization or consortium approve onboarding into the AnVIL?
7) If this dataset is NIH funded, please list the name(s) of the Program Officer(s) affiliated with this dataset organization or consortium
8) Are you an authorized representative capable of making decisions on behalf of the dataset?
9) What is the scientific goal of this dataset or consortium?
10) What makes this dataset impactful to the research community? Please include references any relevant advancements, for example in defining disease etiology, or advancing technology/tool development.
11) How can your dataset benefit from being hosted by the AnVIL?
12) Does this dataset contain data generated from human-derived samples?
13) Was this data generated in an ethical manner and have IRB approvals as needed?
14) Are the data use limitations on the dataset defined? (e.g. GRU, HMB, etc.)
15) Does your dataset contain any Personally Identifiable Information (PII) or Personal Health Information (PHI)? *
Examples of PII and PHI include: Names, Locations smaller than a state, dates more specific than a year, telephone numbers, vehicle identification numbers, licence plate numbers, fax numbers, serial numbers, email addresses, URLs, SSN, IP addresses, medical record numbers, biometric identifiers, beneficiary numbers, photographs, account numbers, or any other unique identifier. For additional guidance, see:
16) If it does contain PII/PHI above, is it de-identified? *
17) Has this dataset (and all associated cohorts expected to be deposited in the AnVIL) been registered in dbGaP?
18) How many cohorts are included in this dataset? (A cohort here is defined as "An organization of data that corresponds to a single IRB-approved study protocol")
19) Does your dataset organization or consortium have a data sharing agreement in place that allows members to access restricted data outside of dbGaP data access requests?
20) Is this dataset currently available to the public via other sources?
21) What types of data files are you interested in hosting in the AnVIL?
22) Will your dataset have genomic and phenotypic data available?
23) What type of analysis was performed to generate data?
24) What sequencing metrics do you have available for your genomic data?
25) Was your genomic data aligned using a functionally equivalent pipeline? (See more info here:
26) What genome build was your genomic data aligned to?
27) What is the total size (in TB) of the genomic files you would like hosted in the AnVIL? Provide estimate per file type and number of files.
28) What is the total size (in TB) of the phenotypic files you would like hosted in the AnVIL? Provide estimate per file type and number of files.
29) What data model do you currently use to organize your data? (ie: OMOP, dbGaP, FHIR, i2b2, etc)
30) Are there any analysis tools or apps that would be useful for your consortium to be able to use within the AnVIL?
