| A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | AA | AB | AC | AD | AE | AF | AG | AH | AI | AJ | AK | AL | AM | AN | AO | AP | AQ | AR | AS | AT | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Field name | AnnData alignment | Description | Data type | Categorical values (ontology and enum) | Examples | Required | Rationale | Encoded by sample, cell, donor, dataset, library (or combination)? | Public / DB side | |||||||||||||||||||||||||||||||||||||
2 | 1 | title | uns | This text describes and differentiates the dataset from other datasets in the same collection. It is strongly recommended that each dataset title in a collection is unique and does not depend on other metadata such as a different assay to disambiguate it from other datasets in the collection. | string | N/A | Cells of the adult human heart collection is "All — Cells of the adult human heart". | MUST | dataset | public | |||||||||||||||||||||||||||||||||||||
3 | 2 | study_pi | uns | Principal Investigator(s) leading the study where the data is/was used. | array | N/A | Sarah,A,Teichmann | MUST | To be able to link to other studies from the same lab as sometimes labs tend to process samples similarly and generate libraries similarly, resulting in less strong batch effects | dataset | public | ||||||||||||||||||||||||||||||||||||
4 | 3 | batch_condition | uns | Values must refer to cell metadata keys in obs. Together, these keys define the batches that a normalization or integration algorithm should be aware of. For example if "patient" and "seqBatch" are keys of vectors of cell metadata, either ["patient"], ["seqBatch"], or ["patient", "seqBatch"] are valid values. | string | N/A | - | RECOMMENDED | dataset | public | |||||||||||||||||||||||||||||||||||||
5 | 4 | default_embedding | uns | The value must match a key to an embedding in obsm for the embedding to display by default in CELLxGENE Explorer. | - | RECOMMENDED | dataset | public | |||||||||||||||||||||||||||||||||||||||
6 | 5 | comments | uns | Other technical or experimental covariates that could affect the quality or batch of the sample. Must not contain identifiers. This field is designed to capture potential challenges for data integration not captured elsewhere. | string | N/A | - | RECOMMENDED | Any additional comments | dataset | public | ||||||||||||||||||||||||||||||||||||
7 | 6 | sample_id | obs | Identification number of the sample. This is the fundamental unit of sampling the tissue (the specimen taken from the subject), which can be the same as the 'donor_ID', but is often different if multiple samples are taken from the same subject. Note: this is NOT a unit of multiplexing of donor samples, which should be stored in "library". | string | N/A | SC24; SC25; SC28 | MUST | Fundamental unit of sampling of the tissue. | sample | public | ||||||||||||||||||||||||||||||||||||
8 | 7 | donor_id | obs | This must be free-text that identifies a unique individual that data were derived from. | string | It is strongly recommended that this identifier be designed so that it is unique to: a given individual within the collection of datasets that includes this dataset, and a given individual across all collections in CELLxGENE Discover. It is strongly recommended that "pooled" be used for observations from a sample of multiple individuals that were not confidently assigned to a single individual through demultiplexing. It is strongly recommended that "unknown" ONLY be used for observations in a dataset when it is not known which observations are from the same individual. | CR_donor_1; MM_donor_1; LR_donor_2 | MUST | Fundamental unit of biological variation of the data | donor | public | ||||||||||||||||||||||||||||||||||||
9 | 8 | protocol_url | obs | The protocols.io URL (if none exists, please use the BioRxiv URL) for the full experimental protocol; or if multiple protocols exist please list them e.g. sample preparation protocol / sequencing protocol. | array | N/A | https://www.biorxiv.org/conte nt/early/2017/09/24/193219 | RECOMMENDED | Useful to look up protocol data that can provide insight on batch effects. As protocols can sometimes apply to a subset of the study, we capture this at a sample level. This information may not always be available. | sample | public | ||||||||||||||||||||||||||||||||||||
10 | 9 | institute | obs | Institution where the samples were processed. | array | N/A | EMBL-EBI; Genome Institute of Singapore | MUST | To be able to link to other studies from the same institution as sometimes samples from different labs in the same institute are processed via similar core facilities. Thus batch effects may be smaller for datasets from the same institute even if other factors differ. | sample | public | ||||||||||||||||||||||||||||||||||||
11 | 10 | sample_collection_site | obs | The pseudonymised name of the site where the sample was collected. | string | It is strongly recommended that this identifier be designed so that it is unique to a given site within the collection of datasets that includes this site (for example, the labels 'site1', 'site2' may appear in other datasets thus rendering them indistinguishable). | AIDA_site_1; AIDA_site_2 | RECOMMENDED | To understand whether the collection site contributes to batch effects | sample | public | ||||||||||||||||||||||||||||||||||||
12 | 11 | sample_collection_relative_time_point | obs | Time point when the sample was collected. This field is only needed if multiple samples from the same subject are available and collected at different time points. Sample collection dates (e.g. 23/09/22) cannot be used due to patient data protection, only relative time points should be used here (e.g. day3). | string | N/A | sampleX_day1 | RECOMMENDED | Explains variability in the data between samples from the same subject | sample | public | ||||||||||||||||||||||||||||||||||||
13 | 12 | library_id | obs | The unique ID that is used to track libraries in the investigator's institution (should align with the publication). | string | N/A | A24; NK_healthy_001 | MUST | A way to track the unit of data generation. This should include sample pooling | sample | public | ||||||||||||||||||||||||||||||||||||
14 | 13 | library_id_repository | obs | The unique ID used to track libraries from one of the following public data repositories: EGAX*, GSM*, SRX*, ERX* | string | N/A | GSM1684095 | RECOMMENDED | Links a dataset back to the source from which it was ingested, optional only if this is the same as the library_ID. | library | public | ||||||||||||||||||||||||||||||||||||
15 | 14 | author_batch_notes | obs | Encoding of author knowledge on any further information related to likely batch effects. | string | N/A | Batch run by different personnel on different days | RECOMMENDED | Space for author intuition of batch effects in their dataset | sample | public | ||||||||||||||||||||||||||||||||||||
16 | 15 | organism_ontology_term_id | obs | The name given to the type of organism, collected in NCBITaxon:0000 format. | ontology | "NCBITaxon:9606" for Homo sapiens or "NCBITaxon:10090" for Mus musculus. | NCBITaxon:9606; NCBITaxon:10090 | MUST | Strong biological effect that needs to be considered for batch covariate selection | donor | public | ||||||||||||||||||||||||||||||||||||
17 | 16 | manner_of_death | obs | Manner of death classification based on the Hardy Scale or 'unknown' or 'not applicable': Category 1 = Violent and fast death Deaths due to accident, blunt force trauma or suicide, terminal phase estimated at < 10 min. Category 2 = Fast death of natural causes -Sudden unexpected deaths of people who had been reasonably healthy, after a terminal phase estimated at < 1 hr (with sudden death from a myocardial infarction as a model cause of death for this category) Category 3 = Intermediate death - Death after a terminal phase of 1 to 24 hrs (not classifiable as 2 or 4); patients who were ill but death was unexpected Category 4 = Slow death - Death after a long illness, with a terminal phase longer than 1 day (commonly cancer or chronic pulmonary disease); deaths that are not unexpected Category 0 =Ventilator Case - All cases on a ventilator immediately before death Unknown = The cause of death is unknown Not applicable = Subject is alive | enum | 1; 2; 3; 4; 0; unknown; not applicable | 1; 2; 3; 4; 0; unknown; not applicable | MUST | Manner of death can affect cellular profiles. | donor | public | ||||||||||||||||||||||||||||||||||||
18 | 17 | sample_source | obs | The study subgroup that the participant belongs to. This indicates whether the participant was a surgical donor (this includes patients providing blood samples or biopsies), a postmortem donor, or an organ donor. | enum | surgical donor; postmortem donor; living organ donor | surgical donor; postmortem donor | MUST | The source of the sample (whether the sample comes from alive subject; an organ donor; or deceased subject) can result in different cellular profiles and hence batch effects. | sample | public | ||||||||||||||||||||||||||||||||||||
19 | 18 | sex_ontology_term_id | obs | Reported sex of the donor. | string | This must be a child of PATO:0001894 for phenotypic sex or "unknown" if unavailable. | PATO:0000383 for female, PATO:0000384 for male | MUST | Likely biological effect. Need to know if we have a balanced dataset or if sex is collinear with the dataset. | donor | public | ||||||||||||||||||||||||||||||||||||
20 | 19 | sample_collection_method | obs | The method the sample was physically obtained from the donor. | enum | brush; scraping; biopsy; surgical resection; blood draw; body fluid; other | biopsy; brush; surgical resection | MUST | Main contributor to batch effects | sample | public | ||||||||||||||||||||||||||||||||||||
21 | 20 | tissue_type | obs | Whether the tissue is "tissue", "organoid", or "cell culture". | enum | tissue; organoid; cell culture | tissue; organoid; cell culture | MUST | Source of batch effect & dataset exclusion criteria | sample | public | ||||||||||||||||||||||||||||||||||||
22 | 21 | sampled_site_condition | obs | Whether the site is considered healthy, diseased or adjacent to disease. | enum | healthy; diseased; adjacent | healthy; diseased; adjacent | MUST | Main contributor to batch effects | sample | public | ||||||||||||||||||||||||||||||||||||
23 | 22 | tissue_ontology_term_id | obs | The detailed anatomical location of the sample, please provide a specific UBERON term. | string | If tissue_type is "tissue" or "organoid", this must be the most accurate child of UBERON:0001062 for anatomical entity. If tissue_type is "cell culture" this must follow the requirements for cell_type_ontology_term_id. | UBERON:0001828; UBERON:0000966 | MUST | Major biological effect that needs to be assessed for sufficient coverage in the atlas datasets. | sample | public | ||||||||||||||||||||||||||||||||||||
24 | 23 | tissue_free_text | obs | The detailed anatomical location of the sample - this does not have to tie to an ontology term. | string | N/A | terminal ileum | RECOMMENDED | To help the integration team understand the anatomical location of the sample, specifically to solve the problem when the UBERON ontology terms are insufficiently precise. | sample | public | ||||||||||||||||||||||||||||||||||||
25 | 24 | sample_preservation_method | obs | Indicating if tissue was frozen, or not, at any point before library preparation. | enum | ambient temperature; cut slide; fresh; frozen at -70C; frozen at -80C; frozen at -150C; frozen in liquid nitrogen; frozen in vapor phase; paraffin block; RNAlater at 4C; RNAlater at 25C; RNAlater at -20C; other | fresh; frozen at -70C | MUST | Main contributor to batch effects | sample | public | ||||||||||||||||||||||||||||||||||||
26 | 25 | suspension_type | obs | Specifies whether the sample contains single cells or single nuclei data. | enum | This must be "cell", "nucleus", or "na". This must be the correct type for the corresponding assay: 10x transcription profiling [EFO:0030080] and its children = "cell" or "nucleus" ATAC-seq [EFO:0007045] and its children = "nucleus" BD Rhapsody Whole Transcriptome Analysis [EFO:0700003] = "cell" BD Rhapsody Targeted mRNA [EFO:0700004] = "cell" CEL-seq2 [EFO:0010010] = "cell" or "nucleus" CITE-seq [EFO:0009294] and its children = "cell" DroNc-seq [EFO:0008720] = "nucleus" Drop-seq [EFO:0008722] = "cell" or "nucleus" GEXSCOPE technology [EFO:0700011] = "cell" or "nucleus" inDrop [EFO:0008780] = "cell" or "nucleus" | cell; nucleus; na | MUST | Major source of batch effect & dataset exclusion criteria | sample | public | ||||||||||||||||||||||||||||||||||||
27 | 26 | cell_enrichment | obs | Specifies the cell types targeted for enrichment or depletion beyond the selection of live cells. | string | This must be a Cell Ontology (CL) term (http://www.ebi.ac.uk/ols4/ontologies/cl). For cells that are enriched, list the CL code followed by a "+". For cells that were depleted, list the CL code followed by a "-". If no enrichment or depletion occurred, please use 'na' (not applicable) | CL:0000057+; na | MUST | If cell lineages were filtered, this may be a dataset exclusion criterion | sample | public | ||||||||||||||||||||||||||||||||||||
28 | 27 | cell_viability_percentage | obs | If measured, per sample cell viability before library preparation (as a percentage). | number | N/A | 88; 95; 93.5 | RECOMMENDED | Is a measure of sample quality that could be used to explain outlier samples | sample | public | ||||||||||||||||||||||||||||||||||||
29 | 28 | cell_number_loaded | obs | Estimated number of cells loaded for library construction. | integer | N/A | 5000; 4000 | RECOMMENDED | Can explain the number of doublets found in samples | sample | public | ||||||||||||||||||||||||||||||||||||
30 | 29 | sample_collection_year | obs | Year of sample collection. Should not be detailed further(to exact month and day), to prevent identifiability. | integer | N/A | 2018 | RECOMMENDED | May explain whether a dataset was separated into smaller batches. | sample | public | ||||||||||||||||||||||||||||||||||||
31 | 30 | assay_ontology_term_id | obs | Platform used for single cell library construction. | string | This must be an EFO term and either: "EFO:0002772" for assay by molecule or preferably its most accurate child "EFO:0010183" for single cell library construction or preferably its most accurate child An assay based on 10X Genomics products should either be "EFO:0008995" for 10x technology or preferably its most accurate child. An assay based on SMART (Switching Mechanism at the 5' end of the RNA Template) or SMARTer technology SHOULD either be "EFO:0010184" for Smart-like or preferably its most accurate child. Recommended: 10x 3' v2 "EFO:0009899" 10x 3' v3 "EFO:0009922" 10x 5' v1 "EFO:0011025" 10x 5' v2 "EFO:0009900" Smart-seq2 "EFO:0008931" Visium Spatial Gene Expression "EFO:0010961" | EFO:0009922 | MUST | Major source of batch effect and dataset filtering criterion | library | public | ||||||||||||||||||||||||||||||||||||
32 | 31 | library_preparation_batch | obs | Indicating which samples' libraries were prepared in the same chip/plate/etc., e.g. batch1, batch2. | string | N/A | batch01; batch02 | MUST | Sample preparation is a major source of batch effects. | library | public | ||||||||||||||||||||||||||||||||||||
33 | 32 | library_sequencing_run | obs | The identifier (or accession number) that indicates which samples' libraries were sequenced in the same run. | string | N/A | ERR10855815; run1; NV0087 | MUST | Library sequencing is a major source of batch effects | library | public | ||||||||||||||||||||||||||||||||||||
34 | 33 | sequenced_fragment | obs | Which part of the RNA transcript was targeted for sequencing. | enum | 3 prime tag; 5 prime tag; probe-based; full length | 3 prime tag; full length | MUST | May be a source of batch effect that has to be tested | library | public | ||||||||||||||||||||||||||||||||||||
35 | 34 | sequencing_platform | obs | Platform used for sequencing. | ontology | "subClassOf" : ["EFO:0002699"] - https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fwww.ebi.ac.uk%2Fefo%2FEFO_0002699 | EFO:0008563 | RECOMMENDED | This captures potential strand hopping which may cause data quality issues | library | public | ||||||||||||||||||||||||||||||||||||
36 | 35 | is_primary_data | obs | This must be True if this is the canonical instance of this cellular observation and False if not. This is commonly False for meta-analyses reusing data or for secondary views of data. | enum | true; false | true; false | MUST | This helps to ensure samples are not used twice. | library | public | ||||||||||||||||||||||||||||||||||||
37 | 36 | reference_genome | obs | Reference genome used for alignment. | enum | GRCh38; GRCh37; GRCm39; GRCm38; GRCm37; not applicable | GRCh38; GRCh37 | MUST | Possible source of batch effect and confounder for some biological analysis | library | public | ||||||||||||||||||||||||||||||||||||
38 | 37 | gene_annotation_version | obs | Ensembl release version accession number. | string | http://www.ensembl.org/info/website/archives/index.html) or NCBI/RefSeq | v110; GCF_000001405.40 | MUST | Possible source of batch effect and confounder for some biological analysis | library | public | ||||||||||||||||||||||||||||||||||||
39 | 38 | alignment_software | obs | Protocol used for alignment analysis, please specify which version was used e.g. cell ranger 2.0, 2.1.1 etc. | string | N/A | cell ranger 3.0.1; kallisto bustools; GSNAP | MUST | Affects which cells are filtered per dataset, and which reads (introns and exons or only exons) are counted as part of the reported transcriptome. This can convey batch effects. | library | public | ||||||||||||||||||||||||||||||||||||
40 | 39 | intron_inclusion | obs | Were introns included during read counting in the alignment process? | enum | yes; no | yes; no | RECOMMENDED | Affects the number of reads per cell called in a sample | library | public | ||||||||||||||||||||||||||||||||||||
41 | 40 | author_cell_type | obs | Encoding of author intuition of cellular annotation in the dataset. | string | N/A | Goblet cell; microglia | RECOMMENDED | Encoding of author intuition of cellular annotation in their dataset. | cell | public | ||||||||||||||||||||||||||||||||||||
42 | 41 | cell_type_ontology_term_id | obs | Cell Ontology (CL) term. | string | This must be a Cell Ontology (CL) term (http://www.ebi.ac.uk/ols4/ontologies/cl). If no appropriate high-level term can be found or the cell type is unknown, then it is strongly recommended to use 'unknown'. The following terms must not be used: "CL:0000255" for eukaryotic cell; "CL:0000257" for Eumycetozoan cell; "CL:0000548" for animal cell | CL:0001204 | MUST | Encoding of cell type to help alignment with other datasets. | cell | public | ||||||||||||||||||||||||||||||||||||
43 | |||||||||||||||||||||||||||||||||||||||||||||||
44 | |||||||||||||||||||||||||||||||||||||||||||||||
45 | |||||||||||||||||||||||||||||||||||||||||||||||
46 | CellxGene mandatory fields (not in the HCA Tier 1 schema) | ||||||||||||||||||||||||||||||||||||||||||||||
47 | CxG mandatory biological metadata terms: disease_ontology_term_id | obs | This must be a MONDO term or "PATO:0000461" for normal or healthy. | ontology | Requirements for data contributors adhering to GDPR or like standards: In the case of disease, HCA requests that you submit a higher order ontology term - this is especially important in the case of rare disease. . | MONDO:0005385; PATO:0000461 | MUST | CELLxGENE core schema | donor | pubilc | |||||||||||||||||||||||||||||||||||||
48 | CxG mandatory biological metadata terms: self_reported_ethnicity_ontology_term_id | obs | Self reported ethnicity. | ontology | If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, this must be a HANCESTRO term or "unknown". Otherwise, for all organisms, this must be “na”. Requirements for data contributors adhering to GDPR or like standards: HCA will be collecting ethnicity data as part of HCA’s Tier 2 metadata that is protected by managed access, therefore please put 'unknown' for this field. | unknown | MUST | CELLxGENE core schema | donor | public | |||||||||||||||||||||||||||||||||||||
49 | CxG mandatory biological metadata terms: development_stage_ontology_term_id | obs | Age of the subject. | string | If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, this should be an HsapDv term. If organism_ontolology_term_id is "NCBITaxon:10090" for Mus musculus, this should be an HsapDv term. Requirements for data contributors adhering to GDPR or like standards: HCA requests that you do not submit year-specific terms. For convenience, below are some broader age bracket ontology terms: Embryonic stage = A term from the set of Carnegie stages 1-23 = (up to 8 weeks after conception; e.g. HsapDv:0000003) Fetal development = A term from the set of 9 to 38 week post-fertilization human stages = (9 weeks after conception and before birth; e.g. HsapDv:0000046) Post natal = Years 0-14 HsapDv:0000264 Years 15-19 HsapDv:0000268 Years 20-29 HsapDv:0000237 Years 30-39 HsapDv:0000238 Years 40-49 HsapDv:0000239 Years 50-59 HsapDv:0000240 Years 60-69 HsapDv:0000241 Years 70-79 HsapDv:0000242 Years 80-89 HsapDv:0000243 | HsapDv:0000237; unknown | MUST | CELLxGENE core schema | sample | public | |||||||||||||||||||||||||||||||||||||
50 | |||||||||||||||||||||||||||||||||||||||||||||||
51 | |||||||||||||||||||||||||||||||||||||||||||||||
52 | |||||||||||||||||||||||||||||||||||||||||||||||
53 | Collection metadata submitted by email: | ||||||||||||||||||||||||||||||||||||||||||||||
54 | consortia | uns | List relevant consortia, specifically HCA. | string | HCA | HCA | MUST | dataset | public | ||||||||||||||||||||||||||||||||||||||
55 | description | uns | Short description of the dataset. | string | N/A | - | MUST | dataset | public | ||||||||||||||||||||||||||||||||||||||
56 | contact name_email | uns | Contact name and email of the submitter. | string | N/A | Polly Bloggs, pbloggs@gmail.com | MUST | dataset | public | ||||||||||||||||||||||||||||||||||||||
57 | publication_doi | uns | The publication digital object identifier (doi) for the protocol. If no pre-print nor publication exists, please write 'not applicable'. | string | N/A | 10.1016/j.cell.2016.07.054 | MUST | To enable data to be linked to the publication. | dataset | public | |||||||||||||||||||||||||||||||||||||
58 | |||||||||||||||||||||||||||||||||||||||||||||||
59 | |||||||||||||||||||||||||||||||||||||||||||||||
60 | |||||||||||||||||||||||||||||||||||||||||||||||
61 | |||||||||||||||||||||||||||||||||||||||||||||||
62 | |||||||||||||||||||||||||||||||||||||||||||||||
63 | |||||||||||||||||||||||||||||||||||||||||||||||
64 | |||||||||||||||||||||||||||||||||||||||||||||||
65 | |||||||||||||||||||||||||||||||||||||||||||||||
66 | |||||||||||||||||||||||||||||||||||||||||||||||
67 | |||||||||||||||||||||||||||||||||||||||||||||||
68 | |||||||||||||||||||||||||||||||||||||||||||||||
69 | |||||||||||||||||||||||||||||||||||||||||||||||
70 | |||||||||||||||||||||||||||||||||||||||||||||||
71 | |||||||||||||||||||||||||||||||||||||||||||||||
72 | |||||||||||||||||||||||||||||||||||||||||||||||
73 | |||||||||||||||||||||||||||||||||||||||||||||||
74 | |||||||||||||||||||||||||||||||||||||||||||||||
75 | |||||||||||||||||||||||||||||||||||||||||||||||
76 | |||||||||||||||||||||||||||||||||||||||||||||||
77 | |||||||||||||||||||||||||||||||||||||||||||||||
78 | |||||||||||||||||||||||||||||||||||||||||||||||
79 | |||||||||||||||||||||||||||||||||||||||||||||||
80 | |||||||||||||||||||||||||||||||||||||||||||||||
81 | |||||||||||||||||||||||||||||||||||||||||||||||
82 | |||||||||||||||||||||||||||||||||||||||||||||||
83 | |||||||||||||||||||||||||||||||||||||||||||||||
84 | |||||||||||||||||||||||||||||||||||||||||||||||
85 | |||||||||||||||||||||||||||||||||||||||||||||||
86 | |||||||||||||||||||||||||||||||||||||||||||||||
87 | |||||||||||||||||||||||||||||||||||||||||||||||
88 | |||||||||||||||||||||||||||||||||||||||||||||||
89 | |||||||||||||||||||||||||||||||||||||||||||||||
90 | |||||||||||||||||||||||||||||||||||||||||||||||
91 | |||||||||||||||||||||||||||||||||||||||||||||||
92 | |||||||||||||||||||||||||||||||||||||||||||||||
93 | |||||||||||||||||||||||||||||||||||||||||||||||
94 | |||||||||||||||||||||||||||||||||||||||||||||||
95 | |||||||||||||||||||||||||||||||||||||||||||||||
96 | |||||||||||||||||||||||||||||||||||||||||||||||
97 | |||||||||||||||||||||||||||||||||||||||||||||||
98 | |||||||||||||||||||||||||||||||||||||||||||||||
99 | |||||||||||||||||||||||||||||||||||||||||||||||
100 |