ABCDEFGHIJKLMNOPQRSTUVWXYZAAABACADAEAFAGAHAIAJAKALAMANAOAPAQARASAT
1
Field nameAnnData alignmentDescriptionData typeCategorical values
(ontology and enum)
ExamplesRequiredRationaleEncoded by sample, cell, donor, dataset, library (or combination)?Public / DB side
2
1titleuns
This text describes and differentiates the dataset from other datasets in the same collection. It is strongly recommended that each dataset title in a collection is unique and does not depend on other metadata such as a different assay to disambiguate it from other datasets in the collection.
stringN/A
Cells of the adult human heart collection is "All — Cells of the adult human heart".
MUSTdatasetpublic
3
2study_piunsPrincipal Investigator(s) leading the study where the data is/was used.arrayN/ASarah,A,TeichmannMUST
To be able to link to other studies from the same lab as sometimes labs tend to process samples similarly and generate libraries similarly, resulting in less strong batch effects
datasetpublic
4
3batch_conditionuns
Values must refer to cell metadata keys in obs. Together, these keys define the batches that a normalization or integration algorithm should be aware of. For example if "patient" and "seqBatch" are keys of vectors of cell metadata, either ["patient"], ["seqBatch"], or ["patient", "seqBatch"] are valid values.
stringN/A - RECOMMENDEDdatasetpublic
5
4default_embeddinguns
The value must match a key to an embedding in obsm for the embedding to display by default in CELLxGENE Explorer.
- RECOMMENDEDdatasetpublic
6
5commentsuns
Other technical or experimental covariates that could affect the quality or batch of the sample. Must not contain identifiers. This field is designed to capture potential challenges for data integration not captured elsewhere.
stringN/A - RECOMMENDEDAny additional commentsdatasetpublic
7
6sample_idobs
Identification number of the sample. This is the fundamental unit of sampling the tissue (the specimen taken from the subject), which can be the same as the 'donor_ID', but is often different if multiple samples are taken from the same subject. Note: this is NOT a unit of multiplexing of donor samples, which should be stored in "library".
stringN/ASC24; SC25; SC28MUST
Fundamental unit of sampling of the tissue.
samplepublic
8
7donor_idobsThis must be free-text that identifies a unique individual that data were derived from.string
It is strongly recommended that this identifier be designed so that it is unique to: a given individual within the collection of datasets that includes this dataset, and a given individual across all collections in CELLxGENE Discover.

It is strongly recommended that "pooled" be used for observations from a sample of multiple individuals that were not confidently assigned to a single individual through demultiplexing.
It is strongly recommended that "unknown" ONLY be used for observations in a dataset when it is not known which observations are from the same individual.
CR_donor_1; MM_donor_1; LR_donor_2MUST
Fundamental unit of biological variation of the data
donorpublic
9
8protocol_urlobs
The protocols.io URL (if none exists, please use the BioRxiv URL) for the full experimental protocol; or if multiple protocols exist please list them e.g. sample preparation protocol / sequencing protocol.
arrayN/A
https://www.biorxiv.org/conte nt/early/2017/09/24/193219
RECOMMENDED
Useful to look up protocol data that can provide insight on batch effects. As protocols can sometimes apply to a subset of the study, we capture this at a sample level. This information may not always be available.
samplepublic
10
9instituteobsInstitution where the samples were processed.arrayN/AEMBL-EBI; Genome Institute of SingaporeMUST
To be able to link to other studies from the same institution as sometimes samples from different labs in the same institute are processed via similar core facilities. Thus batch effects may be smaller for datasets from the same institute even if other factors differ.
samplepublic
11
10sample_collection_siteobsThe pseudonymised name of the site where the sample was collected.string
It is strongly recommended that this identifier be designed so that it is unique to a given site within the collection of datasets that includes this site (for example, the labels 'site1', 'site2' may appear in other datasets thus rendering them indistinguishable).
AIDA_site_1; AIDA_site_2RECOMMENDED
To understand whether the collection site contributes to batch effects
samplepublic
12
11sample_collection_relative_time_pointobs
Time point when the sample was collected. This field is only needed if multiple samples from the same subject are available and collected at different time points. Sample collection dates (e.g. 23/09/22) cannot be used due to patient data protection, only relative time points should be used here (e.g. day3).
stringN/AsampleX_day1RECOMMENDED
Explains variability in the data between samples from the same subject
samplepublic
13
12library_idobs
The unique ID that is used to track libraries in the investigator's institution (should align with the publication).
stringN/AA24; NK_healthy_001MUST
A way to track the unit of data generation. This should include sample pooling
samplepublic
14
13library_id_repositoryobs
The unique ID used to track libraries from one of the following public data repositories: EGAX*, GSM*, SRX*, ERX*
stringN/AGSM1684095RECOMMENDED
Links a dataset back to the source from which it was ingested, optional only if this is the same as the library_ID.
librarypublic
15
14author_batch_notesobs
Encoding of author knowledge on any further information related to likely batch effects.
stringN/A
Batch run by different personnel on different days
RECOMMENDED
Space for author intuition of batch effects in their dataset
samplepublic
16
15organism_ontology_term_idobsThe name given to the type of organism, collected in NCBITaxon:0000 format.ontology
"NCBITaxon:9606" for Homo sapiens or "NCBITaxon:10090" for Mus musculus.
NCBITaxon:9606; NCBITaxon:10090MUST
Strong biological effect that needs to be considered for batch covariate selection
donorpublic
17
16manner_of_deathobs
Manner of death classification based on the Hardy Scale or 'unknown' or 'not applicable':
Category 1 = Violent and fast death Deaths due to accident, blunt force trauma or suicide, terminal phase estimated at < 10 min.
Category 2 = Fast death of natural causes -Sudden unexpected deaths of people who had been reasonably healthy, after a terminal phase estimated at < 1 hr (with sudden death from a myocardial infarction as a model cause of death for this category)
Category 3 = Intermediate death - Death after a terminal phase of 1 to 24 hrs (not classifiable as 2 or 4); patients who were ill but death was unexpected
Category 4 = Slow death - Death after a long illness, with a terminal phase longer than 1 day (commonly cancer or chronic pulmonary disease); deaths that are not unexpected
Category 0 =Ventilator Case - All cases on a ventilator immediately before death
Unknown = The cause of death is unknown
Not applicable = Subject is alive
enum1; 2; 3; 4; 0; unknown; not applicable1; 2; 3; 4; 0; unknown; not applicableMUST
Manner of death can affect cellular profiles.
donorpublic
18
17sample_sourceobs
The study subgroup that the participant belongs to. This indicates whether the participant was a surgical donor (this includes patients providing blood samples or biopsies), a postmortem donor, or an organ donor.
enumsurgical donor; postmortem donor; living organ donorsurgical donor; postmortem donorMUST
The source of the sample (whether the sample comes from alive subject; an organ donor; or deceased subject) can result in different cellular profiles and hence batch effects.
samplepublic
19
18sex_ontology_term_idobsReported sex of the donor.string
This must be a child of PATO:0001894 for phenotypic sex or "unknown" if unavailable.
PATO:0000383 for female, PATO:0000384 for male
MUST
Likely biological effect. Need to know if we have a balanced dataset or if sex is collinear with the dataset.
donorpublic
20
19sample_collection_methodobsThe method the sample was physically obtained from the donor. enum
brush; scraping; biopsy; surgical resection; blood draw; body fluid; other
biopsy; brush; surgical resectionMUST
Main contributor to batch effects
samplepublic
21
20tissue_typeobsWhether the tissue is "tissue", "organoid", or "cell culture".enumtissue; organoid; cell culturetissue; organoid; cell cultureMUST
Source of batch effect & dataset exclusion criteria
samplepublic
22
21sampled_site_conditionobsWhether the site is considered healthy, diseased or adjacent to disease.enumhealthy; diseased; adjacenthealthy; diseased; adjacentMUST
Main contributor to batch effects
samplepublic
23
22tissue_ontology_term_idobs
The detailed anatomical location of the sample, please provide a specific UBERON term.
string
If tissue_type is "tissue" or "organoid", this must be the most accurate child of UBERON:0001062 for anatomical entity. If tissue_type is "cell culture" this must follow the requirements for cell_type_ontology_term_id.
UBERON:0001828; UBERON:0000966MUST
Major biological effect that needs to be assessed for sufficient coverage in the atlas datasets.
samplepublic
24
23tissue_free_textobs
The detailed anatomical location of the sample - this does not have to tie to an ontology term.
stringN/Aterminal ileumRECOMMENDED
To help the integration team understand the anatomical location of the sample, specifically to solve the problem when the UBERON ontology terms are insufficiently precise.
samplepublic
25
24sample_preservation_methodobsIndicating if tissue was frozen, or not, at any point before library preparation.enum
ambient temperature; cut slide; fresh; frozen at -70C; frozen at -80C; frozen at -150C; frozen in liquid nitrogen; frozen in vapor phase; paraffin block; RNAlater at 4C; RNAlater at 25C; RNAlater at -20C; other
fresh; frozen at -70CMUST
Main contributor to batch effects
samplepublic
26
25suspension_typeobsSpecifies whether the sample contains single cells or single nuclei data.enum
This must be "cell", "nucleus", or "na".
This must be the correct type for the corresponding assay:
10x transcription profiling [EFO:0030080] and its children = "cell" or "nucleus"
ATAC-seq [EFO:0007045] and its children = "nucleus"
BD Rhapsody Whole Transcriptome Analysis [EFO:0700003] = "cell"
BD Rhapsody Targeted mRNA [EFO:0700004] = "cell"
CEL-seq2 [EFO:0010010] = "cell" or "nucleus"
CITE-seq [EFO:0009294] and its children = "cell"
DroNc-seq [EFO:0008720] = "nucleus"
Drop-seq [EFO:0008722] = "cell" or "nucleus"
GEXSCOPE technology [EFO:0700011] = "cell" or "nucleus"
inDrop [EFO:0008780] = "cell" or "nucleus"
cell; nucleus; naMUST
Major source of batch effect & dataset exclusion criteria
samplepublic
27
26cell_enrichmentobs
Specifies the cell types targeted for enrichment or depletion beyond the selection of live cells.
string
This must be a Cell Ontology (CL) term (http://www.ebi.ac.uk/ols4/ontologies/cl). For cells that are enriched, list the CL code followed by a "+". For cells that were depleted, list the CL code followed by a "-". If no enrichment or depletion occurred, please use 'na' (not applicable)
CL:0000057+; naMUST
If cell lineages were filtered, this may be a dataset exclusion criterion
samplepublic
28
27cell_viability_percentageobsIf measured, per sample cell viability before library preparation (as a percentage).numberN/A88; 95; 93.5RECOMMENDED
Is a measure of sample quality that could be used to explain outlier samples
samplepublic
29
28cell_number_loadedobsEstimated number of cells loaded for library construction.integerN/A5000; 4000RECOMMENDED
Can explain the number of doublets found in samples
samplepublic
30
29sample_collection_yearobs
Year of sample collection. Should not be detailed further(to exact month and day), to prevent identifiability.
integerN/A2018RECOMMENDED
May explain whether a dataset was separated into smaller batches.
samplepublic
31
30assay_ontology_term_idobsPlatform used for single cell library construction.string
This must be an EFO term and either:
"EFO:0002772" for assay by molecule or preferably its most accurate child
"EFO:0010183" for single cell library construction or preferably its most accurate child
An assay based on 10X Genomics products should either be "EFO:0008995" for 10x technology or preferably its most accurate child. An assay based on SMART (Switching Mechanism at the 5' end of the RNA Template) or SMARTer technology SHOULD either be "EFO:0010184" for Smart-like or preferably its most accurate child.
Recommended:
10x 3' v2 "EFO:0009899"
10x 3' v3 "EFO:0009922"
10x 5' v1 "EFO:0011025"
10x 5' v2 "EFO:0009900"
Smart-seq2 "EFO:0008931"
Visium Spatial Gene Expression "EFO:0010961"
EFO:0009922MUST
Major source of batch effect and dataset filtering criterion
librarypublic
32
31library_preparation_batchobs
Indicating which samples' libraries were prepared in the same chip/plate/etc., e.g. batch1, batch2.
stringN/Abatch01; batch02MUST
Sample preparation is a major source of batch effects.
librarypublic
33
32library_sequencing_runobs
The identifier (or accession number) that indicates which samples' libraries were sequenced in the same run.
stringN/AERR10855815; run1; NV0087MUST
Library sequencing is a major source of batch effects
librarypublic
34
33sequenced_fragmentobsWhich part of the RNA transcript was targeted for sequencing.enum3 prime tag; 5 prime tag; probe-based; full length3 prime tag; full lengthMUST
May be a source of batch effect that has to be tested
librarypublic
35
34sequencing_platformobsPlatform used for sequencing.ontology
"subClassOf" : ["EFO:0002699"] - https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fwww.ebi.ac.uk%2Fefo%2FEFO_0002699
EFO:0008563RECOMMENDED
This captures potential strand hopping which may cause data quality issues
librarypublic
36
35is_primary_dataobs
This must be True if this is the canonical instance of this cellular observation and False if not. This is commonly False for meta-analyses reusing data or for secondary views of data.
enumtrue; falsetrue; falseMUST
This helps to ensure samples are not used twice.
librarypublic
37
36reference_genomeobsReference genome used for alignment.enumGRCh38; GRCh37; GRCm39; GRCm38; GRCm37; not applicableGRCh38; GRCh37MUST
Possible source of batch effect and confounder for some biological analysis
librarypublic
38
37gene_annotation_versionobsEnsembl release version accession number.string
http://www.ensembl.org/info/website/archives/index.html) or NCBI/RefSeq
v110; GCF_000001405.40MUST
Possible source of batch effect and confounder for some biological analysis
librarypublic
39
38alignment_softwareobs
Protocol used for alignment analysis, please specify which version was used e.g. cell ranger 2.0, 2.1.1 etc.
stringN/A
cell ranger 3.0.1; kallisto bustools; GSNAP
MUST
Affects which cells are filtered per dataset, and which reads (introns and exons or only exons) are counted as part of the reported transcriptome. This can convey batch effects.
librarypublic
40
39intron_inclusionobsWere introns included during read counting in the alignment process?enumyes; noyes; noRECOMMENDED
Affects the number of reads per cell called in a sample
librarypublic
41
40author_cell_typeobsEncoding of author intuition of cellular annotation in the dataset.stringN/AGoblet cell; microgliaRECOMMENDED
Encoding of author intuition of cellular annotation in their dataset.
cellpublic
42
41cell_type_ontology_term_idobsCell Ontology (CL) term.string
This must be a Cell Ontology (CL) term (http://www.ebi.ac.uk/ols4/ontologies/cl).
If no appropriate high-level term can be found or the cell type is unknown, then it is strongly recommended to use 'unknown'.
The following terms must not be used: "CL:0000255" for eukaryotic cell; "CL:0000257" for Eumycetozoan cell; "CL:0000548" for animal cell
CL:0001204MUST
Encoding of cell type to help alignment with other datasets.
cellpublic
43
44
45
46
CellxGene mandatory fields (not in the HCA Tier 1 schema)
47
CxG mandatory biological metadata terms:

disease_ontology_term_id
obsThis must be a MONDO term or "PATO:0000461" for normal or healthy.ontologyRequirements for data contributors adhering to GDPR or like standards: In the case of disease, HCA requests that you submit a higher order ontology term - this is especially important in the case of rare disease.

.
MONDO:0005385; PATO:0000461MUSTCELLxGENE core schemadonorpubilc
48
CxG mandatory biological metadata terms:

self_reported_ethnicity_ontology_term_id
obsSelf reported ethnicity. ontologyIf organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, this must be a HANCESTRO term or "unknown". Otherwise, for all organisms, this must be “na”.

Requirements for data contributors adhering to GDPR or like standards: HCA will be collecting ethnicity data as part of HCA’s Tier 2 metadata that is protected by managed access, therefore please put 'unknown' for this field.
unknownMUSTCELLxGENE core schemadonorpublic
49
CxG mandatory biological metadata terms:

development_stage_ontology_term_id
obsAge of the subject. string
If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, this should be an HsapDv term. If organism_ontolology_term_id is "NCBITaxon:10090" for Mus musculus, this should be an HsapDv term.

Requirements for data contributors adhering to GDPR or like standards: HCA requests that you do not submit year-specific terms. For convenience, below are some broader age bracket ontology terms:
Embryonic stage = A term from the set of Carnegie stages 1-23 = (up to 8 weeks after conception; e.g. HsapDv:0000003)
Fetal development = A term from the set of 9 to 38 week post-fertilization human stages = (9 weeks after conception and before birth; e.g. HsapDv:0000046)
Post natal =
Years 0-14 HsapDv:0000264
Years 15-19 HsapDv:0000268
Years 20-29 HsapDv:0000237
Years 30-39 HsapDv:0000238
Years 40-49 HsapDv:0000239
Years 50-59 HsapDv:0000240
Years 60-69 HsapDv:0000241
Years 70-79 HsapDv:0000242
Years 80-89 HsapDv:0000243
HsapDv:0000237; unknownMUSTCELLxGENE core schemasamplepublic
50
51
52
53
Collection metadata submitted by email:
54
consortiaunsList relevant consortia, specifically HCA.stringHCAHCAMUSTdatasetpublic
55
descriptionunsShort description of the dataset.stringN/A - MUSTdatasetpublic
56
contact name_emailunsContact name and email of the submitter.stringN/APolly Bloggs, pbloggs@gmail.comMUSTdatasetpublic
57
publication_doiuns
The publication digital object identifier (doi) for the protocol. If no pre-print nor publication exists, please write 'not applicable'.
stringN/A10.1016/j.cell.2016.07.054MUST
To enable data to be linked to the publication.
datasetpublic
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100