ABCDEFGHIJKLMNOPQRSTUVWXYZ
1
# v1.0.3
2
This metadata template is for use with marker gene sequence data derived from host-associated environmental samples. It is adapted from the MIMARKS: survey, host-associated package to include Darwin Core terms and terms recommended by NOAA Omics.
3
4
Sheet definitions
5
study_dataMetadata about the study, such as project name, description, funding info, and other project-level metadata required by NCBI and OBIS. This is filled out at the start of a project.
6
sample_dataContextual data about the samples collected, such as when it was collected, where it was collected from, what kind of sample it is, and what were the properties of the environment or experimental condition from which the sample was taken. Each row is a distinct sample. Most of this information is recorded during sample collection. Many terms have controlled vocabulary, such as organism, env_broad_scale, waterBody. This file contains information that is submitted to NCBI when generating a BioSample. Other important fields for metadata processing include amplicon_sequenced, which helps to link together different types of metdata. This sheet contains terms from the MIMARKS survey host-asociated 6.0 package. For other types of samples (eg, sediment), use the appropriate AOML_MIMARKS.survey.sediment file.
7
prep_dataContextual data about how the samples were prepared for sequencing. Includes how they were extracted, what amplicon was targeted, how they were sequenced. The 1st section of this file is in the format for an NCBI SRA upload and should NOT be rearranged or renamed. Each row is a separate sequencing library preparation, distinguished by a unique library_id. One sample from sample_prep could be represented multiple times on this sheet if multiple marker genes were amplified.
8
analysis_dataData about processing from raw sequences to the derived outputs, including software versions, processing parameters, reference database used. Often there is only one row for each amplicon that is sequenced.
9
asv_dataFile generated by Tourmaline, containing ASV featureid, DNA sequence, assigned taxonomy, confidence in taxonomy, and read counts for each sample. This file is not stored in the metadata sheet and is not required for submission to NCBI, but is necessary for submission to OBIS. Sample names in the file must match names in the metadata template.
10
11
WorkflowNew projectTransferring existing project metadata
12
Project initiation: study_dataUpon initiating a project, create a copy of the NOAA_MIMARKS.survey.host-associated.6.0 Google Sheet (File -> Make a Copy) and save it in the project's Google Drive Metadata folder with the project_id at the start of the file name (eg, gomecc_AOML_MIMARKS.survey.host-associated.6.0). Fill in as much info to `study_data` as is known. You will not have info for 'accessions' until you submit data to NCBI and OBIS.Create a copy of the AOML_MIMARKS.survey.water.6.0 Google Sheet (File -> Make a Copy) and save it in the project's Google Drive Metadata folder with the project_id at the start of the file name (eg, gomecc_AOML_MIMARKS.survey.water.6.0). Fill in as much info to `study_data` as is known. You will not have info for 'accessions' until you submit data to NCBI and OBIS.
13
Sample collection: sample_dataDuring sample_collection, record required sample-specific info on a separate local data sheet. Key fields to record include serial_number, line_id, station, ctd_bottle_no, sample_replicate, sample_type, notes_sampling, collection_date_local, depth, decimalLongitude, decimalLatitude, geodeticDatum, samp_vol_we_dna_ext, and any environmental variable not being recorded by others on the cruise. ASAP, transfer these data to your sample_data sheet, with one row for each filtered water sample. Have someone else double check that all values were input correctly and that GPS coordinates and dates fall within the expected range. Use the sample_data to generate unique sample names for each distinct water sample to be DNA extracted.Transfer sample data from an existing metadata sheet to your sample_data sheet, with one row for each filtered host-associated sample. Have someone else double check that all values were input correctly and that GPS coordinates and dates fall within the expected range. Use the sample_data to generate unique sample names for each distinct host-associated sample to be DNA extracted. Ensure that existing data matches the required formats, and if not then convert them. For example, collection_date should be in UTC time and ISO format.collection_date_local can be in local time, ISO format. For data that does not match columns in the template, create a new column and color it light blue.
14
Post-sample collection: sample_dataIf you have biological replicates (eg, distinct water samples taken from the same Niskin bottle), make sure that you record one unique identifier for the full water sample in source_mat_id and list out cooresponding replicates under biological_replicates. Fill in any other sample_data that is known, such as organism, env_broad_scale, env_local scale, env_medium, geo_loc_name, waterBody, samp_collect_device, samp_mat_process, size_frac, collection_method. Many of these are controlled terms and the same between projects.
15
Lab preparation: sample_dataWhen preparing samples for sequencing in the lab, you will generate other information that should be added to sample_data: amplicon_sequenced, dna_conc, concentrationUnit, and extract_number. You will also add a few more samples that are prepared for sequencing, such as extraction blanks and mock communities. Make sure to select the correct sample_type for these samples.
16
Lab preparation: prep_dataThe prep_data sheet is organized with one row for each sequencing library prep. sample_name must match the name in sample_data, while library_id should be the name that was given the sequencing center and should be unique to each sequencing library prep (so different between 16S and 18S preps, for example). amplicon_sequenced must match the name provided to amplicon_sequenced in sample_data. Some of this sheet can be filled out prior to PCR prep, as you will already know the pcr primers and conditions being used. prep_data is split into 2 section after column M. The 1st section mostly contains controlled vocabulary that is submitted the NCBI SRA. Do NOT reorganize or change the column in the 1st section. Make sure to record the date and personnel for DNA extractions and PCRs. You will not have the biosample or sra accessions until after submitting to NCBI SRA.
17
After sequencing; prep_dataOnce you receive sequences back, enter the filenames in prep_data for each library prep. Upload all sequences to a google drive location and link that location on the sheet.
18
Analysis: analysis_dataThe analysis_data has one row for each amplicon_sequenced. Provide short but descriptive info on software, parameters, and versions used for assembling ASVs and assigning taxonomy. Other types of analyses (such as estimates of diversity) should be provided in a code_repo link.
19
NCBI SRA submission: project_dataInitiate a new NCBI SRA submission. Use the project_data sheet to fill in info for the BioProject if you have not already created a BioProject. I you have an existing BioProject, make sure to add that accession to sample_data.
20
NCBI SRA submission: sample_dataThe sample_data sheet will be used directly to create Biosamples in NCBI. We recommend downloading the Google Sheet as an Excel file, then saving the sample_data sheet as it's own excel file. Delete any columns from this file that you do not want on NCBI (such as date_sheet_modified, modified_by, internal notes). Upload the sample_data Excel file.
21
NCBI SRA submission: prep_dataIt is easiest to submit SRA data to NCBI in batches based on sequencing run or amplicon_sequenced, but you can also submit allof your data at once. You can submit youWe recommend submitting different markers separaIn the Google Sheet, create a new sheet that is a copy of the SRA_template sheet and name it based on the marker you are
22
23
24
Guidelines
25
Ensure that sample names are consistent between sample_data, prep_data, and asv_data.
26
Ensure that the amplicon name provided in amplicon_sequenced in sample_data, prep_data, and analysis_data are all consistent.
27
Do not reorganize or rename columns in the 1st section of prep_data (before column N)
28
Keep the date_sheet_modified and modified_by as the last 2 columns in each sheet. These are set by the custom onEdit function through Apps Scripts.
29
Do not rename headers. If you wish to provide a custom column,you can add that with a cyan color header.
30
Do not make edits to the SRA Terms sheet, it is used for the data validation in prep_data
31
Saving this file as an Excel file may lose some data validation functionality.
32
For blank cells, NCBI only allows 'not collected', 'not applicable' or 'missing'.
33
34
35
Custom AOML terms
36
37
sample_data
38
cruise_id
39
line_id
40
station
41
ctd_bottle_no
42
sample_replicate
43
source_mat_id
44
biological_replicates
45
extract_number
46
serial_number
47
biosample_accession
48
notes_sampling
49
project_id
50
amplicon_sequenced
51
metagenome_sequenced
52
collection_date_local
53
waterBody
54
decimalLatitude
55
decimalLongitude
56
geodeticDatum
57
dna_conc
58
concentrationUnit
59
sample_type
60
basisOfRecord
61
date_sheet_modified
62
modified_by
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100