Dataset and software creation
best practices v.1.2
AIFARMS Data Management Working Group
Spring 2025
Table of Contents
Data collection considerations
Software publication considerations
Dataset publication considerations
Data/software packaging/preservation
Appendix
Dataset creation checklist❃
Software creation checklist❃
❃Print this page and check off the boxes that you have completed. The checkboxes are linked to the relevant sections in the
document. Alternatively, create a copy of the following Dataset/Software creation checklist templates
(courtesy of Ben Collins) and use them as an interactive checklist.
Data collection considerations
Data storage considerations
Data collection activities require storage considerations:
“data storage” by Alone forever from
Data collection documentation❃
Data collection process requires documentation:
*More information on data collection documentation and organization can be found in the Appendix
“Information” by ABDUL LATIF from Noun Project
Dataset publication considerations
Dataset license considerations
The publication of a dataset will require a license consideration:
License by Template from Noun Project (CC BY 3.0)
Creative Commons data licenses
A Creative Commons license is a common choice for a dataset that can be shared with the wider research community:
“Creative Commons” created by Austin Condiff from the Noun Project
An example of an AIFARMS dataset and its Creative Commons license
“Multi-camera pig tracking dataset” accompanies the Multi-camera pig tracking software that was developed under the AIFARMS auspices; the Creative Commons BY-NC-ND 4.0 license was selected to govern the use of the dataset:
A highly visual nature of the dataset that consists of annotated images and videos has determined the selection of a Creative Commons license with stricter conditions.
Image source: https://creativecommons.org/about/cclicenses/
Proprietary and confidential data
confidential by Start Up Graphic Design from Noun Project (CC BY 3.0)
Dataset hosting platform considerations
The publication of a dataset will require a hosting platform consideration:
Illinois Data Bank is another hosting platform to consider.
“Cloud Computing” by ProSymbols from
Questions about Dataset hosting platforms
questions by AVAM from Noun Project (CC BY 3.0)
Dataset contact person considerations
Consider assigning a contact person to your dataset:
“contact person” by Ranah Pixel Studio from the
Dataset error reporting considerations
Consider whether you would be willing to make changes to the dataset if any errors are reported:
“Error” by ProSymbols from
the Noun Project
Dataset citation considerations
Consider creating a citation file for your dataset:
Video source: ICPSR website: https://www.icpsr.umich.edu/web/pages/datamanagement/citations.html
Dataset DOI considerations
Consider assigning a DOI for your dataset:
can help identify a suitable hosting platform for your dataset.
Image source: https://www.doi.org/
Dataset metadata file considerations
Consider creating and adding a metadata file to your dataset:
“Metadata” by M. Oki Orlando from the
Datasheet for datasets
If you have created a machine learning dataset, consider creating an accompanying datasheet to facilitate the use of the dataset:
Please get in touch if you would like to learn more about this template.
“Datasheet” by Gacem Tachfin from the
Croissant metadata format
For making the datasets ML- and AI-ready, AIFARMS has adopted the Croissant metadata format:
Please get in touch with the Data Management working group for assistance with adding or converting to the Croissant metadata format for your dataset.
Akhtar M, Benjelloun O, Conforti C, et al. Croissant: A Metadata Format for ML-Ready Datasets. Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning. 2024;
pastry by Gregor Cresnar from Noun Project (CC BY 3.0)
Software Publishing Considerations
Software license considerations
Publication of your software code will require a license consideration:
Classification of software licenses according to their permissiveness level
Image source: https://bit.ly/3ct5HWF
Image source: https://opensource.org/
What License should I use?
Public domain versus open source software
Although there is a considerable amount of overlap, the two concepts have distinct characteristics:
“Public Domain Nouns” from the Noun Project
Software hosting considerations
Publication of software requires hosting platform considerations:
“Cloud Computing” by ProSymbols from the
Software contact person considerations
Published software requires determinations regarding its maintenance and development:
“development” by Gregor Cresnar from
Software documentation considerations
As with datasets, consider creating documentation for your published software:
“write documentation” by Juicy Fish
trom the Noun Project
Programming style guidelines
Consider following a programming style guideline when preparing your code for publication and writing documentation for your code:
“Clean coding” by Nhor from the Noun Project
Software citation considerations
Software DOI considerations
Consider using the Citation file format (.cff) for your software
Consider obtaining a unique, Digital Object Identifier (DOI) for your software:
citation by Alice Design from the
Image source: https://www.doi.org/
Contributor License Agreement
If you are planning to maintain and develop an open source software project and accept contributions to it, consider creating a Contributor License Agreement (CLA):
Dataset/software packaging and preservation
Data packaging considerations
Examples of data packaged core datasets
Once the dataset is ready to be shared with a wider research community, the data files will require organization/packaging considerations:
RO-CRATE file
To increase the visibility of your published dataset/software and to make sure they are preserved, consider creating a RO-CRATE file for your dataset and/or software:
through the AIFARMS data repository;
Please get in touch if you would like to learn more about the RO-CRATE file.
Image source: https://www.researchobject.org/ro-crate/
BagIt file packaging format
An alternative data packaging format that provides a directory structure data and metadata:
Image source: https://bit.ly/3wq4fe9
On making your Research FAIR and Reproducible
Arguillas, Florio, Christian, Thu-Mai, Gooch, Mandy, Honeyman, Tom, Peer, Limor, & CURE-FAIR WG. (2022). 10 Things for Curating Reproducible and FAIR Research (1.1). https://doi.org/10.15497/RDA00074.
Further reading:
https://www.go-fair.org/fair-principles/fairification-process/
APPENDIX
How should data file names be constructed?
How long should (data) file names be?
How should (data) file names be formatted?
File type
Analysis type ❃
Time period
How should data files be organized?
❃If possible, avoid creating nested folders beyond three levels
❃Consider including a Documentation/README file in the root folder to explain the structure of the folder, its contents, and file naming conventions.
What type of information should documentation file include?❃
Study level data:
Data level data:
Two levels of data collection documentation/description:
❃In addition to the dataset title, Principal Investigator’s name and email address, keywords for the dataset, funding sources, language information
Data Disclaimer examples
“The data are provided ‘as is’ and the originating source for the data are not liable for any damages.”
“The accuracy or reliability of the data is not guaranteed or warranted in any way and the providers disclaim liability of any kind whatsoever, including, without limitation, liability for quality, performance, merchantability and fitness for a particular purpose arising out of the use, or inability to use the data.”
“The user of this dataset will need to take care of handling missing observations, outliers and violations of logical consistency.”
More Data Disclaimer examples
Questions?
Who to get in touch with?
How to get assistance?
Get in touch with the AIFARMS Data Management working group team:
Get in touch with the Office of Technology Management for questions
regarding customized Dataset/Software licenses and Intellectual Property Management:
Subscribe to Data Nudge newsletter from the University of Illinois Library Research Data Services (archive of past Data Nudges) or get in touch with the service directly to seek assistance.
Svetlana Sowers - svsowers@illinois.edu
AIFARMS Current members of the Data Management Working Group
Vikram Adve
Jessica Wedow
Matthew Hudson
Ana Lučić
Rob Kooper
Isabella Condotta
Melanie Rodriguez
Past members of the AIFARMS Data Management Working Group
Jingrui He
Alex Kuhl
Pradeep Senthill
Roser Matamala
Ryan Dilger
Carl J Bernachhi