1 of 46

Dataset and software creation

best practices v.1.2

AIFARMS Data Management Working Group

Spring 2025

2 of 46

Table of Contents

Data collection considerations

Software publication considerations

Dataset publication considerations

Data/software packaging/preservation

Appendix

3 of 46

Dataset creation checklist

Software creation checklist

Print this page and check off the boxes that you have completed. The checkboxes are linked to the relevant sections in the

document. Alternatively, create a copy of the following Dataset/Software creation checklist templates

(courtesy of Ben Collins) and use them as an interactive checklist.

4 of 46

Data collection considerations

5 of 46

Data storage considerations

Data collection activities require storage considerations:

  • AIFARMS can support data collection efforts by providing a storage space through the NCSA’s Taiga File System:
    • Data size: less than 1TB is acceptable whereas larger datasets will require obtaining a permission.
    • Storage duration: AIFARMS lifetime (if and when it is required/needed, we will try to secure funding for a longer period of time, beyond AIFARMS lifetime)
  • Please get in touch with Rob Kooper (kooper@illinois.edu) to inquire about data storage options.
  • Storing raw data in at least two locations (e.g. UIUC servers and other institutions) with frequent backups and version control is strongly recommended.
  • Storing data on personal computers and portable storage devices is not recommended because these devices may not be regularly backed up.

“data storage” by Alone forever from

Noun Project

6 of 46

Data collection documentation

Data collection process requires documentation:

    • It is a common practice to create a documentation file (frequently referred to as a README file) that can be formatted as a text file, pdf, or a web page and that describes the project as well as the data that is being collected;
    • The documentation file frequently resides in the folder where data resides and provides a fuller context for the project and increases the understandability of your data for your research team;
    • An example of a README.txt template can be accessed through the following Box link;
    • This template can be adapted to your data collection parameters.

*More information on data collection documentation and organization can be found in the Appendix

“Information” by ABDUL LATIF from Noun Project

7 of 46

Dataset publication considerations

8 of 46

Dataset license considerations

The publication of a dataset will require a license consideration:

    • By assigning a license to your dataset, you give permission to others to use your data and specify the conditions under which the data can be used;
    • License considerations will have an impact on how the data can be hosted, shared, used;
    • If you publish a dataset on the Web and make it available to the public but you don’t assign the license, the terms and conditions of data usage will not be defined and some users may refrain from using the data;
    • It is not necessary to register a work with the Copyright Office in Washington in order to copyright it, nor is it necessary to include a copyright notice. Data that are collected facts may not be subject to copyright (or license);
    • You can get acquainted with a dataset that is freely available on the Web, view it, be informed by it, but you shouldn’t assume that a derivative work can be created only because the dataset is available on the Web and there are no specific restrictions mentioned.

License by Template from Noun Project (CC BY 3.0)

9 of 46

Creative Commons data licenses

A Creative Commons license is a common choice for a dataset that can be shared with the wider research community:

    • The Creative Commons License Chooser site in beta version can assist with the process of selecting the Creative Commons license that would be appropriate for your dataset.

“Creative Commons” created by Austin Condiff from the Noun Project

10 of 46

An example of an AIFARMS dataset and its Creative Commons license

“Multi-camera pig tracking dataset” accompanies the Multi-camera pig tracking software that was developed under the AIFARMS auspices; the Creative Commons BY-NC-ND 4.0 license was selected to govern the use of the dataset:

  • “BY” stands for Attribution meaning that appropriate credit needs to be given to the creators of the dataset.
  • “NC” stands for NonCommercial meaning that non-commercial uses of the dataset are allowed.
  • “ND” stands for NoDerivatives, meaning that no derivation or modifications of the dataset are allowed without obtaining an explicit permission.

A highly visual nature of the dataset that consists of annotated images and videos has determined the selection of a Creative Commons license with stricter conditions.

Image source: https://creativecommons.org/about/cclicenses/

11 of 46

Proprietary and confidential data

  • In cases when data is licensed from a third party and cannot be shared with the public, it is important to mark the data as proprietary and/or confidential;
  • The status of the data and the restrictions for sharing should be made explicit;
  • This can be done by adding a statement in the README file at the top of the data folder (so that it is clear that this applies to all the folders beneath it).

confidential by Start Up Graphic Design from Noun Project (CC BY 3.0)

12 of 46

Dataset hosting platform considerations

The publication of a dataset will require a hosting platform consideration:

    • AIFARMS data portal gathers the datasets generated under the AIFARMS auspices:
      • You can find either the actual data or the descriptions of the datasets, depending on the license;
    • Scholarly or discipline specific repository are recommended platforms to consider for the publication of your dataset;
    • For datasets that can be made available under a Creative Commons license,

Illinois Data Bank is another hosting platform to consider.

    • Google drive, Box, Dropbox folders are not suitable for hosting/sharing your dataset:
      • These platforms are not meant for long term data storage, curation and management; they don’t provide statistics, nor do the give guarantees about longevity of the data.

Cloud Computing” by ProSymbols from

Noun Project

13 of 46

Questions about Dataset hosting platforms

  • Please get in touch with the Data Management working group to let us know if you have published a dataset, or have questions about where is the best place to publish it.
  • There are many options available, each with their own unique requirements and restrictions.

questions by AVAM from Noun Project (CC BY 3.0)

14 of 46

Dataset contact person considerations

Consider assigning a contact person to your dataset:

    • Consider selecting a researcher/s, faculty member/s from your team who is knowledgeable about the dataset but is expected to be at your institution for a longer period of time.
    • If undergraduate, graduate students, postdoctorates are involved with the dataset collection/creation, consider their intended graduation/finishing date when assigning a contact person for the dataset.

“contact person” by Ranah Pixel Studio from the

Noun Project

15 of 46

Dataset error reporting considerations

Consider whether you would be willing to make changes to the dataset if any errors are reported:

    • If you are willing to make changes, consider adding to the description of the dataset that users can report errors/inconsistencies that they encounter;
    • If you are not willing to make the changes, consider adding a Data Disclaimer to the data description through which you would specify that the user should be responsible for handling any encountered inconsistencies;
    • Several examples of disclaimers are included in the Appendix.

“Error” by ProSymbols from

the Noun Project

16 of 46

Dataset citation considerations

Consider creating a citation file for your dataset:

  • A dataset is an independent research artifact that requires its own citation file:
  • Researchers who have helped collect, compile, pre-process data may not be the same researchers who contributed to the research paper and should be credited in the dataset citation;
  • Consider using the Citation file format (.cff) for your dataset and software
  • The CFFINIT tool can assist with the creation of the .cff file for the dataset.
  • More information on how to format the dataset citation is available through the University of Illinois Library Data Nudge service: https://emails.illinois.edu/newsletter/1922790509.html

Video source: ICPSR website: https://www.icpsr.umich.edu/web/pages/datamanagement/citations.html

17 of 46

Dataset DOI considerations

Consider assigning a DOI for your dataset:

    • A DOI identifies the dataset as a research object and increases its findability by providing a stable identifier;
    • DOI assignment is done by a registration agency, a short list can be seen at: https://www.doi.org/registration_agencies.html
    • Examples of Data and software repositories that can assign a DOI:
    • Re3data repository is a registry of research data repositories that

can help identify a suitable hosting platform for your dataset.

Image source: https://www.doi.org/

18 of 46

Dataset metadata file considerations

Consider creating and adding a metadata file to your dataset:

  • A metadata file enhances findability of data;
  • Common metadata formats include machine readable XML and JSON formats;
  • To fill out a metadata file:
    • Consider using Dublin Core, a common vocabulary used for description of a range of digital objects.
    • Alternatively, consider using agriculture domain controlled vocabularies such as:

“Metadata” by M. Oki Orlando from the

Noun Project

19 of 46

Datasheet for datasets

If you have created a machine learning dataset, consider creating an accompanying datasheet to facilitate the use of the dataset:

    • Datasheet for datasets template is a recommended template for documenting machine learning datasets;
    • It is meant to provide the context and motivation for the dataset creation;
    • Facilitates the use of dataset by the wider community;
    • Should be modified according to the parameters of your project;
    • Can accompany the publication of your dataset or serve as the basis for a data paper.

Please get in touch if you would like to learn more about this template.

“Datasheet” by Gacem Tachfin from the

Noun Project

20 of 46

Croissant metadata format

For making the datasets ML- and AI-ready, AIFARMS has adopted the Croissant metadata format:

  • Croissant is a format that makes the datasets more discoverable, portable and interoperable.
  • Relies on schema.org in the background to make the datasets findable
  • Provides a specification for providing information on the datasets splits that are useful for machine learning tasks

Please get in touch with the Data Management working group for assistance with adding or converting to the Croissant metadata format for your dataset.

Akhtar M, Benjelloun O, Conforti C, et al. Croissant: A Metadata Format for ML-Ready Datasets. Proceedings of the Eighth Workshop on Data Management for End-to-End Machine Learning. 2024;

pastry by Gregor Cresnar from Noun Project (CC BY 3.0)

21 of 46

Software Publishing Considerations

22 of 46

Software license considerations

Publication of your software code will require a license consideration:

  • Choose-a-license site can assist with the process of license selection.
  • Considerations for choosing a software license:
    • Will you allow commercial application based on your code?
    • Will you allow modifications/additions to your code?
    • Will you allow redistribution of your code?
  • Changing a software license later is hard:
    • All people that ever contributed need to agree to the license change.
  • Review the funding sources that have enabled the development of the software project as they might contain a restriction on the release of software, including mandating the software is released under an open source license.
  • Review the license when using the 3rd party software code, including open source code, as there may be restrictions set out in the license terms that are not obvious immediately.

23 of 46

Classification of software licenses according to their permissiveness level

  • Permissive licenses:
    • Least requirements with respect to how the software can be modified and/or distributed (Apache, MIT, NCSA);
    • Allow the redistribution of and modification of the code without sharing change back to the open source community;
    • New changes can have a different license (e.g. commercial).
  • Restrictive licenses:
    • More restrictions in terms of software distribution and its modification:
      • GNU GPL licenses (copyleft licenses)
    • Derivative works allowed but must be licensed under the same license;
    • Changes to the code must be contributed back to the open source community.
  • Proprietary licenses:
    • No modification or redistribution of the code allowed:
      • e.g. Apple Software, Microsoft software

Image source: https://bit.ly/3ct5HWF

Image source: https://opensource.org/

24 of 46

What License should I use?

  • A Creative Commons license is not a suitable choice for software:
    • Creative Commons licenses do not contain specific terms about the distribution of source code which is important for free reuse and modification of software.
    • http://bit.ly/3HvCuFL
  • Most open source software will choose a license that is most permissive:
    • i.e. MIT/Apache/NCSA
  • If you are a member of the University of Illinois, you can reach out to Office of Technology Management (OTM):
    • They can help finding the right license for you
    • They can tailor a license to specific customer needs.

25 of 46

Public domain versus open source software

Although there is a considerable amount of overlap, the two concepts have distinct characteristics:

  • Public domain waives the right to copyright, i.e. copyright ownership is waived;
  • Open source license does not waive the right to copyright, i.e. copyright ownership is retained;
  • An open source license is a preferred choice compared to releasing the software under public domain;
  • Public domain may not be recognized in all jurisdictions:
    • Works that are under public domain in one jurisdiction may not be under public domain worldwide;

“Public Domain Nouns” from the Noun Project

26 of 46

Software hosting considerations

Publication of software requires hosting platform considerations:

  • AIFARMS has a GitHub repository page that collects code created by AIFARMS researchers;
  • Please consider adding your code to the AIFARMS GitHub repository;
  • Although AIFARMS encourages sharing and publication of the code used for creation of the apps, files, datasets, analysis, sharing the code is not a requirement;
  • Other software hosting platforms include Zenodo, GitLab, BitBucket (private code hosting platform).
  • Out of the hosting platform options mentioned above, Zenodo is the only archival repository that guarantees a longer term preservation of the software.

“Cloud Computing” by ProSymbols from the

Noun Project

27 of 46

Software contact person considerations

Published software requires determinations regarding its maintenance and development:

  • If the plan is that software will be actively maintained and developed, consider assigning a contact person/s that will handle development;
  • If software was created as part of a class project, or research project, and is not likely to be actively developed/maintained in the future, consider making the repository read only access (GitHub calls this archive, this does not guarantee that the software will be archived and be around long term)
  • If the repository is made read-only, it does not accept new development requests and issue reporting but it can still be forked and starred by the users.
  • Information on how to make a GitHub repository read-only is available here;

“development” by Gregor Cresnar from

Noun Project

28 of 46

Software documentation considerations

As with datasets, consider creating documentation for your published software:

  • Consider adding a README.md (markdown format) or a differently formatted documentation file at the root of your repository to explain what your software does and how to use/run it;
  • Consider adding keywords to your repository that can increase the findability of your project;
  • Consider mentioning in the documentation file whether the project is actively maintained and whether others can contribute to its development.

“write documentation” by Juicy Fish

trom the Noun Project

29 of 46

Programming style guidelines

Consider following a programming style guideline when preparing your code for publication and writing documentation for your code:

“Clean coding” by Nhor from the Noun Project

30 of 46

Software citation considerations

Software DOI considerations

Consider using the Citation file format (.cff) for your software

    • More people using and citing your software is good;
    • Consider adding a citation.cff so people know how to cite your software:
      • Tools can convert the citation file to a file for Zenodo to create a DOI

Consider obtaining a unique, Digital Object Identifier (DOI) for your software:

    • Zenodo can create a DOI for each release of your code at GitHub;
    • Each version of software can have their own DOI;
    • One DOI should point to the latest version of the code.

citation by Alice Design from the

Noun Project

Image source: https://www.doi.org/

31 of 46

Contributor License Agreement

If you are planning to maintain and develop an open source software project and accept contributions to it, consider creating a Contributor License Agreement (CLA):

  • CLA lays out expectations in terms of making sure that the contributors receive a proper contribution/credit;
  • CLA also ensure that the contributors have the right to contribute to the project (e.g. the contribution does not violate the agreement that the developer might have with their employer);
  • This CLA-chooser web site can assist with creation of the CLA.
  • Contact Office of Technology Management (OTM) at otm@illinois.edu for help with CLA.

32 of 46

Dataset/software packaging and preservation

33 of 46

Data packaging considerations

Examples of data packaged core datasets

Once the dataset is ready to be shared with a wider research community, the data files will require organization/packaging considerations:

  • Data files can be organized by including a datapackage.json file (alternatively, .txt, .xml, .pdf, .html file) in the top directory that explains the file organization;
  • The data files can be included in either the main directory or a subdirectory;
  • More information about creating a data package and the datapackage.json file can be found here: https://datahub.io/docs/data-packages;
  • Data files can then be archived, published, uploaded, transferred as a .zip or .tar package.

34 of 46

RO-CRATE file

To increase the visibility of your published dataset/software and to make sure they are preserved, consider creating a RO-CRATE file for your dataset and/or software:

  • RO-CRATE file gathers/packages research artifacts related to one project in one file;
  • All or several research artifacts from your project can be referenced through this file;
  • Metadata is included in the same file (JSON-LD);
  • This packaged file can be published in Zenodo repository or, alternatively, hosted

through the AIFARMS data repository;

  • This format is particularly convenient for data preservation and curation.

Please get in touch if you would like to learn more about the RO-CRATE file.

Image source: https://www.researchobject.org/ro-crate/

35 of 46

BagIt file packaging format

An alternative data packaging format that provides a directory structure data and metadata:

  • Specifies a set of files for network transfer of digital files and their storing;
  • BagIt format can be created using the python library;
  • The format is used for preserving digital assets from different domains.

Image source: https://bit.ly/3wq4fe9

36 of 46

On making your Research FAIR and Reproducible

Arguillas, Florio, Christian, Thu-Mai, Gooch, Mandy, Honeyman, Tom, Peer, Limor, & CURE-FAIR WG. (2022). 10 Things for Curating Reproducible and FAIR Research (1.1). https://doi.org/10.15497/RDA00074.

37 of 46

APPENDIX

38 of 46

How should data file names be constructed?

How long should (data) file names be?

How should (data) file names be formatted?

      • File names can reflect the types of data that is being collected, time of collection, data collection instrument, alternatively questionnaire that was used to collect data, or version of the data (e.g. v.1.0, v.1.1)

    • Ideally less than 25 characters long
    • Spaces and special characters in file names should be avoided;
    • The following date formats are recommended (ISO-8601):
      • YYMMDD or YYYYMMDD ensures that data is listed chronologically;
    • When dealing with multiple version of the file, consider using ordinal numbers (1,2,3) to indicate different versions of the file for major changes (e.g., 1.0, 2.0);
    • For minor changes, consider using decimals to indicate a different version or date at the end of the file name (e.g., 1.0, 1.1, 1.2).

39 of 46

File type

Analysis type

Time period

How should data files be organized?

  • Dataset.A
    • Data Processed
    • Data Raw
  • Results
    • Figure.1
    • Figure.2
    • Models
  • README.txt

  • Dataset.B
    • Figure.1
      • Data
      • Results
    • Figure.2
      • Data
      • Results
    • Table.1
      • Data
      • Results
  • README.txt

  • Dataset.C
    • YYYYMMDD
    • YYYYMMDD
    • YYYYMMDD
  • Results
    • Figure.1
    • Figure.2
    • Models
  • README.txt

❃If possible, avoid creating nested folders beyond three levels

❃Consider including a Documentation/README file in the root folder to explain the structure of the folder, its contents, and file naming conventions.

40 of 46

What type of information should documentation file include?

Study level data:

  • Description of data collection context
  • Purpose of data collection
  • Data collection instruments
  • Data transformation protocols
  • Span of collection

Data level data:

  • Description of data files contents
  • Variables, labels, and types (code book, variable description)
  • Values, measures, and units used
  • Encoding/interpretation methods
  • Treatment of missing data
  • Handling of private data

Two levels of data collection documentation/description:

❃In addition to the dataset title, Principal Investigator’s name and email address, keywords for the dataset, funding sources, language information

41 of 46

Data Disclaimer examples

“The data are provided ‘as is’ and the originating source for the data are not liable for any damages.”

“The accuracy or reliability of the data is not guaranteed or warranted in any way and the providers disclaim liability of any kind whatsoever, including, without limitation, liability for quality, performance, merchantability and fitness for a particular purpose arising out of the use, or inability to use the data.”

“The user of this dataset will need to take care of handling missing observations, outliers and violations of logical consistency.”

42 of 46

More Data Disclaimer examples

  1. The U.S. Geological Survey (USGS): https://water.usgs.gov/data/disclaimer.html
  2. National Oceanic and Atmospheric Administration (NOAA): https://tidesandcurrents.noaa.gov/disclaimers.html
  3. Environmental Protection Agency (EPA): https://www.epa.gov/web-policies-and-procedures/epa-disclaimers
  4. Forest Service: https://www.fs.usda.gov/database/gps/disclaimer.htm
  5. USDA “Energy” page disclaimer: https://www.wctsservices.usda.gov/Energy/Disclaimer

43 of 46

Questions?

Who to get in touch with?

How to get assistance?

44 of 46

Get in touch with the AIFARMS Data Management working group team:

Get in touch with the Office of Technology Management for questions

regarding customized Dataset/Software licenses and Intellectual Property Management:

Subscribe to Data Nudge newsletter from the University of Illinois Library Research Data Services (archive of past Data Nudges) or get in touch with the service directly to seek assistance.

Rob Kooper - kooper@illinois.edu

Ana Lucic - alucic2@illinois.edu

Svetlana Sowers - svsowers@illinois.edu

45 of 46

AIFARMS Current members of the Data Management Working Group

Vikram Adve

Jessica Wedow

Matthew Hudson

Ana Lučić

Rob Kooper

Isabella Condotta

Melanie Rodriguez

46 of 46

Past members of the AIFARMS Data Management Working Group

Jingrui He

Alex Kuhl

Pradeep Senthill

Roser Matamala

Ryan Dilger

Carl J Bernachhi