Guidelines on Dataset Versioning

Datasets are often created or produced and made available on an “as-is” or ad hoc basis without proper versioning information. Changes in the data or data properties may invalidate applications consuming the data or generate incorrect results. The concept of dataset versioning is important when encouraging the provision and use of datasets.

These guidelines are intended for adoption by data providers offering data services to data users (or clients). In recent times, there has been a push for openness in the sharing of data, not just in terms of access, but also in allowing for collaboration and modifications to the dataset.
Data providers can make data available for consumption by other parties either as bulk downloads or through queries using application programming interfaces (APIs). In such an open environment, the synchronisation of dataset updates increases in complexity due to the need to differentiate among various deltas from all distributed and collaborative datasets, as well as between each precedent of a dataset. Data users will have to rely on clear re-cords of versioning for traceability, and to establish the reliability of the data consumed.
This document will discuss issues pertaining to dataset versioning and provide guidelines for best practices. The use of these guidelines will allow the data user to better understand processes used in the collation of data, which in turn, will enable them to reproduce the processes so that the data can be analysed for defects. Dataset versioning should also allow data users to easily detect incremental changes, as well as deltas between similar datasets.

Fill in the form below for link to the full document.

    This is a required question
    This is a required question
    This is a required question
    This is a required question
    This is a required question
    This is a required question