WebVid-CoVR - Datasheet

This document presents a datasheet [2] for the WebVid-CoVR dataset.

Motivation

For what purpose was the dataset created? The WebVid-CoVR dataset was created for the task of Composed Video Retrieval (CoVR). It aims to provide training data for models that can retrieve videos from a database by searching with both a query image and a query text.

Who created the dataset (e.g., which team, research group) and on behalf of which entity (e.g., company, institution, organization)? The dataset was created by the authors as part of a research project.
Who funded the creation of the dataset? This work was granted access to the HPC resources of IDRIS under the allocation 2023-AD011014223 made by GENCI. The authors would like to acknowledge the research gift from Google, the ANR project CorVis ANR-21-CE23-0003-01, Antoine Yang"s Google PhD fellowship.

Composition

What do the instances that comprise the dataset represent (e.g., documents, photos, people, countries)? Each instance in WebVid-CoVR represents a triplet containing: (a) a query image/video, (b) a modification text, and (c) a target video.
How many instances are there in total (of each type, if appropriate)? There are 1.6 million triplets in the WebVid-CoVR training set and 2,556 triplets in the WebVid-CoVR-Test set.
Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? The training set is a sample automatically generated from a larger set of videos from the WebVid2M dataset [1].
The test set was manually annotated by sampling a random subset of the dataset automatically generated from WebVid8M (Note that to avoid overlap between training and test sets, we define WebVid8M as the difference between the full public WebVid10M [1] and the widely used subset WebVid2M.)
What data does each instance consist of? As described above, each instance contains a query image id, modification text, and target video id.
Is there a label or target associated with each instance? We provide labels in the form of triplets. For the composed retrieval scenario, the target video in the triplet is considered as the correct label.
Is any information missing from individual instances? No, all data is complete.
Are relationships between individual instances made explicit (e.g., users" movie ratings, social network links)? Not applicable.
Are there recommended data splits (e.g., training, development/validation, testing)? Yes, a training set (WebVid-CoVR) and test set (WebVid-CoVR-Test) are defined.
Are there any errors, sources of noise, or redundancies in the dataset? The automatically generated training set likely contains some noise. We study the removed noise in the Source of noise section of the Appendix of our paper.
Is the dataset self-contained, or does it link to or otherwise rely on external resources (e.g., websites, tweets, other datasets)? It relies on the links to Shutterstock videos provided by the WebVid dataset [1]. There is no guarantee that they will remain permanent due to copyright and privacy reasons. We note that our dataset users must also adhere to the terms of use stipulated by WebVid [1].
Does the dataset contain data that might be considered confidential (e.g., data that is protected by legal privilege or by doctor–patient confidentiality, data that includes the content of individuals" non-public communications)? No, the data consists of publicly available videos and captions that have been filtered to remove inappropriate content.
Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? We have reviewed and filtered any potential offensive data. The details of this procedure can be found in the Appendix of the paper.
Does the dataset identify any subpopulations (e.g., by age, gender)? No.
Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? It is possible to identify individuals based on the video, although we do not provide any associated data about personal identities.
Does the dataset contain data that might be considered sensitive in any way (e.g., data that reveals race or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? Certain modification texts may mention gender-identifying words such as "man", "woman" or race identifying words such as "asian".

Collection Process

How was the data associated with each instance acquired? The data was acquired by automatically mining similar video-caption pairs from the WebVid2M dataset and using a language model to generate modification texts.
What mechanisms or procedures were used to collect the data (e.g., hardware apparatuses or sensors, manual human curation, software programs, software APIs)? The data was collected using custom software programs and large language models.
If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)? The WebVid-CoVR training set was deterministically sampled by mining pairs of video captions from WebVid2M that differed by one word. The WebVid-CoVR-Test follows the same procedure but we randomly sample from generated triplets, and further verify their correctness manually.
Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)? The authors of the associated paper were involved in the data collection process. One family member helped with the manual annotation of the test set without compensation.
Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? The data collection was performed during a few months of 2023. The manual annotation took approximately 9 hours over 3 days.
Were any ethical review processes conducted (e.g., by an institutional review board)? No.
Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)? The videos come from WebVid [1], which were scraped from a public website Shutterstock.
Were the individuals in question notified about the data collection? While the contributors made their videos public on web sources, and agreed to broad licensing terms, they were not directly notified or asked to consent to the use of their videos in the WebVid-CoVR dataset specifically.
Did the individuals in question consent to the collection and use of their data? No (see previous question).
If consent was obtained, were the consenting individuals provided with a mechanism to revoke their consent in the future or for certain uses? N/A.
Has an analysis of the potential impact of the dataset and its use on data subjects (e.g., a data protection impact analysis) been conducted? N/A.

Preprocessing/cleaning/labeling

Was any preprocessing/cleaning/labeling of the data done? Yes, paired captions were filtered based on similarity. Inappropriate content was removed. More details are provided in the associated paper.
Was the "raw" data saved in addition to the preprocessed/cleaned/labeled data (e.g., to support unanticipated future uses)? Yes.
Is the software that was used to preprocess/clean/label the data available? Not yet.

Uses

Has the dataset been used for any tasks already? Not yet, this is a new dataset.
Is there a repository that links to any or all papers or systems that use the dataset? No. People are welcome to submit their results to https://paperswithcode.com/dataset/webvid-covr.
What (other) tasks could the dataset be used for? The dataset could be used to train models for other cross-modal video-text retrieval tasks.
Is there anything about the composition of the dataset or the way it was collected and preprocessed/cleaned/labeled that might impact future uses? The dataset was automatically generated from web-scraped video-caption pairs, so there are a few things to keep in mind:

The videos may contain inappropriate or objectionable content, since they were scraped from the web without curation. The authors attempted to filter out some inappropriate content, but dataset consumers should be aware that objectionable material may still exist in the data.
The automatically generated modification texts between video pairs may be noisy or not accurately describe the visual differences between videos, since they were generated from caption differences only. This could impact potential uses or analysis done on the textual aspects of the data.
The lack of human curation means there could be biases or quality issues in the data. Consumers should evaluate the data carefully for their application.
There may be copyright or terms of use issues with the original videos that were scraped, so consumers should review permissions and potential restrictions associated to their institutions.

To mitigate risks/harms, consumers could manually review and filter objectionable content, analyze texts to exclude noisy samples, and evaluate dataset distributions/biases before use. Working with subset samples or known high-quality portions may also help. Overall, the automated nature of the dataset generation process should be kept in mind.

Are there tasks for which the dataset should not be used? Users of the WebVid-CoVR dataset must adhere to the terms of use stipulated by WebVid [1]. The dataset should not be used for commercial applications or harmful purposes like surveillance, looking up violent content, personal information, orchestrating attacks, or any harmful application. Such uses are completely forbidden.

Distribution

Will the dataset be distributed to third parties outside of the entity (e.g., company, institution, organization) on behalf of which the dataset was created? Yes, the dataset will be publicly released.
How will the dataset will be distributed (e.g., tarball on website, API, GitHub)? The dataset will be distributed via direct download from the project website.
When will the dataset be distributed? Already released.
Will the dataset be distributed under a copyright or other intellectual property (IP) license, and/or under applicable terms of use (ToU)? The videos themselves come from WebVid and users should adhere to the terms of use stipulated by WebVid. We are releasing the WebVid-CoVR dataset for research purposes and request citing the dataset paper ("CoVR: Learning Composed Video Retrieval from Web Video Captions").
Have any third parties-imposed IP-based or other restrictions on the data associated with the instances? If so, please describe these restrictions, and provide a link or other access point to, or otherwise reproduce, any relevant licensing terms, as well as any fees associated with these restrictions. As the dataset was generated from the WebVid dataset [1], users should follow the licensing terms associated with WebVid [1].

Do any export controls or other regulatory restrictions apply to the dataset or to individual instances? Only the ones associated with WebVid [1].

Maintenance

Who will be supporting/hosting/maintaining the dataset? The authors.
How can the owner/curator/manager of the dataset be contacted (e.g., email address)? Email address (lucas.ventura@enpc.fr).
Is there an erratum? Contact information will be provided on the project website.
Will the dataset be updated (e.g., to correct labeling errors, add new instances, delete instances)? Yes, it will be updated within weeks by the authors in case of errors found in the dataset.
If the dataset relates to people, are there applicable limits on the retention of the data associated with the instances (e.g., were the individuals in question told that their data would be retained for a fixed period of time and then deleted)? As we do not store nor release the videos, if a video is deleted, we will not have access to it.
Will older versions of the dataset continue to be supported/hosted/maintained? Only the latest version will be officially supported in case of update.
If others want to extend/augment/build on/contribute to the dataset, is there a mechanism for them to do so? Contributions are welcome but there is no official mechanism at this point.

References

Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. Frozen in time: A joint video and image encoder for end-to-end retrieval. In ICCV, 2021.
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé, and Kate Crawford. Datasheets for datasets. Communications of the ACM.

Motivation

Composition

Collection Process

Preprocessing/cleaning/labeling

Uses

Distribution

Maintenance

References