1 of 16

Dataverse Croissant Update

Philip Durbin

Croissant Task Force Meeting

2024-06-12

2 of 16

What is Dataverse?

  • Open source research data repository software
    • WordPress for research data
  • 118 installations
  • 37 countries
  • 430K datasets worldwide
  • 171K datasets in Harvard Dataverse
  • https://dataverse.org
  • https://github.com/IQSS/dataverse

3 of 16

Demo!

  • https://beta.dataverse.org
  • Navigate to a dataset
  • Click "Metadata"
  • Click "Export Metadata"
  • Click "Croissant"
  • Or "view source" and look in <head>

4 of 16

Status of activities

  • Croissant override in Dataverse 6.2 - DONE #10382
  • External Croissant jar on Maven Central - IN REVIEW #10533
  • Sync up with Kaggle - STARTED
  • Sync up with pyDataverse - STARTED
  • Advertise availability of Croissant jar (thread)
  • Add Croissant jar to Harvard Dataverse
  • (Wait for some of 100+ installations to add Croissant jar)
  • Add Dataverse installations to Croissant Online Health

5 of 16

Design influences

  • Make each file downloadable separately
    • zipping is expensive
    • direct download from S3
  • Preserve Schema.org optimizations for Google Dataset Search

6 of 16

Croissant jar 0.1.2

  • Code at https://github.com/gdcc/exporter-croissant
  • Resolved since last time (2024-03-20 Croissant Task Force meeting)
    • citeAs as BibteX - #638
    • file paths go in @id - #639
  • Resolved but not implemented for Dataverse
    • contentUrl for each format of a file (original proprietary vs archival) #641
  • README with known issues and open questions
    • 1.0 as a string should be a valid version for a dataset #609 #643
    • summary statistics (mean, max, min, etc.) #640
    • guidance on large Croissant files, especially in <head> #646

7 of 16

citeAs vs. citation (Related Publication)

"citeAs": "@data{FK2/VQTYHD_2024,author = {Durbin, Philip and IQSS},publisher = {Root},title = {Max Schema.org},year = {2024},url = {https://doi.org/10.5072/FK2/VQTYHD}}",

"citation": [{

"@type": "CreativeWork",

"name": "Tykhonov, V., & Durbin, P. (2024, March 20). Croissant ML standard in the context of Dataverse, EOSC and beyond. Zenodo. https://doi.org/10.5281/zenodo.10843668",

"@id": "https://doi.org/10.5281/zenodo.10843668",

"identifier": "https://doi.org/10.5281/zenodo.10843668",

"url": "https://doi.org/10.5281/zenodo.10843668" }],

8 of 16

Geographic Coverage and Time Period

"spatialCoverage": [

"Cambridge, MA, United States, Harvard Square"

],

"temporalCoverage": [

"2023-01-01/2023-12-31"

],

9 of 16

distribution and recordSet

"distribution": [ {

"@type": "cr:FileObject",

"@id": "data/stata13-auto.dta",

"name": "stata13-auto.dta",

"encodingFormat": "application/x-stata-13",

"md5": "7b1201ce6b469796837a835377338c5a",

"contentSize": "6443",

"description": "",

"contentUrl": "http://localhost:8080/api/access/datafile/6?format=original"

}],

"recordSet": [ {

"@type": "cr:RecordSet",

"field": [ {

"@type": "cr:Field",

"name": "make",

"description": "Make and Model",

"dataType": "sc:Text",

"source": {"@id": "11","fileObject": {"@id": "data/stata13-auto.dta"}}

},

10 of 16

Concern: large Croissant files

  • guidance on large Croissant files, especially in <head> #646
  • Dataset with 25K files
    • 4.4 MB Schema.org file
    • 7.1 MB Croissant file
  • Automatic cut off at certain size?
  • Signposting? https://signposting.org

11 of 16

Concern: date formats

  • "datePublished":"2024-06-11"
  • "datePublished":"2024-06-11T16:53:09.0120607"
    • https://schema.org/DateTime

12 of 16

Differences from Kaggle

13 of 16

Differences from pyDataverse

14 of 16

Differences from Schema.org JSON-LD

15 of 16

Once the QA'ed Croissant jar has been released...

  • Advertise availability of Croissant jar (thread)
  • Add Croissant jar to Harvard Dataverse
  • (Wait for other ~100 installations to add Croissant jar)
  • Add Dataverse installations to Croissant Online Health #530
    • health: scrapydweb fails to launch, seems to require newer version #647
    • Dataverse installations as JSON

16 of 16

Thank you!