1 of 8

Data curation and assessment

Carlo Minotti

SciCat F2F 2022

2 of 8

Outline

  • Kpi: how do we want to track usage, downloads, users clicks…
  • Data assessment: can we assess if our metadata/data is correct and how much variability does it contains?

2

3 of 8

Advantages

  • Useful to get an idea of UI/search use
  • Performance optimisation
  • FAIR compliance
  • Searchability

3

4 of 8

Kpi: how do we want to track usage, downloads, users clicks

Adding fields to collections

Pros:

  • Implemented at the SciCat level
  • No extra requirement to each facility
  • Fast to query

Cons:

  • Need to create extra fields upfront
  • Maintain
  • Not very flexible

4

5 of 8

Kpi: how do we want to track usage, downloads, users clicks

Improve the logging/monitoring (e.g. graylog)

Pros:

  • Richer logging, an advantage also outside the context of KPIs
  • More flexibility

Cons:

  • Aggregation at facility level
  • External monitoring system
  • Some work to enrich the logs

5

6 of 8

Kpi: how do we want to track usage, downloads, users clicks

Plots/Jupyter notebooks on existing data (e.g. number of triggered retrieve jobs)

Pros:

  • Some statistics are already available

Cons:

  • Not really flexible
  • Queries hit directly the DB, performance?

6

7 of 8

Data assessment: can we assess if our metadata/data is correct and how much variability does it contains?

Simple scripts that check formatting in DB (e.g. number of empty fields/values)

Pros:

  • Flexible

Cons:

  • Difficult to define what to check
  • Too flexible?
  • No consistent scoring across facilities

7

8 of 8

Data assessment: can we assess if our metadata/data is correct and how much variability does it contains?

Scripts that check based on well defined interfaces (e.g. OAI-PMH, search-API, Tubingen’s schemas)

Pros:

  • Easy to define what to check
  • Known and agreed standards
  • Facility independent
  • Easy to provide a consistent score across facilities
  • Could be a central service

Cons:

  • Not flexible
  • Need to agree on interface upfront

8