1 of 55

Data Discovery Paradigms Interest Group� November 7, 2018RDA 12th Plenary Meeting, Gaborone, Botswana

Siri Jodha Singh Khalsa

Fotis Psomopoulos

Mingfang Wu

1

2 of 55

Collaborative Session Notes

2

3 of 55

Agenda Today:

  1. Goals of group and progress (5’)
  2. Report on progress/discuss current activities from Data Granularity TF (5’)
  3. Report on progress/discuss current activities from the Metadata Enrichment TF (5’)
  4. Report on progress/discuss current activities from the Research Schemas TF�(50’)
    1. Introductory talk to Schema.org
    2. Applications of schema.org in different disciplines
    3. Discuss the Objectives discussed / decided during the TF calls
    4. Connect to the 4-hour “workshop” tomorrow morning
  5. Summary of results and next steps (20’)

3

4 of 55

Charter Data Discovery Paradigms Interest Group:

  • Motivation:
    • Helping to make research data Findable to support users in discovering data regardless of the manner in which it is stored, described and exposed.
    • Explore shared issues among those who search for data, and those who build and operate systems that enable data search.
  • Goals:
    • Provide a forum where representatives across the spectrum of stakeholders and roles can explore how to improve data discovery.
    • Produce actionable recommendations for data producers, data repositories and data seekers.

4

5 of 55

Timeline:

  • RDA P7: Held BoF
  • RDA P8: Kickoff of IG, captured potential topics
  • Oct 16: Website, mailing list, poll on potential topics
  • Dec 16: Task Forces initiated
  • RDA P9: Discuss draft Task Force outputs
  • RDA P10: Present outputs & status from Task Forces
  • RDA P11: Summarize final TF outputs; advance ongoing TF, initiate new TFs
  • Nov 7 (RDA P12): Discuss progress of new TF; initiate a Research Schemas WG

5

6 of 55

Data Discovery Paradigms IG �

Data Granularity Task Force

Siri Jodha Singh Khalsa, ???

6

7 of 55

Background and Scope

  • Formed in ???
  • Objective:
    • ???

7

8 of 55

GRANULARITY

Of Data Discovery & Access

Image from https://openelectiondata.net/en/guide/principles/granular/

9 of 55

10 of 55

Granular Discovery

  • Data are commonly discoverable & accessible at the level of ‘datasets’, which are often aggregates of observations.
  • Efficient and effective reuse of data requires users to be able find and access resources at a higher level of granularity, e.g. Sampling Features.
  • This requires PIDs & metadata at a more granular level to support discovery, access, & citation.
    • E.g. sample IDs or cruidseIDs in DOI metadata (relatedIdentifier)
  • More granular discovery needs to lead to more granular data access.

See: NISO Webinar: “Is Granularity the Next Discovery Frontier? Part 1: Supporting Direct Access to Increasingly

Granular Chunks of Content”; http://www.niso.org/news/events/2015/webinars/granularity_pt1/

11 of 55

Use Case: Geochemistry

  • A ‘dataset’ usually contains a set of different compositional properties (chemical, mineralogical) made on a set of samples in a tabular format.
    • Properties such as SiO2, 87Sr/86Sr, La, etc.
    • Specimens that have a name, PID, rock/sediment type, geospatial location, etc.
  • Users want to
    • Find a specific sample/specimen
    • Find specimens of a specific rock type
    • Find data of a specific property
    • Find data/samples from a cruise/field program

12 of 55

Work of the Task Forces

  • New TF based on results from:
    • Relevancy Ranking TF
  • Next steps for:
    • Metadata TF?
  • Continue with List from P8?
    • Cataloging common API's
    • Granularity, domain-specific cross-domain issues
    • De-duplication of search results
    • Using upper-level ontologies
    • Search personalisation

12

13 of 55

Work of the Task Forces - 2

  • Proposals for future DDPIG work coming out of relevancy ranking survey
    • Create and maintain an environment in which community members can implement and test search algorithms and provide technical support to each other.
    • Facilitate creation of a corpus or several corpora that would be made available to the community to facilitate benchmark testing of data search systems
    • Develop evaluation standards and / or evaluate existing standards for data discovery.
    • Develop detailed recommendations on how to improve relevance rankings using a specific approach that the current group recommends.
  • Other ideas for new Task Forces can we start?

13

14 of 55

Suggestions from RR Survey

  1. Detailed recommendations on how to improve relevance rankings using a specific approaches.
  2. New data discovery topics, like including primary data into search, using of visualizations to represent results, new concepts of discovery.
  3. Facilitate improved relationships with journal publishers
  4. Ranking in linguistic corpus search, e.g., in terms of maximally different linguistic contexts for hits
  5. Intelligent search
  6. Clarity on the degrees of relevancy and the means to define this
  7. The need to fund software development and maintenance for repositories develeoped with research funds
  8. Evaluation of search engine rankings - comparison with peers.

14

15 of 55

Data Discovery Paradigms IG �

Metadata Enrichment Task Force

Beth Huffer, Ilya Zaslavsky

15

16 of 55

Background and Scope

  • Formed in April 2017
  • Objective:
    • To describe and catalog various efforts to enrich research data metadata sets to satisfy several use cases

16

17 of 55

Deliverables

  1. A catalogue of automated metadata enrichment tools, together with information about what type of metadata they are able to produce, and the use cases for such metadata;
  2. A brief report on how metadata enrichment correlates (or doesn't) with other aspects of data discovery.

17

18 of 55

Planned Activities

  • The group is currently conducting a survey (March 31 response deadline) on automated methods of generating metadata (https://goo.gl/forms/i1ZAKxXoXVxNScUq2);
  • With an initial focus on automated metadata enrichment tools and services, the survey is designed to identify and document:
    • The specific method(s) being employed by each tool or service;
    • The types of metadata (e.g. methods, tools, location, provenance) being produced by each;
    • The use cases (e.g., improving search, enabling faceted browsing, facilitating data integration) those metadata are supporting;

18

19 of 55

Planned Activities, cont’d

  • Cross-reference meta enrichment survey responses with responses to Search Relevancy TF survey to look for possible correlations. For example, are repositories that perform metadata enrichment more or less likely to:
    • Analyze query logs?
    • Measure search engine performance?
    • Tune relevancy rankings using internal resources?
  • Submit follow-up questions to survey respondents, if indicated

19

20 of 55

From the Relevancy Ranking Survey:

  • Next: explore the mentioned systems; follow up with respondents, via phone interviews, invite respondents to present and demo their metadata enrichment tools

20

21 of 55

Questions?

  • Contact

Beth Huffer

beth@lingualogica.net

Ilya Zaslavsky

zaslavsk@sdsc.edu

Shortened link to survey:

https://goo.gl/qTgJ8F

21

22 of 55

Agenda Today:

  • Goals of group and progress (5’)
  • Report on progress/discuss current activities from Data Granularity TF (5’)
  • Report on progress/discuss current activities from the Metadata Enrichment TF (5’)
  • Report on progress/discuss current activities from the Research Schemas TF�(50’)
    • Introductory talk to Schema.org
    • Applications of schema.org in different disciplines
    • Discuss the Objectives discussed / decided during the TF calls
    • Connect to the 4-hour “workshop” tomorrow morning
  • Summary of results and next steps (20’)

22

23 of 55

Data Discovery Paradigms IG �

Research Schemas Task Force

Leyla Garcia, Nick Juty, Rafael Jimenez

23

24 of 55

Background and Scope

  • Under discussion since the 11th RDA Plenary Meeting on March 2018
  • Goals defined in October 2018
  • Expecting a concrete proposal by November 2018
  • Goal
    • Establish research schemas as RDA working group and as a community initiative to agree and implement a common strategy to promote the use and adoption of schema.org markup to increase research data discoverability (aka findability) and accessibility.

24

25 of 55

Roadmap

25

26 of 55

  • Collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond

26

27 of 55

Structured data → descriptors

What we can say about it

Types

Properties

What we are talking about

27

28 of 55

Bioschemas

Community initiative built on top of schema.org

  • Aim
    • Improve data discoverability and interoperability in Life Sciences
  • How
    • Adding Life Science types to schema.org
    • Providing usage guidelines, examples and tools

28

29 of 55

Motivation

  • Researcher looking for help to

  • sequence a cactus → people, expertise

  • find similar sequences → datasets

  • compare and visualize → tools

29

30 of 55

Use cases

  • Generic
    • Search engines
    • Google dataset specialized search
  • Life sciences
    • Data depositions can harvest mark up from small providers → small providers can benefit back from data depositions
    • Tess (training and events registry) → Automatic population → trainers add markup
    • Common/rare terms → Resource index

30

31 of 55

Guess the person

Height: 1.72 cm

Weight: unknown

Work location: 1 Einstein Drive�Princeton, New Jersey�08540 USA

Honorific prefix: Dr. Prof.

Siblings: Maria

isicv4: 7210

Name: Albert

Last name: Einstein

Birth date: 14 March 1879

Death date: 18 April 1955

Alumni of: University of Zurich

Birth place: Ulm, Kingdom of Württemberg, German Empire

Knows language: German, English

Know about: Physics, relativity, mass energy equivalence

Award: Nobel prize in physics

32 of 55

Guess the person

Height: 1.72 cm

Weight: unknown

Work location: 1 Einstein Drive�Princeton, New Jersey�08540 USA

Honorific prefix: Dr. Prof.

Siblings: Maria

isicv4: 7210

Albert

Einstein

Birth date: 14 March 1879

Death date: 18 April 1955

Alumni of: University of Zurich

Birth place: Ulm, Kingdom of Württemberg, German Empire

Knows language: German, English

Know about: Physics, relativity, mass energy equivalence

Award: Nobel prize in physics

email

orcid

32

33 of 55

Bioschemas vs schema.org

Schema.org Types

Bioschemas Profiles

    • Specific for Life Sciences
    • Apply constrains to existing Schema.org types
    • Minimum properties for finding and accessing data
    • Best practices for selected properties
    • Managed by Bioschemas
    • Generic data model
    • Generous list of properties to describe data types
    • Managed by Schema.org

Dataset

DataRecord

BioChemEntity

LabProtocol

Dataset

DataCatalog

DataCatalog

LabProtocol

Sample

Protein

Chemical

33

34 of 55

Beyond databases → digital resources

TrainingMaterial

Course

CourseInstance

Software

TrainingMaterial

Event

Event

Tool

Person

Organization

Organization

Person

Course

CourseInstance

34

35 of 55

Why schema.org → providers

  • Simple to adopt
  • Can be exposed in HTML & APIs
  • Supports several formats
  • Widely agreed and used
  • Considered a best practice by search engines
  • Helps search engines to index and present metadata (snippets/summaries)
  • Contribute to improve search results ranking

35

36 of 55

Why schema.org → consumers

  • Common way to describe metadata across many resources
  • Structured metadata even in resources without an API

36

37 of 55

Why schema.org → user

  • Better findability, integration and presentation facilitated by “search engines” and “metadata catalogues”.

37

38 of 55

Why Bioschemas

  • Focus on key properties prioritized as Minimum, Recommended and Optional based on community agreements and common practices

  • Customization on schema.org types to better supports needs on the life sciences community

  • Additional recommendations regarding properties cardinality

  • Terms reused from well-known ontologies thus avoiding reinventing the wheel

38

39 of 55

Why structured data with Bioschemas

39

40 of 55

How structured data contributes to FAIRability

Making reuse rules explicit and providing provenance information

Finding proteins, samples, phenotypes and so exposed via web pages

Gathering data from different life sciences resources following a common format, linking to each other

Findability

Accessibility

Reusability

Interoperability

Enabling data extraction over web pages

40

41 of 55

Bioschemas meets FAIR

Findable

Accessible

Interoperable

Reusable

  • Unique identifiers

  • Descriptive metadata

  • Indexed and available

41

42 of 55

Bioschemas meets FAIR

  • HTTP

  • Authorization via login or IP
  • Schema.org →JSON-LD

  • Links to other resources
  • License

  • Provenance

Findable

Accessible

Interoperable

Reusable

42

43 of 55

TeSS: Bioschemas in action

http://bioschemas.org

  • Contact
  • Description
  • End date
  • Event Type
  • Host Institution
  • Location
  • Name
  • Start Date

Bioschemas Event:

44 of 55

Prototype: MarRef → BioSamples�Without an API

http://bioschemas.org

44

4 June 2018

https://github.com/EBIBioSamples/bioschemas_marref_demo/blob/master/Summary.md

45 of 55

Identifiers.org: Bioschemas DataCatalog in action (future)

ELIXIR All Hands 2018, June 2018, Berlin, Germany

  • Description
  • Keywords
  • Name
  • Provider
  • URL

Bioschema DataCatalog:

  • alternateName
  • citation
  • dateCreated
  • licence

46 of 55

How is schema.org currently used on sciences?

46

47 of 55

Questionnaire

  • https://tinyurl.com/rda-rs-questionnaire
    • Sent to about 150 people → members of DDPIG (127) + people who expressed interest in schema.or (17) + a couple more on one-to-one basis
  • Results and discussion

47

48 of 55

Research schemas

how sciences can get the most of schema.org

48

49 of 55

Minutes and discussions

  • https://github.com/RDA-DDP/Schema.org-for-Research-Data/
    • RDA-DDPIG-Schema-TF-Proposal.md

49

50 of 55

Task force objectives

  • Objective 3.- Review existing efforts working on Schemas to describe scientific types
  • Objective 4.- Engagement and communication strategy; collaboration and with existing efforts
  • Objective 1.- Define research schemas types and minimum information guidelines for discoverability and accessibility
  • Objective 2.- Crosswalk and gap analysis evaluating existing standards and guidelines

50

51 of 55

Thanks for your attention!

Questions?

51

52 of 55

53 of 55

Potential activities for discussion

§ Use cases and scenarios

  • Recommendations for schema.org (eg missing attribute, controlled value)
  • Recommendations for better description practices (how should people document and describe their data to support discovery)
  • Recommendations for including / publishing schema.org (where should it go, how can repos support it)
  • Deepening the metadata. Schema.org provides a high level description (around Dublin core level). A lot of the value comes from the more detailed information that is closer to a community.
  • How-to for data centres (how hard is it going to be; what is the level of effort; what are good implementation patterns)
  • Recommendation on semantics for basic data discovery eg 'modified' (is it metadata, data, web page)
  • DCAT is doing a revision cycle; let's engage from the RDA community perspective

54 of 55

Data Discovery Paradigms IG �

Action items and next steps

54

55 of 55

Closing and Actions

  1. Review of Actions coming out of this meeting
    • Action 1 (responsible person)
    • Action 2 (responsible person)
  2. Next Steps

55