1 of 55

Data Discovery Paradigms Interest Group� November 7, 2018�RDA 12th Plenary Meeting, Gaborone, Botswana

Siri Jodha Singh Khalsa

Fotis Psomopoulos

Mingfang Wu

1

2 of 55

Collaborative Session Notes

https://tinyurl.com/p12-ddpig

https://www.rd-alliance.org/ig-data-discovery-paradigms-rda-12th-plenary-meeting

(please sign in)

2

3 of 55

Agenda Today:

Goals of group and progress (5’)
Report on progress/discuss current activities from Data Granularity TF (5’)
Report on progress/discuss current activities from the Metadata Enrichment TF (5’)
Report on progress/discuss current activities from the Research Schemas TF�(50’)

Introductory talk to Schema.org
Applications of schema.org in different disciplines
Discuss the Objectives discussed / decided during the TF calls
Connect to the 4-hour “workshop” tomorrow morning

Summary of results and next steps (20’)

3

4 of 55

Charter Data Discovery Paradigms Interest Group:

Motivation:

Helping to make research data Findable to support users in discovering data regardless of the manner in which it is stored, described and exposed.
Explore shared issues among those who search for data, and those who build and operate systems that enable data search.

Goals:

Provide a forum where representatives across the spectrum of stakeholders and roles can explore how to improve data discovery.
Produce actionable recommendations for data producers, data repositories and data seekers.

4

5 of 55

Timeline:

RDA P7: Held BoF
RDA P8: Kickoff of IG, captured potential topics
Oct 16: Website, mailing list, poll on potential topics
Dec 16: Task Forces initiated
RDA P9: Discuss draft Task Force outputs
RDA P10: Present outputs & status from Task Forces
RDA P11: Summarize final TF outputs; advance ongoing TF, initiate new TFs
Nov 7 (RDA P12): Discuss progress of new TF; initiate a Research Schemas WG

5

6 of 55

Data Discovery Paradigms IG �

Data Granularity Task Force

Siri Jodha Singh Khalsa, ???

6

7 of 55

Background and Scope

Formed in ???
Objective:

???

7

8 of 55

GRANULARITY

Of Data Discovery & Access

Image from https://openelectiondata.net/en/guide/principles/granular/

9 of 55

10 of 55

Granular Discovery

Data are commonly discoverable & accessible at the level of ‘datasets’, which are often aggregates of observations.
Efficient and effective reuse of data requires users to be able find and access resources at a higher level of granularity, e.g. Sampling Features.
This requires PIDs & metadata at a more granular level to support discovery, access, & citation.

E.g. sample IDs or cruidseIDs in DOI metadata (relatedIdentifier)

More granular discovery needs to lead to more granular data access.

See: NISO Webinar: “Is Granularity the Next Discovery Frontier? Part 1: Supporting Direct Access to Increasingly

Granular Chunks of Content”; http://www.niso.org/news/events/2015/webinars/granularity_pt1/

11 of 55

Use Case: Geochemistry

A ‘dataset’ usually contains a set of different compositional properties (chemical, mineralogical) made on a set of samples in a tabular format.

Properties such as SiO₂, ⁸⁷Sr/⁸⁶Sr, La, etc.
Specimens that have a name, PID, rock/sediment type, geospatial location, etc.

Users want to

Find a specific sample/specimen
Find specimens of a specific rock type
Find data of a specific property
Find data/samples from a cruise/field program

12 of 55

Work of the Task Forces

New TF based on results from:

Relevancy Ranking TF

Next steps for:

Metadata TF?

Continue with List from P8?

Cataloging common API's
Granularity, domain-specific cross-domain issues
De-duplication of search results
Using upper-level ontologies
Search personalisation

12

13 of 55

Work of the Task Forces - 2

Proposals for future DDPIG work coming out of relevancy ranking survey

Create and maintain an environment in which community members can implement and test search algorithms and provide technical support to each other.
Facilitate creation of a corpus or several corpora that would be made available to the community to facilitate benchmark testing of data search systems
Develop evaluation standards and / or evaluate existing standards for data discovery.
Develop detailed recommendations on how to improve relevance rankings using a specific approach that the current group recommends.

Other ideas for new Task Forces can we start?

13

14 of 55

Suggestions from RR Survey

Detailed recommendations on how to improve relevance rankings using a specific approaches.
New data discovery topics, like including primary data into search, using of visualizations to represent results, new concepts of discovery.
Facilitate improved relationships with journal publishers
Ranking in linguistic corpus search, e.g., in terms of maximally different linguistic contexts for hits
Intelligent search
Clarity on the degrees of relevancy and the means to define this
The need to fund software development and maintenance for repositories develeoped with research funds
Evaluation of search engine rankings - comparison with peers.

14

15 of 55

Data Discovery Paradigms IG �

Metadata Enrichment Task Force

Beth Huffer, Ilya Zaslavsky

15

16 of 55

Background and Scope

Formed in April 2017
Objective:

To describe and catalog various efforts to enrich research data metadata sets to satisfy several use cases

16

17 of 55

Deliverables

A catalogue of automated metadata enrichment tools, together with information about what type of metadata they are able to produce, and the use cases for such metadata;
A brief report on how metadata enrichment correlates (or doesn't) with other aspects of data discovery.

17

18 of 55

Planned Activities

The group is currently conducting a survey (March 31 response deadline) on automated methods of generating metadata (https://goo.gl/forms/i1ZAKxXoXVxNScUq2);
With an initial focus on automated metadata enrichment tools and services, the survey is designed to identify and document:

The specific method(s) being employed by each tool or service;
The types of metadata (e.g. methods, tools, location, provenance) being produced by each;
The use cases (e.g., improving search, enabling faceted browsing, facilitating data integration) those metadata are supporting;

18

19 of 55

Planned Activities, cont’d

Cross-reference meta enrichment survey responses with responses to Search Relevancy TF survey to look for possible correlations. For example, are repositories that perform metadata enrichment more or less likely to:

Analyze query logs?
Measure search engine performance?
Tune relevancy rankings using internal resources?

Submit follow-up questions to survey respondents, if indicated

19

20 of 55

From the Relevancy Ranking Survey:

Next: explore the mentioned systems; follow up with respondents, via phone interviews, invite respondents to present and demo their metadata enrichment tools

20

21 of 55

Questions?

Contact

Beth Huffer

beth@lingualogica.net

Ilya Zaslavsky

zaslavsk@sdsc.edu

Shortened link to survey:

https://goo.gl/qTgJ8F

21

22 of 55

Agenda Today:

Goals of group and progress (5’)
Report on progress/discuss current activities from Data Granularity TF (5’)
Report on progress/discuss current activities from the Metadata Enrichment TF (5’)
Report on progress/discuss current activities from the Research Schemas TF�(50’)

Introductory talk to Schema.org
Applications of schema.org in different disciplines
Discuss the Objectives discussed / decided during the TF calls
Connect to the 4-hour “workshop” tomorrow morning

Summary of results and next steps (20’)

22

23 of 55

Data Discovery Paradigms IG �

Research Schemas Task Force

Leyla Garcia, Nick Juty, Rafael Jimenez

23

24 of 55

Background and Scope

Under discussion since the 11th RDA Plenary Meeting on March 2018
Goals defined in October 2018
Expecting a concrete proposal by November 2018
Goal

Establish research schemas as RDA working group and as a community initiative to agree and implement a common strategy to promote the use and adoption of schema.org markup to increase research data discoverability (aka findability) and accessibility.

24

25 of 55

Roadmap

25

26 of 55

Collaborative, community activity with a mission to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond

26

27 of 55

Structured data → descriptors

What we can say about it

Types

Properties

What we are talking about

27

28 of 55

Bioschemas

Community initiative built on top of schema.org

Aim

Improve data discoverability and interoperability in Life Sciences

How

Adding Life Science types to schema.org
Providing usage guidelines, examples and tools

28

29 of 55

Motivation

Researcher looking for help to

sequence a cactus → people, expertise

find similar sequences → datasets

compare and visualize → tools

29

30 of 55

Use cases

Generic

Search engines
Google dataset specialized search

Life sciences

Data depositions can harvest mark up from small providers → small providers can benefit back from data depositions
Tess (training and events registry) → Automatic population → trainers add markup
Common/rare terms → Resource index

30

31 of 55

Guess the person

Height: 1.72 cm

Weight: unknown

Work location: 1 Einstein Drive�Princeton, New Jersey�08540 USA

Honorific prefix: Dr. Prof.

Siblings: Maria

isicv4: 7210

Name: Albert

Last name: Einstein

Birth date: 14 March 1879

Death date: 18 April 1955

Alumni of: University of Zurich

Birth place: Ulm, Kingdom of Württemberg, German Empire

Knows language: German, English

Know about: Physics, relativity, mass energy equivalence

Award: Nobel prize in physics

32 of 55

Guess the person

Height: 1.72 cm

Weight: unknown

Work location: 1 Einstein Drive�Princeton, New Jersey�08540 USA

Honorific prefix: Dr. Prof.

Siblings: Maria

isicv4: 7210

Albert

Einstein

Birth date: 14 March 1879

Death date: 18 April 1955

Alumni of: University of Zurich

Birth place: Ulm, Kingdom of Württemberg, German Empire

Knows language: German, English

Know about: Physics, relativity, mass energy equivalence

Award: Nobel prize in physics

email

orcid

32

33 of 55

Bioschemas vs schema.org

Schema.org Types

Bioschemas Profiles

Specific for Life Sciences
Apply constrains to existing Schema.org types
Minimum properties for finding and accessing data
Best practices for selected properties
Managed by Bioschemas

Generic data model
Generous list of properties to describe data types
Managed by Schema.org

Dataset

DataRecord

BioChemEntity

LabProtocol

Dataset

DataCatalog

LabProtocol

Sample

Protein

Chemical

…

33

34 of 55

Beyond databases → digital resources

TrainingMaterial

Course

CourseInstance

Software

TrainingMaterial

Event

Tool

Person

Organization

Person

Course

CourseInstance

34

35 of 55

Why schema.org → providers

Simple to adopt
Can be exposed in HTML & APIs
Supports several formats
Widely agreed and used
Considered a best practice by search engines
Helps search engines to index and present metadata (snippets/summaries)
Contribute to improve search results ranking

35

36 of 55

Why schema.org → consumers

Common way to describe metadata across many resources
Structured metadata even in resources without an API

36

37 of 55

Why schema.org → user

Better findability, integration and presentation facilitated by “search engines” and “metadata catalogues”.

37

38 of 55

Why Bioschemas

Focus on key properties prioritized as Minimum, Recommended and Optional based on community agreements and common practices

Customization on schema.org types to better supports needs on the life sciences community

Additional recommendations regarding properties cardinality

Terms reused from well-known ontologies thus avoiding reinventing the wheel

38

39 of 55

Why structured data with Bioschemas

39

40 of 55

How structured data contributes to FAIRability

Making reuse rules explicit and providing provenance information

Finding proteins, samples, phenotypes and so exposed via web pages

Gathering data from different life sciences resources following a common format, linking to each other

Findability

Accessibility

Reusability

Interoperability

Enabling data extraction over web pages

40

41 of 55

Bioschemas meets FAIR

Findable

Accessible

Interoperable

Reusable

Unique identifiers

Descriptive metadata

Indexed and available

41

42 of 55

Bioschemas meets FAIR

HTTP

Authorization via login or IP

Schema.org →JSON-LD

Links to other resources

License

Provenance

Findable

Accessible

Interoperable

Reusable

42

43 of 55

TeSS: Bioschemas in action

http://bioschemas.org

Contact
Description
End date
Event Type
Host Institution

Location
Name
Start Date
…

Bioschemas Event:

44 of 55

Prototype: MarRef → BioSamples�Without an API

http://bioschemas.org

44

4 June 2018

https://github.com/EBIBioSamples/bioschemas_marref_demo/blob/master/Summary.md

45 of 55

Identifiers.org: Bioschemas DataCatalog in action (future)

ELIXIR All Hands 2018, June 2018, Berlin, Germany

Description
Keywords
Name
Provider
URL

Bioschema DataCatalog:

alternateName
citation
dateCreated
licence
…

46 of 55

How is schema.org currently used on sciences?

46

47 of 55

Questionnaire

https://tinyurl.com/rda-rs-questionnaire

Sent to about 150 people → members of DDPIG (127) + people who expressed interest in schema.or (17) + a couple more on one-to-one basis

Results and discussion

47

48 of 55

Research schemas

how sciences can get the most of schema.org

48

49 of 55

Minutes and discussions

https://github.com/RDA-DDP/Schema.org-for-Research-Data/

RDA-DDPIG-Schema-TF-Proposal.md

49

50 of 55

Task force objectives

Objective 3.- Review existing efforts working on Schemas to describe scientific types
Objective 4.- Engagement and communication strategy; collaboration and with existing efforts
Objective 1.- Define research schemas types and minimum information guidelines for discoverability and accessibility
Objective 2.- Crosswalk and gap analysis evaluating existing standards and guidelines

50

51 of 55

Thanks for your attention!

Questions?

51

52 of 55

53 of 55

Potential activities for discussion

§ Use cases and scenarios

Recommendations for schema.org (eg missing attribute, controlled value)
Recommendations for better description practices (how should people document and describe their data to support discovery)
Recommendations for including / publishing schema.org (where should it go, how can repos support it)
Deepening the metadata. Schema.org provides a high level description (around Dublin core level). A lot of the value comes from the more detailed information that is closer to a community.
How-to for data centres (how hard is it going to be; what is the level of effort; what are good implementation patterns)
Recommendation on semantics for basic data discovery eg 'modified' (is it metadata, data, web page)
DCAT is doing a revision cycle; let's engage from the RDA community perspective

54 of 55

Data Discovery Paradigms IG �

Action items and next steps

54

55 of 55

Closing and Actions

Review of Actions coming out of this meeting

Action 1 (responsible person)
Action 2 (responsible person)

Next Steps

55