1 of 50

2023

Dr. Petr Knoth

Machine Learning and AI for and from Open Repositories:

Unlocking the power of repositories across use cases requiring machine access to open research

Big Scientific Data and Text Analytics group :

AI for open and responsible research

2 of 50

CORE delivers services for HEIs, researchers, funders and commercial partners, offering seamless access to research.

AI Applications in Research Evaluation (e.g. citation type classification, bibliometrics, impact assessment)
Automatic Expert Finder systems (e.g. for peer-review and grant applications)
Deduplication, document classification, rapid systematic reviews
Research graphs: entity extraction (affiliation, author, etc.)
Research recommender systems and academic search

Research areas

Innovation and trends analysis
Plagiarism detection
Fact checking
Finance
Health

Commercial Partners

Institutional Members

Big Scientific Data and Text Analytics group : AI for open and responsible research

Providing

seamless access

to open research

for humans and

machines.

Big Scientific Data and Text Analytics group : AI for open and responsible research

Dr. Petr Knoth : Senior Research Fellow in Text and Data Mining petr.knoth@open.ac.uk

CORE is the world’s most used aggregator of Open Access papers, collating and enriching content from over 11,000 repositories.

>20 Million monthly active users (MAU)
34 Million full-text research papers hosted by CORE.
260 Million metadata records

Signatory of Principles of Open Scholarly Infrastructure (POSI)

25 supporting or sustaining members

3 of 50

Outline

How can Artificial Intelligence and Machine Learning (AI/ML) applied to research papers benefit and transform research
The crucial role of repositories in providing machine access to research content.
Using AI/ML for research intelligence and improving repository workflows

4 of 50

Outline

5 of 50

How can AI/ML transform research

6 of 50

The importance of open research literature

Research literature documents the knowledge we have assembled as human species.

7 of 50

The wide variety of use cases over research literature

A limited number of use cases can be satisfied with a sample of scholarly content.
Many use cases require machine access to all existing research from everywhere and always up-to-date.
High cost when a repository does not participate in the open network, by not providing machine access. Some use cases significantly affected.

8 of 50

AI for systematic reviews

Systematic reviews

Time consuming

Rapid reviews

Limitation on the number of outcomes, interventions and comparators

Living reviews

Live updates to historic systematic reviews with the help of recommender system

9 of 50

AI for systematic reviews

Involves many steps
Some of teh most-time consuming can be automated

10 of 50

AI for systematic reviews

11 of 50

AI for systematic reviews

12 of 50

AI for systematic reviews

13 of 50

AI for systematic reviews

14 of 50

AI for citation typing and research assessment

Knowing not only that something was cited, but WHY it was cited.

Built ACT Dataset of >11,000 citations annotated by authors according to classification schema

Ran 2 Shared Tasks to establish benchmarks for SoA classification models using ACT and extended ACT2 datasets

Currently investigating extended / dynamic citation contexts to improve model performance

Citation Function	Examples
BACKGROUND	Most of the participatory models to design educational games are founded on educational theories and game design (see for example: Amory, 2007; #CITATION_TAG).
COMPARES_CONTRASTS	Similar observations have been made in the past [30] [31] [32] [33] [34], although others have reported either no relationship or a negative association with SES [#CITATION_TAG].
EXTENSION	This database is the result of a mandatory questionnaire about the home to work displacements and the mobility management measures at large workplaces in Belgium (#CITATION_TAG).
FUTURE	We are thus exploring the option of using datasets such as CrossRef 12, Dimensions 13, OpenCitations [11], and Core [#CITATION_TAG].
MOTIVATION	To illustrate, consider the motivation given by #CITATION_TAG in developing their Bayesian account of word learning.
USES	The diffraction patterns from single crystal measurements were indexed with a home-made program based on the Fit2D software [#CITATION_TAG].

15 of 50

AI for citation typing and research assessment

11 of 34 REF2014 Peer Review Panels used citation data to ‘inform’ their decisions

REF GPA results highly correlated with citation data in these domains

Addition of citation type information can allow for better modelling of how research is being used.

Potential for development of new metrics that leverage enhanced citation information

	UoA	mn2017	med2017	mn2014	med2014
1	Chemistry	0.663	0.802	0.637	0.738
2	Biological Sciences	0.782	0.797	0.688	0.785
3	Aero. Mech. Chem. Engineering	0.771	0.758	0.745	0.760
4	Social Work and Policy	0.697	0.752	0.629	0.635
5	Computer Science and Informatics	0.715	0.743	0.720	0.678

16 of 50

AI for credible trustworthy question answering (CORE-GPT)

CORE is the world’s largest collection of Open Access papers, collating and enriching content from over 11,000 data providers.

>20 Million monthly active users
34 Million full-text research papers hosted by CORE.
260 Million metadata records

GPT large language models*

Can comprehend context and generate human-like text
Can infer meaning from large-scale data

*Other large language models are available

@JayAlammar

17 of 50

Introducing CORE-GPT

18 of 50

Introducing CORE-GPT

19 of 50

CORE-GPT Results

20 of 50

CORE-GPT Results

21 of 50

Reflections / limitations …

ChatGPT

CORE-GPT

Both

Can get confused (esp. when answers are ambiguous) mixing content from entirely semantically different uses of a concept
Can be made to argue your way producing biased text
It can start inventing things / hallucinate …

Answers need to be anchored to research papers.
More honest about what it doesn’t know => fewer hallucinations
References make it easier to assess the trustworthiness of the answer.

Powerful at synthesizing content and creating summaries
Able to compare and contrast
Can get confused (esp. when answers are ambiguous) mixing content from entirely semantically different areas / uses of a concept

Can be made to argue your way producing biased text
Critical thinking and judgement needs to be exercised

22 of 50

CORE - AI Expert Finder

Evaluation:

Relevancy - was the suggested candidate a suitable match?
Prior Knowledge - was the suggested candidate previously known to the enquirer?
Conflict - are there any conflicts of interest with the proposed candidate?

Prototype tool to automatically identify domain experts based on publications in >34m research papers

Applications in:

Peer review

Proposal review

Consultant/Expert recruitment

Results

74% of suggested candidates were suitable

34% of suggested candidates were not known to enquirer

23 of 50

The crucial role of repositories in providing machine access to research content.

24 of 50

Principle 1

Repositories should always establish a link from the metadata record to the item the metadata record describes using a dereferencable identifier pointing to the version held locally in the repository (if applicable). The dereferencable identifier should be provided in the appropriate metadata element in the used metadata format.

25 of 50

Principle 2

Repositories should provide universal access to machines with the same level of access as humans have. It should be possible for machines to harvest the entire content of the repository in a reasonable time to enable a machine to maintain up-to-date information about the content held in the repository.

26 of 50

Functional OAI-PMH endpoint

Test, don’t take that it works for granted

Monitor: the fact that it works now doesn’t mean it can’t go wrong when you least expect it

Use an external system to see how your repository is seen from the outside of your organisation.

27 of 50

Robots.txt

Be careful not to block robots

Don’t give preferential treatment

28 of 50

Validate metadata

Adopt a relevant application profile (e.g. RIOXX.net)
Use a metadata validation service, e.g. within the CORE Reposiotry Dashboard

29 of 50

Validate

Validate, don’t take it works for granted

Monitor: the fact that it works now doesn’t mean it can’t go wrong when you least expect

30 of 50

Support Signposting

Helping machines to navigate repositories in order to locate

the content.

31 of 50

COAR Next Generation Repositories Working Group

32 of 50

Why is CORE important?

Increase your contents’ discoverability and prevent its misuse

Search, Recommender, Discovery, PMC Linkout

Make your papers uniquely identifiable and resolvable with PIDs

OAI Resolver

Assess and contribute to Open Access compliance and FAIRness

Indexed by CORE badge

Make your content machine readable

Repository Health Check, CORE API, CORE Dataset, CORE FastSYnc,

Become a CORE Member and benefit from lots more

Dashboard: Metadata validation and monitoring

monthly

active users

>20M

33 of 50

AI/ML for research intelligence and for improving repository workflows

34 of 50

Affiliation extraction

Problem

Many metadata records do not have Some text …

Show an example how affiliations can be extracted. Show Grobid output …

How does this correspond with ROR

This is a problem we are currently working on

2. Publication footprint

35 of 50

Affiliation extraction

Many metadata records do not have affiliation data
Affiliation is important for a range of use cases, including publication footprint
At CORE, we developed a method to extract affiliation information from papers using a supervised ML model.
Will propagate to the CORE API and Dashboard.

36 of 50

Deduplication

How do duplicates look like and why do they occur in repositories?

37 of 50

Deduplication

CORE uses an adapted version of locality sensitive hashing (simhash) for deduplication.
>90% F1-score performance
Deduplication powers our service including in the Dashboard for:

versioning
OA compliance (cross-repository)
with affiliation extraction, this will allow us to warn institutions before outputs become non-compliant

Comparison mode

38 of 50

Deduplication

Comparison mode

List of possible duplicates

39 of 50

Data enrichment

Here you can see if there is an earlier version of an article in another repository …

…and can download a spreadsheet showing deposit dates from multiple repositories

You can enrich data with DOIs identified in other repositories

40 of 50

Document classification

Classification of research papers in a distributed environment is a problem.
Established a benchmark for research document classification as part of the SDP/COLING conference.
In the process of bringing themes to the CORE API.

41 of 50

CORE moving to a membership model

CORE will become an independent open scholarly infrastructure

CORE will no longer receive direct funding from Jisc

August 2023

CORE will be operated by The Open University

Membership

Sponsorship

(data providers)

42 of 50

CORE Membership

A network of data providers who are committed to the ongoing success of the Open Access movement
We provide tools and benefits for our members
All CORE data providers are eligible to become CORE Starting Members free of charge
Supporting and Sustaining Members:

help shape our development roadmap
support and sustain CORE

43 of 50

Three levels of CORE Membership

Starting

Supporting

Sustaining

FREE

44 of 50

Next Generation Repositories: Behaviours

Some text …

45 of 50

46 of 50

47 of 50

48 of 50

2023

Take home …

ML/AI has the potential to transform all stages of the research process, including how we carry out research, how we assess it and how we organise research knowledge.
OA repositories play a key role in this process by providing machine access to research content.
AI/ML already provides opportunities for improving the ways we use repositories, organise, enrich and curate content in them.

49 of 50

2023

Take home …

CORE provides a large corpus of open research papers which can be utilised to make GPT answer in an evidence-based manner using references.
Aiming to make CORE-GPT available from the CORE API.
Our group is very interested in working with others on research projects in the area of AI for open and responsible research

50 of 50

2023

THANK YOU