1 of 50

2023

Dr. Petr Knoth

Machine Learning and AI for and from Open Repositories:

Unlocking the power of repositories across use cases requiring machine access to open research

Big Scientific Data and Text Analytics group :

AI for open and responsible research

2 of 50

CORE delivers services for HEIs, researchers, funders and commercial partners, offering seamless access to research.

  • AI Applications in Research Evaluation (e.g. citation type classification, bibliometrics, impact assessment)
  • Automatic Expert Finder systems (e.g. for peer-review and grant applications)
  • Deduplication, document classification, rapid systematic reviews
  • Research graphs: entity extraction (affiliation, author, etc.)
  • Research recommender systems and academic search

Research areas

  • Innovation and trends analysis
  • Plagiarism detection
  • Fact checking
  • Finance
  • Health

Commercial Partners

Institutional Members

Big Scientific Data and Text Analytics group : AI for open and responsible research

Providing

seamless access

to open research

for humans and

machines.

Big Scientific Data and Text Analytics group : AI for open and responsible research

Dr. Petr Knoth : Senior Research Fellow in Text and Data Mining petr.knoth@open.ac.uk

CORE is the world’s most used aggregator of Open Access papers, collating and enriching content from over 11,000 repositories.

  • >20 Million monthly active users (MAU)
  • 34 Million full-text research papers hosted by CORE.
  • 260 Million metadata records

Signatory of Principles of Open Scholarly Infrastructure (POSI)

25 supporting or sustaining members

3 of 50

Outline

  1. How can Artificial Intelligence and Machine Learning (AI/ML) applied to research papers benefit and transform research
  2. The crucial role of repositories in providing machine access to research content.
  3. Using AI/ML for research intelligence and improving repository workflows

4 of 50

Outline

5 of 50

How can AI/ML transform research

6 of 50

The importance of open research literature

Research literature documents the knowledge we have assembled as human species.

7 of 50

The wide variety of use cases over research literature

  • A limited number of use cases can be satisfied with a sample of scholarly content.
  • Many use cases require machine access to all existing research from everywhere and always up-to-date.
  • High cost when a repository does not participate in the open network, by not providing machine access. Some use cases significantly affected.

8 of 50

AI for systematic reviews

  • Systematic reviews
    • Time consuming
  • Rapid reviews
    • Limitation on the number of outcomes, interventions and comparators
  • Living reviews
    • Live updates to historic systematic reviews with the help of recommender system

9 of 50

AI for systematic reviews

  • Involves many steps
  • Some of teh most-time consuming can be automated

10 of 50

AI for systematic reviews

11 of 50

AI for systematic reviews

12 of 50

AI for systematic reviews

13 of 50

AI for systematic reviews

14 of 50

AI for citation typing and research assessment

Knowing not only that something was cited, but WHY it was cited.

Built ACT Dataset of >11,000 citations annotated by authors according to classification schema

Ran 2 Shared Tasks to establish benchmarks for SoA classification models using ACT and extended ACT2 datasets

Currently investigating extended / dynamic citation contexts to improve model performance

Citation Function

Examples

BACKGROUND

Most of the participatory models to design educational games are founded on educational theories and game design (see for example: Amory, 2007; #CITATION_TAG).

COMPARES_CONTRASTS

Similar observations have been made in the past [30] [31] [32] [33] [34], although others have reported either no relationship or a negative association with SES [#CITATION_TAG].

EXTENSION

This database is the result of a mandatory questionnaire about the home to work displacements and the mobility management measures at large workplaces in Belgium (#CITATION_TAG).

FUTURE

We are thus exploring the option of using datasets such as CrossRef 12, Dimensions 13, OpenCitations [11], and Core [#CITATION_TAG].

MOTIVATION

To illustrate, consider the motivation given by #CITATION_TAG in developing their Bayesian account of word learning.

USES

The diffraction patterns from single crystal measurements were indexed with a home-made program based on the Fit2D software [#CITATION_TAG].

15 of 50

AI for citation typing and research assessment

11 of 34 REF2014 Peer Review Panels used citation data to ‘inform’ their decisions

REF GPA results highly correlated with citation data in these domains

Addition of citation type information can allow for better modelling of how research is being used.

Potential for development of new metrics that leverage enhanced citation information

UoA

mn2017

med2017

mn2014

med2014

1

Chemistry

0.663

0.802

0.637

0.738

2

Biological Sciences

0.782

0.797

0.688

0.785

3

Aero. Mech. Chem. Engineering

0.771

0.758

0.745

0.760

4

Social Work and Policy

0.697

0.752

0.629

0.635

5

Computer Science and Informatics

0.715

0.743

0.720

0.678

16 of 50

AI for credible trustworthy question answering (CORE-GPT)

CORE is the world’s largest collection of Open Access papers, collating and enriching content from over 11,000 data providers.

  • >20 Million monthly active users
  • 34 Million full-text research papers hosted by CORE.
  • 260 Million metadata records

GPT large language models*

  • Can comprehend context and generate human-like text
  • Can infer meaning from large-scale data

*Other large language models are available

@JayAlammar

17 of 50

Introducing CORE-GPT

18 of 50

Introducing CORE-GPT

19 of 50

CORE-GPT Results

20 of 50

CORE-GPT Results

21 of 50

Reflections / limitations …

ChatGPT

CORE-GPT

Both

  • Can get confused (esp. when answers are ambiguous) mixing content from entirely semantically different uses of a concept
  • Can be made to argue your way producing biased text
  • It can start inventing things / hallucinate …
  • Answers need to be anchored to research papers.
  • More honest about what it doesn’t know => fewer hallucinations
  • References make it easier to assess the trustworthiness of the answer.
  • Powerful at synthesizing content and creating summaries
  • Able to compare and contrast
  • Can get confused (esp. when answers are ambiguous) mixing content from entirely semantically different areas / uses of a concept
  • Can be made to argue your way producing biased text
  • Critical thinking and judgement needs to be exercised

22 of 50

CORE - AI Expert Finder

Evaluation:

  • Relevancy - was the suggested candidate a suitable match?
  • Prior Knowledge - was the suggested candidate previously known to the enquirer?
  • Conflict - are there any conflicts of interest with the proposed candidate?

Prototype tool to automatically identify domain experts based on publications in >34m research papers

Applications in:

Peer review

Proposal review

Consultant/Expert recruitment

Results

74% of suggested candidates were suitable

34% of suggested candidates were not known to enquirer

23 of 50

The crucial role of repositories in providing machine access to research content.

24 of 50

Principle 1

Repositories should always establish a link from the metadata record to the item the metadata record describes using a dereferencable identifier pointing to the version held locally in the repository (if applicable). The dereferencable identifier should be provided in the appropriate metadata element in the used metadata format.

25 of 50

Principle 2

Repositories should provide universal access to machines with the same level of access as humans have. It should be possible for machines to harvest the entire content of the repository in a reasonable time to enable a machine to maintain up-to-date information about the content held in the repository.

26 of 50

Functional OAI-PMH endpoint

Test, don’t take that it works for granted

Monitor: the fact that it works now doesn’t mean it can’t go wrong when you least expect it

Use an external system to see how your repository is seen from the outside of your organisation.

27 of 50

Robots.txt

  • Be careful not to block robots
  • Don’t give preferential treatment

28 of 50

Validate metadata

  • Adopt a relevant application profile (e.g. RIOXX.net)
  • Use a metadata validation service, e.g. within the CORE Reposiotry Dashboard

29 of 50

Validate

Validate, don’t take it works for granted

Monitor: the fact that it works now doesn’t mean it can’t go wrong when you least expect

30 of 50

Support Signposting

Helping machines to navigate repositories in order to locate

the content.

31 of 50

COAR Next Generation Repositories Working Group

32 of 50

Why is CORE important?

Increase your contents’ discoverability and prevent its misuse

Search, Recommender, Discovery, PMC Linkout

Make your papers uniquely identifiable and resolvable with PIDs

OAI Resolver

Assess and contribute to Open Access compliance and FAIRness

Indexed by CORE badge

Make your content machine readable

Repository Health Check, CORE API, CORE Dataset, CORE FastSYnc,

Become a CORE Member and benefit from lots more

Dashboard: Metadata validation and monitoring

monthly

active users

>20M

33 of 50

AI/ML for research intelligence and for improving repository workflows

34 of 50

Affiliation extraction

  1. Problem

Many metadata records do not have Some text …

Show an example how affiliations can be extracted. Show Grobid output …

How does this correspond with ROR

This is a problem we are currently working on

2. Publication footprint

35 of 50

Affiliation extraction

  • Many metadata records do not have affiliation data
  • Affiliation is important for a range of use cases, including publication footprint
  • At CORE, we developed a method to extract affiliation information from papers using a supervised ML model.
  • Will propagate to the CORE API and Dashboard.

36 of 50

Deduplication

How do duplicates look like and why do they occur in repositories?

37 of 50

Deduplication

  • CORE uses an adapted version of locality sensitive hashing (simhash) for deduplication.
  • >90% F1-score performance
  • Deduplication powers our service including in the Dashboard for:
    1. versioning
    2. OA compliance (cross-repository)
    3. with affiliation extraction, this will allow us to warn institutions before outputs become non-compliant

Comparison mode

38 of 50

Deduplication

Comparison mode

List of possible duplicates

39 of 50

Data enrichment

Here you can see if there is an earlier version of an article in another repository …

…and can download a spreadsheet showing deposit dates from multiple repositories

You can enrich data with DOIs identified in other repositories

40 of 50

Document classification

  • Classification of research papers in a distributed environment is a problem.
  • Established a benchmark for research document classification as part of the SDP/COLING conference.
  • In the process of bringing themes to the CORE API.

41 of 50

CORE moving to a membership model

CORE will become an independent open scholarly infrastructure

CORE will no longer receive direct funding from Jisc

August 2023

CORE will be operated by The Open University

Membership

Sponsorship

(data providers)

42 of 50

CORE Membership

  • A network of data providers who are committed to the ongoing success of the Open Access movement
  • We provide tools and benefits for our members
  • All CORE data providers are eligible to become CORE Starting Members free of charge
  • Supporting and Sustaining Members:
    • help shape our development roadmap
    • support and sustain CORE

43 of 50

Three levels of CORE Membership

Starting

Supporting

Sustaining

FREE

44 of 50

Next Generation Repositories: Behaviours

  • Some text …

45 of 50

More reading: references

Knoth, P. (2013). From open access metadata to open access content: two principles for increased visibility of open access content. In Open Repositories 2013. Retrieved from http://oro.open.ac.uk/37824/

Pride, D., & Knoth, P. (2020). An Authoritative Approach to Citation Classification. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020. doi:10.1145/3383583.3398617

Kunnath, Suchetha N.; Pride, David; Gyawali, Bikash and Knoth, Petr (2020). Overview of the 2020 WOSP 3C Citation Context Classification Task. In: Proceedings of the 8th International Workshop on Mining Scientific Publications, Association for Computational Linguistics pp. 75–83.

Kunnath, Suchetha N.; Herrmannova, Drahomira; Pride, David; Knoth, Petr (2022). A Meta-analysis of Semantic Classification of Citations . Quantitative Science Studies, 2 (4), pp. 1170-1215

46 of 50

More reading: references

Kusa, Wojciech; Hanbury, Allan; Knoth, Petr (2022). Automation of Citation Screening for Systematic Literature Reviews using Neural Networks: A Replicability Study . In: 44th European Conference on Information Retrieval, 10-14 Apr 2022, Stavanger, Norway Springer , 13185 , pp. 584-598

Nambanoor Kunnath, Suchetha; Pride, David; Knoth, Petr (2022). Dynamic Context Extraction for Citation Classification. In: The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, 20-23 Nov 2022, Virtual

Gyawali, Bikash; Anastasiou, Lucas; Knoth, Petr (2020). Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings. In: 12th Language Resources and Evaluation Conference, 11-16 May 2020, Marseille, France European Language Resources Association , pp. 894-903

47 of 50

More reading: references

Óscar E. Mendoza, Wojciech Kusa, Alaa El-Ebshihy, Ronin Wu, David Pride, Petr Knoth, Drahomira Herrmannova, Florina Piroi, Gabriella Pasi, and Allan Hanbury. 2022. Benchmark for Research Theme Classification of Scholarly Documents. In Proceedings of the Third Workshop on Scholarly Document Processing, pages 253–262, Gyeongju, Republic of Korea. Association for Computational Linguistics.

Pride, David; Harag, Jozef; Knoth, Petr (2019). ACT: An Annotation Platform for Citation Typing at Scale. In: JCDL 2019 - ACM/IEEE-CS Joint Conference on Digital Libraries 2019, 2-6 Jun 2019, Urbana-Champaign, Illinois

Herrmannova, Drahomira; Pontika, Nancy; Knoth, Petr (2019). Do Authors Deposit on Time? Tracking Open Access Policy Compliance . In: 2019 ACM/IEEE Joint Conference on Digital Libraries, 2-6 Jun 2019, Urbana-Champaign, IL , pp. 206-216 BEST PAPER AWARD

48 of 50

2023

Take home …

  • ML/AI has the potential to transform all stages of the research process, including how we carry out research, how we assess it and how we organise research knowledge.
  • OA repositories play a key role in this process by providing machine access to research content.
  • AI/ML already provides opportunities for improving the ways we use repositories, organise, enrich and curate content in them.

49 of 50

2023

Take home …

  • CORE provides a large corpus of open research papers which can be utilised to make GPT answer in an evidence-based manner using references.
  • Aiming to make CORE-GPT available from the CORE API.
  • Our group is very interested in working with others on research projects in the area of AI for open and responsible research

50 of 50

2023

THANK YOU