2023
Dr. Petr Knoth
Machine Learning and AI for and from Open Repositories:
Unlocking the power of repositories across use cases requiring machine access to open research
Big Scientific Data and Text Analytics group :
AI for open and responsible research
CORE delivers services for HEIs, researchers, funders and commercial partners, offering seamless access to research.
Research areas
Commercial Partners
Institutional Members
Big Scientific Data and Text Analytics group : AI for open and responsible research
Providing
seamless access
to open research
for humans and
machines.
Big Scientific Data and Text Analytics group : AI for open and responsible research
Dr. Petr Knoth : Senior Research Fellow in Text and Data Mining petr.knoth@open.ac.uk
CORE is the world’s most used aggregator of Open Access papers, collating and enriching content from over 11,000 repositories.
Signatory of Principles of Open Scholarly Infrastructure (POSI)
25 supporting or sustaining members
Outline
Outline
How can AI/ML transform research
The importance of open research literature
Research literature documents the knowledge we have assembled as human species.
The wide variety of use cases over research literature
AI for systematic reviews
AI for systematic reviews
AI for systematic reviews
AI for systematic reviews
AI for systematic reviews
AI for systematic reviews
AI for citation typing and research assessment
Knowing not only that something was cited, but WHY it was cited.
Built ACT Dataset of >11,000 citations annotated by authors according to classification schema
Ran 2 Shared Tasks to establish benchmarks for SoA classification models using ACT and extended ACT2 datasets
Currently investigating extended / dynamic citation contexts to improve model performance
Citation Function | Examples |
BACKGROUND | Most of the participatory models to design educational games are founded on educational theories and game design (see for example: Amory, 2007; #CITATION_TAG). |
COMPARES_CONTRASTS | Similar observations have been made in the past [30] [31] [32] [33] [34], although others have reported either no relationship or a negative association with SES [#CITATION_TAG]. |
EXTENSION | This database is the result of a mandatory questionnaire about the home to work displacements and the mobility management measures at large workplaces in Belgium (#CITATION_TAG). |
FUTURE | We are thus exploring the option of using datasets such as CrossRef 12, Dimensions 13, OpenCitations [11], and Core [#CITATION_TAG]. |
MOTIVATION | To illustrate, consider the motivation given by #CITATION_TAG in developing their Bayesian account of word learning. |
USES | The diffraction patterns from single crystal measurements were indexed with a home-made program based on the Fit2D software [#CITATION_TAG]. |
AI for citation typing and research assessment
11 of 34 REF2014 Peer Review Panels used citation data to ‘inform’ their decisions
REF GPA results highly correlated with citation data in these domains
Addition of citation type information can allow for better modelling of how research is being used.
Potential for development of new metrics that leverage enhanced citation information
| UoA | mn2017 | med2017 | mn2014 | med2014 |
1 | Chemistry | 0.663 | 0.802 | 0.637 | 0.738 |
2 | Biological Sciences | 0.782 | 0.797 | 0.688 | 0.785 |
3 | Aero. Mech. Chem. Engineering | 0.771 | 0.758 | 0.745 | 0.760 |
4 | Social Work and Policy | 0.697 | 0.752 | 0.629 | 0.635 |
5 | Computer Science and Informatics | 0.715 | 0.743 | 0.720 | 0.678 |
AI for credible trustworthy question answering (CORE-GPT)
CORE is the world’s largest collection of Open Access papers, collating and enriching content from over 11,000 data providers.
GPT large language models*
*Other large language models are available
@JayAlammar
Introducing CORE-GPT
Introducing CORE-GPT
CORE-GPT Results
CORE-GPT Results
Reflections / limitations …
ChatGPT
CORE-GPT
Both
CORE - AI Expert Finder
Evaluation:
Prototype tool to automatically identify domain experts based on publications in >34m research papers
Applications in:
Peer review
Proposal review
Consultant/Expert recruitment
Results
74% of suggested candidates were suitable
34% of suggested candidates were not known to enquirer
The crucial role of repositories in providing machine access to research content.
Principle 1
Repositories should always establish a link from the metadata record to the item the metadata record describes using a dereferencable identifier pointing to the version held locally in the repository (if applicable). The dereferencable identifier should be provided in the appropriate metadata element in the used metadata format.
Principle 2
Repositories should provide universal access to machines with the same level of access as humans have. It should be possible for machines to harvest the entire content of the repository in a reasonable time to enable a machine to maintain up-to-date information about the content held in the repository.
Functional OAI-PMH endpoint
Test, don’t take that it works for granted
Monitor: the fact that it works now doesn’t mean it can’t go wrong when you least expect it
Use an external system to see how your repository is seen from the outside of your organisation.
Robots.txt
Validate metadata
Validate
Validate, don’t take it works for granted
Monitor: the fact that it works now doesn’t mean it can’t go wrong when you least expect
Support Signposting
Helping machines to navigate repositories in order to locate
the content.
COAR Next Generation Repositories Working Group
Why is CORE important?
Increase your contents’ discoverability and prevent its misuse
Search, Recommender, Discovery, PMC Linkout
Make your papers uniquely identifiable and resolvable with PIDs
OAI Resolver
Assess and contribute to Open Access compliance and FAIRness
Indexed by CORE badge
Make your content machine readable
Repository Health Check, CORE API, CORE Dataset, CORE FastSYnc,
Become a CORE Member and benefit from lots more
Dashboard: Metadata validation and monitoring
monthly
active users
>20M
AI/ML for research intelligence and for improving repository workflows
Affiliation extraction
Many metadata records do not have Some text …
Show an example how affiliations can be extracted. Show Grobid output …
How does this correspond with ROR
This is a problem we are currently working on
2. Publication footprint
Affiliation extraction
Deduplication
How do duplicates look like and why do they occur in repositories?
Deduplication
Comparison mode
Deduplication
Comparison mode
List of possible duplicates
Data enrichment
Here you can see if there is an earlier version of an article in another repository …
…and can download a spreadsheet showing deposit dates from multiple repositories
You can enrich data with DOIs identified in other repositories
Document classification
CORE moving to a membership model
CORE will become an independent open scholarly infrastructure
CORE will no longer receive direct funding from Jisc
August 2023
CORE will be operated by The Open University
Membership
Sponsorship
(data providers)
CORE Membership
Three levels of CORE Membership
Starting
Supporting
Sustaining
FREE
Next Generation Repositories: Behaviours
More reading: references
Knoth, P. (2013). From open access metadata to open access content: two principles for increased visibility of open access content. In Open Repositories 2013. Retrieved from http://oro.open.ac.uk/37824/
Pride, D., & Knoth, P. (2020). An Authoritative Approach to Citation Classification. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020. doi:10.1145/3383583.3398617
Kunnath, Suchetha N.; Pride, David; Gyawali, Bikash and Knoth, Petr (2020). Overview of the 2020 WOSP 3C Citation Context Classification Task. In: Proceedings of the 8th International Workshop on Mining Scientific Publications, Association for Computational Linguistics pp. 75–83.
Kunnath, Suchetha N.; Herrmannova, Drahomira; Pride, David; Knoth, Petr (2022). A Meta-analysis of Semantic Classification of Citations . Quantitative Science Studies, 2 (4), pp. 1170-1215
More reading: references
Kusa, Wojciech; Hanbury, Allan; Knoth, Petr (2022). Automation of Citation Screening for Systematic Literature Reviews using Neural Networks: A Replicability Study . In: 44th European Conference on Information Retrieval, 10-14 Apr 2022, Stavanger, Norway Springer , 13185 , pp. 584-598
Nambanoor Kunnath, Suchetha; Pride, David; Knoth, Petr (2022). Dynamic Context Extraction for Citation Classification. In: The 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, 20-23 Nov 2022, Virtual
Gyawali, Bikash; Anastasiou, Lucas; Knoth, Petr (2020). Deduplication of Scholarly Documents using Locality Sensitive Hashing and Word Embeddings. In: 12th Language Resources and Evaluation Conference, 11-16 May 2020, Marseille, France European Language Resources Association , pp. 894-903
More reading: references
Óscar E. Mendoza, Wojciech Kusa, Alaa El-Ebshihy, Ronin Wu, David Pride, Petr Knoth, Drahomira Herrmannova, Florina Piroi, Gabriella Pasi, and Allan Hanbury. 2022. Benchmark for Research Theme Classification of Scholarly Documents. In Proceedings of the Third Workshop on Scholarly Document Processing, pages 253–262, Gyeongju, Republic of Korea. Association for Computational Linguistics.
Pride, David; Harag, Jozef; Knoth, Petr (2019). ACT: An Annotation Platform for Citation Typing at Scale. In: JCDL 2019 - ACM/IEEE-CS Joint Conference on Digital Libraries 2019, 2-6 Jun 2019, Urbana-Champaign, Illinois
Herrmannova, Drahomira; Pontika, Nancy; Knoth, Petr (2019). Do Authors Deposit on Time? Tracking Open Access Policy Compliance . In: 2019 ACM/IEEE Joint Conference on Digital Libraries, 2-6 Jun 2019, Urbana-Champaign, IL , pp. 206-216 BEST PAPER AWARD
2023
Take home …
2023
Take home …
2023
THANK YOU