1 of 48

Annif

Feeding your subject indexing robot�with bibliographic metadata

Osma Suominen

LIBER 47th Annual Conference, Lille, France, 6th July 2018

2 of 48

3 of 48

4 of 48

.

5 of 48

6 of 48

YSA YSO

Allärs KOKO

€£$

7 of 48

Subject indexing is a hard problem

for humans:

  • Subjectivity: when two people index the same document, only ~⅓ of the subjects are the same�
  • Many concepts: tens of thousands of concepts to pick from�
  • Vocabulary changes: new concepts are added, existing ones are renamed and redefined

for machines:

  • Long tail phenomenon: even with large amounts of training data, most subjects are only used a small number of times�
  • Many concepts: requires complex models that are computationally intensive�
  • Difficult to evaluate: hard to tell “somewhat bad” answers from really wrong ones without human evaluation�
  • Vocabulary changes: models must be retrained

long tail

8 of 48

Approach

Automating our own processes

Creating generic tools for many contexts

vs.

9 of 48

Enter Annif

Feed your subject indexing robot with bibliographic metadata!

10 of 48

Machine learning requires training data

Bibliographic

metadata

(titles + subjects)

Fulltext

docs

11 of 48

12 of 48

Hot tub by a lake

Andrei Niemimäki

CC BY-SA

Metadata about 13M documents,

many of them tagged with subjects!

13 of 48

Hot tub by a lake

Andrei Niemimäki

CC BY-SA

14 of 48

Hot tub by a lake

Andrei Niemimäki

CC BY-SA

15 of 48

Finna API

All Finna metadata is !

16 of 48

~30 000 concepts that can be used for subject indexing

17 of 48

Annif prototype (2017)

18 of 48

19 of 48

Finna API subject searches:

  • renewable natural resources type=Subject
  • “renewable natural resources” type=Subject
  • topic_facet:”renewable natural resources”

20 of 48

Finna API subject searches:

  • renewable natural resources type=Subject
  • “renewable natural resources” type=Subject
  • topic_facet:”renewable natural resources”

Renewable energy in power systems

Luonnonvaratilinpito. Puuainestilinpito

Local politics of renewable energy : Project planning, siting conflicts and citizen participation

Sustainable biotechnology : sources of renewable energy

Native people and renewable resource management : the 1986 symposium of the Alberta Society of Professional Biologists co-sponsored by Alberta Native Affairs and Indian and Northern Affairs Canada

Tiivistelmä: Pienimuotoisten biokaasulaitosten ympäristövaikutus : paikallinen ja globaali näkökulma.

Renewable hydrogen and fuel cells in vehicles

Community action plan for renewable energies : summary = Aktionsplan der Gemeinschaft für Erneuerbare Energie : zusammenfassung = Plan d'action communautaire dans le domaine des energies renouvelables : resume

Environmental impact of household biogas plants in India : local and global perspective

Renewable natural resources : a management handbook for the 1980s

Bioenergy 2009 : 31.8.-4.9.2009 : book of proceedings part 2

The existence of steady states in growth models with renewable resources and pollution

Redox reactions and water quality in cultivated boreal acid sulphate soils in relation to water management

Tiivistelmä: Energiantuotanto ja päästöt.

Biotechnology and renewable energy

Perspectives of renewable energy resources utilization (regional aspects) : proceedings of the third international seminar, 11.-13. September 1995, Petrozavodsk, Russia

Renewable energy sources statistics in the European Union : 1989-1997

Traditional knowledge and renewable resource management in northern regions

Mechanical, microstructural and barrier properties of agricultural biopolymer films and foams : a literature review

Environmental assessment of green chemicals : LCA of bio-based chemicals produced using biocatalysis

21 of 48

Renewable energy in power systems

Luonnonvaratilinpito. Puuainestilinpito

Local politics of renewable energy : Project planning, siting conflicts and citizen participation

Sustainable biotechnology : sources of renewable energy

Native people and renewable resource management : the 1986 symposium of the Alberta Society of Professional Biologists co-sponsored by Alberta Native Affairs and Indian and Northern Affairs Canada

Tiivistelmä: Pienimuotoisten biokaasulaitosten ympäristövaikutus : paikallinen ja globaali näkökulma.

Renewable hydrogen and fuel cells in vehicles

Community action plan for renewable energies : summary = Aktionsplan der Gemeinschaft für Erneuerbare Energie : zusammenfassung = Plan d'action communautaire dans le domaine des energies renouvelables : resume

Environmental impact of household biogas plants in India : local and global perspective

Renewable natural resources : a management handbook for the 1980s

Bioenergy 2009 : 31.8.-4.9.2009 : book of proceedings part 2

The existence of steady states in growth models with renewable resources and pollution

Redox reactions and water quality in cultivated boreal acid sulphate soils in relation to water management

Tiivistelmä: Energiantuotanto ja päästöt.

Biotechnology and renewable energy

Perspectives of renewable energy resources utilization (regional aspects) : proceedings of the third international seminar, 11.-13. September 1995, Petrozavodsk, Russia

Renewable energy sources statistics in the European Union : 1989-1997

Traditional knowledge and renewable resource management in northern regions

Mechanical, microstructural and barrier properties of agricultural biopolymer films and foams : a literature review

Environmental assessment of green chemicals : LCA of bio-based chemicals produced using biocatalysis

22 of 48

Indexing Wikipedia by topics

Finnish Wikipedia has 410 000 articles (620 MB as raw text)�Automated subject indexing took 7 hours on a laptop�1-3 topics per article (average ~2)

23 of 48

Indexing Wikipedia by topics

Finnish Wikipedia has 410 000 articles (620 MB as raw text)�Automated subject indexing took 7 hours on a laptop�1-3 topics per article (average ~2)

Examples: (random sample)

Wikipedia article YSO topics

Ahvenuslammi (Urjala) shores

Brasilian Grand Prix 2016 race drivers, formula racing, karting

Guy Topelius folk poetry researcher, saccharin

HMS Laforey warships

Liigacup football, football players

Pää Kii ensembles (groups), pop music

RT-21M Pioneer missiles

Runoja pop music, recording (music recordings), compositions (music)

Sjur Røthe skiers, skiing, Nordic combined

Veikko Lavi lyricists, comic songs

24 of 48

Most common topics in Finnish Wikipedia

25 of 48

Most common topics in Finnish Wikipedia

Image credits:

Petteri Lehtonen [CC BY-SA 3.0]

Hockeybroad/Cheryl Adams [CC BY-SA 3.0]

Tomisti [CC BY-SA 3.0]

Tuomas Vitikainen [CC BY-SA 3.0]

26 of 48

People vs. Robots Workshop

20 documents

40 librarians

45 minutes

...

225 indexing results

  • 11 per document
  • 5.5 per person

27 of 48

Average similarity of subject sets

33.39 %

Using Rolling similarity, a.k.a. F1 score, to compare subject sets

28 of 48

Similarity of indexing results by indexer (larger is better)

29 of 48

Digitized

books

Environment Institute publications

Doctoral

dissertations

Serials

Non-fiction

books

Similarity of indexing results (larger is better)

Librarians

Annif

Fennica

30 of 48

Annif prototype vs. new Annif

Prototype (2017)

New Annif (2018→)

architecture

loose collection of scripts

Flask web application

coding style

quick and dirty

solid software engineering

backends

Elasticsearch index

TF-IDF, fastText, Maui ...

language support

Finnish, Swedish, English

any language supported by NLTK

vocabulary support

YSO, GACS ...

YSO, YKL, others coming

REST API

minimal

extended (e.g. list projects)

user interface

web form for testing

mobile app

HTML/CSS/JS based

(native Android app?)

open source license

CC0

Apache License 2.0

31 of 48

Mobile app

Annif��Flask/Connexion web app

REST API

TF-IDF model

fastText model

HTTP backend

MauiService��Microservice

around Maui

REST API

New Annif Architecture

Finna.fi

metadata

Fulltext

docs

training

data

training

data

Any metadata / document

management

system

training data

more backends can be added in future,

e.g. neural network,

fastXML, StarSpace

OCR

32 of 48

Backends / Algorithms

  • TF-IDF similarityBaseline bag-of-words similarity measure. Implemented with the Gensim library.�
  • fastText by Facebook Research�Machine learning algorithm for text classification.�Uses word embeddings (similar to word2vec) and resembles a neural network architecture.�Promises to be good for e.g. library classifications (DDC, UDC, YKL…)

  • HTTP backend for accessing MauiService REST API�MauiService is a microservice wrapper around the Maui automated indexing tool.�Based on traditional Natural Language Processing techniques - finds terms within text.�

33 of 48

Ideal subject indexing / classification algorithm?

34 of 48

Ideal subject indexing / classification algorithm?

35 of 48

Ideal subject indexing / classification algorithm?

36 of 48

Backend configuration

Backends may be used alone, or in combinations (ensembles)

37 of 48

REST API

Main operations:

Defined using a Swagger / OpenAPI specification

GET /projects/

list available projects

GET /projects/<project_id>

show information about a project

POST /projects/<project_id>/analyze

analyze text and return subjects

POST /projects/<project_id>/explain

analyze text and return subjects, with explanations indicating why they were chosen

POST /projects/<project_id>/train

train the model by giving a document and gold standard subjects

38 of 48

Command line interface

Analyzing a document:

$ cat berries.txt

Rising interest in local food has boosted the popularity of pick-your-own berries in Finland – and the best time for picking is now. Mornings are quiet at the Raijan Aitta strawberry farm in Mikkeli, eastern Finland. In fields in the distance, Ukrainian workers pick strawberries for market sales. In those closer to the road are the self-pickers. This morning

entrepreneur Katariina Turman sent out a text message to her regular customers, letting them know that the best time for pick-your-own (PYO) strawberries is at hand. Farms have been inviting customers to pick their own strawberries since the 1990s, when farmers began having difficulty recruiting enough employees. Then, as Finland recovered from a severe recession, many pickers were purely motivated by a chance to save money.

$ annif analyze tfidf-en <berries.txt

<http://www.yso.fi/onto/yso/p772> strawberry 0.39644203283656165

<http://www.yso.fi/onto/yso/p18109> wild strawberry 0.37539359094384245

<http://www.yso.fi/onto/yso/p25548> stolons 0.3261554545369906

<http://www.yso.fi/onto/yso/p6749> berry cultivation 0.2394291077460799

<http://www.yso.fi/onto/yso/p10631> questionnaire survey 0.22714475653823335

<http://www.yso.fi/onto/yso/p6821> farms 0.21725243067995587

<http://www.yso.fi/onto/yso/p3294> customers 0.216395821347059

<http://www.yso.fi/onto/yso/p1834> work motivation 0.21612376226244975

<http://www.yso.fi/onto/yso/p8531> customership 0.21536113638508098

<http://www.yso.fi/onto/yso/p19047> corporate clients 0.21412270159920782

39 of 48

Calculating statistical measures

$ annif evaldir tfidf-fi tests/corpora/archaeology/fulltext/

Precision: 0.17142857142857143

Recall: 0.3664965986394558

F-measure: 0.23185107376283848

NDCG@5: 0.3426718725322724

NDCG@10: 0.36769238316041325

Precision@1: 0.42857142857142855

Precision@3: 0.3571428571428571

Precision@5: 0.2857142857142857

True positives: 48

False positives: 232

False negatives: 85

40 of 48

Mobile app

Prototype web app

ocr.space cloud OCR

A native app (Android / iOS …) could do OCR on the device.

This would enable an AR (augmented reality) mode,

where the app would “reveal” concepts when pointing the camera at text documents, book covers etc.

Watch the video for the prototype:

41 of 48

Test corpora

Full text documents indexed with YSA/YSO for training and evaluation

  • Articles from Arto database (n=6287)�Both scientific research papers and less formal publications. Many disciplines.�
  • Master’s and Doctoral theses from Jyväskylä University (n=7400)�Long, in-depth scientific documents. Many disciplines.

  • Question/Answer pairs from an Ask a Librarian service (n=3150)�Short, informal questions and answers about many different topics.

Available on GitHub: https://github.com/NatLibFi/Annif-corpora(for the first two corpora, only links to PDFs are provided for copyright reasons)

42 of 48

Evaluation of different backends

F-measure similarity scores against a gold standard

Observations:�

  1. When using just one backend, Maui often gives the best results
  2. Combinations (ensembles) usually give at least as good results as single backends
  3. The combination of all three backends gives the best results

43 of 48

Different algorithms, different weaknesses

Receiver Operating Characteristic (ROC) Area Under Curve (AUC) scores�for different YSO concepts used to index Jyväskylä University theses, by algorithm

good

questionable

worthless

top 200 most frequent concepts

44 of 48

Annif on GitHub

Python 3.5+ code base

Apache License 2.0

Fully unit tested (98% coverage)

PEP8 style guide compliant

Usage documentation in the wiki

https://github.com/NatLibFi/Annif

45 of 48

Apply Annif on your own data!

Choose an

indexing

vocabulary

Load the corpus

into Annif

Prepare a corpus

from your existing metadata

Use it to index

new documents

46 of 48

Next steps

  1. Improved combination of results from multiple algorithms�
  2. Testing on different vocabularies, including classification with DDC based YKL�
  3. Training on full text documents to further improve results�
  4. Further human evaluation in an indexing quality workshop

47 of 48

Lessons learned (so far)

  • Good quality training data is key for training and evaluation�Don’t expect good results if you don’t have the data it takes�
  • Gold standard subjects are useful, but human evaluation is necessary�Subject indexing is inherently subjective; comparing to a single gold standard can be misleading�
  • All algorithms have strong and weak points�Combinations work better than any algorithm by itself�
  • Surprising amount of interest also from non-library organizations�Archives, media organizations, book distributors … automation is better done together!

48 of 48

Thank you!

Questions?