Annif
Feeding your subject indexing robot�with bibliographic metadata
Osma Suominen
LIBER 47th Annual Conference, Lille, France, 6th July 2018
.
YSA YSO
Allärs KOKO
€£$
Subject indexing is a hard problem
for humans:
for machines:
long tail
Approach
Automating our own processes
Creating generic tools for many contexts
vs.
Enter Annif
Feed your subject indexing robot with bibliographic metadata!
Machine learning requires training data
Bibliographic
metadata
(titles + subjects)
Fulltext
docs
Metadata about 13M documents,
many of them tagged with subjects!
Finna API
All Finna metadata is !
~30 000 concepts that can be used for subject indexing
Annif prototype (2017)
Finna API subject searches:
Finna API subject searches:
Renewable energy in power systems
Luonnonvaratilinpito. Puuainestilinpito
Local politics of renewable energy : Project planning, siting conflicts and citizen participation
Sustainable biotechnology : sources of renewable energy
Native people and renewable resource management : the 1986 symposium of the Alberta Society of Professional Biologists co-sponsored by Alberta Native Affairs and Indian and Northern Affairs Canada
Tiivistelmä: Pienimuotoisten biokaasulaitosten ympäristövaikutus : paikallinen ja globaali näkökulma.
Renewable hydrogen and fuel cells in vehicles
Community action plan for renewable energies : summary = Aktionsplan der Gemeinschaft für Erneuerbare Energie : zusammenfassung = Plan d'action communautaire dans le domaine des energies renouvelables : resume
Environmental impact of household biogas plants in India : local and global perspective
Renewable natural resources : a management handbook for the 1980s
Bioenergy 2009 : 31.8.-4.9.2009 : book of proceedings part 2
The existence of steady states in growth models with renewable resources and pollution
Redox reactions and water quality in cultivated boreal acid sulphate soils in relation to water management
Tiivistelmä: Energiantuotanto ja päästöt.
Biotechnology and renewable energy
Perspectives of renewable energy resources utilization (regional aspects) : proceedings of the third international seminar, 11.-13. September 1995, Petrozavodsk, Russia
Renewable energy sources statistics in the European Union : 1989-1997
Traditional knowledge and renewable resource management in northern regions
Mechanical, microstructural and barrier properties of agricultural biopolymer films and foams : a literature review
Environmental assessment of green chemicals : LCA of bio-based chemicals produced using biocatalysis
Renewable energy in power systems
Luonnonvaratilinpito. Puuainestilinpito
Local politics of renewable energy : Project planning, siting conflicts and citizen participation
Sustainable biotechnology : sources of renewable energy
Native people and renewable resource management : the 1986 symposium of the Alberta Society of Professional Biologists co-sponsored by Alberta Native Affairs and Indian and Northern Affairs Canada
Tiivistelmä: Pienimuotoisten biokaasulaitosten ympäristövaikutus : paikallinen ja globaali näkökulma.
Renewable hydrogen and fuel cells in vehicles
Community action plan for renewable energies : summary = Aktionsplan der Gemeinschaft für Erneuerbare Energie : zusammenfassung = Plan d'action communautaire dans le domaine des energies renouvelables : resume
Environmental impact of household biogas plants in India : local and global perspective
Renewable natural resources : a management handbook for the 1980s
Bioenergy 2009 : 31.8.-4.9.2009 : book of proceedings part 2
The existence of steady states in growth models with renewable resources and pollution
Redox reactions and water quality in cultivated boreal acid sulphate soils in relation to water management
Tiivistelmä: Energiantuotanto ja päästöt.
Biotechnology and renewable energy
Perspectives of renewable energy resources utilization (regional aspects) : proceedings of the third international seminar, 11.-13. September 1995, Petrozavodsk, Russia
Renewable energy sources statistics in the European Union : 1989-1997
Traditional knowledge and renewable resource management in northern regions
Mechanical, microstructural and barrier properties of agricultural biopolymer films and foams : a literature review
Environmental assessment of green chemicals : LCA of bio-based chemicals produced using biocatalysis
Indexing Wikipedia by topics
Finnish Wikipedia has 410 000 articles (620 MB as raw text)�Automated subject indexing took 7 hours on a laptop�1-3 topics per article (average ~2)
Indexing Wikipedia by topics
Finnish Wikipedia has 410 000 articles (620 MB as raw text)�Automated subject indexing took 7 hours on a laptop�1-3 topics per article (average ~2)
Examples: (random sample)
Wikipedia article YSO topics
Ahvenuslammi (Urjala) shores
Brasilian Grand Prix 2016 race drivers, formula racing, karting
Guy Topelius folk poetry researcher, saccharin
HMS Laforey warships
Liigacup football, football players
Pää Kii ensembles (groups), pop music
RT-21M Pioneer missiles
Runoja pop music, recording (music recordings), compositions (music)
Sjur Røthe skiers, skiing, Nordic combined
Veikko Lavi lyricists, comic songs
Most common topics in Finnish Wikipedia
Most common topics in Finnish Wikipedia
Image credits:
Petteri Lehtonen [CC BY-SA 3.0]
Hockeybroad/Cheryl Adams [CC BY-SA 3.0]
Tomisti [CC BY-SA 3.0]
Tuomas Vitikainen [CC BY-SA 3.0]
People vs. Robots Workshop
20 documents
40 librarians
45 minutes
...
225 indexing results
Average similarity of subject sets
33.39 %
Using Rolling similarity, a.k.a. F1 score, to compare subject sets
Similarity of indexing results by indexer (larger is better)
Digitized
books
Environment Institute publications
Doctoral
dissertations
Serials
Non-fiction
books
Similarity of indexing results (larger is better)
Librarians
Annif
Fennica
Annif prototype vs. new Annif
| Prototype (2017) | New Annif (2018→) |
architecture | loose collection of scripts | Flask web application |
coding style | quick and dirty | solid software engineering |
backends | Elasticsearch index | TF-IDF, fastText, Maui ... |
language support | Finnish, Swedish, English | any language supported by NLTK |
vocabulary support | YSO, GACS ... | YSO, YKL, others coming |
REST API | minimal | extended (e.g. list projects) |
user interface | web form for testing | |
mobile app | HTML/CSS/JS based | (native Android app?) |
open source license | CC0 | Apache License 2.0 |
Mobile app
Annif��Flask/Connexion web app
REST API
TF-IDF model
fastText model
HTTP backend
MauiService��Microservice
around Maui
REST API
New Annif Architecture
Finna.fi
metadata
Fulltext
docs
training
data
training
data
Any metadata / document
management
system
training data
more backends can be added in future,
e.g. neural network,
fastXML, StarSpace
OCR
Backends / Algorithms
Ideal subject indexing / classification algorithm?
Ideal subject indexing / classification algorithm?
Ideal subject indexing / classification algorithm?
Backend configuration
Backends may be used alone, or in combinations (ensembles)
REST API
GET /projects/ | list available projects |
GET /projects/<project_id> | show information about a project |
POST /projects/<project_id>/analyze | analyze text and return subjects |
POST /projects/<project_id>/explain | analyze text and return subjects, with explanations indicating why they were chosen |
POST /projects/<project_id>/train | train the model by giving a document and gold standard subjects |
Command line interface
Analyzing a document:
$ cat berries.txt
Rising interest in local food has boosted the popularity of pick-your-own berries in Finland – and the best time for picking is now. Mornings are quiet at the Raijan Aitta strawberry farm in Mikkeli, eastern Finland. In fields in the distance, Ukrainian workers pick strawberries for market sales. In those closer to the road are the self-pickers. This morning
entrepreneur Katariina Turman sent out a text message to her regular customers, letting them know that the best time for pick-your-own (PYO) strawberries is at hand. Farms have been inviting customers to pick their own strawberries since the 1990s, when farmers began having difficulty recruiting enough employees. Then, as Finland recovered from a severe recession, many pickers were purely motivated by a chance to save money.
$ annif analyze tfidf-en <berries.txt
<http://www.yso.fi/onto/yso/p772> strawberry 0.39644203283656165
<http://www.yso.fi/onto/yso/p18109> wild strawberry 0.37539359094384245
<http://www.yso.fi/onto/yso/p25548> stolons 0.3261554545369906
<http://www.yso.fi/onto/yso/p6749> berry cultivation 0.2394291077460799
<http://www.yso.fi/onto/yso/p10631> questionnaire survey 0.22714475653823335
<http://www.yso.fi/onto/yso/p6821> farms 0.21725243067995587
<http://www.yso.fi/onto/yso/p3294> customers 0.216395821347059
<http://www.yso.fi/onto/yso/p1834> work motivation 0.21612376226244975
<http://www.yso.fi/onto/yso/p8531> customership 0.21536113638508098
<http://www.yso.fi/onto/yso/p19047> corporate clients 0.21412270159920782
Calculating statistical measures
$ annif evaldir tfidf-fi tests/corpora/archaeology/fulltext/
Precision: 0.17142857142857143
Recall: 0.3664965986394558
F-measure: 0.23185107376283848
NDCG@5: 0.3426718725322724
NDCG@10: 0.36769238316041325
Precision@1: 0.42857142857142855
Precision@3: 0.3571428571428571
Precision@5: 0.2857142857142857
True positives: 48
False positives: 232
False negatives: 85
Mobile app
Prototype web app
ocr.space cloud OCR
A native app (Android / iOS …) could do OCR on the device.
This would enable an AR (augmented reality) mode,
where the app would “reveal” concepts when pointing the camera at text documents, book covers etc.
Watch the video for the prototype:
Test corpora
Full text documents indexed with YSA/YSO for training and evaluation
Available on GitHub: https://github.com/NatLibFi/Annif-corpora�(for the first two corpora, only links to PDFs are provided for copyright reasons)
Evaluation of different backends
F-measure similarity scores against a gold standard
Observations:�
Different algorithms, different weaknesses
Receiver Operating Characteristic (ROC) Area Under Curve (AUC) scores�for different YSO concepts used to index Jyväskylä University theses, by algorithm
good
questionable
worthless
top 200 most frequent concepts
Annif on GitHub
Python 3.5+ code base
Apache License 2.0
Fully unit tested (98% coverage)
PEP8 style guide compliant
Usage documentation in the wiki
Apply Annif on your own data!
Choose an
indexing
vocabulary
Load the corpus
into Annif
Prepare a corpus
from your existing metadata
Use it to index
new documents
Next steps
Lessons learned (so far)
Thank you!
Questions?
osma.suominen@helsinki.fi - @OsmaSuominen�
Website: http://annif.org
Code: https://github.com/NatLibFi/Annif
Test corpora: https://github.com/NatLibFi/Annif-corpora
These slides: https://tinyurl.com/annif-liber