1 of 22

NCSA faculty fellowship w/iSchool on turning free-text into Knowledge-Graph triples

Mike Bobak

2 of 22

NCSA faculty fellowship w/iSchool 2021-2022

Takes free-text to Knowledge-Graph triples (entities & relationships between them)
Takes work of the professor from nlm.nih SemRep and get an easier to maintain port
Started in a collection of languages incl. Prolog, then Java port, now in Python
Has already helped in putting in for a NIH grant to take the work even further
Makes use of NLM’s MetaMap-Lite (MML) which does the Named-Entity-Recognition
Then sets of rules are used to find relationships between the entities
MML matching ability generated from any ontology, with synonyms in each class
Also an aim to make it easier to generalize beyond the biomedical domain

3 of 22

I worked on:

Get the java then python code bases running on a new machine, update everything to python3

Start some simple logging, suggest use to catch errors, test for changes in output

incl some in braat format to more easily view the parse/relationships within the sentences

Move away from socketed connections to either local calls or REST based service calls

or

Move services either to REST based calls, or to local execution.

Update process to pull synonym references from ontologies for NER in other domains

Updated python code to produce datafilebuilder input and run that into metamap
also found a simple python library to pull then match from an ontology

Use of owlready2.pymedtermino2 for concept relationship [/ subsumption] tests

Some looking at further work

List of next steps / use in possible grants

4 of 22

Motivation: of machine interpretability of knowledge from free-text

Things-not-strings via: free-text -to-> Knowledge-Graph triples (entities w/relationships)

helps achieve achieve the goal of machine-interpretability [KGs need connected things]

blog.google/products/search/introducing-knowledge-graph-things-not

Introducing the Knowledge Graph:

things, not strings

1. Find the right thing Language can be ambiguous

2. Get the best summary With the Knowledge Graph, Google can better understand your query

3. Go deeper and broader

Finally, the part that’s the most fun of all—the Knowledge Graph can help you make some unexpected discoveries.

5 of 22

There are several application areas for

machine interpretable knowledge

e.g.

6 of 22

Named-Entity-Recognition & Linking

wikipedia.org/wiki/Capital_city_of

7 of 22

Knowledge-Graph triples are made of URI/things,

w/some literal objects

wikipedia.org/wiki/France

wikipedia.org/wiki/Capital_city

wikipedia.org/wiki/Paris

literals are eg. text numbers, or any xml type; but can only be in terminal Objects

dbp:Paris dbp:Population 2161000^^xsd:int

8 of 22

We use MetaMap-Lite for Entity-Linking

How it works:

input text ->
sentence/line segmentation -> tokenization -> part-of-speech tagging ->
token window generation -> term normalization ->
concept dictionary lookup ->
negation detection ->
result presentation

9 of 22

Example MML match:

"Papillary Thyroid Carcinoma is a Unique Clinical Entity"

"Papillary Thyroid Carcinoma is a Unique Clinical"

"Papillary Thyroid Carcinoma is a Unique"

"Papillary Thyroid Carcinoma is a"

"Papillary Thyroid Carcinoma is"

"Papillary Thyroid Carcinoma" --> match

"is a Unique Clinical Entity"

"is a Unique Clinical"

"is a Unique"

"is a"

"is"

"a Unique Clinical Entity"

"a Unique Clinical"

"a Unique"

"a"

"Unique Clinical Entity"

"Unique Clinical"

"Unique" --> match

"Clinical Entity"

"Clinical" --> match

"Entity" --> match

10 of 22

Entity Linking output to the brat rapid annotation tool

11 of 22

Expanding Beyond BioMedical domain

Ontologies with predicate hasExactSynonym,

w/literal objects being that text that can be harvested

to make MML handle new domains.

I plan to use it for GeoCODES, & can think of many others it could be used in

12 of 22

Get the java then python code bases running on a new machine, update everything to python3
Start some simple logging, suggest use to catch errors, test for changes in output

incl some in braat to more easily view the parse/relationships within the sentences

Move away from socketed connections to either local calls or REST based service calls.
Update process to pull synonym references from ontologies for NER in other domains
Use of owlready2.pymedtermino2 for concept relationship tests

https://isda.ncsa.illinois.edu/~mbobak/

for February-June:

Process/documentation for regular UMLS updates

Metamorphosys
Can we rely on MetaMap Lite files?

Process/documentation for adapting MetaMap Lite to non-UMLS vocabularies/ontologies

What is required in the vocabulary/ontology? What is good-to-have?
Data File Builder
Tips/tricks

Overall infrastructure

Should we consider running MetaMap Lite and other server processes in a different way?
Logging
Unit tests
Serialization/deserialization

13 of 22

after this, extra slides, this is just a very rough, 1st draft

14 of 22

Clowder is mentioned in the NIH grant proposal &I will annotate this EC free-text too

15 of 22

16 of 22

Clowder organization

One space per data-facility
Datasets hold metadata
Also a Resources space:

Allows for

dataset & tool search
metadata/annotation
linking out to get the data
& sometimes (assoc) tool/s

17 of 22

Clowder search results

& a result’s metadata(tab) tree listing

18 of 22

Future work:

Linking data with tools ..
Automatic launching of tools with data
From search to use in a NoteBook
Search on map & in NoteBook
Search enhanced w/NER & more, see:
https://mbcode.github.io/ec
Getting these benefits in clowder via:

triple store sync with clowder
embedding science on schema
DCAT as a superset/furthering the gateway from schema.org to real science descriptions

19 of 22

Faster time to science

via metadata use

to get more

resources

Can take questions later: @Mike Bobak

20 of 22

extra slides