1 of 22

NCSA faculty fellowship w/iSchool on turning free-text into Knowledge-Graph triples

Mike Bobak

2 of 22

NCSA faculty fellowship w/iSchool 2021-2022

  • Takes free-text to Knowledge-Graph triples (entities & relationships between them)
  • Takes work of the professor from nlm.nih SemRep and get an easier to maintain port
  • Started in a collection of languages incl. Prolog, then Java port, now in Python
  • Has already helped in putting in for a NIH grant to take the work even further
  • Makes use of NLM’s MetaMap-Lite (MML) which does the Named-Entity-Recognition
  • Then sets of rules are used to find relationships between the entities
  • MML matching ability generated from any ontology, with synonyms in each class
  • Also an aim to make it easier to generalize beyond the biomedical domain

3 of 22

I worked on:

  • Get the java then python code bases running on a new machine, update everything to python3

  • Start some simple logging, suggest use to catch errors, test for changes in output

incl some in braat format to more easily view the parse/relationships within the sentences

  • Move away from socketed connections to either local calls or REST based service calls

or

Move services either to REST based calls, or to local execution.

  • Update process to pull synonym references from ontologies for NER in other domains
    • Updated python code to produce datafilebuilder input and run that into metamap
    • also found a simple python library to pull then match from an ontology

  • Use of owlready2.pymedtermino2 for concept relationship [/ subsumption] tests

  • Some looking at further work
    • List of next steps / use in possible grants

4 of 22

Motivation: of machine interpretability of knowledge from free-text

Things-not-strings via: free-text -to-> Knowledge-Graph triples (entities w/relationships)

helps achieve achieve the goal of machine-interpretability [KGs need connected things]

blog.google/products/search/introducing-knowledge-graph-things-not

Introducing the Knowledge Graph:

things, not strings

1. Find the right thing Language can be ambiguous

2. Get the best summary With the Knowledge Graph, Google can better understand your query

3. Go deeper and broader

Finally, the part that’s the most fun of all—the Knowledge Graph can help you make some unexpected discoveries.

5 of 22

There are several application areas for

machine interpretable knowledge

e.g.

6 of 22

Named-Entity-Recognition & Linking

wikipedia.org/wiki/Capital_city_of

7 of 22

Knowledge-Graph triples are made of URI/things,

w/some literal objects

wikipedia.org/wiki/France

wikipedia.org/wiki/Capital_city

wikipedia.org/wiki/Paris

literals are eg. text numbers, or any xml type; but can only be in terminal Objects

dbp:Paris dbp:Population 2161000^^xsd:int

8 of 22

We use MetaMap-Lite for Entity-Linking

How it works:

  • input text ->
  • sentence/line segmentation -> tokenization -> part-of-speech tagging ->
  • token window generation -> term normalization ->
  • concept dictionary lookup ->
  • negation detection ->
  • result presentation

9 of 22

Example MML match:

"Papillary Thyroid Carcinoma is a Unique Clinical Entity"

"Papillary Thyroid Carcinoma is a Unique Clinical"

"Papillary Thyroid Carcinoma is a Unique"

"Papillary Thyroid Carcinoma is a"

"Papillary Thyroid Carcinoma is"

"Papillary Thyroid Carcinoma" --> match

"is a Unique Clinical Entity"

"is a Unique Clinical"

"is a Unique"

"is a"

"is"

"a Unique Clinical Entity"

"a Unique Clinical"

"a Unique"

"a"

"Unique Clinical Entity"

"Unique Clinical"

"Unique" --> match

"Clinical Entity"

"Clinical" --> match

"Entity" --> match

10 of 22

Entity Linking output to the brat rapid annotation tool

11 of 22

Expanding Beyond BioMedical domain

Ontologies with predicate hasExactSynonym,

w/literal objects being that text that can be harvested

to make MML handle new domains.

I plan to use it for GeoCODES, & can think of many others it could be used in

12 of 22

  • Get the java then python code bases running on a new machine, update everything to python3
  • Start some simple logging, suggest use to catch errors, test for changes in output

incl some in braat to more easily view the parse/relationships within the sentences

  • Move away from socketed connections to either local calls or REST based service calls.
  • Update process to pull synonym references from ontologies for NER in other domains
  • Use of owlready2.pymedtermino2 for concept relationship tests

https://isda.ncsa.illinois.edu/~mbobak/

for February-June:

  • Process/documentation for regular UMLS updates
    • Metamorphosys
    • Can we rely on MetaMap Lite files?
  • Process/documentation for adapting MetaMap Lite to non-UMLS vocabularies/ontologies
    • What is required in the vocabulary/ontology? What is good-to-have?
    • Data File Builder
    • Tips/tricks
  • Overall infrastructure
    • Should we consider running MetaMap Lite and other server processes in a different way?
    • Logging
    • Unit tests
    • Serialization/deserialization

13 of 22

after this, extra slides, this is just a very rough, 1st draft

14 of 22

Clowder is mentioned in the NIH grant proposal &I will annotate this EC free-text too

15 of 22

16 of 22

Clowder organization

  • One space per data-facility
  • Datasets hold metadata
  • Also a Resources space:

Allows for

  • dataset & tool search
  • metadata/annotation
  • linking out to get the data
  • & sometimes (assoc) tool/s

17 of 22

Clowder search results

& a result’s metadata(tab) tree listing

18 of 22

Future work:

  • Linking data with tools ..
  • Automatic launching of tools with data
  • From search to use in a NoteBook
  • Search on map & in NoteBook
  • Search enhanced w/NER & more, see:
  • https://mbcode.github.io/ec
  • Getting these benefits in clowder via:
    • triple store sync with clowder
    • embedding science on schema
    • DCAT as a superset/furthering the gateway from schema.org to real science descriptions

19 of 22

Faster time to science

via metadata use

to get more

resources

Can take questions later: @Mike Bobak

20 of 22

21 of 22

22 of 22