1 of 40

A Lexicon and Rule-Based Tool for Translating Short Biomedical Specimen Descriptions into Ontology Terms

The William Hsiao Lab

UBC Department of Pathology & British Columbia Centre for Disease Control

Gurinder Gosal

gurinder.gosal@bccdc.ca

Damion Dooley

Scientific Programmer

damion.dooley@bccdc.ca

Backgrounder

  • Challenge: Pairing genomics with biosample metadata
  • IRIDA.ca next gen. sequencing platform of foodborn pathogens:�Supporting epidemiology, federated data sharing
  • Biomedical data standardization via ontology:�GenEpiO + FoodOn + LexMapr + Genomic Epidemiology Entity Mart

2 of 40

A Lexicon and Rule-Based Tool for Translating Short Biomedical Specimen Descriptions into Ontology Terms

Open source!

3 of 40

LexMapr project goals

Tune ontology labels and synonyms to support common language

Transform sample text into n-gram combinations of tokens that are likely to match ontology term labels and synonyms

No word association machine learning!

Why not?

Short texts often have very limited features available to build robust machine learning systems as compared to longer texts.

cow paddock straw bedding_bovine (dairy)

Sus scrofa; intestine

4 of 40

Challenges

Challenge

Observations

Examples from dataset

Synonyms

  • Significant impact on recall
  • Resources usually miss synonyms or have low coverage

feces

Horse Stool

Cattle Fecal

Abbreviations/ Acronyms

  • A rapid rate of creation
  • Difficult to get comprehensive list and keep it up-to-date

  • Abbreviation or acronym may refer to multiple concepts
  • Usage of non-standard ones

frz rock lobster tails

frz. cooked shrimp

CSF

NWNorth West or net weight

frozen → frz., fz, frz, froz, frzn, fzn

Plurality

  • Has to be dealt with to boost recall
  • Dealing with false positives

sesame seeds

macadamia → macadamium

5 of 40

Challenges …

Challenge

Observations

Examples from dataset

Grammatical mistakes

  • Frequent spelling mistakes

  • Misspelled words might be valid in other contexts

cucmbers → cucumbers

beast” misspelled in “chicken breast

Boundary detection

Boundary detection is comparatively difficult

  • Usually longer names than in other domains

  • Names can be overlapping

  • Arbitrary ordering of words

“shrimp, white, farm raised, raw, frz”

[fresh grape] tomatoes or fresh [grape tomatoes]?

frozen shrimp vs shrimp, frz

6 of 40

Challenges …

Challenge

Observations

Examples from dataset

Multilingual phrases

  • Common to mix multilingual words to describe sample descriptions

sambar (seasoning mix)

Sambar (Indian food name) →lentil curry

Contextual deficit and ambiguity

  • Shorter text often devoid of surrounding contexts

  • Ambiguity increases when the terms belong to multiple underlying domains

plain”, “ground– (environmental and food domain terms)

Term labelling in ontologies

  • The variations (and inconsistencies) in the labels of ontology terms pose additional challenges

  • Use of different suffixes (or parentheses, punctuations etc.)

FoodOn: snack, ready-to-eat

FoodOn: chicken breast (sliced, ready to eat)

FoodOn: lettuce vegetable food product

7 of 40

Resources used

8 of 40

LexMapr pipeline

9 of 40

Punctuation

&!@_#(%;@! → &!@#@!

fish-meal → fish meal

quail, frozen → quail frozen

A token is a sequence of characters (usually between spaces or punctuation) that are grouped together as a useful semantic unit for processing.

10 of 40

Singularization and spelling fix

Singularization

Convert to the singular form

With required exceptions (for dealing with false positives)

sesame seeds

Spelling fix

Using a misspelling lookup table.

Future: pay attention to rdfs:label language variations.

turmaric → turmeric

11 of 40

Normalization

Lowercase transform

Normalize terms to lower case

Future” unless capitalization is the norm (e.g. pH value)

Ggarlic Ppowder

Synonymize

Using synonym and acronym lookup tables

stool → feces

frz → frozen

Multilingual phrases

Non English food names lookup table

ashwagandha → rennet

in “ashwagandha powder”

12 of 40

Term mapping

Multiple rules in place

  • longest match
  • permutations, combinations of tokens- in input, in resource
  • addition of suffixes etc.

Sample description

Matched terms

Ontology id

smoked round scad

preservation by smoking

round scad (food source)

FOODON_03470106

FOODON_03412481

Octopus, frz, raw

octopus (raw)

preservation by freezing

FOODON_03310498

FOODON_03470136

13 of 40

Term mapping

Sample description

Matched terms

Ontology id

sediment, stream

stream sediment

ENVO_00002127

sesame seeds, hulled

sesame seed (hulled)

FOODON_03304876

frz cooked shrimp

shrimp (cooked, frozen)

FOODON_03308827

frz. lobster tails

lobster tail (frozen)

FOODON_03305435

14 of 40

Stop words and singularization exceptions

Customized stop words removal

We do remove selected stop words, such as the, a, an etc.

Others, such as with, from etc., not removed - make a good contribution for semantic context.

    • In “cooked shrimp ring with cocktail sauce”, with representing the semantics of combining two food components.

Singularization exceptions

Generalized exception rules for the bio-sample domain.

E.g., no inflection if token ends with “us”, “ia”, “ta” etc.

    • to stop bos taurus from inflecting to bos tauru

Singularization exception terms lookup table based on the training dataset for the underlying domain

      • to stop canis from being singularized to cani

15 of 40

Semantic tagging and candidate terms

shrimp, tiny

tiny: Quality-Size

Brush inside of dryer

brush: Equipment-OR-Device-OR-ManMadeObject

dryer: Equipment

Tagging unmapped entities semantically after the constituents of textual phrases do not match to the ontology terms

Contribution for enrichment of resources - potential new candidates for resource terms

Semantic Tags

………

BodyPart-OR-OrganicPart

Equipment-OR-Device-OR-ManmadeObject

Furniture

Portion_FoodOrOther

Preposition-Containment

Preposition-HavingOrigin

Preposition-Presence

Quality-Age

Quality-BodyRelated

Quality-Color

Quality-Content

Quality-Direction

Quality-Environment

Quality-Ethnic

Quality-Food

Quality-Gender

Quality-Generic

Quality-Geo

Quality-Grading

Quality-Heat

Quality-Heat-or-Food

Quality-Odour

Quality-PlantRelated

Quality-Position

Quality-Shape

Quality-Size

Quality-SpeciesRelated

Quality-State

Quality-Taste

Quality-Texture

Quality-TimeRelated

Quality-Trait

Structure-OR-Area

Structure-OR-Area-OR-ManmadeObject

TimeRelated

Trademark

Unit

WaterBody

………

16 of 40

A snapshot of mapping results

Sample Description 

Matched Ontology Terms

Resource IDs 

Rule Triggered

Fish-meal

fish meal

FOODON_03301620

Punctuation Treatment

quail, frozen

preservation by freezing

quail (food source)

FOODON_03470136

FOODON_03411346

Soil

soil

ENVO_00001998

Change of Case

Mayonnaise

mayonnaise

FOODON_03301440

Dog Stool

Canis lupus familiaris

feces

NCBITaxon_9615

ENVO_00002003

Synonym Usage

 

Cat

Felis catus

NCBITaxon_9685

Pecans

pecan

FOODON_03304764

Singularization

 

Walnuts

walnut

FOODON_03316466

17 of 40

A snapshot of mapping results…

Sample Description 

Matched Ontology Terms

Resource IDs 

Rule Triggered

homo spaiens; Stool

homo sapiens

feces

NCBITaxon_9606

UBERON_0001988

Spelling Correction Treatment

Sus scrofa; intestin

intestine

sus scrofa

UBERON_0000160

NCBITaxon_9823

spice mix

spice mixture

FOODON_03304292

Abbreviation-Acronym Treatment

frz frog legs

frog leg (frozen)

FOODON_03305167

dried soup

preservation by dehydration or drying

soup food product

FOODON_03470116

FOODON_00002257

Suffix Addition- food product to the Input

 

Pork

pork meat food product

FOODON_00001038

Suffix Addition- meat food product to the Input

 

18 of 40

Text sample annotation tools

Zooma

Text lookup → Selectable OLS ontologies

https://www.ebi.ac.uk/spot/zooma/

Monarch Text Annotator

Text lookup → Genotype/phenotype

https://monarchinitiative.org/annotate/text

EXTRACT 2.0 (browser add-on+API)

Text lookup → Genotype/phenotype

https://extract.jensenlab.org/

OntoMaton (Google sheets)

Phrase to term lookup → many choices. Uses all of BioPortal or OLS or LOV.

https://chrome.google.com/webstore/detail/ontomaton/

Webulous

(Google sheets)

Phrase to term lookup → sparse or many choices. One or all BioPortal ontologies.

https://www.ebi.ac.uk/spot/webulous/

19 of 40

Tools: Extract 2.0

fish meal

quail, frozen

Soil

Garlic Powder

Stool

Cat

Pecans

sesame seeds

Mackeral

homo spaiens; Stool

Sus scrofa; intestine

spice mix

frz frog legs

dried soup

sediment, stream

sesame seeds, hulled

frz cooked shrimp

...

ENVO:Island

ENVO:Sand

ENVO:Sediment

ENVO:Sewage

ENVO:Soil

ENVO:Waste water

gene:Homo sapiens:SPICE1

NCBITaxon:Allium sativum

NCBITaxon:Bos taurus

NCBITaxon:Canis lupus familiaris

NCBITaxon:Coriandrum sativum

NCBITaxon:Lobster

NCBITaxon:Meleagris gallopavo

NCBITaxon:Pistacia vera

NCBITaxon:Prunus dulcis

NCBITaxon:Rusa unicolor

NCBITaxon:Scophthalmus maximus

NCBITaxon:Shorea almon

NCBITaxon:Spinacia oleracea

NCBITaxon:Sus scrofa

NCBITaxon:Sus scrofa scrofa

NCBITaxon:Withania somnifera

NCBITaxon:Zea mays subsp. mays

BTO:Intestine

Demo text: 36 phrases → Genotype/phenotype: DOID, GO, ENVO, MP, NCBITaxon ...

Fish meal quail, frozen Soil Garlic Powder Stool Cat Pecans sesame seeds Mackeral homo spaiens; Stool Sus scrofa; intestine spice mix frz frog legs dried soup sediment, stream sesame seeds, hulled frz cooked shrimp Pork ashwagandha powder sambar powder coriander, fresh Bathroom Countertop frz. Fish frz. lobster tails biological tissue and/or fluid -canis lupus familiaris raw organic baby spinach chili & lime flavored corn snacks ready to eat deli meat oven roasted turkey cow paddock straw bedding_bovine(dairy) textered soya protein made of flour for cooking sewage sand island waste water toasted soy grits roasted salted pistachio salmon turbot butterfish smoked whole raw shelled almond

20 of 40

Tools: Monarch Text Annotator

→ Genotype/phenotype: GO, NCBIGene, MONDO, PO, GO, PO, WBPheno...

Fish-meal quail, frozen Soil Garlic Powder Stool Cat Pecans sesame seeds Mackeral homo spaiens; Stool Sus scrofa; intestine spice mix frz frog legs dried soup sediment, stream sesame seeds, hulled frz cooked shrimp Pork ashwagandha powder sambar powder coriander, fresh Bathroom Countertop frz. Fish frz. lobster tails biological tissue and/or fluid canis lupus familiaris raw organic baby spinach chili & lime flavored corn snacks ready to eat deli meat oven roasted turkey cow paddock straw bedding_bovine (dairy) textered soya protein made of flour for cooking sewage sand island waste water toasted soy grits roasted salted pistachio salmon turbot butterfish smoked whole raw shelled almond

Note: “-canis” blocks “canis lupus familiaris”

text

category

term

sesame

gene

KCNJ10 (NCBIGene:3766)

sesame

Sesamum indicum (NCBITaxon:4182)

sesame

gene

Hira (NCBIGene:31680)

sesame

disease

EAST syndrome (MONDO:0013005)

seeds

seed (OBO:PO_0009010)

21 of 40

Tools: EBI Zooma

misspelling → skipped NCBITaxon_9606

no singularization → FoodOn “sesame seed” missed?

a virus!

sesame seeds

11850 - sesame seeds (efsa foodex2)

sesame seeds and products thereof

Gros IIot Sesame

Townland of Seeds

Sesame necrotic mosaic virus

Iranian Sesame phyllody phytoplasma

homo spaiens; Stool

Homo

feces collection

Homo heidelbergensis

Skerry of Stool

Bogue Homo

feces

“... -canis lupus familiaris” blocks “-cannis”

22 of 40

Tools: LexMapr

Biosample domain: have disease & gene match be an option.

Fish-meal quail, frozen Soil Garlic Powder Stool Cat Pecans sesame seeds Mackeral homo spaiens; Stool Sus scrofa; intestine spice mix frz frog legs dried soup sediment, stream sesame seeds, hulled frz cooked shrimp Pork ashwagandha powder sambar powder coriander, fresh Bathroom Countertop frz. Fish frz. lobster tails biological tissue and/or fluid -canis lupus familiaris raw organic baby spinach chili & lime flavored corn snacks ready to eat deli meat oven roasted turkey cow paddock straw bedding_bovine (dairy) textered soya protein made of flour for cooking sewage sand island waste water toasted soy grits roasted salted pistachio salmon turbot butterfish smoked whole raw shelled almond

  • Green term : good ontology match
  • Blue term : Manually set as candidate term. (previous Enterobase or GenomeTrakr sample set term missing in LexMapr resources. )
  • GAZ subset: countries and some states & provinces. Avoids mismatch on town etc.
  • Future: deal with stop words: and/or/not

23 of 40

BioSample

specimen

OBI:0100051

specimen extraction geographic location

GENEPIO:0001656

environmental material

(mud, soil, wastewater)

ENVO:00010483

food product type

FOODON:03400361

specimen source substance

GENEPIO:0000134

subject organism

GENEPIO:00001578

Petrus bone extraction for DNA

photo credit: https://www.nature.com/news/stop-hoarding-ancient-bones-plead-archaeologists-1.22445

OR

OR

located in

derives from

is a

is a

is a

located in

located in

24 of 40

BioSample draft

longitude

OBI:0001621

latitude

OBI:0001620

specimen source substance

GENEPIO:0000134

specimen extraction latitude and longitude

GENEPIO:0000134

is about

specimen extraction country*

GENEPIO:0000118

(ENVO:00000009)

specimen extraction city*

GENEPIO:0001785

(ENVO:00000856)

Immediate sampling site

(building wall, plumbing drain)

ENVO | AGRO

biome

(forest, grassland ...)

ENVO:00000428

built environment

(hospital, food kiosk…)

(Anthropogenic geographic feature) ENVO:00000002

physiographic feature

(watershed, desert, beach...)

ENVO:00000191

specimen extraction geographic location

GENEPIO:0000134

located in

located in

located in

located in

*Where specimen is obtained, not where processing, storage, etc. occur. Precomposed “specimen x extraction” terms are used to facilitate data transfer.

has component

environmental parameter

temperature, depth

25 of 40

BioSample draft

specimen from organism

OBI:0001479

human

NCBITaxon:9606

plant structure

PO:0009011

subject anatomical site

GENEPIO:0000025

subject age

GENEPIO:0001775

(OBI:0001169)

subject sex�GENEPIO:0000031

(PATO:0001894)

epidemiology case model

GENEPIO:0001613

has taxonomic id

anatomical structure

UBERON:0000061

is about

is about

specimen

OBI:0100051

specimen source substance

GENEPIO:0000134

derives from

subject organism

GENEPIO:00001578

produced by

derives

from

OR

is a

clinical

is a

part of

subject taxonomy

GENEPIO:0001567

has quality

symptom

HP:0000118

subject disease

GENEPIO:0001617

has quality

subject demographic

GENEPIO:0001683

subject body product

GENEPIO:0000028

26 of 40

FoodOn semantics

27 of 40

FoodOn semantics

shrimp (cooked, frozen)

FOODON:03308827

frz cooked shrimp

short biosample description

LexMapr

shrimp (cooked) FOODON_03301481

shrimp (frozen)

FOODON_03301169

Cooked (boiled, grilled, microwaved?) then frozen?

Or frozen, (then thawed?) then cooked?

Now hot? Frozen? Or room temperature?

28 of 40

FoodOn semantics

shrimp (cooked, frozen)

FOODON:03308827

shrimp (cooked)

shrimp (frozen)

and

food (boiled)

“food was boiled”

Underwent a boiling process. Food’s surface temperature reached at least 100℃ (at normal atmospheric pressure).

An instance of this process has a start and end time.

No claim about food’s “current” temperature.

food (frozen)

“food was frozen”

Underwent a complete freezing process at which point its internal temperature dropped below 0℃. etc.

29 of 40

FoodOn: curation rules

A parenthesized list of differentiae is permitted in food product label along with “core” name

Food varietal or common name is included in core name rather than differentiae�

purple cauliflower 

❌ cauliflower (purple)

30 of 40

FoodOn: curation rules

Processing steps a food has been subjected to may be placed in order, for future semantic use.�

fish stick (breaded, quick-frozen)

❌ fish stick (quick-frozen, breaded) unlikely, if retail product

prawn (cooked, frozen)

prawn (frozen, cooked) ← possible e.g. if sample from restaurant

Currently, FoodOn does not implement logic regarding process steps.

31 of 40

FoodOn semantics

"Any small swimming crustacean resembling a shrimp tends to be called one." - wikipedia:shrimp

"The Dendrobranchiata consist of prawns, including many species colloquially referred to as 'shrimp', such as the 'white shrimp', Litopenaeus setiferus." - wikipedia:prawn

shrimp (cooked, frozen)

FOODON:03308827

32 of 40

FoodOn semantics

And not

food cooked!

OR

food (frozen)

food (raw)

food (boiled)

food (thawed)

food freezing process

output of

food cooking process

output of

output of

food thawing process

output of

food (fresh)

output of only

process that maintains freshness

food material class

food process class

shrimp (cooked)

shrimp (frozen)

shrimp

(cooked, frozen)

food (cooked)

To make a claim about the frozen / cold / warm / hot quality of an instance of a food product, provide a time-stamped observation.

is a

is a

food boiling

process

is a

33 of 40

FoodOn structure

shrimp (cooked, frozen)

crustacean (frozen)

FOODON_03310904

crustacean (cooked)

FOODON_03310615

shrimp (frozen)

shrimp (whole or parts)

FOODON_03301673

shrimp (cooked)

34 of 40

FoodOn structure

shrimp (cooked, frozen)

crustacean (frozen)

crustacean (cooked)

shrimp (cooked)

shrimp (frozen)

shrimp (whole or parts)

shrimp food product

FOODON_00002239

food (cooked)

FOODON_00001181

seafood (frozen) FOODON_03305677

35 of 40

FoodOn structure

shrimp (cooked, frozen)

crustacean (frozen)

shrimp (cooked)

shrimp (frozen)

shrimp (whole or parts)

shrimp food product

food (cooked)

seafood (frozen)

crustacean (cooked)

food cooking process

FOODON:03450002

food (frozen)

FOODON:03302148

crustacean food product

FOODON:00001792

derives from

output of

shrimp food source

FOODON:00002239

About “derives from”, why not use “has part”?

Because we are reserving “has part” for anatomical relations.

36 of 40

FoodOn structure

shrimp (cooked, frozen)

crustacean (frozen)

shrimp (cooked)

shrimp (frozen)

shrimp (whole or parts)

shrimp food product

food (cooked)

seafood (frozen)

crustacean (cooked)

shrimp (food source)

food cooking process

food (frozen)

crustacean food product

seafood product

process

vertebrate animal

food product

crustacean

(food source)

food transformation process

seafood or seafood product (us cfr)

food source

FOODON:03411564

food product by organism

invertebrate

product type, u.s. code of federal regulations, title 21

FOODON:03401270

meat, poultry, seafood or related product (us cfr)

foodon product type

FOODON:00001002

decapod (food source)

food freezing process

member of

output of

output of

derives from

derives from

37 of 40

LexMapr: future directions

  • Make LexMapr a more general purpose system for mining short text in multiple domains and mapping to relevant domain ontology terms

  • Assign confidence level to matched terms

  • Organize matched terms based on ontology hierarchy and relationships

  • Compile guidelines for annotating short biosample descriptions

  • Provide service online with JSON output format

  • Incorporate matching Wikipedia definitions and ontology synonyms into lexicons

38 of 40

LexMapr Installation (Linux and macOS)

Download & install Miniconda: https://conda.io/miniconda.html

conda create -n lexmapr lexmapr

(Press ‘return/Enter’ to proceed with installation)

Activate the conda environment:

source activate lexmapr

Your shell prompt should start with (lexmapr) if it is installed and activated

39 of 40

LexMapr Installation (Windows)

Download & install Miniconda: https://conda.io/miniconda.html

conda create -n lexmapr nltk inflection wikipedia python-dateutil

(Press ‘return/Enter’ to proceed with installation)

Activate the conda environment:

source activate lexmapr

Download nltk data:

python -c 'import nltk; nltk.download()'

On the window that pops up, select ‘All packages’ and click ‘Download

40 of 40

LexMapr workshop

At Terminal

    • > ssh userXY@[IP]
    • > source activate lexmaprLexmapr operates in the bioconda environment which was set up on the server.
    • > less /data/input/demo.txtThis is the text file of sample descriptions that Lexmapr will analyze. 1 sample description per line. Input format requires 1st line header, and ensure data is comma-delimited.�
    • > lexmapr /data/input/demo.txt -o testout.txt -f shortThe output will be in a short format – just the sample id, description, lexmapr coding, match status, and transformation applied fields.
    • > less testout.txt