A Lexicon and Rule-Based Tool for Translating Short Biomedical Specimen Descriptions into Ontology Terms
The William Hsiao Lab
UBC Department of Pathology & British Columbia Centre for Disease Control
Gurinder Gosal
gurinder.gosal@bccdc.ca
Damion Dooley
Scientific Programmer
damion.dooley@bccdc.ca
Backgrounder
��A Lexicon and Rule-Based Tool for Translating Short Biomedical Specimen Descriptions into Ontology Terms
Open source!
LexMapr project goals
Tune ontology labels and synonyms to support common language
Transform sample text into n-gram combinations of tokens that are likely to match ontology term labels and synonyms
No word association machine learning!
Why not?
Short texts often have very limited features available to build robust machine learning systems as compared to longer texts.
cow paddock straw bedding_bovine (dairy)
Sus scrofa; intestine
Challenges
Challenge | Observations | Examples from dataset |
Synonyms |
| feces Horse Stool Cattle Fecal |
Abbreviations/ Acronyms |
| frz rock lobster tails frz. cooked shrimp CSF NW → North West or net weight frozen → frz., fz, frz, froz, frzn, fzn |
Plurality |
| sesame seeds macadamia → macadamium |
Challenges …
Challenge | Observations | Examples from dataset |
Grammatical mistakes |
| cucmbers → cucumbers “beast” misspelled in “chicken breast” |
Boundary detection | Boundary detection is comparatively difficult
| “shrimp, white, farm raised, raw, frz”
[fresh grape] tomatoes or fresh [grape tomatoes]? frozen shrimp vs shrimp, frz |
Challenges …
Challenge | Observations | Examples from dataset |
Multilingual phrases |
| sambar (seasoning mix) Sambar (Indian food name) →lentil curry |
Contextual deficit and ambiguity |
| “plain”, “ground” – (environmental and food domain terms) |
Term labelling in ontologies |
| FoodOn: snack, ready-to-eat FoodOn: chicken breast (sliced, ready to eat) FoodOn: lettuce vegetable food product |
Resources used
LexMapr pipeline
Punctuation
&!@_#(%;@! → &!@#@!
fish-meal → fish meal
quail, frozen → quail frozen
A token is a sequence of characters (usually between spaces or punctuation) that are grouped together as a useful semantic unit for processing.
Singularization and spelling fix
Singularization | Convert to the singular form With required exceptions (for dealing with false positives) | sesame seeds |
Spelling fix | Using a misspelling lookup table. Future: pay attention to rdfs:label language variations. | turmaric → turmeric |
Normalization
Lowercase transform | Normalize terms to lower case Future” unless capitalization is the norm (e.g. pH value) | Ggarlic Ppowder |
Synonymize | Using synonym and acronym lookup tables | stool → feces frz → frozen |
Multilingual phrases | Non English food names lookup table | ashwagandha → rennet in “ashwagandha powder” |
Term mapping
Multiple rules in place
Sample description | Matched terms | Ontology id |
smoked round scad | preservation by smoking round scad (food source) | FOODON_03470106 FOODON_03412481 |
Octopus, frz, raw | octopus (raw) preservation by freezing | FOODON_03310498 FOODON_03470136 |
Term mapping
Sample description | Matched terms | Ontology id |
sediment, stream | stream sediment | ENVO_00002127 |
sesame seeds, hulled | sesame seed (hulled) | FOODON_03304876 |
frz cooked shrimp | shrimp (cooked, frozen) | FOODON_03308827 |
frz. lobster tails | lobster tail (frozen) | FOODON_03305435 |
Stop words and singularization exceptions
Customized stop words removal
We do remove selected stop words, such as the, a, an etc.
Others, such as with, from etc., not removed - make a good contribution for semantic context.
Singularization exceptions
Generalized exception rules for the bio-sample domain.
E.g., no inflection if token ends with “us”, “ia”, “ta” etc.
Singularization exception terms lookup table based on the training dataset for the underlying domain
Semantic tagging and candidate terms
shrimp, tiny | tiny: Quality-Size |
Brush inside of dryer | brush: Equipment-OR-Device-OR-ManMadeObject dryer: Equipment |
Tagging unmapped entities semantically after the constituents of textual phrases do not match to the ontology terms
Contribution for enrichment of resources - potential new candidates for resource terms
Semantic Tags |
……… |
BodyPart-OR-OrganicPart |
Equipment-OR-Device-OR-ManmadeObject |
Furniture |
Portion_FoodOrOther |
Preposition-Containment |
Preposition-HavingOrigin |
Preposition-Presence |
Quality-Age |
Quality-BodyRelated |
Quality-Color |
Quality-Content |
Quality-Direction |
Quality-Environment |
Quality-Ethnic |
Quality-Food |
Quality-Gender |
Quality-Generic |
Quality-Geo |
Quality-Grading |
Quality-Heat |
Quality-Heat-or-Food |
Quality-Odour |
Quality-PlantRelated |
Quality-Position |
Quality-Shape |
Quality-Size |
Quality-SpeciesRelated |
Quality-State |
Quality-Taste |
Quality-Texture |
Quality-TimeRelated |
Quality-Trait |
Structure-OR-Area |
Structure-OR-Area-OR-ManmadeObject |
TimeRelated |
Trademark |
Unit |
WaterBody |
……… |
A snapshot of mapping results
Sample Description | Matched Ontology Terms | Resource IDs | Rule Triggered |
Fish-meal | fish meal | FOODON_03301620 | Punctuation Treatment |
quail, frozen | preservation by freezing quail (food source) | FOODON_03470136 FOODON_03411346 | |
Soil | soil | ENVO_00001998 | Change of Case |
Mayonnaise | mayonnaise | FOODON_03301440 | |
Dog Stool | Canis lupus familiaris feces | NCBITaxon_9615 ENVO_00002003 | Synonym Usage
|
Cat | Felis catus | NCBITaxon_9685 | |
Pecans | pecan | FOODON_03304764 | Singularization
|
Walnuts | walnut | FOODON_03316466 |
A snapshot of mapping results…
Sample Description | Matched Ontology Terms | Resource IDs | Rule Triggered |
homo spaiens; Stool | homo sapiens feces | NCBITaxon_9606 UBERON_0001988 | Spelling Correction Treatment |
Sus scrofa; intestin | intestine sus scrofa | UBERON_0000160 NCBITaxon_9823 | |
spice mix | spice mixture | FOODON_03304292 | Abbreviation-Acronym Treatment |
frz frog legs | frog leg (frozen) | FOODON_03305167 | |
dried soup | preservation by dehydration or drying soup food product | FOODON_03470116 FOODON_00002257 | Suffix Addition- food product to the Input
|
Pork | pork meat food product | FOODON_00001038 | Suffix Addition- meat food product to the Input
|
Text sample annotation tools
Zooma | Text lookup → Selectable OLS ontologies |
Monarch Text Annotator | Text lookup → Genotype/phenotype |
EXTRACT 2.0 (browser add-on+API) | Text lookup → Genotype/phenotype |
OntoMaton (Google sheets) | Phrase to term lookup → many choices. Uses all of BioPortal or OLS or LOV. |
Webulous (Google sheets) | Phrase to term lookup → sparse or many choices. One or all BioPortal ontologies. |
Tools: Extract 2.0
fish meal |
quail, frozen |
Soil |
Garlic Powder |
Stool |
Cat |
Pecans |
sesame seeds |
Mackeral |
homo spaiens; Stool |
Sus scrofa; intestine |
spice mix |
frz frog legs |
dried soup |
sediment, stream |
sesame seeds, hulled |
frz cooked shrimp |
... |
ENVO:Island ENVO:Sand ENVO:Sediment ENVO:Sewage ENVO:Soil ENVO:Waste water gene:Homo sapiens:SPICE1 NCBITaxon:Allium sativum NCBITaxon:Bos taurus NCBITaxon:Canis lupus familiaris NCBITaxon:Coriandrum sativum NCBITaxon:Lobster NCBITaxon:Meleagris gallopavo NCBITaxon:Pistacia vera NCBITaxon:Prunus dulcis NCBITaxon:Rusa unicolor NCBITaxon:Scophthalmus maximus NCBITaxon:Shorea almon NCBITaxon:Spinacia oleracea NCBITaxon:Sus scrofa NCBITaxon:Sus scrofa scrofa NCBITaxon:Withania somnifera NCBITaxon:Zea mays subsp. mays BTO:Intestine |
Demo text: 36 phrases → Genotype/phenotype: DOID, GO, ENVO, MP, NCBITaxon ...
Fish meal quail, frozen Soil Garlic Powder Stool Cat Pecans sesame seeds Mackeral homo spaiens; Stool Sus scrofa; intestine spice mix frz frog legs dried soup sediment, stream sesame seeds, hulled frz cooked shrimp Pork ashwagandha powder sambar powder coriander, fresh Bathroom Countertop frz. Fish frz. lobster tails biological tissue and/or fluid -canis lupus familiaris raw organic baby spinach chili & lime flavored corn snacks ready to eat deli meat oven roasted turkey cow paddock straw bedding_bovine(dairy) textered soya protein made of flour for cooking sewage sand island waste water toasted soy grits roasted salted pistachio salmon turbot butterfish smoked whole raw shelled almond
Tools: Monarch Text Annotator
→ Genotype/phenotype: GO, NCBIGene, MONDO, PO, GO, PO, WBPheno...
Fish-meal quail, frozen Soil Garlic Powder Stool Cat Pecans sesame seeds Mackeral homo spaiens; Stool Sus scrofa; intestine spice mix frz frog legs dried soup sediment, stream sesame seeds, hulled frz cooked shrimp Pork ashwagandha powder sambar powder coriander, fresh Bathroom Countertop frz. Fish frz. lobster tails biological tissue and/or fluid canis lupus familiaris raw organic baby spinach chili & lime flavored corn snacks ready to eat deli meat oven roasted turkey cow paddock straw bedding_bovine (dairy) textered soya protein made of flour for cooking sewage sand island waste water toasted soy grits roasted salted pistachio salmon turbot butterfish smoked whole raw shelled almond
Note: “-canis” blocks “canis lupus familiaris”
text | category | term |
sesame | gene | KCNJ10 (NCBIGene:3766) |
sesame | | Sesamum indicum (NCBITaxon:4182) |
sesame | gene | Hira (NCBIGene:31680) |
sesame | disease | EAST syndrome (MONDO:0013005) |
seeds | | seed (OBO:PO_0009010) |
✔
✔
Tools: EBI Zooma
misspelling → skipped NCBITaxon_9606
no singularization → FoodOn “sesame seed” missed?
a virus!
sesame seeds | 11850 - sesame seeds (efsa foodex2) | |
| sesame seeds and products thereof | |
| Gros IIot Sesame | |
| Townland of Seeds | |
| Sesame necrotic mosaic virus | |
| Iranian Sesame phyllody phytoplasma | |
homo spaiens; Stool | Homo | |
| feces collection | |
| Homo heidelbergensis | |
| Skerry of Stool | |
| Bogue Homo | |
| feces |
“... -canis lupus familiaris” blocks “-cannis”
✔
✔
✔
Tools: LexMapr
Biosample domain: have disease & gene match be an option.
Fish-meal quail, frozen Soil Garlic Powder Stool Cat Pecans sesame seeds Mackeral homo spaiens; Stool Sus scrofa; intestine spice mix frz frog legs dried soup sediment, stream sesame seeds, hulled frz cooked shrimp Pork ashwagandha powder sambar powder coriander, fresh Bathroom Countertop frz. Fish frz. lobster tails biological tissue and/or fluid -canis lupus familiaris raw organic baby spinach chili & lime flavored corn snacks ready to eat deli meat oven roasted turkey cow paddock straw bedding_bovine (dairy) textered soya protein made of flour for cooking sewage sand island waste water toasted soy grits roasted salted pistachio salmon turbot butterfish smoked whole raw shelled almond
BioSample
specimen
OBI:0100051
specimen extraction geographic location
GENEPIO:0001656
environmental material
(mud, soil, wastewater)
ENVO:00010483
food product type
FOODON:03400361
specimen source substance
GENEPIO:0000134
subject organism
GENEPIO:00001578
Petrus bone extraction for DNA
photo credit: https://www.nature.com/news/stop-hoarding-ancient-bones-plead-archaeologists-1.22445
OR
OR
located in
derives from
is a
is a
is a
located in
located in
BioSample draft
longitude
OBI:0001621
latitude
OBI:0001620
specimen source substance
GENEPIO:0000134
specimen extraction latitude and longitude
GENEPIO:0000134
is about
specimen extraction country*
GENEPIO:0000118
(ENVO:00000009)
specimen extraction city*
GENEPIO:0001785
(ENVO:00000856)
Immediate sampling site
(building wall, plumbing drain)
ENVO | AGRO
biome
(forest, grassland ...)
ENVO:00000428
built environment
(hospital, food kiosk…)
(Anthropogenic geographic feature) ENVO:00000002
physiographic feature
(watershed, desert, beach...)
ENVO:00000191
specimen extraction geographic location
GENEPIO:0000134
located in
located in
located in
located in
*Where specimen is obtained, not where processing, storage, etc. occur. Precomposed “specimen x extraction” terms are used to facilitate data transfer.
has component
environmental parameter
temperature, depth
BioSample draft
specimen from organism
OBI:0001479
human
NCBITaxon:9606
plant structure
PO:0009011
subject anatomical site
GENEPIO:0000025
subject age
GENEPIO:0001775
(OBI:0001169)
subject sex�GENEPIO:0000031
(PATO:0001894)
epidemiology case model
GENEPIO:0001613
has taxonomic id
anatomical structure
UBERON:0000061
is about
is about
specimen
OBI:0100051
specimen source substance
GENEPIO:0000134
derives from
produced by
derives
from
OR
is a
clinical
is a
part of
subject taxonomy
GENEPIO:0001567
has quality
symptom
HP:0000118
subject disease
GENEPIO:0001617
has quality
subject demographic
GENEPIO:0001683
subject body product
GENEPIO:0000028
FoodOn semantics
FoodOn semantics
shrimp (cooked, frozen)
FOODON:03308827
frz cooked shrimp
short biosample description
LexMapr
shrimp (cooked) FOODON_03301481
shrimp (frozen)
FOODON_03301169
Cooked (boiled, grilled, microwaved?) then frozen?
Or frozen, (then thawed?) then cooked?
Now hot? Frozen? Or room temperature?
FoodOn semantics
shrimp (cooked, frozen)
FOODON:03308827
shrimp (cooked)
shrimp (frozen)
and
food (boiled) | |
✔ “food was boiled” | Underwent a boiling process. Food’s surface temperature reached at least 100℃ (at normal atmospheric pressure). An instance of this process has a start and end time. No claim about food’s “current” temperature. |
food (frozen) | |
✔ “food was frozen” | Underwent a complete freezing process at which point its internal temperature dropped below 0℃. etc. |
FoodOn: curation rules
A parenthesized list of differentiae is permitted in food product label along with “core” name
Food varietal or common name is included in core name rather than differentiae�
✔ purple cauliflower
❌ cauliflower (purple)
FoodOn: curation rules
Processing steps a food has been subjected to may be placed in order, for future semantic use.�
✔ fish stick (breaded, quick-frozen)
❌ fish stick (quick-frozen, breaded) ← unlikely, if retail product
✔ prawn (cooked, frozen)
✔ prawn (frozen, cooked) ← possible e.g. if sample from restaurant
�Currently, FoodOn does not implement logic regarding process steps.
FoodOn semantics
"Any small swimming crustacean resembling a shrimp tends to be called one." - wikipedia:shrimp
"The Dendrobranchiata consist of prawns, including many species colloquially referred to as 'shrimp', such as the 'white shrimp', Litopenaeus setiferus." - wikipedia:prawn
shrimp (cooked, frozen)
FOODON:03308827
FoodOn semantics
And not
food cooked!
OR
food (frozen)
food (raw)
food (boiled)
food (thawed)
food freezing process
output of
food cooking process
output of
output of
food thawing process
output of
food (fresh)
output of only
process that maintains freshness
food material class
food process class
shrimp (cooked)
shrimp (frozen)
shrimp
(cooked, frozen)
food (cooked)
To make a claim about the frozen / cold / warm / hot quality of an instance of a food product, provide a time-stamped observation.
is a
is a
food boiling
process
is a
FoodOn structure
shrimp (cooked, frozen)
crustacean (frozen)
FOODON_03310904
crustacean (cooked)
FOODON_03310615
shrimp (frozen)
shrimp (whole or parts)
FOODON_03301673
shrimp (cooked)
FoodOn structure
shrimp (cooked, frozen)
crustacean (frozen)
crustacean (cooked)
shrimp (cooked)
shrimp (frozen)
shrimp (whole or parts)
shrimp food product
FOODON_00002239
food (cooked)
FOODON_00001181
seafood (frozen) FOODON_03305677
FoodOn structure
shrimp (cooked, frozen)
crustacean (frozen)
shrimp (cooked)
shrimp (frozen)
shrimp (whole or parts)
shrimp food product
food (cooked)
seafood (frozen)
crustacean (cooked)
food cooking process
FOODON:03450002
food (frozen)
FOODON:03302148
crustacean food product
FOODON:00001792
derives from
output of
shrimp food source
FOODON:00002239
About “derives from”, why not use “has part”?
Because we are reserving “has part” for anatomical relations.
FoodOn structure
shrimp (cooked, frozen)
crustacean (frozen)
shrimp (cooked)
shrimp (frozen)
shrimp (whole or parts)
shrimp food product
food (cooked)
seafood (frozen)
crustacean (cooked)
shrimp (food source)
food cooking process
food (frozen)
crustacean food product
seafood product
process
vertebrate animal
food product
crustacean
(food source)
food transformation process
seafood or seafood product (us cfr)
food source
FOODON:03411564
food product by organism
invertebrate
product type, u.s. code of federal regulations, title 21
FOODON:03401270
meat, poultry, seafood or related product (us cfr)
foodon product type
FOODON:00001002
decapod (food source)
food freezing process
member of
output of
output of
derives from
derives from
LexMapr: future directions
LexMapr Installation (Linux and macOS)
Download & install Miniconda: https://conda.io/miniconda.html
conda create -n lexmapr lexmapr
(Press ‘return/Enter’ to proceed with installation)
Activate the conda environment:
source activate lexmapr
Your shell prompt should start with (lexmapr) if it is installed and activated
LexMapr Installation (Windows)
Download & install Miniconda: https://conda.io/miniconda.html
conda create -n lexmapr nltk inflection wikipedia python-dateutil
(Press ‘return/Enter’ to proceed with installation)
Activate the conda environment:
source activate lexmapr
Download nltk data:
python -c 'import nltk; nltk.download()'
On the window that pops up, select ‘All packages’ and click ‘Download’
LexMapr workshop
At Terminal