1 of 10

CINECA WP3 - Text Mining Integrated Pipeline - High Level Schematic Diagram

Coming ......

2 of 10

Models

3 of 10

LexMapr TM-Pipeline

for CINECA

4 of 10

SORTA

(by Chao Pang)

5 of 10

Dictionary*

UMLS

Lookup table

MetaMap

Match to exactly

one concept

CINECA cohort

free text

Spelling corrector

Free text

normalized

Normalization pipeline

Exact match module

Learning to rank module

Getting more than two

candidates from MetaMap

Learning to rank

(new candidates order)

CUI

CUI

CUI

CUI

CUI

CUI

CUI

Simplified diagram of HES-SO/SIB text mining workflow

* N2C2, MedMentions data

still not normalized?

6 of 10

API concept

7 of 10

-disease

-drug

-gender

-procedure

-HES-SO/SIB

-LexMapr

-SORTA

-Zooma/EBI

Input free text

Load free text/semi-structured text from local file

API concept

8 of 10

Output: Normalized free text

Web API

Free text:

HEADACHE, BACK PAINS

Concept type:

Disease

Model:

HES-SO/SIB

Normalization pipeline

API input/output example

Input:

  • Free text
  • Concept
  • Model

Output ontology

Concept code

Concept name

Normalization score

UMLS

C0018681

Headache

0.5958

UMLS

C0004604

Back Pain

0.4882

UMLS

C4553197

Headache, CTCAE

0.3011

9 of 10

API input/output example (LexMapr)

10 of 10

ZOOMA

API input/output example: ZOOMA

Label

Type

!Curated data sources

!Ontologies

diabetes

disease

Term Type

Term Value

Ontology Class Label

Mapping Confidence

Ontology Class ID

Source

disease

diabetes

diabetes mellitus

High

EFO_0000400

https://www.ebi.ac.uk/gxa