LLM-based Approaches for Biomedical Named Entity Recognition and Relational Extraction
Large Language Models (LLMs) - Introduction
LLMs
(L)LMs - History
Parallel computation on GPUs for faster learning, more model parameters, and more training data
Capable of performing more complex tasks and problem-solving compared to PLMs
LLM architecture - Transformers
Embedding
Pos. encoding
Encoder block #1
Encoder block #2
Encoder block #3
Encoder block #4
Encoder block #5
Input
Embedding
Pos. encoding
Decoder block #1
Decoder block #2
Decoder block #3
Decoder block #4
Decoder block #5
Output
Linear + softmax
Encoder block #6
Decoder block #6
⤵
Token distribution
Encoder
Decoder
The reference problem
By the way, smartphones do not use transformers for this.
How is this problem tackled?
Text vectorization/embedding + next token prediction via self attention mechanism
?
If there is a single, definite meaning of a word, we rely on the used embedding.
What about the (frequent) case this is not true?
Solution: move towards the more «contextually» similar words.
Vaswani et al., Attention Is All You Need
1. LLMs predictably get more capable with increasing investment, even without targeted innovation
2. Specific important behaviors in LLM tend to emerge unpredictably as a byproduct of increasing investment
4. There are no reliable techniques for steering the behavior of LLMs
Techniques for steering the behavior of LLMs:
Develop a dedicated model for a precise task
Building a LLM from scratch is challenging. Steps:
Concerns:
Techniques for steering the behavior of LLMs:
Fine-tuning
LLM fine-tuning is the process of taking pre-trained models and further training them on smaller, specific datasets to refine their capabilities and improve performance in a particular task or domain.
Domain specific knowledge base(s)
Techniques for steering the behavior of LLMs:
Model prompting
Message: Support has been terrible for 2 weeks...
Sentiment: Negative
###
Message: I love your API, it is simple and so fast!
Sentiment: Positive
###
Message: GPT-J has been released 2 months ago.
Sentiment: Neutral
###
Message: The reactivity of your team has been amazing, thanks!
Sentiment:
Instruction-prompting a general purpose model
A prompt is natural language text describing the task that an AI should perform.
Prompting is the practice of writing instructions or requests for an LLM.
Prompt engineering
Prompt engineering is the process of structuring or crafting an instruction in order to produce the best possible output from a LLM.
Multiple distinct prompt engineering techniques have been published:
CoT, ToT, zero/few-shot learning, RAG, Graph-RAG, KG-RAG, …
Chain-of-Thought
Solve a problem as a series of intermediate steps before giving a final answer
Chain-of-Thought
Solve a problem as a series of intermediate steps before giving a final answer
Tree-of-Thought
In-context learning techniques
In-context learning refers to a model's ability to temporarily learn from prompts.
Zero-shot learning: prompt an LLM without any examples, attempting to take advantage of the reasoning patterns it has gleaned
Identify airlines like this - [#AIRLINE_NAME_1] from the following tweet: "SouthwestAir Just got companion pass and trying to add companion
Few-shot learning: a technique whereby we prompt an LLM with several concrete examples of task performance
Given the following tweets and their corresponding airlines: 1) SouthwestAir bags fly free ['Southwest Airlines'] ; 2) Jet Blue I don't know- no one would tell me where they were coming from ['JetBlue Airways'] – Please extract the airline(s) from the following tweet: "SouthwestAir Just got companion pass and trying to add companion" Using the following format - ['#AIRLINE_NAME_1] for one airline or ['#AIRLINE_NAME_1, #AIRLINE_NAME_2...] for multiple airlines.
Retrieval Augmed Generation (RAG)
Two key steps for this approach:
• Retrieval: Extract relevant data (``context``) that we want the LLM to answer questions about e.g. from databases
• Augmented Generation: Feed a LLM window with prompt+context, which generates a response based on that
(K)Graph-RAG
KG
/
SPARQL Query
/SPARQL
retrieve subgraph (triples) from KGs with factual-based relationships
And furthermore: re-ranking, Agents, …
Ontology-Agnostic Approaches for RE
Key Properties:
Two main categories:
Modular RE techniques
2 step-pipeline: (Named) Entity Recognition → Relation extraction
Input: "Marie Curie discovered radium."�Output: "[Marie Curie]"; "[radium]"
Input: "Marie Curie discovered radium."; "[Marie Curie]"; "[radium]"
Output: "[Marie Curie] | [discovered] | [radium]"
Ontology-Agnostic Approaches & Examples:
seq2seq RE techniques
``One-shot`` (single step) extraction of RDF-like triples <subject, predicate, object>.
Input: "Marie Curie discovered radium."�Output: "[Marie Curie] | [discovered] | [radium]"
Ontology-Agnostic Approaches & Examples:
Can we combine ontologies and LLMs to help with these tasks?
Text extraction on the biomedical literature is hard
How can we curate structured data from unstructured text?
Ontologies are ubiquitous in the life sciences
45
15k citations/year
1 billion annotations
10-1000 billion instances
Millions of environmental samples annotated
Rare disease registries
Variant prioritization
uses
Analyzing single-cell seq
obofoundry.org
Ontologies help us integrating data within a consistent structure. However, …
Can we translate unstructured scientific text directly into arbitrary knowledge schemas?
Can those schemas match the data models used in other resources, like knowledge graphs?
What if we need to:
Can LLMs help?
Ontologies structure data but don’t do the work of structuring data for us
Incompatible
Schemas !
KB
Ontologies
KB
LinkML: A schema language grounded in ontologies
JSON-Schema
ShEx, SHACL
JSON-LD Contexts
Python Dataclasses
OWL
SQL DDL
TSVs
Create datamodels in simple YAML files, annotated using ontologies
Compile to other frameworks
Biocurator
Data Scientist
dct:creator
Pluggable architecture for grounding
Recipe:
Title:
Ingredients:
Steps:
Step:
Utensil:
Inputs: …
recipe.linkml.yaml
Recipe:
Title: Scallion cream..
…
LLM
LinkML describes the domain of interest in a textual manner →
translate LinkML schemas in a list of prompt to be issued to a LLM
Extract the following text into the follow key-value pairs:
label: <name of the recipe>
Description: <text description>
categories: <semi-colon separated list of categories>
ingredients: <semi-colon separated list of ingredients>
steps: <semi-colon separated list of steps>
Here is the recipe:
<RECIPE TEXT>
OntoGPT: Ontologies and LLMs
Python libraries and command line toolkit
Components:
SPIRES for Text Extraction
Inputs:
Process:
Output:
Recursive Interrogation
Extract the following text into the follow key-value pairs:
utensils: <semi-colon separated list of utensiles>
inputs: <semi-colon separated list of ingredients>
outputs: <semi-colon separated list of ingredients>
steps: <semi-colon separated list of steps>
“On medium heat melt the butter and sauté the onion and bell peppers”
Action: melt; sauté
Inputs: butter; onion; bell peppers
Outputs: None
SPIRES
52
s
p
o
s
p
o
s
p
o
s
p
o
KG-Hub:
see kghub.org
Knowledge
Graph
LLMs for NER/RE - Challenges
LLM+KG: RE + validation
Hallucination
Incompletion
Issues:
Few notes on our tool: SPIREX
RNA-KG schema
SPIRES prediction accuracy and comparison with base LLMs
Link prediction (ML technique) for evaluating triples' plausibility
Demo 1: LinkML design with SchemaLink
Disease:
is_a: NamedEntity
annotations:
prompt: >-
The name of a disease. Examples include: neurothekoma, retinal
vasculitis, chicken monocytic leukemia
annotators: sqlite:obo:mondo, sqlite:obo:hp
id_prefixes:
- MONDO
- HP
[...]
RNAGeneList:
is_a: TextWithTriples
slot_usage:
triples:
range: RNAGeneRelationship
annotations:
prompt: >-
A semi-colon separated list of RNA to Gene relationships
where the relationship is regulates. For example:
hsa-miR-1 regulates RELA;
miR-123 regulates IGF8
(RAG-based Intelligent Component available soon)
Demo 2: NER and RE with OntoGPT-SPIRES
extracted_object:
triples:
- triples:
- subject: AUTO:miRNA-125b
predicate: RO:0002211
object: HGNC:4274
named_entities:
- id: AUTO:miRNA-125b
label: miRNA-125b
original_spans:
- 1:10
- 77:86
- id: RO:0002211
label: regulates
original_spans:
- 88:96
- id: HGNC:4274
label: GJA1
original_spans:
- 116:119
Abstract:
miRNA-125b is an RNA that causes myocardial infarction (infarct).
Moreover, miRNA-125b regulates the expression of GJA1, a gene involved in the cardiac valve vegetations.
Schema:
Demo 3: Relation Validation with SPIREX
testRNA-KG_enhancement.ipynb → Notebook`s chunks 78-86