1 of 61

LLM-based Approaches for Biomedical Named Entity Recognition and Relational Extraction

2 of 61

3 of 61

4 of 61

5 of 61

6 of 61

Large Language Models (LLMs) - Introduction

LLMs

7 of 61

(L)LMs - History

  • Stage 1 (1960-1990): Linguistic Rules, Statistics-based Models
  • Stage 2 (2000): Neural Language Models, Word Embedding, LSTM, GRU
  • Stage 3 (2010): Pre-trained Language Models (PLMs) based on Transformer, Self-attention
    • e.g., BERT, GPT-2, BART

Parallel computation on GPUs for faster learning, more model parameters, and more training data

  • Stage 4 (2020): Large Language Models (LLMs)
    • Large models (with 7-100B+ parameters)

Capable of performing more complex tasks and problem-solving compared to PLMs

8 of 61

LLM architecture - Transformers

  • Highly structured neural networks with architectures tailored for generative AI
  • Introduced six years ago, become rapidly popular
  • Most known example: GPT
  • Pros:
    • highly efficient
    • parallel hardware can be used for training
  • Cons:
    • huge in size
    • training has high costs (time, energy, computational resources)

Embedding

Pos. encoding

Encoder block #1

Encoder block #2

Encoder block #3

Encoder block #4

Encoder block #5

Input

Embedding

Pos. encoding

Decoder block #1

Decoder block #2

Decoder block #3

Decoder block #4

Decoder block #5

Output

Linear + softmax

Encoder block #6

Decoder block #6

Token distribution

Encoder

Decoder

9 of 61

The reference problem

  • Textual sequence prediction (generative AI)
  • Given the beginning of a text, give suggestions for the next words (possibly completing the sentence)
  • But there is more than that (e.g., synthesis of images or sound)
  • Everyday example: augmented keyboards in smartphones

By the way, smartphones do not use transformers for this.

10 of 61

How is this problem tackled?

Text vectorization/embedding + next token prediction via self attention mechanism

?

If there is a single, definite meaning of a word, we rely on the used embedding.

What about the (frequent) case this is not true?

Solution: move towards the more «contextually» similar words.

Vaswani et al., Attention Is All You Need

11 of 61

Eight Things to Know about Large Language Models

Samuel R. Bowman

1. LLMs predictably get more capable with increasing investment, even without targeted innovation

2. Specific important behaviors in LLM tend to emerge unpredictably as a byproduct of increasing investment

4. There are no reliable techniques for steering the behavior of LLMs

12 of 61

Techniques for steering the behavior of LLMs:

Develop a dedicated model for a precise task

Building a LLM from scratch is challenging. Steps:

  1. Task definition: NLP, Chatting, sentiment analysis, …
  2. Data preparation (cleaning, annotation, ...)
  3. Tokenization (i.e. the process of splitting text into smaller units, such as words, subwords, or characters named tokens)
  4. Architect, design, and train a transformer model
    1. Number of transformer layers
    2. Size of hidden layers and embedding vectors
    3. Number of parallel attention mechanisms
  5. Evaluation
  6. Optimization

Concerns:

  • Privacy: Data privacy, personally identifiable information, data retention policy, IP leakage, security vulnerabilities, legal compliance
  • Environmental: High cost, energy consumption, carbon emissions, and water usage
  • Societal Impacts: Job loss, disparities, phishing, fraud, manipulation, plagiarism, cheating, fake news, big tech monopolies, societal unrest, …

13 of 61

Techniques for steering the behavior of LLMs:

Fine-tuning

LLM fine-tuning is the process of taking pre-trained models and further training them on smaller, specific datasets to refine their capabilities and improve performance in a particular task or domain.

Domain specific knowledge base(s)

14 of 61

Techniques for steering the behavior of LLMs:

Model prompting

Message: Support has been terrible for 2 weeks...

Sentiment: Negative

###

Message: I love your API, it is simple and so fast!

Sentiment: Positive

###

Message: GPT-J has been released 2 months ago.

Sentiment: Neutral

###

Message: The reactivity of your team has been amazing, thanks!

Sentiment:

Instruction-prompting a general purpose model

A prompt is natural language text describing the task that an AI should perform.

Prompting is the practice of writing instructions or requests for an LLM.

15 of 61

Prompt engineering

Prompt engineering is the process of structuring or crafting an instruction in order to produce the best possible output from a LLM.

Multiple distinct prompt engineering techniques have been published:

CoT, ToT, zero/few-shot learning, RAG, Graph-RAG, KG-RAG, …

16 of 61

Chain-of-Thought

Solve a problem as a series of intermediate steps before giving a final answer

17 of 61

Chain-of-Thought

Solve a problem as a series of intermediate steps before giving a final answer

18 of 61

Tree-of-Thought

19 of 61

In-context learning techniques

In-context learning refers to a model's ability to temporarily learn from prompts.

Zero-shot learning: prompt an LLM without any examples, attempting to take advantage of the reasoning patterns it has gleaned

Identify airlines like this - [#AIRLINE_NAME_1] from the following tweet: "SouthwestAir Just got companion pass and trying to add companion

Few-shot learning: a technique whereby we prompt an LLM with several concrete examples of task performance

Given the following tweets and their corresponding airlines: 1) SouthwestAir bags fly free ['Southwest Airlines'] ; 2) Jet Blue I don't know- no one would tell me where they were coming from ['JetBlue Airways'] – Please extract the airline(s) from the following tweet: "SouthwestAir Just got companion pass and trying to add companion" Using the following format - ['#AIRLINE_NAME_1] for one airline or ['#AIRLINE_NAME_1, #AIRLINE_NAME_2...] for multiple airlines.

20 of 61

Retrieval Augmed Generation (RAG)

  • Few-shot learning powered by meaningful ``context``
  • RAG also empowers users to prompt an LLM with information the model has not been previously exposed to (updated data/information without the need of retraining/fine-tune the model)

Two key steps for this approach:

Retrieval: Extract relevant data (``context``) that we want the LLM to answer questions about e.g. from databases

Augmented Generation: Feed a LLM window with prompt+context, which generates a response based on that

21 of 61

(K)Graph-RAG

KG

/

SPARQL Query

/SPARQL

retrieve subgraph (triples) from KGs with factual-based relationships

22 of 61

And furthermore: re-ranking, Agents, …

23 of 61

24 of 61

25 of 61

26 of 61

27 of 61

28 of 61

29 of 61

30 of 61

31 of 61

32 of 61

33 of 61

34 of 61

35 of 61

36 of 61

37 of 61

38 of 61

39 of 61

40 of 61

41 of 61

Ontology-Agnostic Approaches for RE

Key Properties:

  • Generalizable to new domains or previously unseen relations
  • Particularly useful in scenarios where: no ontology schema exists / the schema is difficult to define

Two main categories:

  • Modular RE
  • seq2seq RE

42 of 61

Modular RE techniques

2 step-pipeline: (Named) Entity Recognition → Relation extraction

  1. (N)ER

Input: "Marie Curie discovered radium."�Output: "[Marie Curie]"; "[radium]"

  • RE

Input: "Marie Curie discovered radium."; "[Marie Curie]"; "[radium]"

Output: "[Marie Curie] | [discovered] | [radium]"

Ontology-Agnostic Approaches & Examples:

  • Feature-based: ML classifier trained on linguistic features (word distance between entities) e.g. OpenNLP
  • Symbolic/rule-based: pattern matching approach ("[sub] verb [obj]") e.g. CoreNLP

43 of 61

seq2seq RE techniques

``One-shot`` (single step) extraction of RDF-like triples <subject, predicate, object>.

Input: "Marie Curie discovered radium."�Output: "[Marie Curie] | [discovered] | [radium]"

Ontology-Agnostic Approaches & Examples:

  • REBEL: Autoregressive model based on BART-large that generates all triplets present in the input text
  • PubMedBERT: Biomedical domain-specific model, pretrained from scratch on PubMed abstracts
  • T5-based models: Reformulate RE as QA ("What is the relation between X and Y?") or as controlled text generation

44 of 61

Can we combine ontologies and LLMs to help with these tasks?

Text extraction on the biomedical literature is hard

How can we curate structured data from unstructured text?

45 of 61

Ontologies are ubiquitous in the life sciences

45

15k citations/year

1 billion annotations

10-1000 billion instances

Millions of environmental samples annotated

Rare disease registries

Variant prioritization

uses

Analyzing single-cell seq

obofoundry.org

Ontologies help us integrating data within a consistent structure. However, …

46 of 61

Can we translate unstructured scientific text directly into arbitrary knowledge schemas?

Can those schemas match the data models used in other resources, like knowledge graphs?

What if we need to:

  • model complex and nested subclasses?
  • link to external unique identifiers?
  • trust the validity of our data?
  • ask questions about integrated data?

Can LLMs help?

Ontologies structure data but don’t do the work of structuring data for us

Incompatible

Schemas !

KB

Ontologies

KB

47 of 61

LinkML: A schema language grounded in ontologies

    • BioPortal
    • Wikidata
    • OBO
    • SciScpacy

JSON-Schema

ShEx, SHACL

JSON-LD Contexts

Python Dataclasses

OWL

SQL DDL

TSVs

Create datamodels in simple YAML files, annotated using ontologies

Compile to other frameworks

Biocurator

Data Scientist

dct:creator

Pluggable architecture for grounding

48 of 61

Recipe:

Title:

Ingredients:

Steps:

Step:

Utensil:

Inputs: …

recipe.linkml.yaml

Recipe:

Title: Scallion cream..

LLM

LinkML describes the domain of interest in a textual manner​ →

translate LinkML schemas in a list of prompt to be issued to a LLM

Extract the following text into the follow key-value pairs:

label: <name of the recipe>

Description: <text description>

categories: <semi-colon separated list of categories>

ingredients: <semi-colon separated list of ingredients>

steps: <semi-colon separated list of steps>

Here is the recipe:

<RECIPE TEXT>

49 of 61

OntoGPT: Ontologies and LLMs

Python libraries and command line toolkit

Components:

  • SPIRES: Extracting from text
    • Structured Prompt Interrogations and Recursive Extraction of Semantics
  • HALO:
    • Hallucinating Latent Ontologies
  • SPINDOCTOR: interpreting genomics experiments
    • Structured Prompting Interpolating Narrative Descriptions Or Controlled Terms for Ontological Reporting

50 of 61

SPIRES for Text Extraction

Inputs:

  • LinkML Schema
  • Free Text

Process:

  • Recursive descent into schema tree querying a LLM (e.g. via OpenAI GPT API)
  • Entity linking using look-up tables

Output:

  • YAML extracted from text
  • Conforms to schema

51 of 61

Recursive Interrogation

Extract the following text into the follow key-value pairs:

utensils: <semi-colon separated list of utensiles>

inputs: <semi-colon separated list of ingredients>

outputs: <semi-colon separated list of ingredients>

steps: <semi-colon separated list of steps>

“On medium heat melt the butter and sauté the onion and bell peppers”

Action: melt; sauté

Inputs: butter; onion; bell peppers

Outputs: None

52 of 61

SPIRES

52

s

p

o

s

p

o

s

p

o

s

p

o

KG-Hub:

see kghub.org

Knowledge

Graph

53 of 61

LLMs for NER/RE - Challenges

  • Alignment Problem: LLMs may produce undesirable outputs (e.g. entities and facts not grounded in schemas/ontologies)

  • Hallucination: generation of syntactically correct but semantically/factually incorrect statements

  • Lack of Consistency: Generate logically contradicting outputs → low semantic similarity of LLM outputs due to violation of important relational properties such as negation, symmetry, and transitivity

  • Black-box Model: Many LLMs are proprietary and little information is released about them. Difficult to explain LLM predictions with billions of parameters. Knowledge in LLMs is hard to interpret, update, and is prone to bias. Challenging to deploy LLMs in decision-critical applications

54 of 61

LLM+KG: RE + validation

Hallucination

Incompletion

Issues:

55 of 61

Few notes on our tool: SPIREX

56 of 61

RNA-KG schema

57 of 61

SPIRES prediction accuracy and comparison with base LLMs

58 of 61

Link prediction (ML technique) for evaluating triples' plausibility

59 of 61

Demo 1: LinkML design with SchemaLink

Disease:

is_a: NamedEntity

annotations:

prompt: >-

The name of a disease. Examples include: neurothekoma, retinal

vasculitis, chicken monocytic leukemia

annotators: sqlite:obo:mondo, sqlite:obo:hp

id_prefixes:

- MONDO

- HP

[...]

RNAGeneList:

is_a: TextWithTriples

slot_usage:

triples:

range: RNAGeneRelationship

annotations:

prompt: >-

A semi-colon separated list of RNA to Gene relationships

where the relationship is regulates. For example:

hsa-miR-1 regulates RELA;

miR-123 regulates IGF8

SchemaLink

(RAG-based Intelligent Component available soon)

60 of 61

Demo 2: NER and RE with OntoGPT-SPIRES

extracted_object:

triples:

- triples:

- subject: AUTO:miRNA-125b

predicate: RO:0002211

object: HGNC:4274

named_entities:

- id: AUTO:miRNA-125b

label: miRNA-125b

original_spans:

- 1:10

- 77:86

- id: RO:0002211

label: regulates

original_spans:

- 88:96

- id: HGNC:4274

label: GJA1

original_spans:

- 116:119

Abstract:

miRNA-125b is an RNA that causes myocardial infarction (infarct).

Moreover, miRNA-125b regulates the expression of GJA1, a gene involved in the cardiac valve vegetations.

Schema:

61 of 61

Demo 3: Relation Validation with SPIREX

testRNA-KG_enhancement.ipynb Notebook`s chunks 78-86