7 of 61

(L)LMs - History

Stage 1 (1960-1990): Linguistic Rules, Statistics-based Models
Stage 2 (2000): Neural Language Models, Word Embedding, LSTM, GRU
Stage 3 (2010): Pre-trained Language Models (PLMs) based on Transformer, Self-attention

e.g., BERT, GPT-2, BART

Parallel computation on GPUs for faster learning, more model parameters, and more training data

Stage 4 (2020): Large Language Models (LLMs)

Large models (with 7-100B+ parameters)

Capable of performing more complex tasks and problem-solving compared to PLMs

8 of 61

LLM architecture - Transformers

Highly structured neural networks with architectures tailored for generative AI
Introduced six years ago, become rapidly popular
Most known example: GPT
Pros:

highly efficient
parallel hardware can be used for training

Cons:

huge in size
training has high costs (time, energy, computational resources)

Embedding

Pos. encoding

Encoder block #1

Encoder block #2

Encoder block #3

Encoder block #4

Encoder block #5

Input

Embedding

Pos. encoding

Decoder block #1

Decoder block #2

Decoder block #3

Decoder block #4

Decoder block #5

Output

Linear + softmax

Encoder block #6

Decoder block #6

⤵

Token distribution

Encoder

Decoder

9 of 61

The reference problem

Textual sequence prediction (generative AI)
Given the beginning of a text, give suggestions for the next words (possibly completing the sentence)
But there is more than that (e.g., synthesis of images or sound)
Everyday example: augmented keyboards in smartphones

By the way, smartphones do not use transformers for this.

10 of 61

How is this problem tackled?

Text vectorization/embedding + next token prediction via self attention mechanism

If there is a single, definite meaning of a word, we rely on the used embedding.

What about the (frequent) case this is not true?

Solution: move towards the more «contextually» similar words.

Vaswani et al., Attention Is All You Need

11 of 61

Eight Things to Know about Large Language Models

Samuel R. Bowman

1. LLMs predictably get more capable with increasing investment, even without targeted innovation

2. Specific important behaviors in LLM tend to emerge unpredictably as a byproduct of increasing investment

4. There are no reliable techniques for steering the behavior of LLMs

12 of 61

Techniques for steering the behavior of LLMs:

Develop a dedicated model for a precise task

Building a LLM from scratch is challenging. Steps:

Task definition: NLP, Chatting, sentiment analysis, …
Data preparation (cleaning, annotation, ...)
Tokenization (i.e. the process of splitting text into smaller units, such as words, subwords, or characters named tokens)
Architect, design, and train a transformer model

Number of transformer layers
Size of hidden layers and embedding vectors
Number of parallel attention mechanisms

Evaluation
Optimization

Concerns:

Privacy: Data privacy, personally identifiable information, data retention policy, IP leakage, security vulnerabilities, legal compliance
Environmental: High cost, energy consumption, carbon emissions, and water usage
Societal Impacts: Job loss, disparities, phishing, fraud, manipulation, plagiarism, cheating, fake news, big tech monopolies, societal unrest, …

13 of 61

Techniques for steering the behavior of LLMs:

Fine-tuning

LLM fine-tuning is the process of taking pre-trained models and further training them on smaller, specific datasets to refine their capabilities and improve performance in a particular task or domain.

Domain specific knowledge base(s)

14 of 61

Techniques for steering the behavior of LLMs:

Model prompting

Message: Support has been terrible for 2 weeks...

Sentiment: Negative

###

Message: I love your API, it is simple and so fast!

Sentiment: Positive

###

Message: GPT-J has been released 2 months ago.

Sentiment: Neutral

###

Message: The reactivity of your team has been amazing, thanks!

Sentiment:

Instruction-prompting a general purpose model

A prompt is natural language text describing the task that an AI should perform.

Prompting is the practice of writing instructions or requests for an LLM.

15 of 61

Prompt engineering

Prompt engineering is the process of structuring or crafting an instruction in order to produce the best possible output from a LLM.

Multiple distinct prompt engineering techniques have been published:

CoT, ToT, zero/few-shot learning, RAG, Graph-RAG, KG-RAG, …

16 of 61

Chain-of-Thought

Solve a problem as a series of intermediate steps before giving a final answer

17 of 61

Chain-of-Thought

Solve a problem as a series of intermediate steps before giving a final answer

18 of 61

Tree-of-Thought

19 of 61

In-context learning techniques

In-context learning refers to a model's ability to temporarily learn from prompts.

Zero-shot learning: prompt an LLM without any examples, attempting to take advantage of the reasoning patterns it has gleaned

Identify airlines like this - [#AIRLINE_NAME_1] from the following tweet: "SouthwestAir Just got companion pass and trying to add companion

Few-shot learning: a technique whereby we prompt an LLM with several concrete examples of task performance

Given the following tweets and their corresponding airlines: 1) SouthwestAir bags fly free ['Southwest Airlines'] ; 2) Jet Blue I don't know- no one would tell me where they were coming from ['JetBlue Airways'] – Please extract the airline(s) from the following tweet: "SouthwestAir Just got companion pass and trying to add companion" Using the following format - ['#AIRLINE_NAME_1] for one airline or ['#AIRLINE_NAME_1, #AIRLINE_NAME_2...] for multiple airlines.

20 of 61

Retrieval Augmed Generation (RAG)

Few-shot learning powered by meaningful ``context``
RAG also empowers users to prompt an LLM with information the model has not been previously exposed to (updated data/information without the need of retraining/fine-tune the model)

Two key steps for this approach:

• Retrieval: Extract relevant data (``context``) that we want the LLM to answer questions about e.g. from databases

• Augmented Generation: Feed a LLM window with prompt+context, which generates a response based on that

21 of 61

(K)Graph-RAG

SPARQL Query

/SPARQL

retrieve subgraph (triples) from KGs with factual-based relationships

22 of 61

And furthermore: re-ranking, Agents, …

41 of 61

Ontology-Agnostic Approaches for RE

Key Properties:

Generalizable to new domains or previously unseen relations
Particularly useful in scenarios where: no ontology schema exists / the schema is difficult to define

Two main categories:

Modular RE
seq2seq RE

42 of 61

Modular RE techniques

2 step-pipeline: (Named) Entity Recognition → Relation extraction

(N)ER

Input: "Marie Curie discovered radium."�Output: "[Marie Curie]"; "[radium]"

Input: "Marie Curie discovered radium."; "[Marie Curie]"; "[radium]"

Output: "[Marie Curie] | [discovered] | [radium]"

Ontology-Agnostic Approaches & Examples:

Feature-based: ML classifier trained on linguistic features (word distance between entities) e.g. OpenNLP
Symbolic/rule-based: pattern matching approach ("[sub] verb [obj]") e.g. CoreNLP

43 of 61

seq2seq RE techniques

``One-shot`` (single step) extraction of RDF-like triples <subject, predicate, object>.

Input: "Marie Curie discovered radium."�Output: "[Marie Curie] | [discovered] | [radium]"

Ontology-Agnostic Approaches & Examples:

REBEL: Autoregressive model based on BART-large that generates all triplets present in the input text
PubMedBERT: Biomedical domain-specific model, pretrained from scratch on PubMed abstracts
T5-based models: Reformulate RE as QA ("What is the relation between X and Y?") or as controlled text generation

44 of 61

Can we combine ontologies and LLMs to help with these tasks?

Text extraction on the biomedical literature is hard

How can we curate structured data from unstructured text?

45 of 61

Ontologies are ubiquitous in the life sciences

15k citations/year

1 billion annotations

10-1000 billion instances

Millions of environmental samples annotated

Rare disease registries

Variant prioritization

uses

Analyzing single-cell seq

obofoundry.org

Ontologies help us integrating data within a consistent structure. However, …

46 of 61

Can we translate unstructured scientific text directly into arbitrary knowledge schemas?

Can those schemas match the data models used in other resources, like knowledge graphs?

What if we need to:

model complex and nested subclasses?
link to external unique identifiers?
trust the validity of our data?
ask questions about integrated data?

Can LLMs help?

Ontologies structure data but don’t do the work of structuring data for us

Incompatible

Schemas !

Ontologies

47 of 61

LinkML: A schema language grounded in ontologies

BioPortal
Wikidata
OBO
SciScpacy

JSON-Schema

ShEx, SHACL

JSON-LD Contexts

Python Dataclasses

OWL

https://linkml.io

https://github.com/linkml/linkml

SQL DDL

TSVs

Create datamodels in simple YAML files, annotated using ontologies

Compile to other frameworks

Biocurator

Data Scientist

dct:creator

Pluggable architecture for grounding

48 of 61

Recipe:

Title:

Ingredients:

Steps:

Step:

Utensil:

Inputs: …

recipe.linkml.yaml

Recipe:

Title: Scallion cream..

…

LLM

LinkML describes the domain of interest in a textual manner →

translate LinkML schemas in a list of prompt to be issued to a LLM

Extract the following text into the follow key-value pairs:

label: <name of the recipe>

Description: <text description>

categories: <semi-colon separated list of categories>

ingredients: <semi-colon separated list of ingredients>

steps: <semi-colon separated list of steps>

Here is the recipe:

49 of 61

OntoGPT: Ontologies and LLMs

Python libraries and command line toolkit

Components:

SPIRES: Extracting from text

Structured Prompt Interrogations and Recursive Extraction of Semantics

HALO:

Hallucinating Latent Ontologies

SPINDOCTOR: interpreting genomics experiments

Structured Prompting Interpolating Narrative Descriptions Or Controlled Terms for Ontological Reporting

https://github.com/monarch-initiative/ontogpt

50 of 61

SPIRES for Text Extraction

Inputs:

LinkML Schema
Free Text

Process:

Recursive descent into schema tree querying a LLM (e.g. via OpenAI GPT API)
Entity linking using look-up tables

Output:

YAML extracted from text
Conforms to schema

https://github.com/monarch-initiative/ontogpt

https://arxiv.org/abs/2304.02711

51 of 61

Recursive Interrogation

Extract the following text into the follow key-value pairs:

utensils: <semi-colon separated list of utensiles>

inputs: <semi-colon separated list of ingredients>

outputs: <semi-colon separated list of ingredients>

steps: <semi-colon separated list of steps>

“On medium heat melt the butter and sauté the onion and bell peppers”

Action: melt; sauté

Inputs: butter; onion; bell peppers

Outputs: None

52 of 61

SPIRES

See https://arxiv.org/abs/2304.02711 and https://github.com/monarch-initiative/ontogpt

KG-Hub:

see kghub.org

Knowledge

Graph

53 of 61

LLMs for NER/RE - Challenges

Alignment Problem: LLMs may produce undesirable outputs (e.g. entities and facts not grounded in schemas/ontologies)

Hallucination: generation of syntactically correct but semantically/factually incorrect statements

Lack of Consistency: Generate logically contradicting outputs → low semantic similarity of LLM outputs due to violation of important relational properties such as negation, symmetry, and transitivity

Black-box Model: Many LLMs are proprietary and little information is released about them. Difficult to explain LLM predictions with billions of parameters. Knowledge in LLMs is hard to interpret, update, and is prone to bias. Challenging to deploy LLMs in decision-critical applications

54 of 61

LLM+KG: RE + validation

Hallucination

Incompletion

Issues:

55 of 61

Few notes on our tool: SPIREX

56 of 61

RNA-KG schema

57 of 61

SPIRES prediction accuracy and comparison with base LLMs

58 of 61

Link prediction (ML technique) for evaluating triples' plausibility

59 of 61

Demo 1: LinkML design with SchemaLink

Disease:

is_a: NamedEntity

annotations:

prompt: >-

The name of a disease. Examples include: neurothekoma, retinal

vasculitis, chicken monocytic leukemia

annotators: sqlite:obo:mondo, sqlite:obo:hp

id_prefixes:

- MONDO

- HP

[...]

RNAGeneList:

is_a: TextWithTriples

slot_usage:

triples:

range: RNAGeneRelationship

annotations:

prompt: >-

A semi-colon separated list of RNA to Gene relationships

where the relationship is regulates. For example:

hsa-miR-1 regulates RELA;

miR-123 regulates IGF8

SchemaLink

(RAG-based Intelligent Component available soon)

60 of 61

Demo 2: NER and RE with OntoGPT-SPIRES

extracted_object:

triples:

- triples:

- subject: AUTO:miRNA-125b

predicate: RO:0002211

object: HGNC:4274

named_entities:

- id: AUTO:miRNA-125b

label: miRNA-125b

original_spans:

- 1:10

- 77:86

- id: RO:0002211

label: regulates

original_spans:

- 88:96

- id: HGNC:4274

label: GJA1

original_spans:

- 116:119

Abstract:

miRNA-125b is an RNA that causes myocardial infarction (infarct).

Moreover, miRNA-125b regulates the expression of GJA1, a gene involved in the cardiac valve vegetations.

Schema:

Demo2_SPIRES_NER_RE.ipynb

1 of 61

2 of 61

3 of 61

4 of 61

5 of 61

6 of 61

7 of 61

8 of 61

9 of 61

10 of 61

11 of 61

12 of 61

13 of 61

14 of 61

15 of 61

16 of 61

17 of 61

18 of 61

19 of 61

20 of 61

21 of 61

22 of 61

23 of 61

24 of 61

25 of 61

26 of 61

27 of 61

28 of 61

29 of 61

30 of 61

31 of 61

32 of 61

33 of 61

34 of 61

35 of 61

36 of 61

37 of 61

38 of 61

39 of 61

40 of 61

41 of 61

42 of 61

43 of 61

44 of 61

45 of 61

46 of 61

47 of 61

48 of 61

49 of 61

50 of 61

51 of 61

52 of 61

53 of 61

54 of 61

55 of 61

56 of 61

57 of 61

58 of 61

59 of 61

60 of 61

61 of 61