1 of 16

Role of ChatGPT in OBO Ontology Development

Sierra Moxon

Lawrence Berkeley National Lab

ICBO Ontologies Tutorial: August 2023

smoxon@lbl.gov

2 of 16

Helpful resources

https://learn.deeplearning.ai/chatgpt-prompt-eng - great guided tutorial, some code examples.

3 of 16

Thanks!

  • Berkeley Bioinformatics Open-source Projects (BBOP)
  • Translational and Integrative Science Lab (TISLab)

Especially to: J. Harry Caufield, Marcin Joachimiak, Justin Reese, Harshad Hegde, Chris Mungall

4 of 16

How did I get started?

Key to learning a new tool is play.

Try to trick the tool into making ridiculously hallucinated disorders and diseases.

“kool-aid man syndrome is a…”

5 of 16

How do I use LLMs/Machine Learning daily?

LLMs

  • GitHub Copilot and ChatGPT for development.
    • anyone willing can write python, SPARQL, SQL, JSON, with ChatGPT
  • ChatGPT for helping pick well known words and phrases to make models that are scientifically relevant, precise and human readable.
  • Routine script writing
  • Definition crafting
  • Explaining error messages and functions

Machine Learning

  • Grammarly for all online writing.
    • basic spelling, grammar, tone transformation.

6 of 16

Prompt Engineering: Talking to an extremely well-read teenageer.

    • Give clear and specific instructions.
    • Use different words to get better results the second time.
    • Give them time to think.

7 of 16

DEMO: check algorithm results quickly

  • “What drugs may treat Dentin Dysplasia”?

8 of 16

DEMO: find all the obsolete terms in an ontology

“find me all the obsolete terms in MONDO disease ontology”

9 of 16

DEMO

given a list of terms separated by commas (the list is surrounded by triple ticks) : ```Cell, Neuron, Hippocampus,Microarray``` I want you to use lexical matching to find potential matches to existing ontology terms.

10 of 16

Add ClinGen xrefs to MONDO

using the mondo disease ontology json file, can you return, in a simple table, a list of mondo term ids and labels that do not have ClinGen xrefs”

“Can you please make teh same table, but I only want IDs that have the string "MONDO" in them”

“I just uploaded another file, the clingen curation activity summary report. I see some MONDO ids in this file. Can we use this file with the table you just generated (the full table, not just a few examples that you printed on the screen), to generate a mapping table linking MONDO ids that do not have clingen xrefs with the xrefs in the "disease_url" column of this new file?”

“the id attribute the object in the mondo json file has a full URL, in order to map the id in the mondo file to the id in the fifth column of the clingen file, we need to extract the id from the url in the mondo json file. we should extract everything after (and including the string MONDO) in the id field of the json file, and replace the "_" with ":" -- once we have these transformed id, we can map it to the clingen file”

“can you show me examples of this table?”

“instead of showing me "URL for MONDO:..." just show me the full url in the table”

11 of 16

Write clear and specific instructions

  • short does not equal clear
  • use delimiters to specify key components
  • ask for structured output
  • ask the model to check whether conditions are satisfied.
    • (Tell it to return “not found” if can’t satisfy the condition)
  • provide successful examples of completing tasks then ask model to perform the task.
  • Describe the input and output formats
  • Tell it who the audience is
  • Give the model a series of steps to complete on its way to the answer

12 of 16

ontogpt

ontogpt: https://github.com/monarch-initiative/ontogpt/tree/main/src/ontogpt/

  • SPIRES: extract nested semantic structures from text

Use a schema to ask for results in a particular format, feed in some text, and use some code to ground the results in existing ontology terms.

  • HALO: generating a domain ontology given a few examples
  • SPINDOCTOR: Summarize gene set descriptions (pseudo gene-set enrichment)

13 of 16

https://github.com/monarch-initiative/ontogpt/blob/main/src/ontogpt/templates/mendelian-disease.yaml

% ontogpt extract -t mendelian_disease.MendelianDisease dentin.txt

14 of 16

Create new ontology terms?

15 of 16

Curate-GPT

curate-gpt: https://github.com/monarch-initiative/curate-gpt

16 of 16

Limitations and gotchas

  • 2021 and before
  • Isn’t actually connected to the internet (can’t follow or extract text from URLs)
  • LLM are ok but not great at following precise numeric instruction
  • Vast amount of knowledge that it has not perfectly memorized, so it doesn’t know the boundary of its knowledge.
  • Odds of trying this the first time and having it be perfect are low, but it is MUCH faster than ML approaches that require training data sets
  • biased, expensive to create (both monetarily and environmentally), echo chamber effect, copyright infringement, face replicators, citations? …