Use of AI Large Language Models in SMD
Dr. Rahul Ramachandran, NASA/MSFC IMPACT
Contact:
rahul.ramachandran@nasa.gov
Motivation
LLMs offer transformative benefits for NASA SMD, streamlining both science and data system operations, thereby accelerating the pace of discovery and enhancing data utility.
Goal for this session
Responsibly integrate AI and LLMs within the Science Mission Directorate, fostering innovation through shared education, collaboration, and best practices.
Build SMD Focused Comprehensive Training Program
Implementing a structured training program and creating a shared resource are crucial steps toward infusing AI/LLM applications across different stakeholder groups within SMD.
Assumptions
LLM field is rapidly evolving and requires flexibility in anticipation of future advancements while being grounded in Open Science principles
General AI Ethics Reminder
Ethical AI use demands openness, critical questioning, collaborative partnership, and vigilant verification to ensure responsible and effective outcomes
Agenda: Recipes for Using LLMs
Recipe | Type | Audience |
1 | Prompt Engineering | Everyone |
2 | Quick Prototype Applications Using RAG | Everyone/Developers |
3 | Creating and Deploying AI Applications | Developers |
4 | Fine Tuning LLMs to Create Custom Applications | ML Engineers |
https://github.com/NASA-IMPACT/LLM-cookbook-for-open-science
chef by Devendra Karkar Noun Project (CC BY 3.0)
Getting access to the workshop environment
Recipe 1: Prompt Engineering for Science
Kaylin Bugbee, NASA/MSFC IMPACT
What is Prompt Engineering?
LLMs have PETs
Parameters
Embeddings
Tokens
LLM Information
More information about the basics of LLMs can be found in the GitHub documentation including:
https://github.com/NASA-IMPACT/LLM-cookbook-for-open-science
Prompt Engineering Ethics and Best Practices
Prompt Patterns for Science
Pattern Category | Prompt Pattern |
Output Customization | Recipe Output Automator Persona |
Interaction | Flipped Interaction |
Prompt Improvement | Question Refinement Alternative Approach Cognitive Verifier |
Error Identification | Fact Check List |
Context Control | Context Manager |
Reference: White et al. ‘A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT.’ https://arxiv.org/abs/2302.11382
Output Customization: Recipe
Definition
Template
“I am trying to [complete a task]. I know that I need [step A, B, C]. Please provide a complete sequence of steps. Please fill in any missing steps.”
Recipe Science Examples
Output Customization: Output Automator
Definition
Template
“Create a script that [describes the task to be automated], using [specific parameters or conditions]. Output the steps in [desired format or language].”
Output Automater Science Examples
Output Customization: Persona
Definition
Template
“Respond to my questions about [a specific topic or issue] as if you are [specific profession].”
Persona Science Examples
Prompt Improvement: Alternative Approach
Definition
Template
“Provide different approaches to solve [specific problem or task], considering various data, methods, tools, or algorithms that could be applied.”
Alternative Approach Science Examples
Putting It All Together: Combining Prompt Patterns
Putting It All Together: Combining Prompt Patterns
Recipe 2: Quick Prototype Applications Using RAG
Kaylin Bugbee, NASA/MSFC IMPACT�Ashish Acharya and Nish Pantha, UAH/IMPACT
Walter Alvarado, NASA Ames Research Center
What is RAG?
Source: Gartner. What Technical Professionals Need to Know About Large Language Models
Quick Prototyping with LangFlow - RAG Chatbot Example
Goal
Approach
Value
Implementation Steps
Science Discovery Engine (SDE)
The SDE is
a source of trusted and curated open science data and information
The SDE includes
Metadata about science data
Code
Documentation
Images
Tutorials
Mission and instrument information
Access at: https://sciencediscoveryengine.nasa.gov/
Image credit: SDE team
Platform for Building LLM-Based Applications
Image Credit: NASA IMPACT team
PromptLab
Ashish Acharya, UAH/NASA IMPACT
Objectives
LangChain
We will see examples of solving scientific problems using LangChain in subsequent presentations today.
Image Source: Microsoft Blog
LangChain
LangFlow
LangFlow Community Examples
PromptLab
Sinequa Retriever
Takeaways
Open Science Data Repository (OSDR) Chatbot using Promptlab
Walter Alvarado, Space Biosciences Research Branch, NASA Ames
Walter Alvarado, Ph.D.
Open Science Data Repository
NASA Ames Research Center
Space Biology Data
41
Earth Science Division
Heliophysics Division
Planetary Science Division
Biological and Physical Sciences Division
Astrophysics Division
Space Biology Program
Studying the impact of space travel on living systems
NASA Open Science Data Repository
42
Physiological/Phenotypic/Imaging/ Environmental Telemetry Data
Molecular/Omics Data
Biospecimens
NASA Open Science
Data Repository (OSDR)
osdr.nasa.gov/bio
Tabular, text, imaging, telemetry, video, code
NASA Open Science Data Repository
43
488
Studies
77
Assays
45
Species
>150TB
Data
910
Datasets
91
Original Publications Have Data Submitted to OSDR
65
Publications Enabled by OSDR Data Mining
142+
Datasets Used in Enabled
Publications
‘OMICS
PHYSIO-PHENO
44
In-situ Analytics & Hardware
Cloud Labs & Automated Inventory Management
Biomonitoring &
Precision Space Health
Data Management
Machine
Learning
Language Models & Generalist AI
Data Standardization & Data Engineering
Basic Biological Discovery
Scott et al., 2023 NatMachIntel https://rdcu.be/c8jSO Li et al., 2023 https://arxiv.org/abs/2311.12045 Soboczenski 2024, https://huggingface.co/kenobi/NASA_GeneLab_MBT
Sanders et al., 2023 NatMachIntel https://rdcu.be/c8jSS Li et al., 2023 PMID: 38092777 AWS BPS Microscopy Benchmark https://registry.opendata.aws/bps_microscopy/�
Open Science Data Repository (OSDR) Chatbot Evaluation
Nishan Pantha, UAH/NASA IMPACT
Recipe 3: Creating and Deploying LLM Based Applications
Iksha Gurung, UAH/IMPACT
Creating and Deploying AI Applications
Goal
Approach
Value
Implementation Steps
Recipe 4: Fine Tuning LLMs to Create Custom Applications
Muthukumaran Ramasubramanian, UAH/IMPACT
Fine Tuning LLMs to Create Custom Applications
Goal
Approach
Value
Implementation Steps
Meeting Community Search Needs
Data and information volumes are growing, with content being dispersed across a number of repositories, web pages, code repos, etc…
Scientific communities have a need to curate and share data and information around their specific use case
Leveraging the broader curation effort happening for the SDE, communities can work with the SDE to build curated search applications. Examples include:
Citation: Bugbee, K., D. Smith, S. Wingo, and E. Foshee (2023), The art of scientific curation, Eos, 104, https://doi.org/10.1029/2023EO230201. Published on 19 May 2023.
March 2024
50
SMD Large Language Model
Large Language Models: Overview
Fig. Types Of Language Model Architectures�(Source: https://medium.com/@yulemoon/an-in-depth-look-at-the-transformer-based-models-22e5f5d17b6b)
Large Language Models: Usecases
Large Language Models: Fine-tuning
Fig. Training and Fine tuning Pipeline
Encoder LLM
HuggingFace Link
SMD Large Language Model: Training Sources
Dataset | Domain | # Tokens | Ratio |
NASA CMR Dataset Description | Earth Science | 0.3 B | 1% |
AGU and AMS Papers | Earth Science | 2.8 B | 4% |
English Wikipedia | General | 5.0 B | 8% |
Pubmed Abstracts | Biomedical | 6.9 B | 10% |
PMC | Biomedical | 18.5 B | 28% |
SAO/NASA ADS | Astronomy, Astrophysics, Physics, General Science | 32.7 B | 49% |
Total | | 66.2 B | 100% |
A curated dataset, spanning multiple science domains, sets a solid foundation for future model development and enhancements.
SMD Large Language Model: Base Encoder
Base Model | RoBERTa |
Parameter Size | 125 M |
Pre training Strategy | Masked Language Modeling |
Application Areas | Named Entity Recognition, Information Retrieval, Sentence Transformers, ExtractQ |
Resource Consumed | 192 V100 GPUs, 500K steps, 10 days training time |
Knowledge Distillation | 30M Model |
A domain-adapted, efficient encoder model marks an advancement for NASA SMD and can be used in supporting many different applications.
HuggingFace Link
SMD LLM Downstream Models: Sentence transformer
Sentence Transformers can assist in information retrieval by encoding documents to efficiently understand text semantics and help text analysis tasks.
Fig: Sentence Transformer Architecture
Fig: Sentence Transformer Evaluation
HuggingFace Link
SMD LLM Downstream Models: Passage Reranker
Fig. Passage Reranker Architecture
Query
Relevancy Score
Passage
Encoder
Models | MsMarco-Dev | Science QA Dataset |
RoBERTa | 35.9 | 31.1 |
nasa-smd-ibm-ranker | 36.4 | 33.2 |
Fig. Passage Re-Ranker Evaluation using Relevancy Scores
HuggingFace Link
Passage Re-Rankers further improves relevancy of passages retrieved using the sentence transformers
SMD LLM Downstream Models: Usage
Sentence transformer
Passage Reranker
SMD LLM Downstream Models: Hands-On
Notebook Demo
Decoder Model: Metadata Extraction