1 of 63

Use of AI Large Language Models in SMD

Dr. Rahul Ramachandran, NASA/MSFC IMPACT

Contact:

rahul.ramachandran@nasa.gov

2 of 63

Motivation

  • LLMs increase research efficiency, saving scientists significant time.
    • Streamline workflows: data discovery, access, literature review, and coding.
    • Reduce "data-wrangling" time from 80% to significantly less, accelerating scientific discovery.
  • LLMs enhance data system value: improving data visibility, access, usability, and value.
    • Facilitate new discovery pathways and applications, increasing data system use.
    • Enable easier and more contextual information retrieval, aligning with user expectations.

LLMs offer transformative benefits for NASA SMD, streamlining both science and data system operations, thereby accelerating the pace of discovery and enhancing data utility.

3 of 63

Goal for this session

  • Lean into AI, particularly LLMs, to enhance the Science Mission Directorate's capabilities.
    • Ensure responsible and ethical AI/LLM usage by recognizing limits and implementing safeguards.
  • Promote knowledge exchange and collaboration for proficient LLM application.
  • Offer AI/LLM training covering:
    • Techniques like Retrieval-Augmented Generation (RAG) and prompt engineering.
    • Distinctions between Encoder and Decoder models.

Responsibly integrate AI and LLMs within the Science Mission Directorate, fostering innovation through shared education, collaboration, and best practices.

4 of 63

Build SMD Focused Comprehensive Training Program

  • Establish a knowledge commons for training materials, prompts, and LLM-based applications to foster knowledge sharing.
    • Training on AI/LLMs, covering techniques like RAG, prompt engineering and fine tuning.
  • Explain the distinct value and functions of types of models (encoder, decoder).
  • Design specialized training modules for three key groups: all users, developers, and ML developers.

Implementing a structured training program and creating a shared resource are crucial steps toward infusing AI/LLM applications across different stakeholder groups within SMD.

5 of 63

Assumptions

  • Recognize the fast-paced evolution of the LLM landscape and infrastructure.
  • Prepare for frequent releases of improved LLM models.
    • Anticipate a decrease in current limitations and continuous methodological evolution.
  • Expect a future where open-source models effectively meet scientific requirements.
  • Embrace Open Science principles, aligning with Strategic Plan Directive (SPD) 41.
  • Ensure infrastructure flexibility to include Azure, Bedrock, Vertex, IBM WatsonX.

LLM field is rapidly evolving and requires flexibility in anticipation of future advancements while being grounded in Open Science principles

6 of 63

General AI Ethics Reminder

  • Prioritize open models, data, workflows, and code for transparency and collaboration.
  • Devote significant effort to formulating clear, concise, and correct questions.
    • Embrace Albert Einstein's approach: Spend most of the time defining the problem accurately to enable rapid solutions.
    • Ensure questions are precisely worded to retrieve relevant answers.
  • Apply Carl Sagan’s “baloney detection kit” for critical analysis and fact verification.
    • Adhere to the principle of "trust but verify" to maintain rigorous scrutiny.
    • Avoid unquestioning acceptance of AI-generated outputs.
  • Approach AI as a collaborative partner, not a sole decision-maker.
    • Employ the co-pilot analogy, emphasizing shared responsibility in AI interaction.
    • Remain accountable for the integrity and accuracy of AI-assisted outcomes.

Ethical AI use demands openness, critical questioning, collaborative partnership, and vigilant verification to ensure responsible and effective outcomes

7 of 63

Agenda: Recipes for Using LLMs

Recipe

Type

Audience

1

Prompt Engineering

Everyone

2

Quick Prototype Applications Using RAG

Everyone/Developers

3

Creating and Deploying AI Applications

Developers

4

Fine Tuning LLMs to Create Custom Applications

ML Engineers

https://github.com/NASA-IMPACT/LLM-cookbook-for-open-science

chef by Devendra Karkar Noun Project (CC BY 3.0)

8 of 63

Getting access to the workshop environment

9 of 63

Recipe 1: Prompt Engineering for Science

Kaylin Bugbee, NASA/MSFC IMPACT

10 of 63

What is Prompt Engineering?

  • The design and optimization of effective queries or instructions (aka prompt patterns) when using generative AI tools like ChatGPT in order to get desired responses.
  • Understanding how to write prompts will increase the quality of responses received.
  • In other words, ask the right questions to get better answers.

11 of 63

LLMs have PETs

Parameters

    • Model Parameters
      • Indicate the size of the LLM. LLMs vary in size - larger models do not always equate to better performance.
    • Prompt Parameters
      • Parameters that can be changed or set during prompting. These can be modified either while using the LLM’s API or via the LLM dashboard

Embeddings

    • A mathematical representation of words that represent meaning and context. An embedding is essentially a list of numbers for all the dimensions the embedding represents. This list is also called a vector.

Tokens

    • The basic unit of input a LLM uses. It can be a word, part of a word, or some other segment of input. Text is converted to numbers by a tokenizer.

12 of 63

LLM Information

More information about the basics of LLMs can be found in the GitHub documentation including:

  • Definitions
  • Working with parameters
  • LLM Quick Start Guide breaking down the most popular models (GPT, Llama)

https://github.com/NASA-IMPACT/LLM-cookbook-for-open-science

13 of 63

Prompt Engineering Ethics and Best Practices

  • Privacy: Avoid requesting or including personal information; adhere to privacy and security protocols.
  • Misinformation: Reference scientific consensus from credible, peer-reviewed sources.
  • Bias/Fairness: Verify fairness in citations.
  • Ownership/Copyright: Generate original content, respecting copyright laws.
  • Transparency: Ask the model to explain the reasoning behind AI outputs or use prompts like ‘fact check’ to help verify
  • Iteration on the prompt may be required until the output meets your expectations.
  • Avoid long sessions. Restart sessions when you need to reset context or want to provide different prompting instructions.
  • Do not share sensitive information.

14 of 63

Prompt Patterns for Science

Pattern Category

Prompt Pattern

Output Customization

Recipe

Output Automator

Persona

Interaction

Flipped Interaction

Prompt Improvement

Question Refinement

Alternative Approach

Cognitive Verifier

Error Identification

Fact Check List

Context Control

Context Manager

Reference: White et al. ‘A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT.’ https://arxiv.org/abs/2302.11382

15 of 63

Output Customization: Recipe

Definition

  • Provides limits to output a sequence of steps given some partially provided “ingredients” that must be provided in a series of steps to achieve a stated goal.

  • Helpful for tasks when the user knows the desired end result and the ingredients needed to achieve the result but not the detailed steps themselves.

Template

“I am trying to [complete a task]. I know that I need [step A, B, C]. Please provide a complete sequence of steps. Please fill in any missing steps.”

16 of 63

Recipe Science Examples

  • I am trying to preprocess Landsat 8 Level-1 data. I know that I need to find and download the data. I know that I need to complete georeferencing, conversion to radiance, solar corrections and atmospheric corrections. I know I will use the ENVI software. Please provide a complete sequence of steps. Please fill in any missing steps.
  • I am trying to find and download infrared data of the Crab Nebula. I know that I need to identify the various coordinates of the Crab Nebula. I know that I need to search for data across a number of astronomical catalogs. Please provide a complete sequence of steps. Please fill in any missing steps.

17 of 63

Output Customization: Output Automator

Definition

  • The LLM generates a script that can automatically perform any steps it recommends taking as part of its output.
  • The goal is to reduce the manual effort needed to implement any LLM output recommendations.

Template

“Create a script that [describes the task to be automated], using [specific parameters or conditions]. Output the steps in [desired format or language].”

18 of 63

Output Automater Science Examples

  • Create a script that automatically compiles and summarizes the number of new planets confirmed in the previous week using the NASA exoplanet archive data. Include data on planet name, host name and discovery method. Output the summary in a CSV format.
  • Create a script that uses the HAPI API to store data from the Parker Solar Probe in an array. Output the summary in a JSON format.
  • Create a script that automatically compiles and summarizes weekly seismic activity reports from the USGS database, focusing on earthquakes above magnitude 4.0. Include data on location, magnitude, depth, and potential affected areas. Output the summary in a CSV format.

19 of 63

Output Customization: Persona

Definition

  • Allows the user to specify a point of view or perspective for the LLM to adopt.
  • The pattern allows users to identify what they need help with without knowing the exact details.

Template

“Respond to my questions about [a specific topic or issue] as if you are [specific profession].”

20 of 63

Persona Science Examples

  • Respond to my questions about gravitational waves as if you are an expert astrophysicist.
  • Respond to my questions about the formation of gas planets as if you are an expert planetary scientist.
  • Respond to my questions about the effects of spaceflight on life as if you are an expert space biologist.

21 of 63

Prompt Improvement: Alternative Approach

Definition

  • Allows the LLM to provide alternative approaches to accomplishing a task.

Template

“Provide different approaches to solve [specific problem or task], considering various data, methods, tools, or algorithms that could be applied.”

22 of 63

Alternative Approach Science Examples

  • Provide different approaches to studying the Earth from space, considering various methods, tools, or perspectives that could be applied.
  • Provide different approaches to detecting exoplanets, considering various data, methods, tools, or algorithms that could be applied.
  • Provide different approaches to determining Earth's surface reflectance, considering various data, methods, tools, or algorithms that could be applied.

23 of 63

Putting It All Together: Combining Prompt Patterns

  • For this exercise, the Persona, Recipe and Output Automator prompt patterns will be combined to help you create either a requirements document or a procedure plan related to scientific data governance and management.
  • This activity uses the Modern Scientific Data Governance Framework (mDGF) to help you easily answer questions about government mandates and organizational policies related to scientific data management. You will also be able to use the prompt to create either a requirements document or a procedure plan informed by the mDGF.
  • The goal of this activity is to make it easier to develop a plan to implement what is needed to be compliant with policies and procedures.

24 of 63

Putting It All Together: Combining Prompt Patterns

25 of 63

Recipe 2: Quick Prototype Applications Using RAG

Kaylin Bugbee, NASA/MSFC IMPACT�Ashish Acharya and Nish Pantha, UAH/IMPACT

Walter Alvarado, NASA Ames Research Center

26 of 63

What is RAG?

  • Retrieval Augmented Generation
  • Allows you to provide the LLM with access to
    • More up-to-date information
    • Domain specific information that may not have been available to the LLM

Source: Gartner. What Technical Professionals Need to Know About Large Language Models

27 of 63

Quick Prototyping with LangFlow - RAG Chatbot Example

Goal

  • Demonstrate rapid prototyping with LangFlow using a RAG chatbot for the Open Science Data Repository (OSDR). The OSDR provides open access to NASA’s space biology data including GeneLab and the Ames Life Sciences Data Archive (ALSDA).

Approach

  • Leverage trusted and curated NASA SMD resources from the Science Discovery Engine (SDE) index in order to create a topical chatbot focused on the OSDR.
  • We will utilize prebuilt Lang Flow components for SMD, requiring minimal configuration.

Value

  • Science Users: Facilitate direct interaction with authoritative domain-specific sources to streamline workflows
  • Data Stewards: Highlight the benefits of curated SDE resources for chatbot development.

Implementation Steps

  1. Begin with existing workflow templates for speed and efficiency.
  2. Customize the SDE retriever to focus on specific topics or themes.
  3. Engage with the chatbot through the chat interface or a Python interface for versatility.

28 of 63

Science Discovery Engine (SDE)

The SDE is

a source of trusted and curated open science data and information

The SDE includes

Metadata about science data

Code

Documentation

Images

Tutorials

Mission and instrument information

Image credit: SDE team

29 of 63

Platform for Building LLM-Based Applications

Image Credit: NASA IMPACT team

30 of 63

PromptLab

Ashish Acharya, UAH/NASA IMPACT

31 of 63

Objectives

  • Main Goal: Demonstrate integrating LLMs with the Science Discovery Engine (SDE) to build a chatbot that is grounded on scientific documents.
  • A brief introduction to LangChain
  • LangFlow as a GUI tool to work with LangChain
  • PromptLab: A fork of LangFlow that we customized in-house and manage on AWS
  • Takeaways

32 of 63

LangChain

  • A tool for leveraging language models in application development.
  • Integrates language models into various workflows.
  • Enables us to:
    • Automate routine science data tasks
    • Derive insights from large datasets
    • Facilitate innovative scientific experiments

We will see examples of solving scientific problems using LangChain in subsequent presentations today.

Image Source: Microsoft Blog

33 of 63

LangChain

  • LangChain consists of a few modules.
    • Model
    • Prompt
    • Memory
    • Chain
    • Agents
  • As its name suggests, chaining different modules together is the main purpose of Langchain.
  • This dynamic chaining of modules gives us the power to solve various scientific problems with the help of LLMs.

34 of 63

LangFlow

  • Langflow is an open source web tool built in Python that provides a graphical LLM interface to work with LangChain, allowing to handle the concepts of Chains, Agents and Prompt Engineering, in a very simple way.
  • Even non-technical users can build and experiment with LLMs and LangChain while bringing their own data.
  • You can also build bespoke components in Python to solve your science use cases.
  • LangFlow is highly customizable and uses popular technologies such as FastAPI and ReactJS.

35 of 63

LangFlow Community Examples

36 of 63

PromptLab

  • A managed LangFlow instance that we customized and deployed on AWS. You don’t have to worry about servers, databases, load balancing, and networking because it’s a managed instance.�
  • Publishing Flows
    • Feature to publish your flows and share them with other users of the application (Open Science!)
    • Unique API Keys for your published flows to keep track of usage.
    • View published flows from other users and clone them into your own account, then modify them for your own use case.�
  • SDE Sinequa Retriever
    • A custom component to talk to the Sinequa instance that powers SDE
    • Can pass a query to SDE and fetch results
    • Then pass the results to other parts of the pipeline (for example, chatbots)

37 of 63

Sinequa Retriever

38 of 63

Takeaways

  • LangChain is a framework that lets us integrate LLMs into our application workflows
  • LangFlow is a GUI application that lets us build LangChain flows with little to no coding
  • We have customized LangFlow and deployed it to AWS. Our instance is called PromptLab
  • We also built custom components to interact with the Science Discovery Engine (SDE)
  • You will see examples of how we use PromptLab to build a chatbot for the NASA Open Science Data Repository (OSDR) in subsequent presentations

39 of 63

Open Science Data Repository (OSDR) Chatbot using Promptlab

Walter Alvarado, Space Biosciences Research Branch, NASA Ames

40 of 63

Walter Alvarado, Ph.D.

Open Science Data Repository

NASA Ames Research Center

41 of 63

Space Biology Data

41

Earth Science Division

Heliophysics Division

Planetary Science Division

Biological and Physical Sciences Division

Astrophysics Division

Space Biology Program

Studying the impact of space travel on living systems

  • Data from spaceflown experiments (e.g. cell culture, rodent, plant)
  • Data from ground simulations and analogs (e.g. particle accelerators, centrifuges)

42 of 63

NASA Open Science Data Repository

42

Physiological/Phenotypic/Imaging/ Environmental Telemetry Data

Molecular/Omics Data

Biospecimens

NASA Open Science

Data Repository (OSDR)

osdr.nasa.gov/bio

Tabular, text, imaging, telemetry, video, code

  • Single Submission Portal (BDME)

  • User Interface/Website Tool for RDSAs (Research Data Submission Agreements)

  • Maximally Open Access with Necessary Controls for Sensitive Data

  • Data Maximally FAIR

43 of 63

NASA Open Science Data Repository

43

488

Studies

77

Assays

45

Species

>150TB

Data

910

Datasets

91

Original Publications Have Data Submitted to OSDR

65

Publications Enabled by OSDR Data Mining

142+

Datasets Used in Enabled 

Publications

‘OMICS

PHYSIO-PHENO

44 of 63

44

In-situ Analytics & Hardware

Cloud Labs & Automated Inventory Management

Biomonitoring &

Precision Space Health

Data Management

Machine

Learning

Language Models & Generalist AI

Data Standardization & Data Engineering

Basic Biological Discovery

Scott et al., 2023 NatMachIntel https://rdcu.be/c8jSO Li et al., 2023 https://arxiv.org/abs/2311.12045 Soboczenski 2024, https://huggingface.co/kenobi/NASA_GeneLab_MBT

Sanders et al., 2023 NatMachIntel https://rdcu.be/c8jSS Li et al., 2023 PMID: 38092777 AWS BPS Microscopy Benchmark https://registry.opendata.aws/bps_microscopy/�

45 of 63

Open Science Data Repository (OSDR) Chatbot Evaluation

Nishan Pantha, UAH/NASA IMPACT

46 of 63

Recipe 3: Creating and Deploying LLM Based Applications

Iksha Gurung, UAH/IMPACT

47 of 63

Creating and Deploying AI Applications

Goal

  • Teach developers to create applications using LLMs that enable users to effortlessly query Earth Science or Astrophysics datasets and observations with natural language.

Approach

  • Adopt the LangChain framework for the integration of existing data and information systems.
  • Implement ReACT pattern orchestration for dynamic interaction.

Value

  • Science Users: advanced search capabilities streamlines data access workflows, making science more efficient and scalable
  • Data stewards: Increased data utilization and reusability through advanced search capabilities.

Implementation Steps

  1. Identify and define LLM-compatible tools for enhanced query handling.
  2. Employ ReACT patterns for structured data interaction and response.
  3. Implement quantitative validations to ensure accuracy and reliability.

48 of 63

Recipe 4: Fine Tuning LLMs to Create Custom Applications

Muthukumaran Ramasubramanian, UAH/IMPACT

49 of 63

Fine Tuning LLMs to Create Custom Applications

Goal

  • Modify an existing LLM to aid in scientific thematic data curation.

Approach

  • Use the SMD Encoder model and training (labeled) data to train a classifier.

Value

  • Science Users: thematic search applications enables the discovery of new, research-relevant datasets
  • Data Stewards:
    • Automation Benefit: Augments and streamlines the current manual curation process.
    • Scientific Advancement: Increase discovery and use of new, research-relevant datasets.

Implementation Steps

  1. Begin with fine-tuning the Encoder Model.
  2. Conduct comparative analysis with the Decoder Model.
  3. Explore results against One-Shot and Few-Shot learning methods.
  4. Perform quantitative evaluation to measure classifier performance.

50 of 63

Meeting Community Search Needs

Data and information volumes are growing, with content being dispersed across a number of repositories, web pages, code repos, etc…

Scientific communities have a need to curate and share data and information around their specific use case

Leveraging the broader curation effort happening for the SDE, communities can work with the SDE to build curated search applications. Examples include:

    • Environmental Justice
    • Multi-messenger, time domain astronomy

Citation: Bugbee, K., D. Smith, S. Wingo, and E. Foshee (2023), The art of scientific curation, Eos, 104, https://doi.org/10.1029/2023EO230201. Published on 19 May 2023.

March 2024

50

51 of 63

52 of 63

SMD Large Language Model

53 of 63

Large Language Models: Overview

Fig. Types Of Language Model Architectures�(Source: https://medium.com/@yulemoon/an-in-depth-look-at-the-transformer-based-models-22e5f5d17b6b)

54 of 63

Large Language Models: Usecases

55 of 63

Large Language Models: Fine-tuning

Fig. Training and Fine tuning Pipeline

56 of 63

Encoder LLM

HuggingFace Link

57 of 63

SMD Large Language Model: Training Sources

Dataset

Domain

# Tokens

Ratio

NASA CMR Dataset Description

Earth Science

0.3 B

1%

AGU and AMS Papers

Earth Science

2.8 B

4%

English Wikipedia

General

5.0 B

8%

Pubmed Abstracts

Biomedical

6.9 B

10%

PMC

Biomedical

18.5 B

28%

SAO/NASA ADS

Astronomy, Astrophysics, Physics, General Science

32.7 B

49%

Total

66.2 B

100%

A curated dataset, spanning multiple science domains, sets a solid foundation for future model development and enhancements.

58 of 63

SMD Large Language Model: Base Encoder

Base Model

RoBERTa

Parameter Size

125 M

Pre training Strategy

Masked Language Modeling

Application Areas

Named Entity Recognition, Information Retrieval, Sentence Transformers, ExtractQ

Resource Consumed

192 V100 GPUs, 500K steps, 10 days training time

Knowledge Distillation

30M Model

A domain-adapted, efficient encoder model marks an advancement for NASA SMD and can be used in supporting many different applications.

HuggingFace Link

59 of 63

SMD LLM Downstream Models: Sentence transformer

Sentence Transformers can assist in information retrieval by encoding documents to efficiently understand text semantics and help text analysis tasks.

Fig: Sentence Transformer Architecture

Fig: Sentence Transformer Evaluation

HuggingFace Link

60 of 63

SMD LLM Downstream Models: Passage Reranker

Fig. Passage Reranker Architecture

Query

Relevancy Score

Passage

Encoder

Models

MsMarco-Dev

Science QA Dataset

RoBERTa

35.9

31.1

nasa-smd-ibm-ranker

36.4

33.2

Fig. Passage Re-Ranker Evaluation using Relevancy Scores

HuggingFace Link

Passage Re-Rankers further improves relevancy of passages retrieved using the sentence transformers

61 of 63

SMD LLM Downstream Models: Usage

Sentence transformer

Passage Reranker

62 of 63

SMD LLM Downstream Models: Hands-On

Notebook Demo

63 of 63

Decoder Model: Metadata Extraction