1 of 63

Use of AI Large Language Models in SMD

Dr. Rahul Ramachandran, NASA/MSFC IMPACT

Contact:

rahul.ramachandran@nasa.gov

2 of 63

Motivation

LLMs increase research efficiency, saving scientists significant time.

Streamline workflows: data discovery, access, literature review, and coding.
Reduce "data-wrangling" time from 80% to significantly less, accelerating scientific discovery.

LLMs enhance data system value: improving data visibility, access, usability, and value.

Facilitate new discovery pathways and applications, increasing data system use.
Enable easier and more contextual information retrieval, aligning with user expectations.

LLMs offer transformative benefits for NASA SMD, streamlining both science and data system operations, thereby accelerating the pace of discovery and enhancing data utility.

3 of 63

Goal for this session

Lean into AI, particularly LLMs, to enhance the Science Mission Directorate's capabilities.

Ensure responsible and ethical AI/LLM usage by recognizing limits and implementing safeguards.

Promote knowledge exchange and collaboration for proficient LLM application.

https://github.com/NASA-IMPACT/LLM-cookbook-for-open-science

Offer AI/LLM training covering:

Techniques like Retrieval-Augmented Generation (RAG) and prompt engineering.
Distinctions between Encoder and Decoder models.

Responsibly integrate AI and LLMs within the Science Mission Directorate, fostering innovation through shared education, collaboration, and best practices.

4 of 63

Build SMD Focused Comprehensive Training Program

Establish a knowledge commons for training materials, prompts, and LLM-based applications to foster knowledge sharing.

Training on AI/LLMs, covering techniques like RAG, prompt engineering and fine tuning.

Explain the distinct value and functions of types of models (encoder, decoder).
Design specialized training modules for three key groups: all users, developers, and ML developers.

Implementing a structured training program and creating a shared resource are crucial steps toward infusing AI/LLM applications across different stakeholder groups within SMD.

5 of 63

Assumptions

Recognize the fast-paced evolution of the LLM landscape and infrastructure.
Prepare for frequent releases of improved LLM models.

Anticipate a decrease in current limitations and continuous methodological evolution.

Expect a future where open-source models effectively meet scientific requirements.
Embrace Open Science principles, aligning with Strategic Plan Directive (SPD) 41.
Ensure infrastructure flexibility to include Azure, Bedrock, Vertex, IBM WatsonX.

LLM field is rapidly evolving and requires flexibility in anticipation of future advancements while being grounded in Open Science principles

6 of 63

General AI Ethics Reminder

Prioritize open models, data, workflows, and code for transparency and collaboration.
Devote significant effort to formulating clear, concise, and correct questions.

Embrace Albert Einstein's approach: Spend most of the time defining the problem accurately to enable rapid solutions.
Ensure questions are precisely worded to retrieve relevant answers.

Apply Carl Sagan’s “baloney detection kit” for critical analysis and fact verification.

Adhere to the principle of "trust but verify" to maintain rigorous scrutiny.
Avoid unquestioning acceptance of AI-generated outputs.

Approach AI as a collaborative partner, not a sole decision-maker.

Employ the co-pilot analogy, emphasizing shared responsibility in AI interaction.
Remain accountable for the integrity and accuracy of AI-assisted outcomes.

Ethical AI use demands openness, critical questioning, collaborative partnership, and vigilant verification to ensure responsible and effective outcomes

7 of 63

Agenda: Recipes for Using LLMs

Recipe	Type	Audience
1	Prompt Engineering	Everyone
2	Quick Prototype Applications Using RAG	Everyone/Developers
3	Creating and Deploying AI Applications	Developers
4	Fine Tuning LLMs to Create Custom Applications	ML Engineers

https://github.com/NASA-IMPACT/LLM-cookbook-for-open-science

chef by Devendra Karkar Noun Project (CC BY 3.0)

8 of 63

Getting access to the workshop environment

Github: https://github.com/NASA-IMPACT/LLM-cookbook-for-open-science

Credentials: https://creds-workshop.nasa-impact.net/

9 of 63

Recipe 1: Prompt Engineering for Science

Kaylin Bugbee, NASA/MSFC IMPACT

10 of 63

What is Prompt Engineering?

The design and optimization of effective queries or instructions (aka prompt patterns) when using generative AI tools like ChatGPT in order to get desired responses.
Understanding how to write prompts will increase the quality of responses received.
In other words, ask the right questions to get better answers.

11 of 63

LLMs have PETs

Parameters

Model Parameters

Indicate the size of the LLM. LLMs vary in size - larger models do not always equate to better performance.

Prompt Parameters

Parameters that can be changed or set during prompting. These can be modified either while using the LLM’s API or via the LLM dashboard

Embeddings

A mathematical representation of words that represent meaning and context. An embedding is essentially a list of numbers for all the dimensions the embedding represents. This list is also called a vector.

Tokens

The basic unit of input a LLM uses. It can be a word, part of a word, or some other segment of input. Text is converted to numbers by a tokenizer.

12 of 63

LLM Information

More information about the basics of LLMs can be found in the GitHub documentation including:

Definitions
Working with parameters
LLM Quick Start Guide breaking down the most popular models (GPT, Llama)

https://github.com/NASA-IMPACT/LLM-cookbook-for-open-science

13 of 63

Prompt Engineering Ethics and Best Practices

Privacy: Avoid requesting or including personal information; adhere to privacy and security protocols.
Misinformation: Reference scientific consensus from credible, peer-reviewed sources.
Bias/Fairness: Verify fairness in citations.
Ownership/Copyright: Generate original content, respecting copyright laws.
Transparency: Ask the model to explain the reasoning behind AI outputs or use prompts like ‘fact check’ to help verify
Iteration on the prompt may be required until the output meets your expectations.
Avoid long sessions. Restart sessions when you need to reset context or want to provide different prompting instructions.
Do not share sensitive information.

14 of 63

Prompt Patterns for Science

Pattern Category	Prompt Pattern
Output Customization	Recipe Output Automator Persona
Interaction	Flipped Interaction
Prompt Improvement	Question Refinement Alternative Approach Cognitive Verifier
Error Identification	Fact Check List
Context Control	Context Manager

Reference: White et al. ‘A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT.’ https://arxiv.org/abs/2302.11382

15 of 63

Output Customization: Recipe

Definition

Provides limits to output a sequence of steps given some partially provided “ingredients” that must be provided in a series of steps to achieve a stated goal.

Helpful for tasks when the user knows the desired end result and the ingredients needed to achieve the result but not the detailed steps themselves.

Template

“I am trying to [complete a task]. I know that I need [step A, B, C]. Please provide a complete sequence of steps. Please fill in any missing steps.”

16 of 63

Recipe Science Examples

I am trying to preprocess Landsat 8 Level-1 data. I know that I need to find and download the data. I know that I need to complete georeferencing, conversion to radiance, solar corrections and atmospheric corrections. I know I will use the ENVI software. Please provide a complete sequence of steps. Please fill in any missing steps.
I am trying to find and download infrared data of the Crab Nebula. I know that I need to identify the various coordinates of the Crab Nebula. I know that I need to search for data across a number of astronomical catalogs. Please provide a complete sequence of steps. Please fill in any missing steps.

17 of 63

Output Customization: Output Automator

Definition

The LLM generates a script that can automatically perform any steps it recommends taking as part of its output.
The goal is to reduce the manual effort needed to implement any LLM output recommendations.

Template

“Create a script that [describes the task to be automated], using [specific parameters or conditions]. Output the steps in [desired format or language].”

18 of 63

Output Automater Science Examples

Create a script that automatically compiles and summarizes the number of new planets confirmed in the previous week using the NASA exoplanet archive data. Include data on planet name, host name and discovery method. Output the summary in a CSV format.
Create a script that uses the HAPI API to store data from the Parker Solar Probe in an array. Output the summary in a JSON format.
Create a script that automatically compiles and summarizes weekly seismic activity reports from the USGS database, focusing on earthquakes above magnitude 4.0. Include data on location, magnitude, depth, and potential affected areas. Output the summary in a CSV format.

19 of 63

Output Customization: Persona

Definition

Allows the user to specify a point of view or perspective for the LLM to adopt.
The pattern allows users to identify what they need help with without knowing the exact details.

Template

“Respond to my questions about [a specific topic or issue] as if you are [specific profession].”

20 of 63

Persona Science Examples

Respond to my questions about gravitational waves as if you are an expert astrophysicist.
Respond to my questions about the formation of gas planets as if you are an expert planetary scientist.
Respond to my questions about the effects of spaceflight on life as if you are an expert space biologist.

21 of 63

Prompt Improvement: Alternative Approach

Definition

Allows the LLM to provide alternative approaches to accomplishing a task.

Template

“Provide different approaches to solve [specific problem or task], considering various data, methods, tools, or algorithms that could be applied.”

22 of 63

Alternative Approach Science Examples

Provide different approaches to studying the Earth from space, considering various methods, tools, or perspectives that could be applied.
Provide different approaches to detecting exoplanets, considering various data, methods, tools, or algorithms that could be applied.
Provide different approaches to determining Earth's surface reflectance, considering various data, methods, tools, or algorithms that could be applied.

23 of 63

Putting It All Together: Combining Prompt Patterns

For this exercise, the Persona, Recipe and Output Automator prompt patterns will be combined to help you create either a requirements document or a procedure plan related to scientific data governance and management.
This activity uses the Modern Scientific Data Governance Framework (mDGF) to help you easily answer questions about government mandates and organizational policies related to scientific data management. You will also be able to use the prompt to create either a requirements document or a procedure plan informed by the mDGF.
The goal of this activity is to make it easier to develop a plan to implement what is needed to be compliant with policies and procedures.

24 of 63

Putting It All Together: Combining Prompt Patterns

25 of 63

Recipe 2: Quick Prototype Applications Using RAG

Kaylin Bugbee, NASA/MSFC IMPACT�Ashish Acharya and Nish Pantha, UAH/IMPACT

Walter Alvarado, NASA Ames Research Center

26 of 63

What is RAG?

Retrieval Augmented Generation
Allows you to provide the LLM with access to

More up-to-date information
Domain specific information that may not have been available to the LLM

Source: Gartner. What Technical Professionals Need to Know About Large Language Models

27 of 63

Quick Prototyping with LangFlow - RAG Chatbot Example

Goal

Demonstrate rapid prototyping with LangFlow using a RAG chatbot for the Open Science Data Repository (OSDR). The OSDR provides open access to NASA’s space biology data including GeneLab and the Ames Life Sciences Data Archive (ALSDA).

Approach

Leverage trusted and curated NASA SMD resources from the Science Discovery Engine (SDE) index in order to create a topical chatbot focused on the OSDR.
We will utilize prebuilt Lang Flow components for SMD, requiring minimal configuration.

Value

Science Users: Facilitate direct interaction with authoritative domain-specific sources to streamline workflows
Data Stewards: Highlight the benefits of curated SDE resources for chatbot development.

Implementation Steps

Begin with existing workflow templates for speed and efficiency.
Customize the SDE retriever to focus on specific topics or themes.
Engage with the chatbot through the chat interface or a Python interface for versatility.

28 of 63

Science Discovery Engine (SDE)

The SDE is

a source of trusted and curated open science data and information

The SDE includes

Metadata about science data

Code

Documentation

Images

Tutorials

Mission and instrument information

Access at: https://sciencediscoveryengine.nasa.gov/

Image credit: SDE team

29 of 63

Platform for Building LLM-Based Applications

Image Credit: NASA IMPACT team

30 of 63

PromptLab

Ashish Acharya, UAH/NASA IMPACT

31 of 63

Objectives

Main Goal: Demonstrate integrating LLMs with the Science Discovery Engine (SDE) to build a chatbot that is grounded on scientific documents.
A brief introduction to LangChain
LangFlow as a GUI tool to work with LangChain
PromptLab: A fork of LangFlow that we customized in-house and manage on AWS
Takeaways

32 of 63

LangChain

A tool for leveraging language models in application development.
Integrates language models into various workflows.
Enables us to:

Automate routine science data tasks
Derive insights from large datasets
Facilitate innovative scientific experiments

We will see examples of solving scientific problems using LangChain in subsequent presentations today.

Image Source: Microsoft Blog

33 of 63

LangChain

LangChain consists of a few modules.

Model
Prompt
Memory
Chain
Agents

As its name suggests, chaining different modules together is the main purpose of Langchain.
This dynamic chaining of modules gives us the power to solve various scientific problems with the help of LLMs.

34 of 63

LangFlow

Langflow is an open source web tool built in Python that provides a graphical LLM interface to work with LangChain, allowing to handle the concepts of Chains, Agents and Prompt Engineering, in a very simple way.
Even non-technical users can build and experiment with LLMs and LangChain while bringing their own data.
You can also build bespoke components in Python to solve your science use cases.
LangFlow is highly customizable and uses popular technologies such as FastAPI and ReactJS.

35 of 63

LangFlow Community Examples

https://github.com/logspace-ai/langflow_examples/tree/main/examples

36 of 63

PromptLab

A managed LangFlow instance that we customized and deployed on AWS. You don’t have to worry about servers, databases, load balancing, and networking because it’s a managed instance.�
Publishing Flows

Feature to publish your flows and share them with other users of the application (Open Science!)
Unique API Keys for your published flows to keep track of usage.
View published flows from other users and clone them into your own account, then modify them for your own use case.�

SDE Sinequa Retriever

A custom component to talk to the Sinequa instance that powers SDE
Can pass a query to SDE and fetch results
Then pass the results to other parts of the pipeline (for example, chatbots)

37 of 63

Sinequa Retriever

https://flow.promptlab.nasa-impact.net/

38 of 63

Takeaways

LangChain is a framework that lets us integrate LLMs into our application workflows
LangFlow is a GUI application that lets us build LangChain flows with little to no coding
We have customized LangFlow and deployed it to AWS. Our instance is called PromptLab
We also built custom components to interact with the Science Discovery Engine (SDE)
You will see examples of how we use PromptLab to build a chatbot for the NASA Open Science Data Repository (OSDR) in subsequent presentations

39 of 63

Open Science Data Repository (OSDR) Chatbot using Promptlab

Walter Alvarado, Space Biosciences Research Branch, NASA Ames

40 of 63

Walter Alvarado, Ph.D.

Open Science Data Repository

NASA Ames Research Center

41 of 63

Space Biology Data

41

Earth Science Division

Heliophysics Division

Planetary Science Division

Biological and Physical Sciences Division

Astrophysics Division

Space Biology Program

Studying the impact of space travel on living systems

Data from spaceflown experiments (e.g. cell culture, rodent, plant)
Data from ground simulations and analogs (e.g. particle accelerators, centrifuges)

42 of 63

NASA Open Science Data Repository

42

Physiological/Phenotypic/Imaging/ Environmental Telemetry Data

Molecular/Omics Data

Biospecimens

NASA Open Science

Data Repository (OSDR)

osdr.nasa.gov/bio

Tabular, text, imaging, telemetry, video, code

Single Submission Portal (BDME)

User Interface/Website Tool for RDSAs (Research Data Submission Agreements)

Maximally Open Access with Necessary Controls for Sensitive Data

Data Maximally FAIR

43 of 63

NASA Open Science Data Repository

43

488

Studies

77

Assays

45

Species

>150TB

Data

910

Datasets

91

Original Publications Have Data Submitted to OSDR

65

Publications Enabled by OSDR Data Mining

142+

Datasets Used in Enabled

Publications

‘OMICS

PHYSIO-PHENO

44 of 63

44

In-situ Analytics & Hardware

Cloud Labs & Automated Inventory Management

Biomonitoring &

Precision Space Health

Data Management

Machine

Learning

Language Models & Generalist AI

Data Standardization & Data Engineering

Basic Biological Discovery

Scott et al., 2023 NatMachIntel https://rdcu.be/c8jSO Li et al., 2023 https://arxiv.org/abs/2311.12045 Soboczenski 2024, https://huggingface.co/kenobi/NASA_GeneLab_MBT

Sanders et al., 2023 NatMachIntel https://rdcu.be/c8jSS Li et al., 2023 PMID: 38092777 AWS BPS Microscopy Benchmark https://registry.opendata.aws/bps_microscopy/�

45 of 63

Open Science Data Repository (OSDR) Chatbot Evaluation

Nishan Pantha, UAH/NASA IMPACT

46 of 63

Recipe 3: Creating and Deploying LLM Based Applications

Iksha Gurung, UAH/IMPACT

47 of 63

Creating and Deploying AI Applications

Goal

Teach developers to create applications using LLMs that enable users to effortlessly query Earth Science or Astrophysics datasets and observations with natural language.

Approach

Adopt the LangChain framework for the integration of existing data and information systems.
Implement ReACT pattern orchestration for dynamic interaction.

Value

Science Users: advanced search capabilities streamlines data access workflows, making science more efficient and scalable
Data stewards: Increased data utilization and reusability through advanced search capabilities.

Implementation Steps

Identify and define LLM-compatible tools for enhanced query handling.
Employ ReACT patterns for structured data interaction and response.
Implement quantitative validations to ensure accuracy and reliability.

48 of 63

Recipe 4: Fine Tuning LLMs to Create Custom Applications

Muthukumaran Ramasubramanian, UAH/IMPACT

49 of 63

Fine Tuning LLMs to Create Custom Applications

Goal

Modify an existing LLM to aid in scientific thematic data curation.

Approach

Use the SMD Encoder model and training (labeled) data to train a classifier.

Value

Science Users: thematic search applications enables the discovery of new, research-relevant datasets
Data Stewards:

Automation Benefit: Augments and streamlines the current manual curation process.
Scientific Advancement: Increase discovery and use of new, research-relevant datasets.

Implementation Steps

Begin with fine-tuning the Encoder Model.
Conduct comparative analysis with the Decoder Model.
Explore results against One-Shot and Few-Shot learning methods.
Perform quantitative evaluation to measure classifier performance.

50 of 63

Meeting Community Search Needs

Data and information volumes are growing, with content being dispersed across a number of repositories, web pages, code repos, etc…

Scientific communities have a need to curate and share data and information around their specific use case

Leveraging the broader curation effort happening for the SDE, communities can work with the SDE to build curated search applications. Examples include:

Environmental Justice
Multi-messenger, time domain astronomy

Citation: Bugbee, K., D. Smith, S. Wingo, and E. Foshee (2023), The art of scientific curation, Eos, 104, https://doi.org/10.1029/2023EO230201. Published on 19 May 2023.

March 2024

50

51 of 63

52 of 63

SMD Large Language Model

53 of 63

Large Language Models: Overview

Fig. Types Of Language Model Architectures�(Source: https://medium.com/@yulemoon/an-in-depth-look-at-the-transformer-based-models-22e5f5d17b6b)

I am Muthukumaran Ramasubramanian, some of you may know me as kumar. I am a Computer Scientist working with the NASA-IMPACT team. Kaylin provided some context into how LLMs can help in data curation efforts - I will provide some insights into what it takes finetune LLMs. To give an overview of what to expect in the next 40 or so minutes. I will give a brief overview of language models, then, I will go over the details of a model co-built by NASA-IMPACT, Science Mission Directorate working group, and IBM research. I will finally walk through a notebook example of how one can finetune the models and hopefully you can also follow me in finetuning the model.

In the recent years, when we talk about language models, they are most definitely based off of attention based transformer architecture popularized by the paper titled “Attention is all you need”

Transformers are a type of neural network architecture that excel at processing sequential data, like text or time series, by using self-attention mechanisms.

Unlike previous architectures that processed data in a linear fashion (e.g., RNNs and LSTMs), transformers can handle inputs in parallel, making them much faster and more efficient for tasks involving large datasets.

Encoder:

Purpose: Processes the input data (e.g., a sentence in the source language) into a continuous representation that captures the essence of the input.

Example: In a machine translation model like the first half of a Transformer, the encoder would take an English sentence and transform it into a high-dimensional space (contextual embeddings) that represents its semantic meaning.

Decoder:

Purpose: Generates output data (e.g., a sentence in the target language) from either the continuous representation created by an encoder or directly from an input sequence in tasks like text generation.

Example: In the second half of a Transformer model used for machine translation, the decoder would take the embeddings from the encoder and step by step generate a sentence in French, translating the original English input.

Encoder-Decoder:

Purpose: Combines both encoder and decoder into a single framework. The encoder processes the input, and the decoder uses the encoder's output to generate the final output. This architecture is crucial for tasks where the input and output sequences do not have a one-to-one correspondence.

Example: In sequence-to-sequence models, like those used for English-to-German translation, the encoder-decoder architecture is central. The encoder first converts an English sentence into a dense representation, capturing its meaning. The decoder then takes this representation to produce the corresponding German sentence.

Practical Applications:

54 of 63

Large Language Models: Usecases

55 of 63

Large Language Models: Fine-tuning

Fig. Training and Fine tuning Pipeline

Semi-supervised Pretraining with LM Tasks (e.g., MLM)

Objective: Improve model understanding of language structure and context without requiring extensive labeled datasets.

Method: For MLM (used in BERT-like models), some words in a sentence are randomly masked, and the model is trained to predict these masked words based only on their context. This enables the model to learn a rich understanding of language syntax and semantics from large, unlabeled text corpora.

Key Benefit: The model learns general language patterns, which can be a powerful starting point for more specialized tasks. When the corpus here is scientific documents, the model learns representations that are more suitable for scientific tasks.

Downstream Fine-tuning with Task Heads

Objective: Adapt the pre-trained model to a specific task (e.g., sentiment analysis, question-answering) by training it on a smaller, task-specific labeled dataset.

Method: A task-specific head (a few layers specific to the task) is added on top of the pre-trained model. The entire model (pre-trained layers + task head) is then fine-tuned on the task-specific dataset.

Contrast: Unlike semi-supervised pretraining, fine-tuning is supervised and requires labeled data. It adapts the broadly learned representations to the nuances of a particular task.

Transfer Learning with Small Learning Rates and Freezing Weights

Objective: Transfer knowledge from a general domain to a specific domain while preventing overfitting on small datasets.

Method: Two common techniques are used:

Small Learning Rates: When fine-tuning, a smaller learning rate is used to make only incremental adjustments to the pre-trained weights. This helps in retaining the useful representations learned during pretraining.

Freezing Weights: Some or most of the pre-trained layers are "frozen," meaning their weights are kept constant, and only a few layers (often the task-specific head) are trained. This is useful when the downstream dataset is small and the risk of overfitting is high.

56 of 63

Encoder LLM

HuggingFace Link

57 of 63

SMD Large Language Model: Training Sources

Dataset	Domain	# Tokens	Ratio
NASA CMR Dataset Description	Earth Science	0.3 B	1%
AGU and AMS Papers	Earth Science	2.8 B	4%
English Wikipedia	General	5.0 B	8%
Pubmed Abstracts	Biomedical	6.9 B	10%
PMC	Biomedical	18.5 B	28%
SAO/NASA ADS	Astronomy, Astrophysics, Physics, General Science	32.7 B	49%
Total		66.2 B	100%

A curated dataset, spanning multiple science domains, sets a solid foundation for future model development and enhancements.

58 of 63

SMD Large Language Model: Base Encoder

Base Model	RoBERTa
Parameter Size	125 M
Pre training Strategy	Masked Language Modeling
Application Areas	Named Entity Recognition, Information Retrieval, Sentence Transformers, ExtractQ
Resource Consumed	192 V100 GPUs, 500K steps, 10 days training time
Knowledge Distillation	30M Model

A domain-adapted, efficient encoder model marks an advancement for NASA SMD and can be used in supporting many different applications.

HuggingFace Link

59 of 63

SMD LLM Downstream Models: Sentence transformer

Sentence Transformers can assist in information retrieval by encoding documents to efficiently understand text semantics and help text analysis tasks.

Fig: Sentence Transformer Architecture

Fig: Sentence Transformer Evaluation

HuggingFace Link

60 of 63

SMD LLM Downstream Models: Passage Reranker

Fig. Passage Reranker Architecture

Query

Relevancy Score

Passage

Encoder

Models	MsMarco-Dev	Science QA Dataset
RoBERTa	35.9	31.1
nasa-smd-ibm-ranker	36.4	33.2

Fig. Passage Re-Ranker Evaluation using Relevancy Scores

HuggingFace Link

Passage Re-Rankers further improves relevancy of passages retrieved using the sentence transformers

61 of 63

SMD LLM Downstream Models: Usage

Sentence transformer

Passage Reranker

62 of 63

SMD LLM Downstream Models: Hands-On

Notebook Demo

63 of 63

Decoder Model: Metadata Extraction