1 of 22

Leveraging Large Language Model(LLM)-based Agents with Multiple Tool Integration for Enhanced Search in the Canadian Astronomy Data Centre

Sara Shishehchi, Shaylin Thadani, Shavon Thadani, Patrick Dowler, Adrian Damian, Hossen Teimoorinia

2 of 22

Agenda

Introduction to LLms and text to SQL

History and challenges of text to SQL

LLM-Based Agents and Tool Integration

Project Overview:

  • Background, Goal and Objectives
  • Architecture

Benchmarking Against Existing Solutions

Demo

3 of 22

LLMs and Text to SQL

Text-to-SQL: a well-established task in natural language processing (NLP) designed to translates natural language queries into executable SQL commands that can retrieve data from a database.

Why text-to-SQL:

  • Democratize access to data
  • No need to learn SQL syntax
  • Eliminates the need for manual data searches through complex GUIs

Why LLMs:

Superior abilities in understanding human instructions and generating code.

4 of 22

History and Challenges of Text to SQL

Challenges of Text-to-SQL:

  • Linguistic Complexity and Ambiguity
  • Schema Understanding and Representation
  • Domain-specific table rules
  • Lack of generated SQL verification.
  • Model’s context window limits when dealing with large-scale real-world databases
  • Lack of large curated dataset to use for fine-tuning

Most recent research focused on In-Context Learning prompt strategies and supervised fine-tuning using data from the target

Hong et al., arXiv:2406.08426v1, Jun 2024

5 of 22

LLM-Based Agents and Tool Integration

Benefits:

  • Verify SQL query and iterate to correct it
  • Incorporate various tools to:
    • Enrich the prompt with additional data (e.g. RAG)
    • Verify a user’s input against factual data. Iterate back with the user and correct
  • Scale and extend to additional agents/use cases
  • Collaborative multi-agent systems proved to accomplish complex tasks

Agent: LLM applications that can execute complex tasks through the use of an architecture that combines LLMs with key modules like planning, tools and memory

6 of 22

Project Background and Goal

Canadian Astronomical Data Centre Database

15 tables with varying relationships

  • 2 primary tables
    • Observation
    • Plane

335 columns of various data types

7 primary columns

    • Band
    • Collection
    • Instrument
    • Filter
    • Calibration Level
    • Data Type
    • Observation Type

Current Data Query Process

  1. Astronomers search data through applying filters on the CADC website
  2. The backend constructs and executes ADQL (Astronomical Data Query Language) code to retrieve the requested observational data
  3. The CADC website displays the list of results
  4. Astronomers can select certain rows and download FITS files

7 of 22

Current Implementation - CADC Advanced Search

8 of 22

Project Goal

9 of 22

Architecture

10 of 22

Key Characteristics of our Architecture

Our Architecture

An Existing solution: CADC Access GPT

• A specialized version of Chat GPT tailored to generating ADQL code�• Access to DB Schema and limited information about the Database

• Generates code in one shot

1. Preprocessing Agent: Corrects and refines user query�

Tools:� • Check Column: Recommends columns if specified column is missing

Alternate Column: Substitutes columns with their correct names

Check Values: Recommends values if specified value is missing

Alternate Values: Substitutes values with their correct names��

2. ADQL Generator Agent: Builds ADQL code from refined query��Tools:� • Database Schema

Mock Database: Cloned copy to test and self-correct before final output

11 of 22

CADC Access GPT vs Our Custom Agent

Criteria for grading

75 questions

4 categories:

  • Spelling Error
  • Incorrect Column
  • Incorrect Value
  • Normal Question (No errors)

12 of 22

Conclusion and Future Work

Conclusion:

  • Supervised fine-tuning and limited model context window pose challenges in text-to-SQL tasks.
  • We developed a chatbot that allows users to interact with Canadian Astronomy Data Center using English language.
  • The chatbot leverages the power of LLM-based agents and integration of various tools to enhance accuracy.
  • Our system outperforms GPT systems built only based on table schema and prompt information.

Future work:

  • Additional tools:
    • Include a tool to fact-check user’s request
    • Apply extra pre-processing steps to understand astronomical language and coordinates
  • Collect more questions to benchmark
  • Extend to collaborative multi-agent systems
  • Extend to use cases beyond text-to-SQL (chat, Q&A)

13 of 22

Demo: Using Incorrect Column Names

Query: Select the columns collection and Data Type from the observation table and plane table where Band is 'UV'

Substitutions:

Data Type dataProductType

Band energy_emBand

14 of 22

Demo: Using Incorrect Column Names

15 of 22

Demo: Using Incorrect Column Names

Query: Select the columns collection and Data Type from the observation table and plane table where Band is 'UV'

Output: SELECT o.collection, p.dataProductType, p.publisherID FROM observation o JOIN plane p ON o.obsID = p.obsID WHERE p.energy_emBand = ‘UV’ LIMIT 5;

16 of 22

Demo: Using Incorrect Value Names

Query: Find the first 20 rows of the instrument_name column where the instrument_name is 3rd band, Fourth band, B-Five, or Sixth band in the observation table.

Substitutions:

3rd band Band 3

Fourth band Band 4

B-Five Band 5

Sixth band Band 6

17 of 22

Demo: Using Incorrect Value Names

18 of 22

Demo: Using Incorrect Value Names

Query: Find the first 20 rows of the instrument_name column where the instrument_name is 3rd band, Fourth band, B-Five, or Sixth band in the observation table.

Output: SELECT p.publisherOD, o.instrument_name FROM observation o JOIN plane p ON o.obsID p.obsID WHERE o.instrument_name IN (‘Band 3’, ‘Band 4’, ‘Band 5’, ‘Band 6’) LIMIT 5;

19 of 22

Demo: Spelling Errors

Query: Get all obsevrations with 'Infrraed' or 'Opitcal' enregy_emBand and colection GEMINY.

Substitutions:

obsevrations observations

Infrraed Infrared

Opitcal Optical

enregy_emBand energy_emBand

colection collection

20 of 22

Demo: Spelling Errors

21 of 22

Demo: Spelling Errors

Query: Get all obsevrations with 'Infrraed' or 'Opitcal' enregy_emBand and colection GEMINY.

Output: SELECT p.publisherID, o.* FROM observation o JOIN plane p ON o.obsID = p.obsID WHERE (p.energy_emBand = Infrared OR p.energy_emBand = Optical’) AND o.collection = ‘GEMINI’ LIMIT 5;

22 of 22