Leveraging Large Language Model(LLM)-based Agents with Multiple Tool Integration for Enhanced Search in the Canadian Astronomy Data Centre
Sara Shishehchi, Shaylin Thadani, Shavon Thadani, Patrick Dowler, Adrian Damian, Hossen Teimoorinia
Agenda
Introduction to LLms and text to SQL
History and challenges of text to SQL
LLM-Based Agents and Tool Integration
Project Overview:
Benchmarking Against Existing Solutions
Demo
LLMs and Text to SQL
Text-to-SQL: a well-established task in natural language processing (NLP) designed to translates natural language queries into executable SQL commands that can retrieve data from a database.
Why text-to-SQL:
Why LLMs:
Superior abilities in understanding human instructions and generating code.
History and Challenges of Text to SQL
Challenges of Text-to-SQL:
Most recent research focused on In-Context Learning prompt strategies and supervised fine-tuning using data from the target
Hong et al., arXiv:2406.08426v1, Jun 2024
LLM-Based Agents and Tool Integration
Benefits:
Agent: LLM applications that can execute complex tasks through the use of an architecture that combines LLMs with key modules like planning, tools and memory
Project Background and Goal
Canadian Astronomical Data Centre Database
15 tables with varying relationships
335 columns of various data types
7 primary columns
Current Data Query Process
Current Implementation - CADC Advanced Search
Project Goal
Architecture
Key Characteristics of our Architecture
Our Architecture
An Existing solution: CADC Access GPT
• A specialized version of Chat GPT tailored to generating ADQL code�• Access to DB Schema and limited information about the Database
• Generates code in one shot
1. Preprocessing Agent: Corrects and refines user query�
Tools:� • Check Column: Recommends columns if specified column is missing
• Alternate Column: Substitutes columns with their correct names
• Check Values: Recommends values if specified value is missing
• Alternate Values: Substitutes values with their correct names��
2. ADQL Generator Agent: Builds ADQL code from refined query��Tools:� • Database Schema
• Mock Database: Cloned copy to test and self-correct before final output
CADC Access GPT vs Our Custom Agent
Criteria for grading
75 questions
4 categories:
Conclusion and Future Work
Conclusion:
Future work:
Demo: Using Incorrect Column Names
Query: Select the columns collection and Data Type from the observation table and plane table where Band is 'UV'
Substitutions:
Data Type dataProductType
Band energy_emBand
Demo: Using Incorrect Column Names
Demo: Using Incorrect Column Names
Query: Select the columns collection and Data Type from the observation table and plane table where Band is 'UV'
Output: SELECT o.collection, p.dataProductType, p.publisherID FROM observation o JOIN plane p ON o.obsID = p.obsID WHERE p.energy_emBand = ‘UV’ LIMIT 5;
Demo: Using Incorrect Value Names
Query: Find the first 20 rows of the instrument_name column where the instrument_name is 3rd band, Fourth band, B-Five, or Sixth band in the observation table.
Substitutions:
3rd band Band 3
Fourth band Band 4
B-Five Band 5
Sixth band Band 6
Demo: Using Incorrect Value Names
Demo: Using Incorrect Value Names
Query: Find the first 20 rows of the instrument_name column where the instrument_name is 3rd band, Fourth band, B-Five, or Sixth band in the observation table.
Output: SELECT p.publisherOD, o.instrument_name FROM observation o JOIN plane p ON o.obsID p.obsID WHERE o.instrument_name IN (‘Band 3’, ‘Band 4’, ‘Band 5’, ‘Band 6’) LIMIT 5;
Demo: Spelling Errors
Query: Get all obsevrations with 'Infrraed' or 'Opitcal' enregy_emBand and colection GEMINY.
Substitutions:
obsevrations observations
Infrraed Infrared
Opitcal Optical
enregy_emBand energy_emBand
colection collection
Demo: Spelling Errors
Demo: Spelling Errors
Query: Get all obsevrations with 'Infrraed' or 'Opitcal' enregy_emBand and colection GEMINY.
Output: SELECT p.publisherID, o.* FROM observation o JOIN plane p ON o.obsID = p.obsID WHERE (p.energy_emBand = ‘Infrared’ OR p.energy_emBand = ‘Optical’) AND o.collection = ‘GEMINI’ LIMIT 5;