Zero-shot HPO and model selection for agentic tasks
ML-Agents
Prior Works
Trail & Error for Simpler Tasks
Other AutoML works
AutoSkLearn [arxiv.org/pdf/2007.04074] (NN)
Google Vizier [arxiv.org/pdf/2207.13676] (NN)
OptFormer [arxiv.org/pdf/2205.13320] (NN)
Zero-shot for complex tasks
Our work (ZeroHPO)
Zero-shot for Simpler Tasks
µTransfer [arxiv.org/pdf/2203.03466] (NN)
zero-shot HPO [arxiv.org/pdf/2007.13382] (NN)
APT [arxiv.org/pdf/2502.04573] (Tabular)
Extending TWIG [arxiv.org/pdf/2412.14801] (Graphs)
Trail & Error for Complex tasks
Blackbox-PBT [arxiv.org/pdf/1902.01894] (RL)
AutoFT [arxiv.org/pdf/2401.10220] (LMs)
AutoGluon-Tabular [arxiv.org/pdf/2003.06505] (DS)
AutoML-Agent [arxiv.org/pdf/2410.02958] (DS/ML)
ML-Agents
Datasets & Baselines
Do all tasks benefit equally from reasoning ?
Do certain tasks require certain range of hyperparameters ?
HPO Selection tools
What works best : To Search or to Tune ?
Do the findings vary between each task type ?
Evals and Analysis
How much improvement can we get over direct usage ?
How much improvement can we get compared to user’s exp ?
An Overview of project scope
ML-Agents
ML-Agents
Part 1 : Datasets & Baselines
ML-Agents
Datasets
Collecting and creating datasets
Collecting Datasets
Benchmarks & Datasets for agentic tasks Mostly range from 100-500 samples each
Categoring Datasets’ Samples
What tasks do each samples focus on ?
What issues do the prompts have ?
Re-Annotating Samples
In Alternatve ways (technical / non-technical way)
Example
How many people do you think would look at my website next month ?
Can you make a forecasting model to predict the traffic for next month ?
ML-Agents
Baseline Analysis
How do these agents perform by default ?
Effect of prompts (Technical / Non-Technical
To what extent does the performance drop/chang based on how you prompt it ? What issues in prompts effect the most ?
Effect of reasoning on each kind of task ?
Do all task types benefit equally from reasoning ?
How does amount of reasoning tokens effect performance ?
Effect of Models and Hyperparameters
What models are better suited for what kind of tasks ?
What ranges of hyperparameters work better for each task ?
Can we detect the difficulty of a task ?
Can we detect the type of task and the level of difficulty of a task directly based on the input query itself ?
ML-Agents
Part 1 : Discussion
ML-Agents
Part 2 : HPO Tools (Search / Tune ?)
ML-Agents
HPO by Tuning
Trying HPO Tools by Fine-tuning
We now have baseline datasets
On what set of hyper-parameters and what model works best for each kind of task, and queries used.
Tune or search ?
Would fine-tuning work better than search ?
Tuning approaches ?
Testing various models and fine-tuning approaches
When does tuning fail ?
On unseen task types ? On any certain task ?
ML-Agents
HPO by Search
Trying HPO Tools by search ?
We now have baseline datasets
On what set of hyper-parameters and what model works best for each kind of task, and queries used.
Tune or search ?
Would search work better than fine-tuning ?
Search approaches ?
Search by query similarity / task similarity etc...
When does searching fail ?
On unseen task types ? On any certain task ?
ML-Agents
Part 2 : Discussion
ML-Agents
Part 3 : Evals and Analysis
ML-Agents
Testing Internally
On a held out test set from Part 1
What works better on seen tasks ?
Does tuning work better than search on seen task types ? How does the comparisons vary across tasks/models/frameworks ?
Can these systems be better than us ?
Tested in LM-arena style, where some of the contributors provide a task along with chosen model, framework and hyperparameters, while our tries to answer the same query on the same framework by its own choice of model and hyper-parameters
How do the win rates vary across task, framework and query types ?
Comparisons to observe whether the systems need improvements or are they ready to launch ?
ML-Agents
Testing Publicly
Human Evals (If passed internal evals)
How are the winrates changing between internal tests and public tests ?
What aspects and features have deviations between public tests and internal evals ?
Annotation platform setup
Setting up a HF Space with gradio or streamlit for users to vote between thier choices and our versions with HPO. The users get to upload a file, ask questions and vote between the default and our HPO variant’s responses in a souble-blind way.
When is our system more/less useful ?
i.e How does the win rate vary compared to the user’s level of expertise ? Can our system outperform those very familiar with the area ? On what tasks can this work better ? How does it fare on unseen kind of tasks ?
ML-Agents
Part 3 : Discussion
ML-Agents
Resources (Compute / API Credits)
What we have
Should most likely be sufficient
Should ideally begin experimenting with the Cohere credits before extending to the rest
ML-Agents
ML-Agents
End goals
Timeline depends on contributors
Do all tasks benefit equally from reasoning ?
ZeroHPO : Zero-shot HPO for agentic tasks
Titles TBD :
ML-Agents
Other Queries / Add-Ons
Thank you
ML-Agents