2 of 20

Prior Works

Trail & Error for Simpler Tasks

Other AutoML works

AutoSkLearn [arxiv.org/pdf/2007.04074] (NN)

Google Vizier [arxiv.org/pdf/2207.13676] (NN)

OptFormer [arxiv.org/pdf/2205.13320] (NN)

Zero-shot for complex tasks

Our work (ZeroHPO)

Zero-shot for Simpler Tasks

µTransfer [arxiv.org/pdf/2203.03466] (NN)

zero-shot HPO [arxiv.org/pdf/2007.13382] (NN)

APT [arxiv.org/pdf/2502.04573] (Tabular)

Extending TWIG [arxiv.org/pdf/2412.14801] (Graphs)

Trail & Error for Complex tasks

Blackbox-PBT [arxiv.org/pdf/1902.01894] (RL)

AutoFT [arxiv.org/pdf/2401.10220] (LMs)

AutoGluon-Tabular [arxiv.org/pdf/2003.06505] (DS)

AutoML-Agent [arxiv.org/pdf/2410.02958] (DS/ML)

ML-Agents

3 of 20

Datasets & Baselines

Do all tasks benefit equally from reasoning ?

Do certain tasks require certain range of hyperparameters ?

HPO Selection tools

What works best : To Search or to Tune ?

Do the findings vary between each task type ?

Evals and Analysis

How much improvement can we get over direct usage ?

How much improvement can we get compared to user’s exp ?

An Overview of project scope

ML-Agents

4 of 20

ML-Agents

Part 1 : Datasets & Baselines

5 of 20

ML-Agents

Datasets

Collecting and creating datasets

Collecting Datasets

Benchmarks & Datasets for agentic tasks Mostly range from 100-500 samples each

Categoring Datasets’ Samples

What tasks do each samples focus on ?

What issues do the prompts have ?

Re-Annotating Samples

In Alternatve ways (technical / non-technical way)

Example

How many people do you think would look at my website next month ?

Can you make a forecasting model to predict the traffic for next month ?

6 of 20

ML-Agents

Baseline Analysis

How do these agents perform by default ?

Effect of prompts (Technical / Non-Technical

To what extent does the performance drop/chang based on how you prompt it ? What issues in prompts effect the most ?

Effect of reasoning on each kind of task ?

Do all task types benefit equally from reasoning ?

How does amount of reasoning tokens effect performance ?

Effect of Models and Hyperparameters

What models are better suited for what kind of tasks ?

What ranges of hyperparameters work better for each task ?

Can we detect the difficulty of a task ?

Can we detect the type of task and the level of difficulty of a task directly based on the input query itself ?

7 of 20

ML-Agents

Part 1 : Discussion

8 of 20

ML-Agents

Part 2 : HPO Tools (Search / Tune ?)

9 of 20

ML-Agents

HPO by Tuning

Trying HPO Tools by Fine-tuning

We now have baseline datasets

On what set of hyper-parameters and what model works best for each kind of task, and queries used.

Tune or search ?

Would fine-tuning work better than search ?

Tuning approaches ?

Testing various models and fine-tuning approaches

When does tuning fail ?

On unseen task types ? On any certain task ?

10 of 20

ML-Agents

HPO by Search

Trying HPO Tools by search ?

We now have baseline datasets

On what set of hyper-parameters and what model works best for each kind of task, and queries used.

Tune or search ?

Would search work better than fine-tuning ?

Search approaches ?

Search by query similarity / task similarity etc...

When does searching fail ?

On unseen task types ? On any certain task ?

11 of 20

ML-Agents

Part 2 : Discussion

12 of 20

ML-Agents

Part 3 : Evals and Analysis

13 of 20

ML-Agents

Testing Internally

On a held out test set from Part 1

What works better on seen tasks ?

Does tuning work better than search on seen task types ? How does the comparisons vary across tasks/models/frameworks ?

Can these systems be better than us ?

Tested in LM-arena style, where some of the contributors provide a task along with chosen model, framework and hyperparameters, while our tries to answer the same query on the same framework by its own choice of model and hyper-parameters

How do the win rates vary across task, framework and query types ?

Comparisons to observe whether the systems need improvements or are they ready to launch ?

14 of 20

ML-Agents

Testing Publicly

Human Evals (If passed internal evals)

How are the winrates changing between internal tests and public tests ?

What aspects and features have deviations between public tests and internal evals ?

Annotation platform setup

Setting up a HF Space with gradio or streamlit for users to vote between thier choices and our versions with HPO. The users get to upload a file, ask questions and vote between the default and our HPO variant’s responses in a souble-blind way.

When is our system more/less useful ?

i.e How does the win rate vary compared to the user’s level of expertise ? Can our system outperform those very familiar with the area ? On what tasks can this work better ? How does it fare on unseen kind of tasks ?

15 of 20

ML-Agents

Part 3 : Discussion

16 of 20

ML-Agents

Resources (Compute / API Credits)

17 of 20

What we have

Should most likely be sufficient

1500$ Anthropic API Credits
4000$ Cohere API Credits
1200$ GPU Credits
300$ HF Credits
3000$ in Vertex AI Credits
800$ in OpenRouter Credits

Should ideally begin experimenting with the Cohere credits before extending to the rest

ML-Agents

18 of 20

ML-Agents

End goals

Timeline depends on contributors

A Github repo & A package offering ZeroHPO as a tool / function
Cookbooks on how to use through search or tuning

Do all tasks benefit equally from reasoning ?

ZeroHPO : Zero-shot HPO for agentic tasks

Titles TBD :

1 of 20

2 of 20

3 of 20

4 of 20

5 of 20

6 of 20

7 of 20

8 of 20

9 of 20

10 of 20

11 of 20

12 of 20

13 of 20

14 of 20

15 of 20

16 of 20

17 of 20

18 of 20

19 of 20

20 of 20