Industrializing
Continuous Learning
1: retraining framework
2: Tool-Calling
3: Training & evaluating
4: Serving
CONTENTS
1: retraining framework
2: Tool-Calling
3: Training & evaluating
4: Serving
CONTENTS
A pip-installable sandbox - a production env
https://github.com/aurelienmorgan/retrain-pipelines/tree/master/sample_pipelines
allowing users to focus on their data, model performance and system compatibility
You can launch an execution from anywhere
Adaptable to your need
from a notebook via the cell magic
programmatically using the python method
from the cli using the utility
https://github.com/aurelienmorgan/retrain-pipelines/blob/master/�sample_pipelines/dag_engine/example_wf_7.py
An internal DAG engine
An internal WebConsole
[GDrive video link]
Have teams collaborate and share tasks
portable html files, can be embedded with a serving endpoint as standalone document on the version served
pipeline-card
A central place for your execution with sections for :
Third party integration - the Hugging Face Hub
https://huggingface.co/retrain-pipelines/function_caller_lora/blob/6cfca854516e0eefdb9898bb9fbc0ee4ab8e1e24/README.md�v0.29 of the retrain-pipelines function_caller_lora adapter
inspectors (1/2)
retrain-pipelines includes programmatic means to investigate any execution in details, if required. Any of your parallel training went off-road ? inspect it.
Also, the retrain-pipelines Hugging Face Hub integration comes with a model- versions inspector :
inspectors (2/2)
1: retraining framework
2: Tool-Calling
3: Training & evaluating
4: Serving
CONTENTS
State of agentic function-calling
LLM + constrained generation
Set of actionable tool-call commands
user query
User query + a set of definitions of accessible tools
Tool calls
Code interpreter
User query + tool-call responses as context
LLM
Response to user
is 48 a perfect square ?
is 48 a perfect square ?��{"name": "is_perfect_square", "description": "Checks if a number is a perfect square.", "parameters": "{"num": {"description": "The number to check.", "type": "int"}}"}��{"name": "is_prime", "description": "Checks if a number is prime.", "parameters": "{"num": {"description": "The number to be checked.", "type": "int"}}"}
pydantic class or grammar
is_perfect_square(num=48)
is 48 a perfect square ?�tool-call returned: False
user query and tool-call context => formulate an answer.
no, 48 is not a perfect square.
question-answering task
function-calling task
The Completion and Responses APIs
Chip Huyen's blog post on agents: https://huyenchip.com/2025/01/07/agents.html
Completion API
Responses API
main differences lie in the structure of the returned“response”�(and how identified tool-calls�can be accessed)
Berkeley Function-Calling Leaderboard
| Single Turn | Multi Turn | Agentic | ||||
| Non-live (AST) | Live (AST) | Multi turn | Web Search | Memory | ||
Rank | Overall Acc | Model | Overall Acc | Overall Acc | Overall Acc | Overall Acc | Overall Acc |
1 | 70.85 | GLM-4.5 (FC) | 86.6 | 81.72 | 65.62 | 79 | 50.75 |
2 | 70.36 | Claude-Opus-4-1-20250805 (FC) | 88.38 | 81.5 | 57.88 | 77 | 62.15 |
3 | 70.29 | Claude-Sonnet-4-20250514 (FC) | 88.38 | 81.05 | 54.75 | 84 | 59.35 |
4 | 67.87 | GLM-4.5-Air (FC) | 87.15 | 79.42 | 62.5 | 73.5 | 47.53 |
5 | 61.6 | Grok-4-0709 (Prompt) | 81.27 | 69.73 | 43.25 | 72 | 54.41 |
6 | 61.01 | Grok-4-0709 (FC) | 85.21 | 74.39 | 36.12 | 72.5 | 65.38 |
7 | 59.22 | GPT-5-2025-08-07 (FC) | 72.92 | 58.25 | 28.5 | 84.5 | 57.63 |
8 | 58.76 | o3-2025-04-16 (Prompt) | 81.42 | 73.43 | 56.12 | 43.5 | 46.45 |
9 | 56.07 | Moonshotai-Kimi-K2-Instruct (FC) | 85.17 | 80.83 | 48.75 | 59 | 25.16 |
10 | 55.3 | Moonshotai-Kimi-K2-Instruct (Prompt) | 84.02 | 77.57 | 41.25 | 63 | 33.98 |
https://gorilla.cs.berkeley.edu/leaderboard.html
Aug. 25th, 2025
1: retraining framework
2: Tool-Calling
3: Training & evaluating
4: Serving
CONTENTS
A base LLM + an on/off knowledge-enhanced task-expert adapter
A function-calling LoRa adapter with tools memory
The retrain-pipelines/func_calls_ds training dataset
We still need to instruct on legitimate absence of tool calls !
Data Augmentation & Data Enrichment�for hallucination mitigation
The PEFT/Unsloth Trainer�for the pipeline CPT & SFT tasks
We can either merge the CPT adapter with base, or keep training SFT on it,. Both options allow to keep it 100% on/off pluggable
Evaluating our trained on-demand�tool-call expert Adapter
https://huggingface.co/retrain-pipelines/function_caller_lora/blob/6cfca854516e0eefdb9898bb9fbc0ee4ab8e1e24/README.md�v0.29 of the retrain-pipelines function_caller_lora adapter (March 2025)
a 75.5% accuracy�from its intrinsic knowledge-bank of 4,200+ tools,�without the usual extended-context arsenal !
our model adapter marks an almost perfect score on not calling any tool when the user query doesn't require any
There are many false negatives in how we computed the previous chart, eg :
Tool-call & eval, relationship status :�it's complicated
etc.
1: retraining framework
2: Tool-Calling
3: Training & evaluating
4: Serving
CONTENTS
Multi-adapters single-endpoint server (1/2)
It plays a major role in helping guide the base model into integrating the task-specific expertise right.
for inference, it's critical to switch prompt_template when switching between adapters
Multi-adapters single-endpoint server (2/2)
Switching on/off any of the named adapters for batch queries
An army of specialized experts
The use-case covered here is thought-of as a stepping stone into large-scale, adaptable corporate agentic systems
1: retraining framework
2: Tool-Calling
3: Training & evaluating
4: Serving
CONTENTS
CONTENTS
Thank You
Hugging Face
GitHub