1 of 29

Industrializing

Continuous Learning

2 of 29

1: retraining framework

2: Tool-Calling

3: Training & evaluating

4: Serving

CONTENTS

3 of 29

1: retraining framework

2: Tool-Calling

3: Training & evaluating

4: Serving

CONTENTS

4 of 29

A pip-installable sandbox - a production env

  • pre-built, highly adaptable pipeline examples that work out of the box

  • Key features of retrain-pipelines executions include
    • Model version blessing
    • Infrastructure validation
    • Comprehensive documentation (pipeline-card)

https://github.com/aurelienmorgan/retrain-pipelines/tree/master/sample_pipelines

allowing users to focus on their data, model performance and system compatibility

5 of 29

You can launch an execution from anywhere

Adaptable to your need

from a notebook via the cell magic

programmatically using the python method

from the cli using the utility

6 of 29

https://github.com/aurelienmorgan/retrain-pipelines/blob/master/�sample_pipelines/dag_engine/example_wf_7.py

An internal DAG engine

  • easy to declare retraining-pipelines
  • can combine taskgroups (async parallel tasks, all getting same inputs) and sub-DAGs (parallel branches, each getting a slice of the upward task's input)
  • user-defined parallel tasks aggregators
  • WebConsole customizability
  • all docstrings show in the dynamic DAG rendering
  • all elements can have user-defined css styling

7 of 29

An internal WebConsole

[GDrive video link]

8 of 29

Have teams collaborate and share tasks

9 of 29

portable html files, can be embedded with a serving endpoint as standalone document on the version served

pipeline-card

  • EDA

A central place for your execution with sections for :

  • training
  • key artifacts
  • pipeline DAG

10 of 29

Third party integration - the Hugging Face Hub

11 of 29

inspectors (1/2)

retrain-pipelines includes programmatic means to investigate any execution in details, if required. Any of your parallel training went off-road ? inspect it.

Also, the retrain-pipelines Hugging Face Hub integration comes with a model- versions inspector :

12 of 29

inspectors (2/2)

13 of 29

1: retraining framework

2: Tool-Calling

3: Training & evaluating

4: Serving

CONTENTS

14 of 29

State of agentic function-calling

LLM + constrained generation

Set of actionable tool-call commands

user query

User query + a set of definitions of accessible tools

Tool calls

Code interpreter

User query + tool-call responses as context

LLM

Response to user

is 48 a perfect square ?

is 48 a perfect square ?��{"name": "is_perfect_square", "description": "Checks if a number is a perfect square.", "parameters": "{"num": {"description": "The number to check.", "type": "int"}}"}��{"name": "is_prime", "description": "Checks if a number is prime.", "parameters": "{"num": {"description": "The number to be checked.", "type": "int"}}"}

pydantic class or grammar

is_perfect_square(num=48)

is 48 a perfect square ?�tool-call returned: False

user query and tool-call context => formulate an answer.

no, 48 is not a perfect square.

question-answering task

function-calling task

15 of 29

The Completion and Responses APIs

Chip Huyen's blog post on agents: https://huyenchip.com/2025/01/07/agents.html

Completion API

Responses API

main differences lie in the structure of the returned“response”�(and how identified tool-calls�can be accessed)

16 of 29

Berkeley Function-Calling Leaderboard

Single Turn

Multi Turn

Agentic

Non-live (AST)

Live (AST)

Multi turn

Web Search

Memory

Rank

Overall Acc

Model

Overall Acc

Overall Acc

Overall Acc

Overall Acc

Overall Acc

1

70.85

GLM-4.5 (FC)

86.6

81.72

65.62

79

50.75

2

70.36

Claude-Opus-4-1-20250805 (FC)

88.38

81.5

57.88

77

62.15

3

70.29

Claude-Sonnet-4-20250514 (FC)

88.38

81.05

54.75

84

59.35

4

67.87

GLM-4.5-Air (FC)

87.15

79.42

62.5

73.5

47.53

5

61.6

Grok-4-0709 (Prompt)

81.27

69.73

43.25

72

54.41

6

61.01

Grok-4-0709 (FC)

85.21

74.39

36.12

72.5

65.38

7

59.22

GPT-5-2025-08-07 (FC)

72.92

58.25

28.5

84.5

57.63

8

58.76

o3-2025-04-16 (Prompt)

81.42

73.43

56.12

43.5

46.45

9

56.07

Moonshotai-Kimi-K2-Instruct (FC)

85.17

80.83

48.75

59

25.16

10

55.3

Moonshotai-Kimi-K2-Instruct (Prompt)

84.02

77.57

41.25

63

33.98

https://gorilla.cs.berkeley.edu/leaderboard.html

Aug. 25th, 2025

17 of 29

1: retraining framework

2: Tool-Calling

3: Training & evaluating

4: Serving

CONTENTS

18 of 29

A base LLM + an on/off knowledge-enhanced task-expert adapter

A function-calling LoRa adapter with tools memory

19 of 29

The retrain-pipelines/func_calls_ds training dataset

We still need to instruct on legitimate absence of tool calls !

20 of 29

Data Augmentation & Data Enrichment�for hallucination mitigation

21 of 29

The PEFT/Unsloth Trainer�for the pipeline CPT & SFT tasks

We can either merge the CPT adapter with base, or keep training SFT on it,. Both options allow to keep it 100% on/off pluggable

22 of 29

Evaluating our trained on-demand�tool-call expert Adapter

a 75.5% accuracy�from its intrinsic knowledge-bank of 4,200+ tools,�without the usual extended-context arsenal !

our model adapter marks an almost perfect score on not calling any tool when the user query doesn't require any

23 of 29

There are many false negatives in how we computed the previous chart, eg :

Tool-call & eval, relationship status :�it's complicated

etc.

24 of 29

1: retraining framework

2: Tool-Calling

3: Training & evaluating

4: Serving

CONTENTS

25 of 29

  • We've seen that LLMs instantiated with the `transformers` library can host on/off, plug-and-play adapters compatible with the `PEFT` library.

Multi-adapters single-endpoint server (1/2)

It plays a major role in helping guide the base model into integrating the task-specific expertise right.

  • during SFT training, for effective learning, "query"/"response" pairs are prepended with a well drafted task-specifc system prompt.

for inference, it's critical to switch prompt_template when switching between adapters

26 of 29

  • Using a custom implementation by retrain-pipelines of LitServe by Litghtning AI

Multi-adapters single-endpoint server (2/2)

  • Takes a yaml config file to inform base-model and list of adapters to be loaded

Switching on/off any of the named adapters for batch queries

27 of 29

An army of specialized experts

The use-case covered here is thought-of as a stepping stone into large-scale, adaptable corporate agentic systems

  • Own your stack – full control, no lock-in
  • Self-hostable – easy deployment
  • All-in-one – one model = complete system
  • Interchangeable expertise – many domain-expert adapters, no context overhead
  • Runs on small models – efficient, low vRAM

28 of 29

1: retraining framework

2: Tool-Calling

3: Training & evaluating

4: Serving

CONTENTS

CONTENTS

29 of 29

Thank You

Hugging Face

GitHub