1 of 20

Self-directed Synthetic Dialogues (SDSD)

and other musing on synthetic data

SDSD - Lambert 2024 - 1

Nathan Lambert, July 2024

https://arxiv.org/abs/2407.18421

2 of 20

Synthetic data

Using outputs of language models to improve other language models
No longer need costly human data
Extensively used in industry and academia
Many, many unknowns
Does not cause model collapse if human + initial data included

SDSD - Lambert 2024 - 2

3 of 20

Synthetic data is the near-term future of pre and post-training. It’s real.

SDSD - Lambert 2024 - 3

4 of 20

Self-directed Synthetic Dialogues (SDSD)

Goal: Replicate Constitutional AI from Anthropic

SDSD - Lambert 2024 - 4

5 of 20

Self-directed Synthetic Dialogues (SDSD)

Goal: Replicate Constitutional AI data from Anthropic

SDSD - Lambert 2024 - 5

Constitutional AI: Harmlessness from AI Feedback, Bai et al. 2022�https://arxiv.org/abs/2212.08073

6 of 20

Self-directed Synthetic Dialogues (SDSD)

Goal: Replicate Constitutional AI data from Anthropic

Meanwhile:

SDSD - Lambert 2024 - 6

Constitutional AI: Harmlessness from AI Feedback, Bai et al. 2022�https://arxiv.org/abs/2212.08073

Constitutional AI with Open LLMs, Huang et al. 2024�https://huggingface.co/blog/constitutional_ai

7 of 20

Self-directed Synthetic Dialogues (SDSD)

Changed to: Generate our own conversations (rather than redo them)

Key components:

Topics (subject of conversation)
Principles (watch for violation, like CAI)
Goals (how to steer conversation to an end)

SDSD - Lambert 2024 - 7

8 of 20

Generating an online dialogue

Have the language model talk to itself with the provided information.

SDSD - Lambert 2024 - 8

9 of 20

Generating an online dialogue

If violation occurs, write a critique of the message and rewrite.

SDSD - Lambert 2024 - 9

10 of 20

Putting it together

SDSD - Lambert 2024 - 10

11 of 20

SDSD - Example

SDSD - Lambert 2024 - 11

12 of 20

SDSD - Lambert 2024 - 12

13 of 20

SDSD - Statistics

SDSD - Lambert 2024 - 13

14 of 20

SDSD - Synthetic data lessons

Automatic filtering and/or verification is required
Debug per-step
Revisions, critiques, and language model feedback are sensitive
Balancing procedurally generation

SDSD - Lambert 2024 - 14

15 of 20

Other interesting recent synthetic datasets

SDSD - Lambert 2024 - 15

16 of 20

AI-MO/NuminaMath-TIR

… we selected approximately 70k problems from the NuminaMath-CoT dataset, focusing on those with numerical outputs, most of which are integers. We then utilized a pipeline leveraging GPT-4 to generate TORA-like reasoning paths, executing the code and producing results until the solution was complete. We filtered out solutions where the final answer did not match the reference and repeated this process three times to ensure accuracy and consistency. This iterative approach allowed us to generate high-quality TORA data efficiently.

SDSD - Lambert 2024 - 16

17 of 20

nvidia/Daring-Anteater

High quality synthetic data used to train Nemotron 340b.

Helps on IFEval and other tests in our early use :)

SDSD - Lambert 2024 - 17

18 of 20

Magpie-Align/Magpie-Pro-MT-300K-v0.1

Having language models generate their own instructions by manipulating the chat template tokens.

Image credit: https://magazine.sebastianraschka.com/p/instruction-pretraining-llms

SDSD - Lambert 2024 - 18

19 of 20

proj-persona/PersonaHub

Crazy project trying to generate 1 billion personas / system prompts plus synthetic instructions.

SDSD - Lambert 2024 - 19

1 of 20

2 of 20

3 of 20

4 of 20

5 of 20

6 of 20

7 of 20

8 of 20

9 of 20

10 of 20

11 of 20

12 of 20

13 of 20

14 of 20

15 of 20

16 of 20

17 of 20

18 of 20

19 of 20

20 of 20