Self-directed Synthetic Dialogues (SDSD)
and other musing on synthetic data
SDSD - Lambert 2024 - 1
Nathan Lambert, July 2024
https://arxiv.org/abs/2407.18421
Synthetic data
SDSD - Lambert 2024 - 2
Synthetic data is the near-term future of pre and post-training. It’s real.
SDSD - Lambert 2024 - 3
Self-directed Synthetic Dialogues (SDSD)
Goal: Replicate Constitutional AI from Anthropic
SDSD - Lambert 2024 - 4
Self-directed Synthetic Dialogues (SDSD)
Goal: Replicate Constitutional AI data from Anthropic
SDSD - Lambert 2024 - 5
Constitutional AI: Harmlessness from AI Feedback, Bai et al. 2022�https://arxiv.org/abs/2212.08073
Self-directed Synthetic Dialogues (SDSD)
Goal: Replicate Constitutional AI data from Anthropic
Meanwhile:
SDSD - Lambert 2024 - 6
Constitutional AI: Harmlessness from AI Feedback, Bai et al. 2022�https://arxiv.org/abs/2212.08073
Constitutional AI with Open LLMs, Huang et al. 2024�https://huggingface.co/blog/constitutional_ai
Self-directed Synthetic Dialogues (SDSD)
Changed to: Generate our own conversations (rather than redo them)
Key components:
SDSD - Lambert 2024 - 7
Generating an online dialogue
Have the language model talk to itself with the provided information.
SDSD - Lambert 2024 - 8
Generating an online dialogue
If violation occurs, write a critique of the message and rewrite.
SDSD - Lambert 2024 - 9
Putting it together
SDSD - Lambert 2024 - 10
SDSD - Example
SDSD - Lambert 2024 - 11
SDSD - Lambert 2024 - 12
SDSD - Statistics
SDSD - Lambert 2024 - 13
SDSD - Synthetic data lessons
SDSD - Lambert 2024 - 14
Other interesting recent synthetic datasets
SDSD - Lambert 2024 - 15
… we selected approximately 70k problems from the NuminaMath-CoT dataset, focusing on those with numerical outputs, most of which are integers. We then utilized a pipeline leveraging GPT-4 to generate TORA-like reasoning paths, executing the code and producing results until the solution was complete. We filtered out solutions where the final answer did not match the reference and repeated this process three times to ensure accuracy and consistency. This iterative approach allowed us to generate high-quality TORA data efficiently.
SDSD - Lambert 2024 - 16
High quality synthetic data used to train Nemotron 340b.
Helps on IFEval and other tests in our early use :)
SDSD - Lambert 2024 - 17
Having language models generate their own instructions by manipulating the chat template tokens.
Image credit: https://magazine.sebastianraschka.com/p/instruction-pretraining-llms
SDSD - Lambert 2024 - 18
Crazy project trying to generate 1 billion personas / system prompts plus synthetic instructions.
SDSD - Lambert 2024 - 19
Get in touch
SDSD - Lambert 2024 - 20