Trillion Parameter Consortium��Generative AI for Science
We will start at 9:00 am CDT
Getting us started
We want to build the worlds most powerful FMs for Science
These models need to be examples of responsible AI
We have created the TPC �to help do this
Trillion Parameter Consortium: goals
Goal 1. Build an open community of researchers that are interested in creating state-of-the-art large-scale generative AI models (FMs/LLMs) aimed broadly at advancing progress on scientific and engineering problems, by sharing methods, approaches, tools, insights, and workflows.
Goal 2. Incubate, launch, and loosely (voluntarily) coordinate specific projects to build specific models at specific sites and attempt to avoid unnecessary duplication of effort and to maximize the impact of the projects in the broader AI and scientific community. Where possible we will work out what we can do together for maximum leverage vs. what needs to be done in smaller groups.
Goal 3. Create a global network of resources and expertise that can help facilitate teaming and training the next generation of AI and related researchers interested in the development and use of large-scale AI in advancing science and engineering.
AI for Science, Energy and Security
What changed in three years?
2019
2022
2020 DOE Office of Science ASCR Advisory Committee report recommending major DOE AI4S program
Report posted here:
Workshops organized on six crosscutting themes�
AI for advanced properties inference �and inverse design
AI and robotics �for autonomous
discovery
AI-based surrogates �for high-performance�computing
AI for software�engineering and�programming
�
AI for prediction and control of complex engineered systems
Foundation, Assured AI for scientific �knowledge
Energy Storage
Proteins, Polymers, Stockpile modernization
Materials, Chemistry, Biology
Light-Sources, Neutrons
Climate Ensembles
Exascale apps with surrogates
1000x faster => Zettascale now
Code Translation, Optimization
Quantum Compilation, QAlgs
Accelerators, Buildings, Cities
Reactors, Power Grid, Networks
Hypothesis Formation, Math
Theory and Modeling Synthesis,
Leveraging Community Efforts
9
Foundation Models for Science — Opportunities
After experimenting with GPT-4 in our own research domains in materials chemistry, physics and quantum information, we find that ChatGPT-4 is knowledgeable, frequently wrong, and interesting to talk to. In other words, not unlike a college professor or a colleague. https://arxiv.org/pdf/2304.12208.pdf
It is likely that many of the use cases we imagine in the AI4SES report can be driven directly or indirectly from�sufficiently powerful Foundation Models�
Leveraging Community Efforts
12
Scientific &
Engineering
Datasets
Mathematics
Biology
Materials
Chemistry
Particle Physics
Nuclear Physics
Computer Science
Climate
Medicine
Cosmology
Fusion Energy
Accelerators
Reactors
Energy Systems
Manufacturing
Exemplar DOE Mission Tasks
Autonomous
Experiments
Scientific
Discovery
Digital Twins
Inverse Design
Code Optimization
Accelerated
Simulations
Secure Data
Infrastructure
Co-Design
Since 2019 LLM model development has accelerated
Rapid Development of Large Language Models
Explosion of development of
LLM based AI systems since 2019
Foundation Models are replacing narrow AI systems at a rapid pace
Foundation Models are the closest things that have yet been created that
hint at the possibility of Artificial General Intelligence
The most capable models today are in the private sector (GPT-4, Claude, ChatGPT-3.5, Bard)
Large models with emergent behavior
Assistant Models have been further
trained to act as helpful chatbots
Open Source Models
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
Falcon 40B Instruct – Trained on AWS by TII UAE
https://huggingface.co/tiiuae/falcon-40b-instruct
In 2020 we started exploring the feasibility of
Trillion Parameter
Foundation Models for Science
Generative AI for Science – Trillion Parameter Consortium
If you are interested in participating please contact me.
Building a useful engine for science is more than training a single model
High-Level Plan – Code base
Test Shots in April Hackathon
Key Objectives for the Model Breakouts
~1 M GPU hours
~4M GPU hours
Compute Needed to Train
arXiv:2204.02311v2 [cs.CL] 7 Apr 2022
PaLM 540B was trained for 29 days on 4096 v4 TPUs @ Exaflop/s BFP16
Training LLMs is a big computing task
State-of-the-Art
1 Trillion parameter model
6 flops per token per parameter
~1-3 months on an exascale system
(2 x 1019) (2 x 106 ) = 4 x 1025 FLOPS == Aurora for 1 month
If N ~ 1 x 1012 then D ~ 1012.8 and D ~6T tokens
High-Level Plan – Pretraining data
Thoughts on data preparation
Meta example from genomics
<organism> <gene name> <gene IDs> <aliases><loci><gene function text><dna sequence><protein name><protein IDs><protein function text><protein sequence>
Sequences if they are long > context, might have to be chunked with the headers replicated
The key idea is to have enough information in the header that prompts that use names (words) will overlap with names in text and in semi-structured data so that IDs and sequence can be integrated
These ideas need to be tested via tunning and RAG type experiments and with early runs on smaller models
Alternatives are needed
High-Level Plan – Pretraining Campaigns
High-Level Plan – Model Evaluation
444 Authors
132 Institutions
~200 different tasks to
Evaluate language models
High-Level Plan – post-pretraining
Important Issues
AI Accelerated Post-Exascale Ecosystem
Scientific &
Engineering
Datasets
Mathematics
Biology
Materials
Chemistry
Particle Physics
Nuclear Physics
Computer Science
Climate
Medicine
Cosmology
Fusion Energy
Accelerators
Reactors
Energy Systems
Manufacturing
Exemplar DOE Mission Tasks
Autonomous
Experiments
Scientific
Discovery
Digital Twins
Inverse Design
Code Optimization
Accelerated
Simulations
Text and Code
Corpora
General Text
Social Media
News
Humanities
History
Law
Digital Libraries
OSTI Archive
Scientific Journals
arXiv
Code repositories
Laboratory Notes
PubMed
Agency Archives
DOE and NNSA Exascale Systems
Common AI Software Frameworks
Responsible AI Techniques
Open Science
Foundation Models
National Security
Foundation
Models
Training
Training
Tuned and Adapted Downstream Models
Integrated Research Infrastructure
Online Experimental Facilities
Strategic Partnerships
Secure Data
Infrastructure
Co-Design
Alignment, Responsibility and Safety
Managing Risks of widespread use of LLMs
https://arxiv.org/pdf/2306.03809.pdf
Responsible AI for Scientific Use�
Goals
Open
Questions
TPC related breakout
A Few Closing Comments