6 of 49

Trillion Parameter Consortium: goals

Goal 1. Build an open community of researchers that are interested in creating state-of-the-art large-scale generative AI models (FMs/LLMs) aimed broadly at advancing progress on scientific and engineering problems, by sharing methods, approaches, tools, insights, and workflows.

Goal 2. Incubate, launch, and loosely (voluntarily) coordinate specific projects to build specific models at specific sites and attempt to avoid unnecessary duplication of effort and to maximize the impact of the projects in the broader AI and scientific community. Where possible we will work out what we can do together for maximum leverage vs. what needs to be done in smaller groups.

Goal 3. Create a global network of resources and expertise that can help facilitate teaming and training the next generation of AI and related researchers interested in the development and use of large-scale AI in advancing science and engineering.

7 of 49

AI for Science, Energy and Security

What changed in three years?

Language Models (e.g. ChatGPT) released
Artificial image generation took off
AI folded a billion proteins
AI hints at advancing mathematics
AI automation of computer programming
Explosion of new AI hardware
AI accelerates HPC simulations
Exascale machines start to arrive

2019

2022

2020 DOE Office of Science ASCR Advisory Committee report recommending major DOE AI4S program

Report posted here:

https://www.anl.gov/ai-for-science-report

8 of 49

Workshops organized on six crosscutting themes�

AI for advanced properties inference �and inverse design

AI and robotics �for autonomous

discovery

AI-based surrogates �for high-performance�computing

AI for software�engineering and�programming

�

AI for prediction and control of complex engineered systems

Foundation, Assured AI for scientific �knowledge

Energy Storage

Proteins, Polymers, Stockpile modernization

Materials, Chemistry, Biology

Light-Sources, Neutrons

Climate Ensembles

Exascale apps with surrogates

1000x faster => Zettascale now

Code Translation, Optimization

Quantum Compilation, QAlgs

Accelerators, Buildings, Cities

Reactors, Power Grid, Networks

Hypothesis Formation, Math

Theory and Modeling Synthesis,

9 of 49

Leveraging Community Efforts

10 of 49

Foundation Models for Science — Opportunities

FMs can summarize and distill knowledge – extract information from million of papers into compact computing representation – PPI networks, materials compositions, code kernels, biological function, etc.
FMs can synthesize – combine information from multiple sources – generate small programs for specific tasks – quantum computing programs using QISkit & Cirq, derivations for applied physics, code for visualization and animation, etc.
FMs can generate plans, solve logic problems and write experimental protocols for robots – powering self-driving labs, generate strategies for problem solving, and planning for testing hypotheses
FMs with additional research, may be able to generate hypotheses to be tested and new theories for exploration – a full-time shared scientific assistant that learns from across all of science is possible

After experimenting with GPT-4 in our own research domains in materials chemistry, physics and quantum information, we find that ChatGPT-4 is knowledgeable, frequently wrong, and interesting to talk to. In other words, not unlike a college professor or a colleague. https://arxiv.org/pdf/2304.12208.pdf

11 of 49

It is likely that many of the use cases we imagine in the AI4SES report can be driven directly or indirectly from�sufficiently powerful Foundation Models�

12 of 49

Leveraging Community Efforts

Scientific &

Engineering

Datasets

Mathematics

Biology

Materials

Chemistry

Particle Physics

Nuclear Physics

Computer Science

Climate

Medicine

Cosmology

Fusion Energy

Accelerators

Reactors

Energy Systems

Manufacturing

Exemplar DOE Mission Tasks

Autonomous

Experiments

Scientific

Discovery

Digital Twins

Inverse Design

Code Optimization

Accelerated

Simulations

Secure Data

Infrastructure

Co-Design

13 of 49

Since 2019 LLM model development has accelerated

14 of 49

Rapid Development of Large Language Models

Explosion of development of

LLM based AI systems since 2019

Foundation Models are replacing narrow AI systems at a rapid pace

Foundation Models are the closest things that have yet been created that

hint at the possibility of Artificial General Intelligence

15 of 49

The most capable models today are in the private sector (GPT-4, Claude, ChatGPT-3.5, Bard)

Large models with emergent behavior

18 of 49

Assistant Models have been further

trained to act as helpful chatbots

19 of 49

Open Source Models

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

20 of 49

Falcon 40B Instruct – Trained on AWS by TII UAE

https://huggingface.co/tiiuae/falcon-40b-instruct

21 of 49

In 2020 we started exploring the feasibility of

Trillion Parameter

Foundation Models for Science

22 of 49

Generative AI for Science – Trillion Parameter Consortium

Intel, HPE, AMD and NVIDIA, …

Optimize implementations on our platforms

Argonne, Oak Ridge, Brookhaven, Berkeley, Pacific Northwest, … Laboratories

Curating data, designing evaluations, scaling models etc.

UChicago, Northwestern, UIC, Caltech, UIUC, U Montreal, MILA, …

Dozens of academic collaborators, NLP, Safety, evaluation, etc.

RIKEN, MSKCC, AI2, BSC, LRZ, CSC, ...

Strategic partnerships with centers working on similar things

Cerebras, SambaNova, Microsoft, Together, Run.AI, Dimension.AI, ..

AI industry collaborators

If you are interested in participating please contact me.

23 of 49

Building a useful engine for science is more than training a single model

25 of 49

High-Level Plan – Code base

Leveraging MEGATRON, Flash Attention and DeepSpeed and other code bases
Implementation that are optimized for target GPUs

Operator fusions, offloads, etc.

Libraries optimized for Aurora (PVC) and Frontier (MI250X) and A100/H100 systems
Series of models (7B, 70B, 200B, 1000B) (trillion parameters goal)
Multiple levels of parallelism

Data Parallel (multiple model instances)
Pipeline Parallel (partition by layers)
Model Parallel (partition vertically)

Test Shots in April Hackathon

26 of 49

Key Objectives for the Model Breakouts

Recommendations
Specific model configurations ~(7B, 70B, 200B and 1T*)

Given the target machines

Aurora, Polaris, Frontier, Fugaku, Lumi, Cerebras Clusters, etc.

Recommended parallelism strategies

Model, Pipe, Data

Should the 1T model be dense or sparse, MoE or something else?

Est pre-training times (Assume we will have at least 30TT data)

Assume chinchilla scaling for the analysis
Include frequent checkpointing and restarts

27 of 49

~1 M GPU hours

~4M GPU hours

28 of 49

Compute Needed to Train

arXiv:2204.02311v2 [cs.CL] 7 Apr 2022

PaLM 540B was trained for 29 days on 4096 v4 TPUs @ Exaflop/s BFP16

29 of 49

Training LLMs is a big computing task

State-of-the-Art

1 Trillion parameter model

6 flops per token per parameter

~1-3 months on an exascale system

30 of 49

(2 x 10¹⁹) (2 x 10⁶) = 4 x 10²⁵ FLOPS == Aurora for 1 month

If N ~ 1 x 10¹² then D ~ 10^12.8 and D ~6T tokens

31 of 49

High-Level Plan – Pretraining data

Broad corpus of data ( ~30 Trillion Tokens to start) in four tranches

General text and code – ~2 – 10? Trillion tokens
Biological, biochemistry, genomics, proteomics, structural biology, discrete math, medicine, drugs, etc. ~ 1-5 trillion tokens
Materials, chemistry, nanoscience, x-ray science, neutron science, math, etc. ~1 -5 trillion tokens
Mathematical physics, nuclear physics, high-energy and participle physics, QED, QCD, astronomy, astrophysics and cosmology, etc. ~ 1-5 trillion tokens
Climate, environmental science, atmospheric science, ecology, biogeochemistry, etc. ~1-5 trillion tokens)

Different scaling models lead to different estimates for scale of training data (Chinchilla vs Kaplan)

Kaplan, Brown would scale model parameters faster than data (M>D)
Chinchilla would scale data faster than model parameters (D>M)

For current planning let’s assume Chinchilla scaling

32 of 49

Thoughts on data preparation

Our strategy splits data prep into two sets of activities

General (mostly web) text and code (general data modules)
Scientific datasets (science domain modules) that combine structured and unstructured data

Specialized scientific text could go into either, but we need to deduplicate between modules
Scientific domain modules

Key idea is that each input context needs to have enough “semantic” overlap with the domain text that prompts that use text will have completions that can bring in structured scientific data (sequences, tables, IDs, equations, images, etc.)
“Cards” maybe one approach that could achieve this

For many scientific problems multi dimentional “images” and time series are important datatypes

Long contexts (~100K and beyond) might be needed for many types of structured data but it appears to be key that these examples need to use the full context or at least the end parts of the context to generate correct completions for the models to learn to use the large contexts.

34 of 49

Meta example from genomics

Say we want to include genes, protein function and sequence for many thousands of organisms
We might consider encoding datasets along the following lines

Sequences if they are long > context, might have to be chunked with the headers replicated

The key idea is to have enough information in the header that prompts that use names (words) will overlap with names in text and in semi-structured data so that IDs and sequence can be integrated

These ideas need to be tested via tunning and RAG type experiments and with early runs on smaller models

Alternatives are needed

39 of 49

High-Level Plan – Pretraining Campaigns

Basic approach is need to sample/mix data during training so that its roughly balanced (don’t train sequentially)
One can over/under sample subsets of data to achieve some target balance
You want to see each Token 1-4 times, >4 times increases risks of memorization
Models are often unstable and will crash during training
There is also periods of warmup etc.
150,000 iterations are typical with checkpoints every 1K, enabling recovery
Smaller models are trained first (and in parallel) as pilots
1 month of Exascale time is probably minimum over 3-4 months of real-time
3 months of Exscale time is probably over 9 months
Models need to be monitored 24/7 during training to recover quickly
STOA models have crashed ~10-30 times during trainings

40 of 49

High-Level Plan – Model Evaluation

Each group that is working on datasets (training corpora) will also need to be designing model evaluation suites and diagnostics
What can we learn from BIG, Helm? etc.
Which harness to use, what types of challenge problems
We also plan to use existing models (GPT-4, Claude2, Llama2, Bard, etc.) to help in generating test cases
Model evaluations will be run in parallel on checkpoints as they are produced to track progress
Perhaps as much as 30% additional cycles need to be committed to evaluation and downstream model improvement

41 of 49

444 Authors

132 Institutions

~200 different tasks to

Evaluate language models

42 of 49

High-Level Plan – post-pretraining

Raw LLMs are somewhat useful however they are too raw for most users

Not safe, Not optimized for “chat mode”
Not optimized for alignment with human tasks

To produce LLMs for Science that will be useful to our community we will need to polish them with additional steps

Finetuning on tasks relevant to scientific use cases (need those tasks)
Reward modeling (improve the model’s likelihood of good response)
Reinforcement learning with (human) feedback (alignment)

Post-pretraining is the least developed part of our plan

Need to accelerate this part of the work and need to automated it

As public models become more capable, its is possible that we will be able to use public models to critique our Science models in some scalable fashion to address the post-training human bottlenecks

43 of 49

Important Issues

Linking Scientific Data (structured) to the unstructured (Text)

Need robust way to do this so that semantic coupling occurs

Large Context windows – For many scientific use cases large context windows are needed, there is interest in developing models with 100K or more context
Going Multimodal (initial models will be primarily text), but quickly we will want to go multimodal (images, 3D, timeseries, etc.)
Representation of domains, interests, directions in an inclusive form to avoid bias in disciplinary orientation (i.e., avoid the model being a physicist or biologist)

Need a forum to work on this

Many issues associated with release of models especially the upstream versions that may not be safe (various uses of the term safe)

44 of 49

AI Accelerated Post-Exascale Ecosystem

Scientific &

Engineering

Datasets

Mathematics

Biology

Materials

Chemistry

Particle Physics

Nuclear Physics

Computer Science

Climate

Medicine

Cosmology

Fusion Energy

Accelerators

Reactors

Energy Systems

Manufacturing

Exemplar DOE Mission Tasks

Autonomous

Experiments

Scientific

Discovery

Digital Twins

Inverse Design

Code Optimization

Accelerated

Simulations

Text and Code

Corpora

General Text

Social Media

News

Humanities

History

Law

Digital Libraries

OSTI Archive

Scientific Journals

arXiv

Code repositories

Laboratory Notes

PubMed

Agency Archives

DOE and NNSA Exascale Systems

Common AI Software Frameworks

Responsible AI Techniques

Open Science

Foundation Models

National Security

Foundation

Models

Training

Tuned and Adapted Downstream Models

Integrated Research Infrastructure

Online Experimental Facilities

Strategic Partnerships

Secure Data

Infrastructure

Co-Design

45 of 49

Alignment, Responsibility and Safety

46 of 49

Managing Risks of widespread use of LLMs

https://arxiv.org/pdf/2306.03809.pdf

47 of 49

Responsible AI for Scientific Use�

Alignment with human values and operational constraints
Compliance with known laws of physics and logic when required
Exhibit reproducible behavior and results
Robustness to noise and changes in operating environments
Respect privacy and are resistant to manipulation to reveal restricted info
Compliance with regulatory or policy requirements
Can explain their reasoning and justify their conclusions

How to systematically compare behaviors between models?
How to comprehensively assess the domain knowledge of models?
How to assess emergent behaviors or novel capabilities?
How to assess knowledge synthesis capabilities?

Goals

Open

Questions

48 of 49

TPC related breakout

To support the TPC development we need to make progress in parallel with the technical activities on some organizational and governance issues
We need a website, we need an executive committee (smallish) and a steering committee (larger)
We need some volunteers to help with organizational issues
We want to have a process for posting events and venues for events
We need to have some groups working on some TPC policies on sharing and releasing things that many people have contributed to
We need to have a few policy statements related to responsible AI

Assessment of models prior to general release, Non-consumptive use of data, use of AI for filtering content rather than humans, etc.

We also need to define more clearly how we want to operate between common efforts (data prep, pipelines, etc.) and specific model building efforts (e.g. AuroraGPT, GPT Fugaku, etc.)

49 of 49

A Few Closing Comments

We endeavor to create ”open” models that can be used for constructive purposes in science and society and to reduce as much as possible the risk that our models will be used to cause harm
AI regulation is a hot topic and we expect new announcements from governments around the world on how each country or region is going to approach AI risks and manage AI impacts on society
Some governments will have restrictions on what models can and can’t do
Our input on these policies are already being sought
Therefore, it will be very important for TPC to have a formal Model release protocols for models that are associated with the TPC

It will be important for us to have a formal process for model evaluation, testing, assessment, prior to general or open release
We will be expected to have an “exemplar” process for this