1 of 49

Trillion Parameter Consortium��Generative AI for Science

We will start at 9:00 am CDT

2 of 49

Getting us started

  • Welcome .. we are glad you are here!
  • Overview of the TPC
  • Plan for the next few days
    • AuroraGPT and TPC organizational efforts
  • Some high-level goals for the breakouts
    • Data organization
    • Model related planning, experiments and tests
    • TPC related development and governance
  • Hybrid meeting (with its logistic challenges)
    • Key people to know for getting help

3 of 49

We want to build the worlds most powerful FMs for Science

4 of 49

These models need to be examples of responsible AI

5 of 49

We have created the TPC �to help do this

6 of 49

Trillion Parameter Consortium: goals

Goal 1. Build an open community of researchers that are interested in creating state-of-the-art large-scale generative AI models (FMs/LLMs) aimed broadly at advancing progress on scientific and engineering problems, by sharing methods, approaches, tools, insights, and workflows.

 Goal 2. Incubate, launch, and loosely (voluntarily) coordinate specific projects to build specific models at specific sites and attempt to avoid unnecessary duplication of effort and to maximize the impact of the projects in the broader AI and scientific community. Where possible we will work out what we can do together for maximum leverage vs. what needs to be done in smaller groups.

 

Goal 3. Create a global network of resources and expertise that can help facilitate teaming and training the next generation of AI and related researchers interested in the development and use of large-scale AI in advancing science and engineering.

7 of 49

AI for Science, Energy and Security

What changed in three years?

  • Language Models (e.g. ChatGPT) released
  • Artificial image generation took off
  • AI folded a billion proteins
  • AI hints at advancing mathematics
  • AI automation of computer programming
  • Explosion of new AI hardware
  • AI accelerates HPC simulations
  • Exascale machines start to arrive

2019

2022

2020 DOE Office of Science ASCR Advisory Committee report recommending major DOE AI4S program

8 of 49

Workshops organized on six crosscutting themes

AI for advanced properties inference �and inverse design

AI and robotics �for autonomous

discovery

AI-based surrogates �for high-performance�computing

AI for software�engineering and�programming

AI for prediction and control of complex engineered systems

Foundation, Assured AI for scientific �knowledge

Energy Storage

Proteins, Polymers, Stockpile modernization

Materials, Chemistry, Biology

Light-Sources, Neutrons

Climate Ensembles

Exascale apps with surrogates

1000x faster => Zettascale now

Code Translation, Optimization

Quantum Compilation, QAlgs

Accelerators, Buildings, Cities

Reactors, Power Grid, Networks

Hypothesis Formation, Math

Theory and Modeling Synthesis,

9 of 49

Leveraging Community Efforts

9

10 of 49

Foundation Models for Science — Opportunities

  • FMs can summarize and distill knowledge – extract information from million of papers into compact computing representation – PPI networks, materials compositions, code kernels, biological function, etc.
  • FMs can synthesize – combine information from multiple sources – generate small programs for specific tasks – quantum computing programs using QISkit & Cirq, derivations for applied physics, code for visualization and animation, etc.
  • FMs can generate plans, solve logic problems and write experimental protocols for robots – powering self-driving labs, generate strategies for problem solving, and planning for testing hypotheses
  • FMs with additional research, may be able to generate hypotheses to be tested and new theories for exploration – a full-time shared scientific assistant that learns from across all of science is possible

After experimenting with GPT-4 in our own research domains in materials chemistry, physics and quantum information, we find that ChatGPT-4 is knowledgeable, frequently wrong, and interesting to talk to. In other words, not unlike a college professor or a colleague. https://arxiv.org/pdf/2304.12208.pdf

11 of 49

It is likely that many of the use cases we imagine in the AI4SES report can be driven directly or indirectly from�sufficiently powerful Foundation Models

12 of 49

Leveraging Community Efforts

12

Scientific &

Engineering

Datasets

Mathematics

Biology

Materials

Chemistry

Particle Physics

Nuclear Physics

Computer Science

Climate

Medicine

Cosmology

Fusion Energy

Accelerators

Reactors

Energy Systems

Manufacturing

Exemplar DOE Mission Tasks

Autonomous

Experiments

Scientific

Discovery

Digital Twins

Inverse Design

Code Optimization

Accelerated

Simulations

Secure Data

Infrastructure

Co-Design

13 of 49

Since 2019 LLM model development has accelerated

14 of 49

Rapid Development of Large Language Models

Explosion of development of

LLM based AI systems since 2019

Foundation Models are replacing narrow AI systems at a rapid pace

Foundation Models are the closest things that have yet been created that

hint at the possibility of Artificial General Intelligence

15 of 49

The most capable models today are in the private sector (GPT-4, Claude, ChatGPT-3.5, Bard)

Large models with emergent behavior

16 of 49

17 of 49

18 of 49

Assistant Models have been further

trained to act as helpful chatbots

19 of 49

Open Source Models

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

20 of 49

Falcon 40B Instruct – Trained on AWS by TII UAE

https://huggingface.co/tiiuae/falcon-40b-instruct

21 of 49

In 2020 we started exploring the feasibility of

Trillion Parameter

Foundation Models for Science

22 of 49

Generative AI for Science – Trillion Parameter Consortium

  • Intel, HPE, AMD and NVIDIA, …
    • Optimize implementations on our platforms
  • Argonne, Oak Ridge, Brookhaven, Berkeley, Pacific Northwest, … Laboratories
    • Curating data, designing evaluations, scaling models etc.
  • UChicago, Northwestern, UIC, Caltech, UIUC, U Montreal, MILA,
    • Dozens of academic collaborators, NLP, Safety, evaluation, etc.
  • RIKEN, MSKCC, AI2, BSC, LRZ, CSC, ...
    • Strategic partnerships with centers working on similar things
  • Cerebras, SambaNova, Microsoft, Together, Run.AI, Dimension.AI, ..
    • AI industry collaborators

If you are interested in participating please contact me.

23 of 49

Building a useful engine for science is more than training a single model

24 of 49

25 of 49

High-Level Plan – Code base

  • Leveraging MEGATRON, Flash Attention and DeepSpeed and other code bases
  • Implementation that are optimized for target GPUs
    • Operator fusions, offloads, etc.
  • Libraries optimized for Aurora (PVC) and Frontier (MI250X) and A100/H100 systems
  • Series of models (7B, 70B, 200B, 1000B) (trillion parameters goal)
  • Multiple levels of parallelism
    • Data Parallel (multiple model instances)
    • Pipeline Parallel (partition by layers)
    • Model Parallel (partition vertically)

Test Shots in April Hackathon

26 of 49

Key Objectives for the Model Breakouts

  • Recommendations
  • Specific model configurations ~(7B, 70B, 200B and 1T*)
    • Given the target machines
      • Aurora, Polaris, Frontier, Fugaku, Lumi, Cerebras Clusters, etc.
    • Recommended parallelism strategies
      • Model, Pipe, Data
    • Should the 1T model be dense or sparse, MoE or something else?
  • Est pre-training times (Assume we will have at least 30TT data)
    • Assume chinchilla scaling for the analysis
    • Include frequent checkpointing and restarts

27 of 49

~1 M GPU hours

~4M GPU hours

28 of 49

Compute Needed to Train

arXiv:2204.02311v2 [cs.CL] 7 Apr 2022

PaLM 540B was trained for 29 days on 4096 v4 TPUs @ Exaflop/s BFP16

29 of 49

Training LLMs is a big computing task

State-of-the-Art

1 Trillion parameter model

6 flops per token per parameter

~1-3 months on an exascale system

30 of 49

(2 x 1019) (2 x 106 ) = 4 x 1025 FLOPS == Aurora for 1 month

If N ~ 1 x 1012 then D ~ 1012.8 and D ~6T tokens

31 of 49

High-Level Plan – Pretraining data

  • Broad corpus of data ( ~30 Trillion Tokens to start) in four tranches
    • General text and code – ~2 – 10? Trillion tokens
    • Biological, biochemistry, genomics, proteomics, structural biology, discrete math, medicine, drugs, etc. ~ 1-5 trillion tokens
    • Materials, chemistry, nanoscience, x-ray science, neutron science, math, etc. ~1 -5 trillion tokens
    • Mathematical physics, nuclear physics, high-energy and participle physics, QED, QCD, astronomy, astrophysics and cosmology, etc. ~ 1-5 trillion tokens
    • Climate, environmental science, atmospheric science, ecology, biogeochemistry, etc. ~1-5 trillion tokens)
  • Different scaling models lead to different estimates for scale of training data (Chinchilla vs Kaplan)
    • Kaplan, Brown would scale model parameters faster than data (M>D)
    • Chinchilla would scale data faster than model parameters (D>M)
  • For current planning let’s assume Chinchilla scaling

32 of 49

Thoughts on data preparation

  • Our strategy splits data prep into two sets of activities
    • General (mostly web) text and code (general data modules)
    • Scientific datasets (science domain modules) that combine structured and unstructured data
  • Specialized scientific text could go into either, but we need to deduplicate between modules
  • Scientific domain modules
    • Key idea is that each input context needs to have enough “semantic” overlap with the domain text that prompts that use text will have completions that can bring in structured scientific data (sequences, tables, IDs, equations, images, etc.)
    • “Cards” maybe one approach that could achieve this
  • For many scientific problems multi dimentional “images” and time series are important datatypes
    • Long contexts (~100K and beyond) might be needed for many types of structured data but it appears to be key that these examples need to use the full context or at least the end parts of the context to generate correct completions for the models to learn to use the large contexts.

33 of 49

34 of 49

Meta example from genomics

  • Say we want to include genes, protein function and sequence for many thousands of organisms
  • We might consider encoding datasets along the following lines

<organism> <gene name> <gene IDs> <aliases><loci><gene function text><dna sequence><protein name><protein IDs><protein function text><protein sequence>

Sequences if they are long > context, might have to be chunked with the headers replicated

The key idea is to have enough information in the header that prompts that use names (words) will overlap with names in text and in semi-structured data so that IDs and sequence can be integrated

These ideas need to be tested via tunning and RAG type experiments and with early runs on smaller models

Alternatives are needed

35 of 49

36 of 49

37 of 49

38 of 49

39 of 49

High-Level Plan – Pretraining Campaigns

  • Basic approach is need to sample/mix data during training so that its roughly balanced (don’t train sequentially)
  • One can over/under sample subsets of data to achieve some target balance
  • You want to see each Token 1-4 times, >4 times increases risks of memorization
  • Models are often unstable and will crash during training
  • There is also periods of warmup etc.
  • 150,000 iterations are typical with checkpoints every 1K, enabling recovery
  • Smaller models are trained first (and in parallel) as pilots
  • 1 month of Exascale time is probably minimum over 3-4 months of real-time
  • 3 months of Exscale time is probably over 9 months
  • Models need to be monitored 24/7 during training to recover quickly
  • STOA models have crashed ~10-30 times during trainings

40 of 49

High-Level Plan – Model Evaluation

  • Each group that is working on datasets (training corpora) will also need to be designing model evaluation suites and diagnostics
  • What can we learn from BIG, Helm? etc.
  • Which harness to use, what types of challenge problems
  • We also plan to use existing models (GPT-4, Claude2, Llama2, Bard, etc.) to help in generating test cases
  • Model evaluations will be run in parallel on checkpoints as they are produced to track progress
  • Perhaps as much as 30% additional cycles need to be committed to evaluation and downstream model improvement

41 of 49

444 Authors

132 Institutions

~200 different tasks to

Evaluate language models

42 of 49

High-Level Plan – post-pretraining

  • Raw LLMs are somewhat useful however they are too raw for most users
    • Not safe, Not optimized for “chat mode”
    • Not optimized for alignment with human tasks
  • To produce LLMs for Science that will be useful to our community we will need to polish them with additional steps
    • Finetuning on tasks relevant to scientific use cases (need those tasks)
    • Reward modeling (improve the model’s likelihood of good response)
    • Reinforcement learning with (human) feedback (alignment)
  • Post-pretraining is the least developed part of our plan
    • Need to accelerate this part of the work and need to automated it
  • As public models become more capable, its is possible that we will be able to use public models to critique our Science models in some scalable fashion to address the post-training human bottlenecks

43 of 49

Important Issues

  • Linking Scientific Data (structured) to the unstructured (Text)
    • Need robust way to do this so that semantic coupling occurs
  • Large Context windows – For many scientific use cases large context windows are needed, there is interest in developing models with 100K or more context
  • Going Multimodal (initial models will be primarily text), but quickly we will want to go multimodal (images, 3D, timeseries, etc.)
  • Representation of domains, interests, directions in an inclusive form to avoid bias in disciplinary orientation (i.e., avoid the model being a physicist or biologist)
    • Need a forum to work on this
  • Many issues associated with release of models especially the upstream versions that may not be safe (various uses of the term safe)

44 of 49

AI Accelerated Post-Exascale Ecosystem

Scientific &

Engineering

Datasets

Mathematics

Biology

Materials

Chemistry

Particle Physics

Nuclear Physics

Computer Science

Climate

Medicine

Cosmology

Fusion Energy

Accelerators

Reactors

Energy Systems

Manufacturing

Exemplar DOE Mission Tasks

Autonomous

Experiments

Scientific

Discovery

Digital Twins

Inverse Design

Code Optimization

Accelerated

Simulations

Text and Code

Corpora

General Text

Social Media

News

Humanities

History

Law

Digital Libraries

OSTI Archive

Scientific Journals

arXiv

Code repositories

Laboratory Notes

PubMed

Agency Archives

DOE and NNSA Exascale Systems

Common AI Software Frameworks

Responsible AI Techniques

Open Science

Foundation Models

National Security

Foundation

Models

Training

Training

Tuned and Adapted Downstream Models

Integrated Research Infrastructure

Online Experimental Facilities

Strategic Partnerships

Secure Data

Infrastructure

Co-Design

45 of 49

Alignment, Responsibility and Safety

46 of 49

Managing Risks of widespread use of LLMs

https://arxiv.org/pdf/2306.03809.pdf

47 of 49

Responsible AI for Scientific Use

  • Alignment with human values and operational constraints
  • Compliance with known laws of physics and logic when required
  • Exhibit reproducible behavior and results
  • Robustness to noise and changes in operating environments
  • Respect privacy and are resistant to manipulation to reveal restricted info
  • Compliance with regulatory or policy requirements
  • Can explain their reasoning and justify their conclusions

  • How to systematically compare behaviors between models?
  • How to comprehensively assess the domain knowledge of models?
  • How to assess emergent behaviors or novel capabilities?
  • How to assess knowledge synthesis capabilities?

Goals

Open

Questions

48 of 49

TPC related breakout

  • To support the TPC development we need to make progress in parallel with the technical activities on some organizational and governance issues
  • We need a website, we need an executive committee (smallish) and a steering committee (larger)
  • We need some volunteers to help with organizational issues
  • We want to have a process for posting events and venues for events
  • We need to have some groups working on some TPC policies on sharing and releasing things that many people have contributed to
  • We need to have a few policy statements related to responsible AI
    • Assessment of models prior to general release, Non-consumptive use of data, use of AI for filtering content rather than humans, etc.
  • We also need to define more clearly how we want to operate between common efforts (data prep, pipelines, etc.) and specific model building efforts (e.g. AuroraGPT, GPT Fugaku, etc.)

49 of 49

A Few Closing Comments

  • We endeavor to create ”open” models that can be used for constructive purposes in science and society and to reduce as much as possible the risk that our models will be used to cause harm
  • AI regulation is a hot topic and we expect new announcements from governments around the world on how each country or region is going to approach AI risks and manage AI impacts on society
  • Some governments will have restrictions on what models can and can’t do
  • Our input on these policies are already being sought
  • Therefore, it will be very important for TPC to have a formal Model release protocols for models that are associated with the TPC
    • It will be important for us to have a formal process for model evaluation, testing, assessment, prior to general or open release
    • We will be expected to have an “exemplar” process for this