1 of 36

A Research Agenda for the Evaluation of �AI-Based Weather Forecasting Models

Presented by: Imme Ebert-Uphoff (CIRA)

Jebb Q. Stewart (NOAA)

Jacob T. Radford (CIRA, NOAA)�

With contributions from many others: see next slides.

CIRA = Cooperative Institute for Research in the Atmosphere @ Colorado State University

NOAA = National Oceanic and Atmospheric Administration @ Boulder, Colorado

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

2 of 36

Today’s presenters

Jebb Q. Stewart

NOAA

Jacob T. Radford

CIRA, NOAA

Imme Ebert-Uphoff

CIRA

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

3 of 36

Other co-authors / contributors

CIRA�

  • Kyle A. Hilburn
  • Kate D. Musgrave
  • Robert T. DeMaria
  • Randy Chase
  • Ryan Lagerquist �(CIRA & NOAA-GSL)
  • Charles White
  • Yoonjin Lee
  • Jason Apke
  • Lander Ver Hoef
  • Marie McGraw
  • Mark DeMaria
  • Galina Chirokova

NOAA-GSL

  • Christina E. Kumler �(CIRES & NOAA-GSL)
  • Matthew S. Wandishin
  • Jeffrey D. Duda �(CIRES & NOAA-GSL)
  • Isidora Jankov
  • David D. Turner

NOAA-PSL

  • Sergey Frolov
  • Laura Slivinski
  • Tim Smith
  • Niraj Agarwal
  • Philip Pegion

CIRA = Cooperative Institute for Research in the Atmosphere �@ Colorado State University

CIRES = Cooperative Institute for Research in Environmental Sciences

@ University of Colorado Boulder

GSL = Global Systems Laboratory

@ Boulder, Colorado

PSL = Physical Science Laboratory

@ Boulder, Colorado

4 of 36

AI-based forecasting - evaluation needs

  • AI-based weather forecasting models are already starting to be used for decision-making:
    • Google’s MetNet-3 model is already the default weather app on Google’s pixel phones.
    • Anecdotal evidence suggests that forecasters in some countries are starting to take the output of AI models into account before issuing forecasts.�
  • We better hurry to develop a framework to evaluate such AI models!

Lots of great research already happening, but need coordinated effort that includes:

    • Feedback from many different groups, including forecasters and research community,
    • Detailed evaluation of the value of these models for specific applications,
    • Making sure that perspectives, ideas, criteria, from community are being incorporated.

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

5 of 36

Potential Opportunities of pure AI-based models

Key strength #1: �Speed�(Up to 1,000x -10,000x faster than NWP models.)

Create large ensembles

(using Generative AI)

Include complex mechanisms, that are too computationally expensive for NWP.

Increase temporal resolution

Increase spatial resolution

Key strength #2: Less computational power needed

to run models. �(Still need significant GPU power to train models.)

Potential Opportunities

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

Ask fundamental science questions about predictability

6 of 36

Potential Opportunities of pure AI-based models

Key strength #1: �Speed�(Up to 1,000x -10,000x faster than NWP models.)

Key strength #2: Less computational power needed

to run models. �(Still need significant GPU power to train models.)

Potential Opportunities

AI models present enormous potential to improve our ability to predict the weather!

But …

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

7 of 36

But … we need to answer many questions before we can put these models into widespread use

Sample questions:

Forecast Value:

  1. What exactly is the value of these models for forecasters for specific applications?
  2. What exactly is the value of these models for the public? What about the private sector?�

Questions to help answer the above:

  • How well do the models represent specific meteorological features, such as tropical cyclones, atmospheric rivers, etc?
  • When will we see forecasts for other key variables, such as precipitation type, CAPE, CIN?
  • How well do the models represent rare events?
  • What are the biases / failure modes of the AI-based models that forecasters need to be aware of?

Data-related:

  • With their complete dependence on learning from data, are we introducing many types of unconscious bias that physics-based NWP models do not have (or have less of)?
  • To create higher resolution global weather forecasting models, where are the high resolution data coming from? �

Operational use (e.g., at NOAA):

  • What kind of training do forecasters need to gain maximal benefit from these models?
  • Who will maintain or fix these models if something goes wrong? What if libraries are no longer supported?

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

8 of 36

The perfect storm: If someone had tried to create chaos on purpose - they could not have come up with a better combination of disruptive factors.

Model development:

  • NWP: coordinated effort of few groups
  • AI: uncoordinated, tons of small groups around the world.

Underlying principles:

  • NWP: physics
  • AI: ML methods

Hardware needed:

  • NWP: CPUs
  • AI: GPUs

Feedback from forecasters during development:

  • NWP: Close connection
  • AI: So far - close to none.

Expertise needed:

  • NWP: ATS, software engineering, etc.
  • AI: machine learning

Where to learn about models:

  • NWP: project updates
  • AI: suddenly on arXiv

Time to develop a new model:

  • NWP: years
  • AI: months

Acknowledge impact of AI model development on meteorological community

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

9 of 36

The perfect storm: If someone had tried to create chaos on purpose - they could not have come up with a better combination of disruptive factors.

Model development:

  • NWP: coordinated effort of few groups
  • AI: uncoordinated, tons of small groups around the world.

Underlying principles:

  • NWP: physics
  • AI: ML methods

Hardware needed:

  • NWP: CPUs
  • AI: GPUs

Feedback from forecasters during development:

  • NWP: Close connection
  • AI: So far - close to none.

Expertise needed:

  • NWP: ATS, software engineering, etc.
  • AI: machine learning

Where to learn about models:

  • NWP: project updates
  • AI: suddenly on arXiv

Time to develop a new model:

  • NWP: years
  • AI: months

AI models seem to

  • Come “out of nowhere”
  • Pop up on arXiv - every few weeks
  • No documentation of changes between models, no logical order.
  • Based on unfamiliar concepts / vocabulary.
  • Unfamiliar speed.

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

10 of 36

The perfect storm: If someone had tried to create chaos on purpose - they could not have come up with a better combination of disruptive factors.

Model development:

  • NWP: coordinated effort of few groups
  • AI: uncoordinated, tons of small groups around the world.

Underlying principles:

  • NWP: physics
  • AI: ML methods

Hardware needed:

  • NWP: CPUs
  • AI: GPUs

Feedback from forecasters during development:

  • NWP: Close connection
  • AI: So far - close to none.

Expertise needed:

  • NWP: ATS, software engineering, etc.
  • AI: machine learning

Where to learn about models:

  • NWP: project updates
  • AI: suddenly on arXiv

Time to develop a new model:

  • NWP: years
  • AI: months

AI models seem to

  • Come “out of nowhere”
  • Pop up on arXiv - every few weeks
  • No documentation of changes between models, no logical order.
  • Based on unfamiliar concepts / vocabulary.
  • Unfamiliar speed.

This seems to result (anecdotal evidence) in:

  • Feeling overwhelmed: �Always trying to catch up - in an area folks weren’t even trained in!
  • Confusion / chaos: �cannot plan for next steps if you don’t know what to expect the next day.
  • Feeling left out / worried: uncertain of how this is going to change one’s job.

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

11 of 36

Forecaster perspective

12 of 36

Forecaster perspective - key questions

Goal: Laying the groundwork for AI-based models in forecast operations

Questions we need to answer:

  1. How do AI models differ from physics models in terms of operations?�
    • Simpler in some ways, more complex in others:
      • Simple: Uses current state of atmosphere to predict future state based on history
      • Complex: Uses complex techniques like graph neural networks�
    • Much faster. Enables:
      • Sensitivity analysis
      • Ensembles�
    • No physics or tools of diagnosis�
    • Coarse resolution and limited output variables

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

13 of 36

Forecaster Perspective - key questions

2) How do these differences influence:�

    • Likelihood of a forecaster to use the model�
    • How and when AI models are applied
      • Deterministically or probabilistically?
      • As alternatives to physics-models or as supplements?
    • Trustworthiness of the model�
    • Communication of output to partners
      • Does the lack of physics or diagnostics make it more difficult to convey uncertainty or confidence in a forecast?

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

Status: Very little research has been done in this area.

Need big research effort on all of these topics!

14 of 36

Forecaster Perspective

  • What is the added value of an AI model for the end user �(forecaster / public)?�
  • What does it take for forecasters to trust AI models? �Bostrom, A., Demuth, J.L., Wirz, C.D., Cains, M.G., Schumacher, A., Madlambayan, D., Bansal, A.S., Bearth, A., Chase, R., Crosman, K.M. and Ebert‐Uphoff, I., 2023. Trust and trustworthy artificial intelligence: A research agenda for AI in the environmental sciences. Risk Analysis. https://onlinelibrary.wiley.com/doi/full/10.1111/risa.14245
  • How much does it help for decision making? �Example:
    • Frequent temperature correction from 86 to 87 degrees does not help much.
    • Occasional temperature correction from 86 to 100 degrees is very helpful.

Research needs:

  • Need a big effort to get forecasters involved in the discussions.
  • Important to get social scientists involved in these discussions.
  • Study impact of AI models - on end user decisions and their impacts.

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

15 of 36

Output variables currently very limited.

So far primarily optimizing image similarity measures from computer vision, rather than meteorological features.

Version management:

  • Model jungle.
  • Lack of documentation of changes.
  • No protocol for readiness level.

Forecasters not yet trained on use of these models.

Forecasters rarely involved in development and evaluation so far.

Added value for forecasters: �Not yet known

Development by many different groups, primarily AI companies

Development still in

early stages

Hard to maintain access to all models:

  • Setting up runs of new AI models
  • Setting up real-time visualization
  • Creating and storing archives of past forecasts

Feedback from forecasters:

Not yet available

Hard for meteorological community to keep up and contribute

AI models getting very complex

Need coordination!

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

16 of 36

Training Data

17 of 36

Data perspective: Available datasets for training

NWP models are primarily based on physics.

  • NWP models have certain well known physical biases.�See, for example, list of “Subjective Model Performance Characteristics” �maintained by National Weather Service’s Weather prediction Center:�https://www.wpc.ncep.noaa.gov/mdlbias/biastext.shtml

AI models are based on data. Issues that come with the data:

Data issue #1: Limited availability of reliable / HR data sets:

    • One advantage of AI models is that they have lower computational complexity, which should enable developing models with higher spatial resolution.
    • But which high resolution data sets are available for training?��

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

18 of 36

Available data sets

Dominant dataset for training of AI global weather forecasting models so far: �

  • ECMWF Reanalysis v5 (ERA5)

New datasets: �

  • NOAA UFSReplay (2023): “replay” of the coupled UFS model to ECMWF �reanalyses.
  • CONUS404 dataset:
    • Covers “CONterminous United States for over 40 years at 4 km resolution”
    • https://rda.ucar.edu/datasets/ds559.0/
    • 4 km long-term reanalysis

Research needs:

  • We need to develop reliable high resolution data sets to be able to train high res AI models.

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

19 of 36

Data perspective: New biases?

Data issue #2: If the training data are biased, the AI models inherit those biases.

    • This is known as “Coded bias” and has already led to many problems in other areas, e.g., legal system, face recognition, etc.�
    • Let’s not repeat these mistakes in weather forecasting:

Depending on data, we risk introducing new types of biases that NWP models do not have.�Example: Data quality may differ regionally based on available sensors - due to terrain, but also due to economic/historical differences, etc. �

    • We need to be extra careful:�Before training any AI model, ask: Do we know all the biases in the data?

Research needs:

  • Best practices to identify and document biases in training data used for AI model development.
  • Methods for testing for common biases in resulting AI models, and for alleviating bias.

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

20 of 36

Data readiness/bias: Sample resources/approaches

NCAI (NOAA Center for AI) & ESIP (Earth Science Information Partners) developed

NSF AI Institute for Research on Trustworthy AI in Weather, Climate, and Coastal Oceanography (AI2ES; https://www.ai2es.org/):

  • McGovern et al. (2024): Identifying and Categorizing Bias in AI/ML for Earth Sciences, BAMS Jan 2024.

Research needs: Good start above, but a lot more work needed, especially for

  • Framework for automatic detection of common biases in data.
  • Framework for automatic detection of common biases in AI models.
  • Methods to avoid inheriting known bias from training data.

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

21 of 36

Access to trained AI models

22 of 36

Access to trained AI models

Trends in 2022-2023:

  • Development of purely AI-based high models took off.
  • Some key models: FourCastNet, Pangu-weather, and GraphCast.
    • Note that the trained models for all three models are available to the public.
    • Special thanks to ECMWF for their code repository that makes it easy to install these models: https://github.com/ecmwf-lab/ai-models
  • ECMWF decided to invest heavily in AI models: first result is their AIFS model.

Potential trends in 2024:

  • Some high-tech companies developed AI4NWP models primarily to test new AI algorithms, so they were open to making them public. As companies become aware of the commercial value of their models this might change and future models might no longer be made public.
  • That would be a problem - community no longer would get access to test the newest models.
  • Appeal to model developers: Please continue to make your models available.
  • Appeal to reviewers: If you review a paper on any of these models, please request for the trained models to be made publicly available upon publication. (Most editors cannot enforce this, unless reviewers request it.) �

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

23 of 36

Trends in AI4NWP model development - increasing model complexity

  • Early models were relatively simple and could be run on any GPU.�
  • Newer models are becoming increasingly complex. That creates several issues:
    • Training newest models: �Requires huge GPU resources that most academic groups and even many government agencies - do not (yet?) have access to.
    • Running newest models: �Memory needs are increasing. Running GraphCast (Operational) requires ~36 GB of GPU memory.
    • Understanding the models:�Models are becoming harder to understand, since they are currently developed to maximize performance, without regards to interpretability. �
  • In contrast, there are only few groups that focus on developing minimal models:
    • Great example:�Matthias Karlbauer, Nathaniel Cresswell-Clay, Raul A. Moreno, Dale R. Durran, Thorsten Kurth, Martin V. Butz, Advancing Parsimonious Deep Learning Weather Prediction using the HEALPix Mesh, Sept 11, 2023. https://arxiv.org/abs/2311.06253
    • We could use more research in this area. ��

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

24 of 36

Access to real-time and archived forecasts from AI models

25 of 36

Access to real-time forecasts

  • Forecasters and research community need easy access to view daily real-time forecasts from AI models.

Sample visualization of real-time forecasts from various AI models available:�

ECMWF’s model charts for AI-based weather models: https://charts.ecmwf.int/catalogue/packages/ai_models/ �Models: FourCastNet v2, Pangu-weather, GraphCast, AIFS (ECMWF’s own model), FuXi.�Forecasts initialized with IFS data.

CIRA-NOAA’s real-time visualizations for purely AI-based weather models:https://aiweather.cira.colostate.edu/�Models: FourCastNet v2, Pangu-weather, GraphCast, GFS, IFS.�Forecasts initialized with GFS initial conditions.

Research needs:

  • Start effort to engage with forecasters to get feedback on these models (and visualization) early on.
  • Keep adding real-time visualization of all the new models that come out.

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

26 of 36

Example: Our Real-Time Website

  • 5 Model options + 4-panel comparison�
  • Last 4 initializations�
  • 240-hr forecasts�
  • Interactive maps�
  • Want to simplify inter-model comparisons

Developed by �Jacob Radford (CIRA/NOAA-GSL)

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

27 of 36

Example: Our Real-Time Website

Developed by �Jacob Radford (CIRA/NOAA-GSL)

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

Insert figure of comparison mode here.

Your tool is worth spreading over two slides!

Want to simplify inter-model comparisons

28 of 36

Access to archives of AI model forecasts

Research community needs easy access to multi-year archive of forecasts.�

Sample activity:

CIRA-NOAA is building an archive of forecasts

    • Models: FourCastNet v2, Pangu-weather, GraphCast
    • Timeframe: 09/2021 to present
    • Forecasts: initialized twice daily (0Z, 12Z) for 10 days,

saving forecasts in 6h time steps.

Research needs:

  • So far, limited effort: selected models, selected duration (3 years), etc.
  • Cannot keep up with new models coming out (set-up, storage needs).
  • How do we bring structure to the model jungle? Selection criteria for which models to feature?

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

Developed by Jacob Radford (CIRA/NOAA-GSL)

29 of 36

Derived fields: Severe-weather parameters

We have developed accurate, efficient, unit-tested code to add the following params to the CIRA archive:

  • CAPE: most unstable (MUCAPE), surface-based (SBCAPE), mixed-layer (MLCAPE)
  • MUCIN, SBCIN, MLCIN
  • Lifted index (surface to 500 hPa)
  • Precipitable water
  • Wind shear: surface - 500 hPa, surface - 850 hPa, 850-200 hPa
  • Storm-relative helicity: 0-1 km, 0-3 km
  • Height of planetary boundary layer (PBL)

Initial archive: one month (May 2023) for all models (4 daily runs of GraphCast, FourCastNet v2, Pangu)

Full archive: coming soon (needs HPC resources)

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

Developed by Ryan Lagerquist (CIRA/NOAA-GSL)

30 of 36

Evaluation of AI models

31 of 36

Evaluate output of AI model as if it was an NWP model.

Applying objective validation measures from NWP models

There is a fairly regular, standard suite of verification applied to either global, coarse resolution, or regional, fine-scale model forecasts that can be used to evaluate forecast output.

There are two broad classes of meteorological model output fields: continuous and feature-specific. �

  • Continuous fields: tend to be smooth and do not vary much from one grid point to the next. �Examples: geopotential height, temperature, wind speed. �
  • Feature-specific fields: look more like objects where an event occurs and can be dominated by “zero/null/empty” values with abrupt spatial gradients. �Examples: precipitation, simulated radar reflectivity, cloud ceiling, visibility

Deterministic forecast verification

  • RMSE for continuous fields
  • Bias/mean error of continuous fields as well as precipitation
  • Dichotomous events (i.e., scores based on a 2x2 contingency table, can be summarized using a performance diagram)
  • Object-oriented

For ensembles, probabilistic verification

  • Brier (skill) score (Continuous Ranked Probability [skill] Score)
  • Reliability diagrams
  • ROC diagrams
  • Forecast sharpness histograms

Credit: Material on this slide provided by Jeff Duda.

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

Status: Lots of research in the works.

32 of 36

Test weaknesses of AI models that NWP models do not have - so no tests exist for those

Develop tests for specific weaknesses of AI models�

Suggested sample topics (not an exhaustive list):

  1. Test temporal consistency: �NWP models are consistent in time by nature. AI models may not be quite as consistent. �
  2. Test Distribution of Extremes: �Compare distribution of extreme temp / wind / etc. in output of AI models to those from NWP models. Are AI models capturing the extremes, or are they missing many of them? How does this distribution change with lead time?�
  3. Test consistency between variables (covariance): �NWP models by definition guarantee physical relationships between different fields (hard constraints). �In AI models those relationships are soft constraints, i.e. the AI model may choose to put more emphasis on other criteria, resulting inconsistencies between variables.
  4. Estimate other fields that forecasters need to see and compare those to NWP models�Examples: precipitation type, CAPE, CIN, wind shear…

Activity: See prior slide - work by Ryan Lagerquist.

  • Evaluate Dynamics of AI-models for weather prediction:�Key paper: �Gregory J. Hakim and Sanjit Masanam, Dynamical tests of a deep-learning weather prediction model. Sept 19, 2023. https://arxiv.org/abs/2309.10867.

  • Do AI models represent meteorological features correctly? �Examples: Tropical cyclones, synoptic fronts, atmospheric rivers, ocean eddies.�Activity: evaluation of tropical cyclones, see next slide.

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

Status: Lots of research in the works.

33 of 36

Evaluation of Forecasts of Tropical Cyclones

  • Project Goal
    • Use CIRA AI4NWP model archive to evaluate tropical cyclone track/intensity forecasts

  • Project Team
    • Mark DeMaria, Kate Musgrave, Galina Chirokova, Robert DeMaria, Jacob T. Radford, Imme Ebert-Uphoff - CIRA
    • James Franklin, National Hurricane Center Contractor
  • Method
    • Apply TC tracker to FourCastNet1 and 2, PanguWeather and GraphCast
    • Use NHC verification rules for track and intensity (max wind)
    • Compare with operational model TC track/intensity forecasts
    • 2023 Northern hemisphere TC cases May-Oct 2023
  • Results
    • Track forecasts are as good or better than GFS, ECMWF and regional hurricane models
    • Intensity forecasts are much worse due to low bias
    • Intensity forecasts may still be useful if post-processing bias correction applied
    • Manuscript in preparation for AMS AIES journal

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

34 of 36

Track and Intensity MAE and Intensity Bias for 2023 AI4NWP TC Forecasts

Track Error

Intensity Error

Intensity Bias

35 of 36

Conclusions - key research needs

Need coordinated comprehensive evaluation effort that includes:

Physical validation of AI models

Value to forecasters / impact on decision making

AI pitfalls / Bias identification and mitigation

NWP experts

AI experts

Social scientists

Forecasters

R2O - operational perspective

Software managers

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

Need to involve:

Urgent need: bring forecasters and social scientists more into the discussion. Social scientists are needed to elicit feedback from forecasters + many other roles.

Red frame:

biggest needs

36 of 36

Acknowledgements

Funding for this work was provided by:

  • NOAA GSL AI funding
  • CIRA ML strategic funding
  • Pilot project under HFIP/STI support to CIRA TC group’s efforts

Thank You to the conveners of these EGU sessions on “Forecasting the Weather”:

Yong Wang, Aitor Atencia, kan dai, Lesley De Cruz, Daniele Nerini.

Kyle Hilburn CIRA/CSU

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)