1 of 36

A Research Agenda for the Evaluation of �AI-Based Weather Forecasting Models

Presented by: Imme Ebert-Uphoff (CIRA)

Jebb Q. Stewart (NOAA)

Jacob T. Radford (CIRA, NOAA)�

With contributions from many others: see next slides.

CIRA = Cooperative Institute for Research in the Atmosphere @ Colorado State University

NOAA = National Oceanic and Atmospheric Administration @ Boulder, Colorado

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

2 of 36

Today’s presenters

Jebb Q. Stewart

NOAA

Jacob T. Radford

CIRA, NOAA

Imme Ebert-Uphoff

CIRA

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

3 of 36

Other co-authors / contributors

CIRA�

Kyle A. Hilburn
Kate D. Musgrave
Robert T. DeMaria
Randy Chase
Ryan Lagerquist �(CIRA & NOAA-GSL)
Charles White
Yoonjin Lee
Jason Apke
Lander Ver Hoef
Marie McGraw
Mark DeMaria
Galina Chirokova

NOAA-GSL

Christina E. Kumler �(CIRES & NOAA-GSL)
Matthew S. Wandishin
Jeffrey D. Duda �(CIRES & NOAA-GSL)
Isidora Jankov
David D. Turner

NOAA-PSL

Sergey Frolov
Laura Slivinski
Tim Smith
Niraj Agarwal
Philip Pegion

CIRA = Cooperative Institute for Research in the Atmosphere �@ Colorado State University

CIRES = Cooperative Institute for Research in Environmental Sciences

@ University of Colorado Boulder

GSL = Global Systems Laboratory

@ Boulder, Colorado

PSL = Physical Science Laboratory

@ Boulder, Colorado

4 of 36

AI-based forecasting - evaluation needs

AI-based weather forecasting models are already starting to be used for decision-making:

Google’s MetNet-3 model is already the default weather app on Google’s pixel phones.
Anecdotal evidence suggests that forecasters in some countries are starting to take the output of AI models into account before issuing forecasts.�

We better hurry to develop a framework to evaluate such AI models!

Lots of great research already happening, but need coordinated effort that includes:

Feedback from many different groups, including forecasters and research community,
Detailed evaluation of the value of these models for specific applications,
Making sure that perspectives, ideas, criteria, from community are being incorporated.

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

5 of 36

Potential Opportunities of pure AI-based models

Key strength #1: �Speed�(Up to 1,000x -10,000x faster than NWP models.)

Create large ensembles

(using Generative AI)

Include complex mechanisms, that are too computationally expensive for NWP.

Increase temporal resolution

Increase spatial resolution

Key strength #2: �Less computational power needed

to run models. �(Still need significant GPU power to train models.)

Potential Opportunities

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

Ask fundamental science questions about predictability

6 of 36

Potential Opportunities of pure AI-based models

Key strength #1: �Speed�(Up to 1,000x -10,000x faster than NWP models.)

Key strength #2: �Less computational power needed

to run models. �(Still need significant GPU power to train models.)

Potential Opportunities

AI models present enormous potential to improve our ability to predict the weather!

But …

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

7 of 36

But … we need to answer many questions before we can put these models into widespread use

Sample questions:

Forecast Value:

What exactly is the value of these models for forecasters for specific applications?
What exactly is the value of these models for the public? What about the private sector?�

Questions to help answer the above:

How well do the models represent specific meteorological features, such as tropical cyclones, atmospheric rivers, etc?
When will we see forecasts for other key variables, such as precipitation type, CAPE, CIN?
How well do the models represent rare events?
What are the biases / failure modes of the AI-based models that forecasters need to be aware of?

Data-related:

With their complete dependence on learning from data, are we introducing many types of unconscious bias that physics-based NWP models do not have (or have less of)?
To create higher resolution global weather forecasting models, where are the high resolution data coming from? �

Operational use (e.g., at NOAA):

What kind of training do forecasters need to gain maximal benefit from these models?
Who will maintain or fix these models if something goes wrong? What if libraries are no longer supported?

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

8 of 36

The perfect storm: If someone had tried to create chaos on purpose - they could not have come up with a better combination of disruptive factors.

Model development:

NWP: coordinated effort of few groups
AI: uncoordinated, tons of small groups around the world.

Underlying principles:

NWP: physics
AI: ML methods

Hardware needed:

NWP: CPUs
AI: GPUs

Feedback from forecasters during development:

NWP: Close connection
AI: So far - close to none.

Expertise needed:

NWP: ATS, software engineering, etc.
AI: machine learning

Where to learn about models:

NWP: project updates
AI: suddenly on arXiv

Time to develop a new model:

NWP: years
AI: months

Acknowledge impact of AI model development on meteorological community

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

9 of 36

The perfect storm: If someone had tried to create chaos on purpose - they could not have come up with a better combination of disruptive factors.

Model development:

NWP: coordinated effort of few groups
AI: uncoordinated, tons of small groups around the world.

Underlying principles:

NWP: physics
AI: ML methods

Hardware needed:

NWP: CPUs
AI: GPUs

Feedback from forecasters during development:

NWP: Close connection
AI: So far - close to none.

Expertise needed:

NWP: ATS, software engineering, etc.
AI: machine learning

Where to learn about models:

NWP: project updates
AI: suddenly on arXiv

Time to develop a new model:

NWP: years
AI: months

AI models seem to

Come “out of nowhere”
Pop up on arXiv - every few weeks
No documentation of changes between models, no logical order.
Based on unfamiliar concepts / vocabulary.
Unfamiliar speed.

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

10 of 36

The perfect storm: If someone had tried to create chaos on purpose - they could not have come up with a better combination of disruptive factors.

Model development:

NWP: coordinated effort of few groups
AI: uncoordinated, tons of small groups around the world.

Underlying principles:

NWP: physics
AI: ML methods

Hardware needed:

NWP: CPUs
AI: GPUs

Feedback from forecasters during development:

NWP: Close connection
AI: So far - close to none.

Expertise needed:

NWP: ATS, software engineering, etc.
AI: machine learning

Where to learn about models:

NWP: project updates
AI: suddenly on arXiv

Time to develop a new model:

NWP: years
AI: months

AI models seem to

Come “out of nowhere”
Pop up on arXiv - every few weeks
No documentation of changes between models, no logical order.
Based on unfamiliar concepts / vocabulary.
Unfamiliar speed.

This seems to result (anecdotal evidence) in:

Feeling overwhelmed: �Always trying to catch up - in an area folks weren’t even trained in!
Confusion / chaos: �cannot plan for next steps if you don’t know what to expect the next day.
Feeling left out / worried: uncertain of how this is going to change one’s job.

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

11 of 36

Forecaster perspective

12 of 36

Forecaster perspective - key questions

Goal: Laying the groundwork for AI-based models in forecast operations

Questions we need to answer:

How do AI models differ from physics models in terms of operations?�

Simpler in some ways, more complex in others:

Simple: Uses current state of atmosphere to predict future state based on history
Complex: Uses complex techniques like graph neural networks�

Much faster. Enables:

Sensitivity analysis
Ensembles�

No physics or tools of diagnosis�
Coarse resolution and limited output variables

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

13 of 36

Forecaster Perspective - key questions

2) How do these differences influence:�

Likelihood of a forecaster to use the model�
How and when AI models are applied

Deterministically or probabilistically?
As alternatives to physics-models or as supplements?�

Trustworthiness of the model�
Communication of output to partners

Does the lack of physics or diagnostics make it more difficult to convey uncertainty or confidence in a forecast?

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

Status: Very little research has been done in this area.

Need big research effort on all of these topics!

14 of 36

Forecaster Perspective

What is the added value of an AI model for the end user �(forecaster / public)?�
What does it take for forecasters to trust AI models? �Bostrom, A., Demuth, J.L., Wirz, C.D., Cains, M.G., Schumacher, A., Madlambayan, D., Bansal, A.S., Bearth, A., Chase, R., Crosman, K.M. and Ebert‐Uphoff, I., 2023. Trust and trustworthy artificial intelligence: A research agenda for AI in the environmental sciences. Risk Analysis. https://onlinelibrary.wiley.com/doi/full/10.1111/risa.14245 �
How much does it help for decision making? �Example:

Frequent temperature correction from 86 to 87 degrees does not help much.
Occasional temperature correction from 86 to 100 degrees is very helpful.

�

Research needs:

Need a big effort to get forecasters involved in the discussions.
Important to get social scientists involved in these discussions.
Study impact of AI models - on end user decisions and their impacts.

https://www.ai2es.org/

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

15 of 36

Output variables currently very limited.

So far primarily optimizing image similarity measures from computer vision, rather than meteorological features.

Version management:

Model jungle.
Lack of documentation of changes.
No protocol for readiness level.

Forecasters not yet trained on use of these models.

Forecasters rarely involved in development and evaluation so far.

Added value for forecasters: �Not yet known

Development by many different groups, primarily AI companies

Development still in

early stages

Hard to maintain access to all models:

Setting up runs of new AI models
Setting up real-time visualization
Creating and storing archives of past forecasts

Feedback from forecasters:

Not yet available

Hard for meteorological community to keep up and contribute

AI models getting very complex

Need coordination!

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

16 of 36

Training Data

17 of 36

Data perspective: Available datasets for training

NWP models are primarily based on physics.

NWP models have certain well known physical biases.�See, for example, list of “Subjective Model Performance Characteristics” �maintained by National Weather Service’s Weather prediction Center:�https://www.wpc.ncep.noaa.gov/mdlbias/biastext.shtml

AI models are based on data. Issues that come with the data:

Data issue #1: Limited availability of reliable / HR data sets:�

One advantage of AI models is that they have lower computational complexity, which should enable developing models with higher spatial resolution.
But which high resolution data sets are available for training?��

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

18 of 36

Available data sets

Dominant dataset for training of AI global weather forecasting models so far: �

ECMWF Reanalysis v5 (ERA5)

New datasets: �

NOAA UFSReplay (2023): “replay” of the coupled UFS model to ECMWF �reanalyses.

Publicly available on AWS.
For more information: https://console.cloud.google.com/storage/browser/noaa-ufs-gefsv13replay
Recommended dataset by NOAA for atmosphere, ocean, ice, and land ML training.
Same resolution as ERA5.�

CONUS404 dataset:

Covers “CONterminous United States for over 40 years at 4 km resolution”
https://rda.ucar.edu/datasets/ds559.0/
4 km long-term reanalysis

Research needs:

We need to develop reliable high resolution data sets to be able to train high res AI models.

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

19 of 36

Data perspective: New biases?

Data issue #2: If the training data are biased, the AI models inherit those biases.�

This is known as “Coded bias” and has already led to many problems in other areas, e.g., legal system, face recognition, etc.�
Let’s not repeat these mistakes in weather forecasting:

Depending on data, we risk introducing new types of biases that NWP models do not have.�Example: Data quality may differ regionally based on available sensors - due to terrain, but also due to economic/historical differences, etc. �

We need to be extra careful:�Before training any AI model, ask: Do we know all the biases in the data?�

Coded bias - documentary

https://www.ajl.org/spotlight-documentary-coded-bias

Research needs:

Best practices to identify and document biases in training data used for AI model development.
Methods for testing for common biases in resulting AI models, and for alleviating bias.

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

20 of 36

Data readiness/bias: Sample resources/approaches

NCAI (NOAA Center for AI) & ESIP (Earth Science Information Partners) developed

AI-Ready Checklist: Checklist to Examine AI-readiness for Open Environmental Datasets�https://www.esipfed.org/merge/collaboration-updates/checklist-ai-ready-data

NSF AI Institute for Research on Trustworthy AI in Weather, Climate, and Coastal Oceanography (AI2ES; https://www.ai2es.org/):

McGovern et al. (2024): Identifying and Categorizing Bias in AI/ML for Earth Sciences, BAMS Jan 2024.

Research needs: Good start above, but a lot more work needed, especially for

Framework for automatic detection of common biases in data.
Framework for automatic detection of common biases in AI models.
Methods to avoid inheriting known bias from training data.

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

21 of 36

Access to trained AI models

22 of 36

Access to trained AI models

Trends in 2022-2023:

Development of purely AI-based high models took off.
Some key models: FourCastNet, Pangu-weather, and GraphCast.

Note that the trained models for all three models are available to the public.
Special thanks to ECMWF for their code repository that makes it easy to install these models: https://github.com/ecmwf-lab/ai-models

ECMWF decided to invest heavily in AI models: first result is their AIFS model.

https://www.ecmwf.int/en/newsletter/178/news/aifs-new-ecmwf-forecasting-system
Today’s talk: 16.15-16.45pm, AIFS – ECMWF’s Data-Driven Probabilistic Forecasting�

Potential trends in 2024:

Some high-tech companies developed AI4NWP models primarily to test new AI algorithms, so they were open to making them public. As companies become aware of the commercial value of their models this might change and future models might no longer be made public.
That would be a problem - community no longer would get access to test the newest models.
Appeal to model developers: Please continue to make your models available.
Appeal to reviewers: If you review a paper on any of these models, please request for the trained models to be made publicly available upon publication. (Most editors cannot enforce this, unless reviewers request it.) �

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

23 of 36

Trends in AI4NWP model development - increasing model complexity

Early models were relatively simple and could be run on any GPU.�
Newer models are becoming increasingly complex. That creates several issues:

Training newest models: �Requires huge GPU resources that most academic groups and even many government agencies - do not (yet?) have access to.
Running newest models: �Memory needs are increasing. Running GraphCast (Operational) requires ~36 GB of GPU memory.
Understanding the models:�Models are becoming harder to understand, since they are currently developed to maximize performance, without regards to interpretability. �

In contrast, there are only few groups that focus on developing minimal models:

Great example:�Matthias Karlbauer, Nathaniel Cresswell-Clay, Raul A. Moreno, Dale R. Durran, Thorsten Kurth, Martin V. Butz, Advancing Parsimonious Deep Learning Weather Prediction using the HEALPix Mesh, Sept 11, 2023. https://arxiv.org/abs/2311.06253
We could use more research in this area. ��

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

24 of 36

Access to real-time and archived forecasts from AI models

25 of 36

Access to real-time forecasts

�

Forecasters and research community need easy access to view daily real-time forecasts from AI models.

Sample visualization of real-time forecasts from various AI models available:�

ECMWF’s model charts for AI-based weather models: �https://charts.ecmwf.int/catalogue/packages/ai_models/ �Models: FourCastNet v2, Pangu-weather, GraphCast, AIFS (ECMWF’s own model), FuXi.�Forecasts initialized with IFS data.

CIRA-NOAA’s real-time visualizations for purely AI-based weather models:�https://aiweather.cira.colostate.edu/�Models: FourCastNet v2, Pangu-weather, GraphCast, GFS, IFS.�Forecasts initialized with GFS initial conditions.

Research needs:

Start effort to engage with forecasters to get feedback on these models (and visualization) early on.
Keep adding real-time visualization of all the new models that come out.

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

26 of 36

Example: Our Real-Time Website

https://aiweather.cira.colostate.edu/

5 Model options + 4-panel comparison�
Last 4 initializations�
240-hr forecasts�
Interactive maps�
Want to simplify inter-model comparisons

Developed by �Jacob Radford (CIRA/NOAA-GSL)

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

27 of 36

Example: Our Real-Time Website

https://aiweather.cira.colostate.edu/

Developed by �Jacob Radford (CIRA/NOAA-GSL)

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

Insert figure of comparison mode here.

Your tool is worth spreading over two slides!

Want to simplify inter-model comparisons

28 of 36

Access to archives of AI model forecasts

�

Research community needs easy access to multi-year archive of forecasts.�

Sample activity:

CIRA-NOAA is building an archive of forecasts

Models: FourCastNet v2, Pangu-weather, GraphCast
Timeframe: 09/2021 to present
Forecasts: initialized twice daily (0Z, 12Z) for 10 days,

saving forecasts in 6h time steps.

https://noaa-oar-mlwp-data.s3.amazonaws.com/index.html

Research needs:

So far, limited effort: selected models, selected duration (3 years), etc.
Cannot keep up with new models coming out (set-up, storage needs).
How do we bring structure to the model jungle? Selection criteria for which models to feature?

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

Developed by Jacob Radford (CIRA/NOAA-GSL)

29 of 36

Derived fields: Severe-weather parameters

We have developed accurate, efficient, unit-tested code to add the following params to the CIRA archive:

CAPE: most unstable (MUCAPE), surface-based (SBCAPE), mixed-layer (MLCAPE)
MUCIN, SBCIN, MLCIN
Lifted index (surface to 500 hPa)
Precipitable water
Wind shear: surface - 500 hPa, surface - 850 hPa, 850-200 hPa
Storm-relative helicity: 0-1 km, 0-3 km
Height of planetary boundary layer (PBL)

Initial archive: one month (May 2023) for all models (4 daily runs of GraphCast, FourCastNet v2, Pangu)

Full archive: coming soon (needs HPC resources)

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

Developed by Ryan Lagerquist (CIRA/NOAA-GSL)

30 of 36

Evaluation of AI models

31 of 36

Evaluate output of AI model as if it was an NWP model.

Applying objective validation measures from NWP models

There is a fairly regular, standard suite of verification applied to either global, coarse resolution, or regional, fine-scale model forecasts that can be used to evaluate forecast output.

There are two broad classes of meteorological model output fields: continuous and feature-specific. �

Continuous fields: tend to be smooth and do not vary much from one grid point to the next. �Examples: geopotential height, temperature, wind speed. �
Feature-specific fields: look more like objects where an event occurs and can be dominated by “zero/null/empty” values with abrupt spatial gradients. �Examples: precipitation, simulated radar reflectivity, cloud ceiling, visibility

Deterministic forecast verification

RMSE for continuous fields
Bias/mean error of continuous fields as well as precipitation
Dichotomous events (i.e., scores based on a 2x2 contingency table, can be summarized using a performance diagram)
Object-oriented

For ensembles, probabilistic verification

Brier (skill) score (Continuous Ranked Probability [skill] Score)
Reliability diagrams
ROC diagrams
Forecast sharpness histograms

Credit: Material on this slide provided by Jeff Duda.

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

Status: Lots of research in the works.

32 of 36

Test weaknesses of AI models that NWP models do not have - so no tests exist for those

Develop tests for specific weaknesses of AI models�

Suggested sample topics (not an exhaustive list):

Test temporal consistency: �NWP models are consistent in time by nature. AI models may not be quite as consistent. �
Test Distribution of Extremes: �Compare distribution of extreme temp / wind / etc. in output of AI models to those from NWP models. Are AI models capturing the extremes, or are they missing many of them? How does this distribution change with lead time?�
Test consistency between variables (covariance): �NWP models by definition guarantee physical relationships between different fields (hard constraints). �In AI models those relationships are soft constraints, i.e. the AI model may choose to put more emphasis on other criteria, resulting inconsistencies between variables. �
Estimate other fields that forecasters need to see and compare those to NWP models�Examples: precipitation type, CAPE, CIN, wind shear…

Activity: See prior slide - work by Ryan Lagerquist.�

Evaluate Dynamics of AI-models for weather prediction:�Key paper: �Gregory J. Hakim and Sanjit Masanam, Dynamical tests of a deep-learning weather prediction model. Sept 19, 2023. https://arxiv.org/abs/2309.10867.

Do AI models represent meteorological features correctly? �Examples: Tropical cyclones, synoptic fronts, atmospheric rivers, ocean eddies.�Activity: evaluation of tropical cyclones, see next slide.

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

Status: Lots of research in the works.

33 of 36

Evaluation of Forecasts of Tropical Cyclones

Project Goal

Use CIRA AI4NWP model archive to evaluate tropical cyclone track/intensity forecasts

Project Team

Mark DeMaria, Kate Musgrave, Galina Chirokova, Robert DeMaria, Jacob T. Radford, Imme Ebert-Uphoff - CIRA
James Franklin, National Hurricane Center Contractor

Method

Apply TC tracker to FourCastNet1 and 2, PanguWeather and GraphCast
Use NHC verification rules for track and intensity (max wind)
Compare with operational model TC track/intensity forecasts
2023 Northern hemisphere TC cases May-Oct 2023

Results

Track forecasts are as good or better than GFS, ECMWF and regional hurricane models
Intensity forecasts are much worse due to low bias
Intensity forecasts may still be useful if post-processing bias correction applied
Manuscript in preparation for AMS AIES journal

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

34 of 36

Track and Intensity MAE and Intensity Bias for 2023 AI4NWP TC Forecasts

Track Error

Intensity Error

Intensity Bias

35 of 36

Conclusions - key research needs

Need coordinated comprehensive evaluation effort that includes:

Physical validation of AI models

Value to forecasters / impact on decision making

AI pitfalls / Bias identification and mitigation

NWP experts

AI experts

Social scientists

Forecasters

R2O - operational perspective

Software managers

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)

Need to involve:

Urgent need: bring forecasters and social scientists more into the discussion. Social scientists are needed to elicit feedback from forecasters + many other roles.

Red frame:

biggest needs

36 of 36

Acknowledgements

Funding for this work was provided by:

NOAA GSL AI funding
CIRA ML strategic funding
Pilot project under HFIP/STI support to CIRA TC group’s efforts

Thank You to the conveners of these EGU sessions on “Forecasting the Weather”:

Yong Wang, Aitor Atencia, kan dai, Lesley De Cruz, Daniele Nerini.

Kyle Hilburn CIRA/CSU

EGU General Assembly 2024 - Session AS1.2 “Forecasting the Weather” (Apr 15, 2024)