Facilitating higher resolution Ocean simulations in EC-Earth4: NEMO mixed precision and I/O at 3km global
Miguel Castrillo
BSC-ES, Computational Earth Sciences
09/02/2021
EC-Earth meeting
MareNostrum 4
Total peak performance: 13,7 Pflops
General Purpose Cluster: 11.15 Pflops (1.07.2017)
CTE1-P9+Volta: 1.57 Pflops (1.03.2018)
CTE2-AMD: 0.52 Pflops (2020)
CTE3-Arm V8: 0.5 Pflops (2020)
MareNostrum 1
2004 – 42,3 Tflops
1st Europe / 4th World
New technologies
MareNostrum 2
2006 – 94,2 Tflops
1st Europe / 5th World
New technologies
MareNostrum 3
2012 – 1,1 Pflops
12th Europe / 36th World
MareNostrum 4
2017 – 11,1 Pflops
2nd Europe / 13th World
New technologies
Access: prace-ri.eu/hpc_acces
Access: bsc.es/res-intranet
MareNostrum 5. A European pre-exascale supercomputer
200 Petaflops peak performance (200 x 1015)
Experimental platform to create supercomputing technologies “made in Europe”
223 M€ of investment
Hosting Consortium:
Spain
Portugal
Turkey
Croatia
ESiWACE2 WP1 - GCM 10 km
Excellence in Simulation of Weather and Climate in Europe
NEMO 4
NEMO 4
Mixed precision calculations in NEMO
In a nutshell
2019: “How to use mixed precision in ocean models: exploring a potential reduction of numerical precision in NEMO 4.0 and ROMS 3.6.”
Tintó Prims, O., Acosta, M.C., Moore, A.M., Castrillo,M., K. Serradell, A. Cortés and F.J. Doblas-Reyes
Discriminating accurate results in nonlinear model
Difference between double and single after one month
Sea Surface Temperature diff.
Oriol Tintó
Discriminating accurate results in nonlinear model
Difference between double and mixed after one month
Sea Surface Temperature diff.
Oriol Tintó
Discriminating accurate results in nonlinear model
Difference between double and mixed after one month
Sea Surface Temperature diff.
The range is a hundred times smaller!
Oriol Tintó
NEMO efficiency evolution
ORCA1 scalability (MareNostrum4)
OCE, ICE
1-year runs
Reduced outclass (4 3D, 23 2D monthly avgs)
NEMO efficiency evolution
ORCA1 scalability - Mixed precision branch (MareNostrum4)
The ORCA36 configuration
From ORCA2 to ORCA36
name | jpiglo | jpjglo | jpk | size (million vertices) | resolution (km) |
ORCA2 | 182 | 149 | 31 | 0.84 | 220.19 |
ORCA1 (SR) | 362 | 292 | 75 | 7.92 | 110.7 |
ORCA025 (HR) | 1,442 | 1,021 | 75 | 110.42 | 27.79 |
ORCA12 (VHR) | 4,322 | 3,059 | 75 | 991.57 | 9.27 |
ORCA36 (VVHR?) | 12,962 | 9,173 | 75 | 8,917.53 | 3.09 |
x14
x9
x9
x9.4
x10,650
Global 1/36° (ORCA36): context
IMMERSE (EU H2020):
demonstrator for developments in NEMO 4 (HPC developments)
ESIWACE2 (EU H2020):
demonstrator for « production runs at unprecedented resolution on pre-exascale supercomputers »
CMEMS contract with BSC:
« 87-GLOBAL-CMEMS-NEMO: EVOLUTION AND OPTIMISATION OF THE NEMO CODE USED FOR THE MFC-GLO IN CMEMS » :
NEMO HPC performance, global 1/36°
Clement Bricaud (MOI)
Scaling ORCA36 - Grand Challenge 2019
ORCA36 scalability (MareNostrum4)
NEMO r4.0
30s. timestep
300 steps
No output
Scaling ORCA36 - Grand Challenge 2019
ORCA36 scalability (MareNostrum4)
MN4 total = 165,888 cores!!
NEMO I/O
NEMO I/O scalability
ORCA1 - XIOS scalability (MareNostrum4)
x15 !!!
Constant number of I/O servers
48 XIOS servers (1 node)
Writing monthly averages
One-file mode
Average step
I/O step
Scaling ORCA36 - Grand Challenge 2020
ORCA36 - XIOS scalability (MareNostrum4)
Performance mode
(large buffer)
Memory mode
(large buffer)
x10 !!!
NEMO using 512 nodes
Scaling I/O servers
5-hour 3D output (340 GB per hour)
Multiple-file mode
Take home messages
Thank you
miguel.castrillo@bsc.es
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 823988.
NEMO efficiency evolution
ORCA25 scalability
OCE, ICE
2-month runs
Reduced outclass (monthly 4 3D, 23 2D)
Intel 2017.4; IMPI 2018.4
Interpolation from G2V4 to grid model for CI
¼° (ORCA025) 1/12° (ORCA12) 1/36°(ORCA36) 1/36°(ORCA36)
IC smooth IC no smooth
SST after 1hour
MOD(UV) after 7 days (hourly)
global 1/12°
global 1/36°
global ¼°
Surface velocities after 3 weeks ( 1 hour outputs)
ORCA36
Configurations
Code | Step | Init T&S | Atmospheric Forcing | ICE | Runoff | Geothermal heating | QSR |
O36-I | 90 | F | F | F | F | F | F |
O36-II | 90 | F | 512x256 | F | F | F | F |
O36_ICE | 90 | F | 512x256 | T | F | F | F |
O36_FULL* | 30 | 9,173x12,962 | 512x256 | T | 9,173x12,962 | 360x180 | 9,173x12,962 |
ORCA36 in MareNostrum4
Resources constraints
Configuration | Minimum resources standard nodes (96GB) | Minimum resources high-mem nodes (384GB) |
O36-I | 64 nodes, 6TB memory | 16 nodes, 6TB memory |
O36-II | 64 nodes, 6TB memory | 16 nodes, 6TB memory |
O36_ICE | 64 nodes, 6TB memory | 16 nodes, 6TB memory |
O36_FULL* | - | 16 nodes, 6TB memory |
The key_single
To enable compilation in mixed precision:
single_precision_substitute.h90
#if defined key_single
# define CASTWP(x) REAL(x,wp)
# define CASTDP(x) REAL(x,dp)
#else
# define CASTWP(x) x
# define CASTDP(x) x
#endif
par_kind.F90
# if defined key_single
INTEGER, PUBLIC, PARAMETER :: wp = sp
# else
INTEGER, PUBLIC, PARAMETER :: wp = dp
# endif
The key enables the following code:
Stella Paronuzzi
ORCA36 scalability with I/O
One file mode
3D hourly output
Multiple file mode
NEMO proc. | XIOS proc. | NEMO step time | XIOS step time | Steps/second |
1536 | 1536 | ~18s | ~366s | 0.05 |
3072 | 1536 | ~8s | ~348s | 0.097 |
3072 | 1920 | ~8s | ~376s | 0.095 |
NEMO proc. | XIOS proc. | NEMO step time | XIOS step time | Steps/second |
1536 | 1536 | ~18s | ~17s | 0.056 |
3072 | 1536 | ~8s | ~17s | 0.122 |