WarpX
Structure of the code, and how to navigate through the source
Maxence Thévenet
LBNL
03/05/2020
Part I: code organization
AMReX basics: block-structured mesh refinement
Box | Lower and upper indices |
FArrayBox | Array defined on a Box |
FABArray/MultiFAB | Collection of FArrayBox |
ParticleContainer | Collection of particles per box, with an iterator |
for (WarpXParIter pti(*this, lev); pti.isValid(); ++pti){
auto& attribs = pti.GetAttribs();
auto& uxp = attribs[PIdx::uxp];
const FArrayBox& exfab = Ex[pti];
Iterator over Box/tile
Particle attributes (SoA)
Array associated with the box/tile
Main WarpX classes (everything in Source/, .H file with same name as class)
WarpX : AmrCore | ./ | All classes, Fields (= MultiFab), Full diags (plotfiles), time steps, some BC |
WarpXParticleContainer : ParticleContainer | Particles/ | 1 species (physical or not), per box, based on Particle, FG, PP, CD |
MultiParticleContainer | Particles/ | Vector of all species, and loops over species |
FiniteDifferenceSolver | FieldSolver/FiniteDifferenceSolver/ | Evolve E and B (CKC, Yee, orders?) |
SpectralSolver | FieldSolver/SpectralSolver/ | Evolve E and B (PSATD, etc.) |
ILaserProfile, GaussianLaserProfile etc. | Laser/ | Laser profile (Gaussian, from file, parsed, etc.) |
LaserParticleContainer | Laser/ | Antenna |
PlasmaInjector | Initialization/ | Plasma profile |
WarpXParser | Parser/ | General parser (used in many places) |
ReducedDiagnostics | Diagnostics/ReducedDiags/ | Reduced diags, well-separated module |
BackTransformedDiagnostics | Diagnostics/ | Big machinery, for runs in a boosted frame |
GuardCellManager | Parallelization/ | Grid communications. uses on MultiFab::FillBoundary |
PML | BoundaryConditions/ | Well-separated module, BC |
Filter | Filter/ | NCI filter (on E and B), bilinear filter (on J) |
“Multi” = “collection of”
BIG AMReX dependency:
particles
Field solver
Initial conditions
Boundary condition
Diagnostics
Communications
Numerics
WarpX: Fields
Efield_fp
WarpX has a LOT of fields, all of them are members of class WarpX
MR levels
Component
void
WarpX::EvolveE (amrex::Real a_dt)
{
for (int lev = 0; lev <= finest_level; ++lev) {
EvolveE(lev, a_dt);
}
}
All field arrays are directly members of the class WarpX
class WarpX
amrex::Vector<std::array< std::unique_ptr<amrex::MultiFab>, 3 > > Efield_fp;
We use the amrex::MultiFab member functions for most of our operations:
Most functions have a general and a per-level version
void
WarpX::EvolveE (int lev, amrex::Real a_dt)
{
m_fdtd_solver_fp[lev]->EvolveE( Efield_fp[lev], etc. );
}
why not amrex::Vector<std::unique_ptr<amrex::MultiFab> > Efield_fp; with 3-component MultiFabs?
WarpX: Particles 1/2 : Particles and ParticleContainers
for (WarpXParIter pti(*this, lev); pti.isValid(); ++pti){
const auto GetPosition = GetParticlePosition(pti, offset);
auto& attribs = pti.GetAttribs();
auto& uxp = attribs[PIdx::ux];
const long np = pti.numParticles();
amrex::ParallelFor(np, …
amrex::ParticleReal xp, yp, zp;
GetPosition(ip, xp, yp, zp);
// now I can play with xp, uxp[ip], etc.
);
amrex::Particle
amrex::ParticleContainer
Array-of-Struct: x1 y1 z1 x2 y2 z2 … xn yn zn : position, id, CPU
Struct-of-Array: ux1 ux2 … uxn … uz1 uz2 … uzn : momentum, fields, etc.
p.pos(0) = x position of the particle
ParticleContainer = species
Particles are stored per-box (or per-tile on CPU)
STANDARD WAY
NEW WAY
using PType = typename WarpXParticleContainer::SuperParticleType;�Ptype p;
p.pos(0);
p.rdata(PIdx::w);
WarpX: Particles 2/2: MultiParticleContainer
class MultiParticleContainer
amrex::Vector<std::unique_ptr<WarpXParticleContainer> > allcontainers;� multi-species handling (QED, etc.)
void
MultiParticleContainer::FieldGather ()
{
for (auto& pc : allcontainers) {
pc->FieldGather(lev, Ex, Ey, Ez, Bx, By, Bz);
}
}
1 ParticleContainer = 1 species.
Class that collects all ParticleContainers ( = all species): MultiParticleContainer
Class WarpX has a SINGLE MultiParticleContainer member variable: mypc
class WarpX
std::unique_ptr<MultiParticleContainer> mypc;
What happens when I run a simulation? Overview
MPI_Init(&argc, &argv);
amrex::Initialize(argc,argv);
WarpX warpx;
warpx.InitData();
warpx.Evolve();
amrex::Finalize();
MPI_Finalize();
OMP_NUM_THREADS=2
mpirun –np 4 ./warpx.ex inputs
amr.n_cell = 64 64 64
amr.max_grid_size = 64
amr.blocking_factor = 32
amr.max_level = 0
geometry.coord_sys = 0
geometry.is_periodic = 1 1 1
geometry.prob_lo = -2. -2. -2.
geometry.prob_hi = 2. 2. 2.
All the rest in inputs
Initialize MPI
Initialize AMReX (box decomposition, distribution mapping)
Construct WarpX:
DON’T allocate memory
Allocate memory, and actually create fields and particles with specified distributions
Run iterations
Simulation sequence
Inputs
What it does
What happens when I run a simulation? Initialization sequence
WarpX warpx;
warpx.InitData();
WarpX::WarpX
Warpx.do_filter
particles.species_names = electrons
electrons.num_particles_per_cell = 2
Null pointers, no particles. Advice?
WarpX::InitData
Allocate all field MultiFabs!
Inject particles with requested plasma profile
PIC loop
What happens when I run a simulation? Time iteration sequence
warpx.Evolve();
WarpX::EvolveEM (Evolve): loop over iterations
WarpX::OneStep_nosub
mypc->doFieldIonization();
mypc->doCoulombCollisions();
PushParticlesandDepose();
SyncCurrent();
EvolveB(0.5*dt[0]);
FillBoundaryB();
EvolveE(dt[0]);
FillBoundaryE();
EvolveB(0.5*dt[0]);
if (do_pml)
DampPML();
FillBoundaryE();
FillBoundaryB();
WarpX::PushParticlesandDepose: loop over ParticleContainers and call WarpXParticleContainer::Evolve
for (WarpXParIter pti(*this, lev); pti.isValid(); ++pti)
applyNCIFilter
FieldGather
PushPX
DepositCurrent
Topics treated below
Could be added
Portability: AMReX::ParallelFor
amrex::ParallelFor(
N,
[=] (int i) {
xp[i] += 1.;
}
);
for(int i=0; i<N; i++){
xp[i] += 1.;
}
CUDA (NVIDIA)
kernel(int* xp) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i<N) xp[i] += 1.;
}�kernel<<<N, 256>>>(xp);
Loop over particles: GPU
Loop over particles: CPU
We don’t want to write everything twice. AMReX provides an abstraction layer (portability layer), its main function is ParallelFor
We write the code only once, and it is compiled as a for loop or kernel launch, depending on USE_GPU
AMReX provides different flavors of ParallelFor, for loops on particles, fields, etc.
amrex::ParallelFor(tex,
[=] AMREX_GPU_DEVICE (int i, int j, int k){
Ex(i, j, k) += 1.;
}
);
amrex::ParallelFor(tex, tey, tez,
[=] AMREX_GPU_DEVICE (int i, int j, int k){
Ex(i, j, k) += 1.;},
[=] AMREX_GPU_DEVICE (int i, int j, int k){
Ey(i, j, k) += 1.;},
[=] AMREX_GPU_DEVICE (int i, int j, int k){
Ez(i, j, k) += 1.;}
);
We cannot use anything inside a ParallelFor!!
🡪 Make a local copy of just the variable, and use this copy
PIC elementary functions
Particles/Gather/FieldGather.H
void doGatherShapeN
free-standing function, called in PhysicalParticleContainer::Evolve
Particles/Pusher/UpdateMomentumBoris.H
void PhysicalParticleContainer::PushPX
Member of PhysicalParticleContainer, called in PhysicalParticleContainer::Evolve
Particles/Deposition/CurrentDeposition.H
void doDepositionShapeN
free-standing function, call in PhysicalParticleContainer::Evolve
FieldSolver/FiniteDifferenceSolver/EvolveE.cpp
void FiniteDifferenceSolver::EvolveE
member of FiniteDifferenceSolver, called in WarpX::OneStep_nosub
Communications
A PIC code should only require local communications between neighbor boxes.
Comm patterns can be tricky with MR, though 🡪 AMReX handles ALL of it.
Field halo exchanges: FillBoundary. Called many times per iteration.
In WarpX, we try to exchange as little data as possible. GuardCellManger keeps track of the number of cells we need to exchange at each part of the PIC code. That could be improved (in particular for PMLs).
Particle halo exchanges: Redistribute. Called once per iteration.
PIC loop
WarpX::OneStep_nosub
mypc->doFieldIonization();
mypc->doCoulombCollisions();
PushParticlesandDepose();
SyncCurrent();
EvolveB(0.5*dt[0]);
FillBoundaryB();
EvolveE(dt[0]);
FillBoundaryE();
EvolveB(0.5*dt[0]);
if (do_pml)
DampPML();
FillBoundaryE();
FillBoundaryB();
Diagnostics
Full diagnostics (plotfiles): Part of class WarpX ☹
Mostly in Diagnostics/
WarpX::EvolveEM
Slice diags: ~parallel implementation
Back-transformed diagnostics (boosted-frame) ☺
Mostly in Diagnostics/
Class BackTransformeDiagnostic
std::vector<std::unique_ptr<LabFrameDiag> > m_LabFrameDiags_;
Class LabFrameDiag
std::unique_ptr<amrex::MultiFab> m_data_buffer_;
WarpX::EvolveEM
Reduced diagnostics (~1 number per diag iteration) ☺☺
ALL in Diagnostics/ReducedDiags/
class WarpX
MultiReducedDiags* reduced_diags;
class MultiReducedDiags
std::vector<std::unique_ptr<ReducedDiags>> m_multi_rd;
class ReducedDiags
base class for these reduced diagnostics
WarpX::EvolveEM
🡪 reduced_diags->ComputeDiags(step);
🡪 reduced_diags->WriteToFile(step);
Then, the instance of ReducedDiags gets a pointer to WarpX to access all the data (unfortunately in a non-const fashion)
1 LabFrameDiags = 1 snapshot in lab frame
(t=t1, z = [zmin, zmax])
Corresponds to different (t’, z’): at each iteration:
Diags should be reorganized soon(ish)!
Mesh Refinement
PushParticlesandDepose()
FieldGather()
PushPX()
DepositCurrent()
SumBoundaryJ()
EvolveB(0.5*dt[0])
FillBoundaryB()
EvolveE(dt[0])
FillBoundaryE()
EvolveB(0.5*dt[0])
FillBoundaryB()
UpdateAuxilaryData()
PushParticlesandDepose()
FieldGather()
PushPX()
DepositCurrent()
SyncCurrent()
EvolveB(0.5*dt[0])
FillBoundaryB()
EvolveE(dt[0])
FillBoundaryE()
EvolveB(0.5*dt[0])
FillBoundaryB()
PIC loop without MR
PIC loop with MR
Substitution:
Fn+1(a) = I[Fn(a)-Fn+1(c)]+Fn+1(f)
absorbing layer (PML)
a = auxilliary
f = fine
c = coarse
Ln
Ln+1
a
f
c
a
SyncCurrent()
interpCurrentFineToCoarse()
AddCurrentFromFineLevelandSumBoundary()
ApplyFilterandSumBoundaryJ()
MultiFab::ParallelAdd()
J
EB
UpdateAuxilaryData()
UpdateAuxilaryDataStagToNodal
MultiFab::setVal
MultiFab::ParallelCopy
MultiFab::Substract
Continuous Integration
WarpX/Regression/
Nightly builds on CRD clusters Battra (CPU) and Garuda (GPU)
🡪 Catch everything (and more)
WarpX-tests.ini
WarpX/Examples/
TravisCI tests on every commit on GitHub
🡪 Only catch what we ask for
prepare_file_travis.py
🡪 reformat
Also performance tests but, heh
Profiling in WarpX
void PhysicalParticleContainer::Evolve (){
WARPX_PROFILE("PPC::Evolve()");
WARPX_PROFILE_VAR_NS("PPC::FieldGather", blp_fg);
WARPX_PROFILE_VAR_START(blp_fg);
FieldGather();
WARPX_PROFILE_VAR_STOP(blp_fg);
}
-------------------------------------------------------------------------------------------
Name NCalls Excl. Min Excl. Avg Excl. Max Max %
-------------------------------------------------------------------------------------------
PPC::FieldGather 800 29.06 29.23 29.32 68.63%
PPC::Evolve() 100 8.175 8.264 8.329 19.50%
FabArray::ParallelCopy() 600 1.11 1.239 1.434 3.36%
FillBoundary_finish() 1200 0.6782 0.8679 1.03 2.41%
Redistribute_partition 102 0.8006 0.8529 0.8762 2.05%
FillBoundary_nowait() 1200 0.4906 0.6733 0.8346 1.95%
AMReX Tiny Profiler
BL_PROFILE
We have a wrapper in WarpX:
WARPX_PROFILE
I think current deposition is not instrumented
Why bother with PRs?
Write a PR
Write good code
Prepare a PR
Test your code
Clean your PR
Comment your code
Write a PR description
Ask for review
Part II: Performance and GPUs
GPU-accelerated supercomputing
CPU supercomputer
Cori @ NERSC (13/500)
GPU-accelerated supercomputer
Summit @ OLCF (1/500)
MPI + x
MPI + x
All pre-exascale and planned exascale supercomputers are accelerated
1 Cori node | 1 Intel KNL CPU |
3 TFLOPS 112 GB | 3 TFLOPS 96 DDR4 + 16 GB HBM |
1 Summit node | 2 IBM Power9 CPUs | 6 NVIDIA V100 GPUs |
42 TFLOPS 608 GB | 0.8 TFLOPS 256 GB DDR4 | 7.8 TFLOPS 16 GB HBM |
GPU-accelerated supercomputing
Kernel �launch
Kernel �launch
Kernel �launch
Kernel �launch
Compute
Compute
Device
Host
Compute
Compute
Kernel 1
Kernel 2
Kernel 3
Kernel 4
Asynchronous execution
You can explicitly synchronize with cudaDeviceSynchronize
GPU-accelerated supercomputing
ParallelFor()
Kernel �launch
Kernel �launch
Kernel �launch
Kernel �launch
Compute
Compute
Device
Host
Compute
Compute
Kernel 1
Kernel 2
Kernel 3
Kernel 4
time()
time()
time()
time()
time()
Makes it difficult to have accurate timers
Now, where is the data?
GPU-accelerated supercomputing
GPU-accelerated supercomputing
MPI_Init(&argc, &argv);
amrex::Initialize(argc,argv);
WarpX warpx;
warpx.InitData();
warpx.Evolve();
amrex::Finalize();
MPI_Finalize();
Data initialized on the host, in unified memory.
Managed/Unified memory: single memory address space!
Device
Host
Host execution | Host memory 256 GB | Device execution | Device memory 16 GB |
Allocate data | | | |
Process data | | | |
ParallelFor(data) = kernel launch | | | |
| | I need data! pagefault | |
| | Process data | |
Don’t need data anymore. | | | |
| | | |
170 GB/s
900 GB/s
50 GB/s
A detailed example
warpx/FieldSolver/FiniteDifferenceSolver/EvolveB.cpp
amrex/Base/AMReX_Array4.H
MFIter, device synchronize and CUDA streams
Kernel �launch
Kernel �launch
Kernel �launch
Compute
Device
Host
Kernel 1
Kernel 2
Kernel 3
Kernel �launch
Kernel �launch
Kernel �launch
Box 0
Box 1
Kernel 1
Kernel 2
1
2
3
1
2
3
Device
Kernel 1
Kernel 2
Kernel 3
Kernel 1
Kernel 2
Kernel 3
Stream 0
Stream 1
CUDA streams allow for executing multiple kernels simultaneously on a GPU!
See AMReX GPU strategy
https://amrex-codes.github.io/amrex/docs_html/GPU.html
See AMReX GPU strategy
https://amrex-codes.github.io/amrex/docs_html/GPU.html
// See PhysicalParticleContainer::Evolve
for (WarpXParIter pti(*this, lev); pti.isValid(); ++pti){
FArrayBox filtered_Ez;
Elixir ezeli;
const Box& tbox = …;
filtered_Ez.resize(amrex::convert(tbox,Ez.box().ixType()));
ezeli = filtered_Ez.Elixir();
Filter[lev]->ApplyStencil(filtered_Ez, Ez[pti]);
// Particles gather from filtered_Ez
}
Call ParIter (or MFIter) destructor
cudaDeviceSynchronize()
Parallelization
| | | | | |
| | | | | |
| | | | | |
| | | | | |
Grid = box
For GPU:
For a lot of optimizations, the Tiny Profiler and the standard output give enough information:
Run efficiently (on Summit, but also on any platform!)
Total GPU global memory (MB) spread across MPI: [16128 ... 16128]
🡪 hardware limit (16 GB)�Free GPU global memory (MB) spread across MPI: [4 ... 6]
🡪 GPU memory that is still available�[The Arena] space (MB) allocated spread across MPI: [24367 ... 27302]
🡪 Memory allocated by the Arena (the AMReX module that handles memory allocation and management).
PPC::CurrentDeposition 1106 0 4.966 20.61 19.44%
# warpx.load_balance_int = 10�# algo. load_balance_costs_update = timers�# warpx.load_balance_with_sfc = 1
�