1 of 31

WarpX

Structure of the code, and how to navigate through the source

Maxence Thévenet

LBNL

03/05/2020

Part I: code organization

Main classes

Main steps in a simulation

Zoom on specific capabilities

How to be a good citizen-contributor

2 of 31

AMReX basics: block-structured mesh refinement

Box	Lower and upper indices
FArrayBox	Array defined on a Box
FABArray/MultiFAB	Collection of FArrayBox
ParticleContainer	Collection of particles per box, with an iterator

for (WarpXParIter pti(*this, lev); pti.isValid(); ++pti){

auto& attribs = pti.GetAttribs();

auto& uxp = attribs[PIdx::uxp];

const FArrayBox& exfab = Ex[pti];

Iterator over Box/tile

Particle attributes (SoA)

Array associated with the box/tile

3 of 31

Main WarpX classes (everything in Source/, .H file with same name as class)

WarpX : AmrCore	./	All classes, Fields (= MultiFab), Full diags (plotfiles), time steps, some BC
WarpXParticleContainer : ParticleContainer	Particles/	1 species (physical or not), per box, based on Particle, FG, PP, CD
MultiParticleContainer	Particles/	Vector of all species, and loops over species
FiniteDifferenceSolver	FieldSolver/FiniteDifferenceSolver/	Evolve E and B (CKC, Yee, orders?)
SpectralSolver	FieldSolver/SpectralSolver/	Evolve E and B (PSATD, etc.)
ILaserProfile, GaussianLaserProfile etc.	Laser/	Laser profile (Gaussian, from file, parsed, etc.)
LaserParticleContainer	Laser/	Antenna
PlasmaInjector	Initialization/	Plasma profile
WarpXParser	Parser/	General parser (used in many places)
ReducedDiagnostics	Diagnostics/ReducedDiags/	Reduced diags, well-separated module
BackTransformedDiagnostics	Diagnostics/	Big machinery, for runs in a boosted frame
GuardCellManager	Parallelization/	Grid communications. uses on MultiFab::FillBoundary
PML	BoundaryConditions/	Well-separated module, BC
Filter	Filter/	NCI filter (on E and B), bilinear filter (on J)

“Multi” = “collection of”

BIG AMReX dependency:

Domain decomposition (Box, BoxArray, DistributionMapping, Geometry)
Structures for fields (MultiFab) and particles (ParticleContainer)
Communications (FillBoundary, Redistribute)
Portability (ParallelFor)

particles

Field solver

Initial conditions

Boundary condition

Diagnostics

Communications

Numerics

4 of 31

WarpX: Fields

E
B
J
rho

Fine patch
Coarse patch
Auxiliary
Coarse-aux

Physical space
Fourier space

Efield_fp

WarpX has a LOT of fields, all of them are members of class WarpX

MR levels

Component

void

WarpX::EvolveE (amrex::Real a_dt)

{

for (int lev = 0; lev <= finest_level; ++lev) {

EvolveE(lev, a_dt);

}

All field arrays are directly members of the class WarpX

class WarpX

amrex::Vector<std::array< std::unique_ptr<amrex::MultiFab>, 3 > > Efield_fp;

We use the amrex::MultiFab member functions for most of our operations:

Communication
Interpolation
Basic operations

Most functions have a general and a per-level version

void

WarpX::EvolveE (int lev, amrex::Real a_dt)

{

m_fdtd_solver_fp[lev]->EvolveE( Efield_fp[lev], etc. );

}

why not amrex::Vector<std::unique_ptr<amrex::MultiFab> > Efield_fp; with 3-component MultiFabs?

5 of 31

WarpX: Particles 1/2 : Particles and ParticleContainers

for (WarpXParIter pti(*this, lev); pti.isValid(); ++pti){

const auto GetPosition = GetParticlePosition(pti, offset);

auto& attribs = pti.GetAttribs();

auto& uxp = attribs[PIdx::ux];

const long np = pti.numParticles();

amrex::ParallelFor(np, …

amrex::ParticleReal xp, yp, zp;

GetPosition(ip, xp, yp, zp);

// now I can play with xp, uxp[ip], etc.

);

amrex::Particle

amrex::ParticleContainer

WarpXParticleContainer : amrex::ParticleContainer
LaserParticleContainer : WarpXParticleContainer
PhysicalParticleContainer : WarpXParticleContainer
PhotonParticleContainer : PhysicalParticleContainer
RigidInjectedParticleContainer : PhysicalParticleContainer

Array-of-Struct: x1 y1 z1 x2 y2 z2 … xn yn zn : position, id, CPU

Struct-of-Array: ux1 ux2 … uxn … uz1 uz2 … uzn : momentum, fields, etc.

p.pos(0) = x position of the particle

ParticleContainer = species

Particles are stored per-box (or per-tile on CPU)

STANDARD WAY

NEW WAY

using PType = typename WarpXParticleContainer::SuperParticleType;�Ptype p;

p.pos(0);

p.rdata(PIdx::w);

6 of 31

WarpX: Particles 2/2: MultiParticleContainer

class MultiParticleContainer

amrex::Vector<std::unique_ptr<WarpXParticleContainer> > allcontainers;� multi-species handling (QED, etc.)

void

MultiParticleContainer::FieldGather ()

{

for (auto& pc : allcontainers) {

pc->FieldGather(lev, Ex, Ey, Ez, Bx, By, Bz);

}

1 ParticleContainer = 1 species.

Class that collects all ParticleContainers ( = all species): MultiParticleContainer

Class WarpX has a SINGLE MultiParticleContainer member variable: mypc

class WarpX

std::unique_ptr<MultiParticleContainer> mypc;

7 of 31

What happens when I run a simulation? Overview

MPI_Init(&argc, &argv);

amrex::Initialize(argc,argv);

WarpX warpx;

warpx.InitData();

warpx.Evolve();

amrex::Finalize();

MPI_Finalize();

OMP_NUM_THREADS=2

mpirun –np 4 ./warpx.ex inputs

amr.n_cell = 64 64 64

amr.max_grid_size = 64

amr.blocking_factor = 32

amr.max_level = 0

geometry.coord_sys = 0

geometry.is_periodic = 1 1 1

geometry.prob_lo = -2. -2. -2.

geometry.prob_hi = 2. 2. 2.

All the rest in inputs

Initialize MPI

Initialize AMReX (box decomposition, distribution mapping)

Construct WarpX:

Create WarpXParticleContainers
Create MultiFabs

DON’T allocate memory

Allocate memory, and actually create fields and particles with specified distributions

Run iterations

Simulation sequence

Inputs

What it does

8 of 31

What happens when I run a simulation? Initialization sequence

WarpX warpx;

warpx.InitData();

WarpX::WarpX

WarpX::ReadParameters
MultiParticleContainer::MultiParticleContainer

MultiParticleContainer::ReadParameters
Loop over all species

ParticleContainer::ParticleContainer

ParticleContainer::ParticleContainer
ParmParse::query()

Warpx.do_filter

particles.species_names = electrons

electrons.num_particles_per_cell = 2

Null pointers, no particles. Advice?

WarpX::InitData

WarpX::InitFromScratch

AmrCore::InitFromScratch

WarpX::MakeNewLevelFromScratch

WarpX::AllocLevelData

WarpX::AllocLevelMF

WarpX::InitLevelData

MultiFab::setVal

MultiParticleContainer::AllocData
MultiParticleContainer::InitData

PhysicalParticleContainer::AddParticles(0)

PhysicalParticleContainer::AddPlasma

ParticleContainer::Redistribute

WarpX::ComputeSpaceChargeField
WarpX::InitPML

PML::PML

WarpX::InitFilter

Filter::Filter

WarpX::InitDiags (BTD)

Allocate all field MultiFabs!

Inject particles with requested plasma profile

9 of 31

PIC loop

What happens when I run a simulation? Time iteration sequence

warpx.Evolve();

WarpX::EvolveEM (Evolve): loop over iterations

MultiParticleContainer::PushP If needed, ½ push back for particle momenta
UpdateAuxilaryData();
WarpX::OneStep_nosub
BackTransformedDiagnostics::writeLabFrameData
MoveWindow
MultiParticleContainer::Redistribute
MultiParticleContainer::SortParticlesByBin
ReducedDiags::ComputeDiags
WarpX:: WritePlotFile

WarpX::OneStep_nosub

mypc->doFieldIonization();

mypc->doCoulombCollisions();

PushParticlesandDepose();

SyncCurrent();

EvolveB(0.5*dt[0]);

FillBoundaryB();

EvolveE(dt[0]);

FillBoundaryE();

EvolveB(0.5*dt[0]);

if (do_pml)

DampPML();

FillBoundaryE();

FillBoundaryB();

WarpX::PushParticlesandDepose: loop over ParticleContainers and call WarpXParticleContainer::Evolve

for (WarpXParIter pti(*this, lev); pti.isValid(); ++pti)

applyNCIFilter

FieldGather

PushPX

DepositCurrent

10 of 31

Topics treated below

Portability
PIC elementary routines
Communications
Diagnostics
MR
CI
PR

Profiling
QED module
Build system
Staggering

Could be added

11 of 31

Portability: AMReX::ParallelFor

amrex::ParallelFor(

N,

[=] (int i) {

xp[i] += 1.;

}

);

for(int i=0; i<N; i++){

xp[i] += 1.;

}

CUDA (NVIDIA)

kernel(int* xp) {

int i = blockIdx.x*blockDim.x + threadIdx.x;

if (i<N) xp[i] += 1.;

}�kernel<<<N, 256>>>(xp);

Loop over particles: GPU

Loop over particles: CPU

We don’t want to write everything twice. AMReX provides an abstraction layer (portability layer), its main function is ParallelFor

We write the code only once, and it is compiled as a for loop or kernel launch, depending on USE_GPU

AMReX provides different flavors of ParallelFor, for loops on particles, fields, etc.

amrex::ParallelFor(tex,

[=] AMREX_GPU_DEVICE (int i, int j, int k){

Ex(i, j, k) += 1.;

}

);

amrex::ParallelFor(tex, tey, tez,

[=] AMREX_GPU_DEVICE (int i, int j, int k){

Ex(i, j, k) += 1.;},

[=] AMREX_GPU_DEVICE (int i, int j, int k){

Ey(i, j, k) += 1.;},

[=] AMREX_GPU_DEVICE (int i, int j, int k){

Ez(i, j, k) += 1.;}

);

We cannot use anything inside a ParallelFor!!

std::vector is NOT going to work well on GPU
If you use a member variable (of class C) in a parallel for, the whole class will be copied to the GPU. WE DON’T DO THAT (except for small classes that we know are GPU-friendly)

🡪 Make a local copy of just the variable, and use this copy

12 of 31

PIC elementary functions

Particles/Gather/FieldGather.H

void doGatherShapeN

free-standing function, called in PhysicalParticleContainer::Evolve

Particles/Pusher/UpdateMomentumBoris.H

void PhysicalParticleContainer::PushPX

Member of PhysicalParticleContainer, called in PhysicalParticleContainer::Evolve

Particles/Deposition/CurrentDeposition.H

void doDepositionShapeN

free-standing function, call in PhysicalParticleContainer::Evolve

FieldSolver/FiniteDifferenceSolver/EvolveE.cpp

void FiniteDifferenceSolver::EvolveE

member of FiniteDifferenceSolver, called in WarpX::OneStep_nosub

13 of 31

Communications

A PIC code should only require local communications between neighbor boxes.

Comm patterns can be tricky with MR, though 🡪 AMReX handles ALL of it.

Field halo exchanges: FillBoundary. Called many times per iteration.

In WarpX, we try to exchange as little data as possible. GuardCellManger keeps track of the number of cells we need to exchange at each part of the PIC code. That could be improved (in particular for PMLs).

Particle halo exchanges: Redistribute. Called once per iteration.

PIC loop

WarpX::OneStep_nosub

mypc->doFieldIonization();

mypc->doCoulombCollisions();

PushParticlesandDepose();

SyncCurrent();

EvolveB(0.5*dt[0]);

FillBoundaryB();

EvolveE(dt[0]);

FillBoundaryE();

EvolveB(0.5*dt[0]);

if (do_pml)

DampPML();

FillBoundaryE();

FillBoundaryB();

14 of 31

Diagnostics

Full diagnostics (plotfiles): Part of class WarpX ☹

Mostly in Diagnostics/

WarpX::EvolveEM

WarpX::WritePlotFile

WarpX::prepareFields

WarpX::AverageAndPackFields
WarpX::coarsenCellCenteredFields

Amrex::WriteMultiLevelPlotfile (or OpenPMD!)
WarpX::WriteRawField
MultiParticleContainer::WritePlotFile
WarpX::WriteJobInfo
WarpX::WriteWarpXHeader

Slice diags: ~parallel implementation

Back-transformed diagnostics (boosted-frame) ☺

Mostly in Diagnostics/

Class BackTransformeDiagnostic

std::vector<std::unique_ptr<LabFrameDiag> > m_LabFrameDiags_;

Class LabFrameDiag

std::unique_ptr<amrex::MultiFab> m_data_buffer_;

WarpX::EvolveEM

GetCellCenteredData

AverageAndPackVectorField
AverageAndPackScalarField

LabFrameDiags::writeLabFrameData

LabFrameDiags::LorentzTransformZ
… some more stuff with buffers

Reduced diagnostics (~1 number per diag iteration) ☺☺

ALL in Diagnostics/ReducedDiags/

class WarpX

MultiReducedDiags* reduced_diags;

class MultiReducedDiags

std::vector<std::unique_ptr<ReducedDiags>> m_multi_rd;

class ReducedDiags

base class for these reduced diagnostics

WarpX::EvolveEM

🡪 reduced_diags->ComputeDiags(step);

🡪 reduced_diags->WriteToFile(step);

Then, the instance of ReducedDiags gets a pointer to WarpX to access all the data (unfortunately in a non-const fashion)

1 LabFrameDiags = 1 snapshot in lab frame

(t=t1, z = [zmin, zmax])

Corresponds to different (t’, z’): at each iteration:

take the (z’, t’) slice that belongs to a given LabFrameDiags, and cell-center it
Lorentz-transform it (to the lab)
stack it in a buffer

Diags should be reorganized soon(ish)!

15 of 31

Mesh Refinement

PushParticlesandDepose()

FieldGather()

PushPX()

DepositCurrent()

SumBoundaryJ()

EvolveB(0.5*dt[0])

FillBoundaryB()

EvolveE(dt[0])

FillBoundaryE()

EvolveB(0.5*dt[0])

FillBoundaryB()

UpdateAuxilaryData()

PushParticlesandDepose()

FieldGather()

PushPX()

DepositCurrent()

SyncCurrent()

EvolveB(0.5*dt[0])

FillBoundaryB()

EvolveE(dt[0])

FillBoundaryE()

EvolveB(0.5*dt[0])

FillBoundaryB()

PIC loop without MR

PIC loop with MR

Substitution:

Fⁿ⁺¹(a) = I[Fⁿ(a)-Fⁿ⁺¹(c)]+Fⁿ⁺¹(f)

absorbing layer (PML)

a = auxilliary

f = fine

c = coarse

L_n

L_n+1

a

f

c

a

SyncCurrent()

interpCurrentFineToCoarse()

AddCurrentFromFineLevelandSumBoundary()

ApplyFilterandSumBoundaryJ()

MultiFab::ParallelAdd()

J

EB

UpdateAuxilaryData()

UpdateAuxilaryDataStagToNodal

MultiFab::setVal

MultiFab::ParallelCopy

MultiFab::Substract

16 of 31

Continuous Integration

WarpX/Regression/

Nightly builds on CRD clusters Battra (CPU) and Garuda (GPU)

Every night
See https://github.com/ECP-WarpX/regression_testing
Compare with ref to machine precision
Published at https://ccse.lbl.gov/pub/RegressionTesting/WarpX/

🡪 Catch everything (and more)

WarpX-tests.ini

WarpX/Examples/

Input files
Analysis script

TravisCI tests on every commit on GitHub

Every time you push on a branch with open PR
GitHub tells you when they fail
Jobs are submitted by batch (see .travis.yml)
Only tests compilation, run and analysis!!

🡪 Only catch what we ask for

prepare_file_travis.py

🡪 reformat

Also performance tests but, heh

17 of 31

Profiling in WarpX

void PhysicalParticleContainer::Evolve (){

WARPX_PROFILE("PPC::Evolve()");

WARPX_PROFILE_VAR_NS("PPC::FieldGather", blp_fg);

WARPX_PROFILE_VAR_START(blp_fg);

FieldGather();

WARPX_PROFILE_VAR_STOP(blp_fg);

}

-------------------------------------------------------------------------------------------

Name NCalls Excl. Min Excl. Avg Excl. Max Max %

-------------------------------------------------------------------------------------------

PPC::FieldGather 800 29.06 29.23 29.32 68.63%

PPC::Evolve() 100 8.175 8.264 8.329 19.50%

FabArray::ParallelCopy() 600 1.11 1.239 1.434 3.36%

FillBoundary_finish() 1200 0.6782 0.8679 1.03 2.41%

Redistribute_partition 102 0.8006 0.8529 0.8762 2.05%

FillBoundary_nowait() 1200 0.4906 0.6733 0.8346 1.95%

AMReX Tiny Profiler

BL_PROFILE

We have a wrapper in WarpX:

WARPX_PROFILE

I think current deposition is not instrumented

18 of 31

Why bother with PRs?

19 of 31

Write a PR

Write good code

To the point
At the right place
Modular
Straightforward
Comments
No duplication

Prepare a PR

Communicate!
One topic
To the point
Look at what’s already there

Test your code

2D local is not enough
2D/3D/RZ, CPU and GPU, serial and parallel, Threaded and not-threaded

Clean your PR

Print statements
Commented code
Useless functions/variables
Etc.

Comment your code

To the point
What’s needed and not more

Write a PR description

To the point
What’s needed and not more
Labels

Ask for review

Check the code
Focus on what’s tricky

20 of 31

Part II: Performance and GPUs

GPU computing

Parallelization

21 of 31

GPU-accelerated supercomputing

CPU supercomputer

Cori @ NERSC (13/500)

GPU-accelerated supercomputer

Summit @ OLCF (1/500)

MPI + x

All pre-exascale and planned exascale supercomputers are accelerated

1 Cori node	1 Intel KNL CPU
3 TFLOPS 112 GB	3 TFLOPS 96 DDR4 + 16 GB HBM

1 Summit node	2 IBM Power9 CPUs	6 NVIDIA V100 GPUs
42 TFLOPS 608 GB	0.8 TFLOPS 256 GB DDR4	7.8 TFLOPS 16 GB HBM

22 of 31

GPU-accelerated supercomputing

23 of 31

Kernel �launch

Compute

Device

Host

Compute

Kernel 1

Kernel 2

Kernel 3

Kernel 4

Asynchronous execution

You can explicitly synchronize with cudaDeviceSynchronize

GPU-accelerated supercomputing

ParallelFor()

24 of 31

Kernel �launch

Compute

Device

Host

Compute

Kernel 1

Kernel 2

Kernel 3

Kernel 4

time()

Makes it difficult to have accurate timers

Now, where is the data?

GPU-accelerated supercomputing

25 of 31

GPU-accelerated supercomputing

MPI_Init(&argc, &argv);

amrex::Initialize(argc,argv);

WarpX warpx;

warpx.InitData();

warpx.Evolve();

amrex::Finalize();

MPI_Finalize();

Data initialized on the host, in unified memory.

Managed/Unified memory: single memory address space!

Device

Host

Host execution	Host memory 256 GB	Device execution	Device memory 16 GB
Allocate data
Process data
ParallelFor(data) = kernel launch
		I need data! pagefault
		Process data
Don’t need data anymore.

No explicit copying to GPU memory
Pass raw pointers
Keep data on GPU as much as possible
Back to CPU if:

Out of memory
Communication/LB
Diagnostics
Any data access from the host…

170 GB/s

900 GB/s

50 GB/s

26 of 31

A detailed example

What is Array4?
Why #ifdef _OPENMP?
Why TilingIfNotGPU?
How many kernel launches?
What exactly is copied to the GPU memory (if anything)?

warpx/FieldSolver/FiniteDifferenceSolver/EvolveB.cpp

amrex/Base/AMReX_Array4.H

27 of 31

MFIter, device synchronize and CUDA streams

Kernel �launch

Compute

Device

Host

Kernel 1

Kernel 2

Kernel 3

Kernel �launch

Box 0

Box 1

Kernel 1

Kernel 2

1

2

3

1

2

3

Device

Kernel 1

Kernel 2

Kernel 3

Kernel 1

Kernel 2

Kernel 3

Stream 0

Stream 1

CUDA streams allow for executing multiple kernels simultaneously on a GPU!

28 of 31

See AMReX GPU strategy

https://amrex-codes.github.io/amrex/docs_html/GPU.html

29 of 31

See AMReX GPU strategy

https://amrex-codes.github.io/amrex/docs_html/GPU.html

// See PhysicalParticleContainer::Evolve

for (WarpXParIter pti(*this, lev); pti.isValid(); ++pti){

FArrayBox filtered_Ez;

Elixir ezeli;

const Box& tbox = …;

filtered_Ez.resize(amrex::convert(tbox,Ez.box().ixType()));

ezeli = filtered_Ez.Elixir();

Filter[lev]->ApplyStencil(filtered_Ez, Ez[pti]);

// Particles gather from filtered_Ez

}

Call ParIter (or MFIter) destructor

cudaDeviceSynchronize()

30 of 31

Parallelization

Grid = box

amr.max_grid_size is the maximum number of points sdf per grid along each direction (default amr.max_grid_size=32 in 3D).
amr.blocking_factor: The size of each grid must be divisible by the blocking_factor along all dimensions(default amr.blocking_factor=8).

the max_grid_size also has to be divisible by blocking_factor.
The actual box decomposition depends on the number of MPI ranks requested.

For GPU:

CUDA-aware MPI amrex.use_gpu_aware_mpi = 1
We currently Synchronize at every WARPX_PROFILE call

For a lot of optimizations, the Tiny Profiler and the standard output give enough information:

Problem well-decomposed?
Problem load-balanced, LB?
Problem fit in GPU memory?
Problem/GPU big enough?
Problem dominated by communications?

31 of 31

Run efficiently (on Summit, but also on any platform!)

The standard output tells you the total number of GPUs used (so far we are using 1 MPI per GPU) and the number of boxes (called "grids") before the first iteration. Use it to make sure that you are using a few (1-10) boxes per GPU, this is how AMReX is meant to operate.
A simulation executed efficiently should take < 1s / iteration on Summit, unless you are using the PSATD solver.
If a run is very slow, the most likely cause is that your simulation is running out of GPU memory (see pagefaults), generating CPU-GPU transfers at each iteration. This is extremely inefficient, and WarpX is not meant to run this way. It usually means that you need to run on a larger number of nodes.
If the simulations completes successfully, you can verify the memory utilization per GPUs with the lines (after Tiny Profiler timers)

Total GPU global memory (MB) spread across MPI: [16128 ... 16128]

🡪 hardware limit (16 GB)�Free GPU global memory (MB) spread across MPI: [4 ... 6]

🡪 GPU memory that is still available�[The Arena] space (MB) allocated spread across MPI: [24367 ... 27302]

🡪 Memory allocated by the Arena (the AMReX module that handles memory allocation and management).

As soon as Free GPU global memory is below 1000 (it will never be exactly 0), you may safely assume that you are overflowing the GPU memory. This is confirmed by the third line, which shows that the memory allocated is > 16 GB, so again this should run on more nodes.
AMReX has an option to abort if the run exceeds GPU memory: amrex.abort_on_out_of_gpu_memory. I like it a lot.
If a run has load imbalance, this can be seen in the Tiny Profiler output, for example in the current deposition time:

PPC::CurrentDeposition 1106 0 4.966 20.61 19.44%

meaning that 1 GPU is spending 0s in deposition, and another one is spending over 20s. In this case, note that load imbalance pollutes the timers for communication routines, so do not trust them too much. To fix this, you may use the AMReX load balancer with some of the options below (see the doc for all options, and ask Michael for more details). Feel free to try these on test2 (we didn't have time to cover it today)

# warpx.load_balance_int = 10�# algo. load_balance_costs_update = timers�# warpx.load_balance_with_sfc = 1

�