1 of 31

WarpX

Structure of the code, and how to navigate through the source

Maxence Thévenet

LBNL

03/05/2020

Part I: code organization

  • Main classes

  • Main steps in a simulation

  • Zoom on specific capabilities

  • How to be a good citizen-contributor

2 of 31

AMReX basics: block-structured mesh refinement

Box

Lower and upper indices

FArrayBox

Array defined on a Box

FABArray/MultiFAB

Collection of FArrayBox

ParticleContainer

Collection of particles per box, with an iterator

for (WarpXParIter pti(*this, lev); pti.isValid(); ++pti){

auto& attribs = pti.GetAttribs();

    auto& uxp = attribs[PIdx::uxp];

    const FArrayBox& exfab = Ex[pti];

Iterator over Box/tile

Particle attributes (SoA)

Array associated with the box/tile

3 of 31

Main WarpX classes (everything in Source/, .H file with same name as class)

WarpX : AmrCore

./

All classes, Fields (= MultiFab), Full diags (plotfiles), time steps, some BC

WarpXParticleContainer : ParticleContainer

Particles/

1 species (physical or not), per box, based on Particle, FG, PP, CD

MultiParticleContainer

Particles/

Vector of all species, and loops over species

FiniteDifferenceSolver

FieldSolver/FiniteDifferenceSolver/

Evolve E and B (CKC, Yee, orders?)

SpectralSolver

FieldSolver/SpectralSolver/

Evolve E and B (PSATD, etc.)

ILaserProfile, GaussianLaserProfile etc.

Laser/

Laser profile (Gaussian, from file, parsed, etc.)

LaserParticleContainer

Laser/

Antenna

PlasmaInjector

Initialization/

Plasma profile

WarpXParser

Parser/

General parser (used in many places)

ReducedDiagnostics

Diagnostics/ReducedDiags/

Reduced diags, well-separated module

BackTransformedDiagnostics

Diagnostics/

Big machinery, for runs in a boosted frame

GuardCellManager

Parallelization/

Grid communications. uses on MultiFab::FillBoundary

PML

BoundaryConditions/

Well-separated module, BC

Filter

Filter/

NCI filter (on E and B), bilinear filter (on J)

“Multi” = “collection of”

BIG AMReX dependency:

  • Domain decomposition (Box, BoxArray, DistributionMapping, Geometry)
  • Structures for fields (MultiFab) and particles (ParticleContainer)
  • Communications (FillBoundary, Redistribute)
  • Portability (ParallelFor)

particles

Field solver

Initial conditions

Boundary condition

Diagnostics

Communications

Numerics

4 of 31

WarpX: Fields

  • E
  • B
  • J
  • rho
  • Fine patch
  • Coarse patch
  • Auxiliary
  • Coarse-aux
  • Physical space
  • Fourier space

Efield_fp

WarpX has a LOT of fields, all of them are members of class WarpX

MR levels

Component

void

WarpX::EvolveE (amrex::Real a_dt)

{

    for (int lev = 0; lev <= finest_level; ++lev) {

        EvolveE(lev, a_dt);

    }

}

All field arrays are directly members of the class WarpX

class WarpX

amrex::Vector<std::array< std::unique_ptr<amrex::MultiFab>, 3 > > Efield_fp;

We use the amrex::MultiFab member functions for most of our operations:

  • Communication
  • Interpolation
  • Basic operations

Most functions have a general and a per-level version

void

WarpX::EvolveE (int lev, amrex::Real a_dt)

{

    m_fdtd_solver_fp[lev]->EvolveE( Efield_fp[lev], etc. );

}

why not amrex::Vector<std::unique_ptr<amrex::MultiFab> > Efield_fp; with 3-component MultiFabs?

5 of 31

WarpX: Particles 1/2 : Particles and ParticleContainers

for (WarpXParIter pti(*this, lev); pti.isValid(); ++pti){

const auto GetPosition = GetParticlePosition(pti, offset);

auto& attribs = pti.GetAttribs();

auto& uxp = attribs[PIdx::ux];

const long np = pti.numParticles();

amrex::ParallelFor(np, …

amrex::ParticleReal xp, yp, zp;

  GetPosition(ip, xp, yp, zp);

// now I can play with xp, uxp[ip], etc.

);

amrex::Particle

amrex::ParticleContainer

  • WarpXParticleContainer : amrex::ParticleContainer
  • LaserParticleContainer : WarpXParticleContainer
  • PhysicalParticleContainer : WarpXParticleContainer
  • PhotonParticleContainer : PhysicalParticleContainer
  • RigidInjectedParticleContainer : PhysicalParticleContainer

Array-of-Struct: x1 y1 z1 x2 y2 z2 xn yn zn : position, id, CPU

Struct-of-Array: ux1 ux2uxnuz1 uz2uzn : momentum, fields, etc.

p.pos(0) = x position of the particle

ParticleContainer = species

Particles are stored per-box (or per-tile on CPU)

STANDARD WAY

NEW WAY

using PType = typename WarpXParticleContainer::SuperParticleType;�Ptype p;

p.pos(0);

p.rdata(PIdx::w);

6 of 31

WarpX: Particles 2/2: MultiParticleContainer

class MultiParticleContainer

amrex::Vector<std::unique_ptr<WarpXParticleContainer> > allcontainers;� multi-species handling (QED, etc.)

void

MultiParticleContainer::FieldGather ()

{

    for (auto& pc : allcontainers) {

pc->FieldGather(lev, Ex, Ey, Ez, Bx, By, Bz);

    }

}

1 ParticleContainer = 1 species.

Class that collects all ParticleContainers ( = all species): MultiParticleContainer

Class WarpX has a SINGLE MultiParticleContainer member variable: mypc

class WarpX

std::unique_ptr<MultiParticleContainer> mypc;

7 of 31

What happens when I run a simulation? Overview

MPI_Init(&argc, &argv);

amrex::Initialize(argc,argv);

WarpX warpx;

warpx.InitData();

warpx.Evolve();

amrex::Finalize();

MPI_Finalize();

OMP_NUM_THREADS=2

mpirun –np 4 ./warpx.ex inputs

amr.n_cell =  64 64 64

amr.max_grid_size = 64

amr.blocking_factor = 32

amr.max_level = 0

geometry.coord_sys   = 0

geometry.is_periodic = 1 1 1

geometry.prob_lo     = -2. -2. -2.

geometry.prob_hi     =  2. 2. 2.

All the rest in inputs

Initialize MPI

Initialize AMReX (box decomposition, distribution mapping)

Construct WarpX:

  • Create WarpXParticleContainers
  • Create MultiFabs

DON’T allocate memory

Allocate memory, and actually create fields and particles with specified distributions

Run iterations

Simulation sequence

Inputs

What it does

8 of 31

What happens when I run a simulation? Initialization sequence

WarpX warpx;

warpx.InitData();

WarpX::WarpX

  • WarpX::ReadParameters
  • MultiParticleContainer::MultiParticleContainer
    • MultiParticleContainer::ReadParameters
    • Loop over all species
      • ParticleContainer::ParticleContainer
        • ParticleContainer::ParticleContainer
        • ParmParse::query()

Warpx.do_filter

particles.species_names = electrons

electrons.num_particles_per_cell = 2

Null pointers, no particles. Advice?

WarpX::InitData

  • WarpX::InitFromScratch
    • AmrCore::InitFromScratch
      • WarpX::MakeNewLevelFromScratch
        • WarpX::AllocLevelData
          • WarpX::AllocLevelMF
        • WarpX::InitLevelData
          • MultiFab::setVal
    • MultiParticleContainer::AllocData
    • MultiParticleContainer::InitData
      • PhysicalParticleContainer::AddParticles(0)
        • PhysicalParticleContainer::AddPlasma
      • ParticleContainer::Redistribute
    • WarpX::ComputeSpaceChargeField
    • WarpX::InitPML
      • PML::PML
  • WarpX::InitFilter
    • Filter::Filter
  • WarpX::InitDiags (BTD)

Allocate all field MultiFabs!

Inject particles with requested plasma profile

9 of 31

PIC loop

What happens when I run a simulation? Time iteration sequence

warpx.Evolve();

WarpX::EvolveEM (Evolve): loop over iterations

  • MultiParticleContainer::PushP If needed, ½ push back for particle momenta
  • UpdateAuxilaryData();
  • WarpX::OneStep_nosub
  • BackTransformedDiagnostics::writeLabFrameData
  • MoveWindow
  • MultiParticleContainer::Redistribute
  • MultiParticleContainer::SortParticlesByBin
  • ReducedDiags::ComputeDiags
  • WarpX:: WritePlotFile

WarpX::OneStep_nosub

    mypc->doFieldIonization();

    mypc->doCoulombCollisions();

    PushParticlesandDepose();

    SyncCurrent();

    EvolveB(0.5*dt[0]);

    FillBoundaryB();

    EvolveE(dt[0]);

    FillBoundaryE();

    EvolveB(0.5*dt[0]);

    if (do_pml)

        DampPML();

        FillBoundaryE();

        FillBoundaryB();

WarpX::PushParticlesandDepose: loop over ParticleContainers and call WarpXParticleContainer::Evolve

for (WarpXParIter pti(*this, lev); pti.isValid(); ++pti)

applyNCIFilter

FieldGather

PushPX

DepositCurrent

10 of 31

Topics treated below

  • Portability
  • PIC elementary routines
  • Communications
  • Diagnostics
  • MR
  • CI
  • PR
  • Profiling
  • QED module
  • Build system
  • Staggering

Could be added

11 of 31

Portability: AMReX::ParallelFor

amrex::ParallelFor(

N,

[=] (int i) {

xp[i] += 1.;

}

);

for(int i=0; i<N; i++){

xp[i] += 1.;

}

CUDA (NVIDIA)

kernel(int* xp) {

int i = blockIdx.x*blockDim.x + threadIdx.x;

if (i<N) xp[i] += 1.;

}kernel<<<N, 256>>>(xp);

Loop over particles: GPU

Loop over particles: CPU

We don’t want to write everything twice. AMReX provides an abstraction layer (portability layer), its main function is ParallelFor

We write the code only once, and it is compiled as a for loop or kernel launch, depending on USE_GPU

AMReX provides different flavors of ParallelFor, for loops on particles, fields, etc.

amrex::ParallelFor(tex,

[=] AMREX_GPU_DEVICE (int i, int j, int k){

    Ex(i, j, k) += 1.;

}

);

amrex::ParallelFor(tex, tey, tez,

[=] AMREX_GPU_DEVICE (int i, int j, int k){

    Ex(i, j, k) += 1.;},

    [=] AMREX_GPU_DEVICE (int i, int j, int k){

        Ey(i, j, k) += 1.;},

    [=] AMREX_GPU_DEVICE (int i, int j, int k){

        Ez(i, j, k) += 1.;}

);

We cannot use anything inside a ParallelFor!!

  • std::vector is NOT going to work well on GPU
  • If you use a member variable (of class C) in a parallel for, the whole class will be copied to the GPU. WE DON’T DO THAT (except for small classes that we know are GPU-friendly)

🡪 Make a local copy of just the variable, and use this copy

12 of 31

PIC elementary functions

Particles/Gather/FieldGather.H

void doGatherShapeN

free-standing function, called in PhysicalParticleContainer::Evolve

Particles/Pusher/UpdateMomentumBoris.H

void PhysicalParticleContainer::PushPX

Member of PhysicalParticleContainer, called in PhysicalParticleContainer::Evolve

Particles/Deposition/CurrentDeposition.H

void doDepositionShapeN

free-standing function, call in PhysicalParticleContainer::Evolve

FieldSolver/FiniteDifferenceSolver/EvolveE.cpp

void FiniteDifferenceSolver::EvolveE

member of FiniteDifferenceSolver, called in WarpX::OneStep_nosub

13 of 31

Communications

A PIC code should only require local communications between neighbor boxes.

Comm patterns can be tricky with MR, though 🡪 AMReX handles ALL of it.

Field halo exchanges: FillBoundary. Called many times per iteration.

In WarpX, we try to exchange as little data as possible. GuardCellManger keeps track of the number of cells we need to exchange at each part of the PIC code. That could be improved (in particular for PMLs).

Particle halo exchanges: Redistribute. Called once per iteration.

PIC loop

WarpX::OneStep_nosub

    mypc->doFieldIonization();

    mypc->doCoulombCollisions();

    PushParticlesandDepose();

    SyncCurrent();

    EvolveB(0.5*dt[0]);

    FillBoundaryB();

    EvolveE(dt[0]);

    FillBoundaryE();

    EvolveB(0.5*dt[0]);

    if (do_pml)

        DampPML();

        FillBoundaryE();

        FillBoundaryB();

14 of 31

Diagnostics

Full diagnostics (plotfiles): Part of class WarpX

Mostly in Diagnostics/

WarpX::EvolveEM

  • WarpX::WritePlotFile
    • WarpX::prepareFields
      • WarpX::AverageAndPackFields
      • WarpX::coarsenCellCenteredFields
    • Amrex::WriteMultiLevelPlotfile (or OpenPMD!)
    • WarpX::WriteRawField
    • MultiParticleContainer::WritePlotFile
    • WarpX::WriteJobInfo
    • WarpX::WriteWarpXHeader

Slice diags: ~parallel implementation

Back-transformed diagnostics (boosted-frame)

Mostly in Diagnostics/

Class BackTransformeDiagnostic

std::vector<std::unique_ptr<LabFrameDiag> > m_LabFrameDiags_;

Class LabFrameDiag

std::unique_ptr<amrex::MultiFab> m_data_buffer_;

WarpX::EvolveEM

  • GetCellCenteredData
    • AverageAndPackVectorField
    • AverageAndPackScalarField
  • LabFrameDiags::writeLabFrameData
    • LabFrameDiags::LorentzTransformZ
    • … some more stuff with buffers

Reduced diagnostics (~1 number per diag iteration) ☺☺

ALL in Diagnostics/ReducedDiags/

class WarpX

MultiReducedDiags* reduced_diags;

class MultiReducedDiags

std::vector<std::unique_ptr<ReducedDiags>> m_multi_rd;

class ReducedDiags

base class for these reduced diagnostics

WarpX::EvolveEM

🡪 reduced_diags->ComputeDiags(step);

🡪 reduced_diags->WriteToFile(step);

Then, the instance of ReducedDiags gets a pointer to WarpX to access all the data (unfortunately in a non-const fashion)

1 LabFrameDiags = 1 snapshot in lab frame

(t=t1, z = [zmin, zmax])

Corresponds to different (t’, z’): at each iteration:

  • take the (z’, t’) slice that belongs to a given LabFrameDiags, and cell-center it
  • Lorentz-transform it (to the lab)
  • stack it in a buffer

Diags should be reorganized soon(ish)!

15 of 31

Mesh Refinement

PushParticlesandDepose()

FieldGather()

PushPX()

DepositCurrent()

SumBoundaryJ()

EvolveB(0.5*dt[0])

FillBoundaryB()

EvolveE(dt[0])

FillBoundaryE()

EvolveB(0.5*dt[0])

FillBoundaryB()

UpdateAuxilaryData()

PushParticlesandDepose()

FieldGather()

PushPX()

DepositCurrent()

SyncCurrent()

EvolveB(0.5*dt[0])

FillBoundaryB()

EvolveE(dt[0])

FillBoundaryE()

EvolveB(0.5*dt[0])

FillBoundaryB()

PIC loop without MR

PIC loop with MR

Substitution:

Fn+1(a) = I[Fn(a)-Fn+1(c)]+Fn+1(f)

absorbing layer (PML)

a = auxilliary

f = fine

c = coarse

Ln

Ln+1

a

f

c

a

SyncCurrent()

interpCurrentFineToCoarse()

AddCurrentFromFineLevelandSumBoundary()

ApplyFilterandSumBoundaryJ()

MultiFab::ParallelAdd()

J

EB

UpdateAuxilaryData()

UpdateAuxilaryDataStagToNodal

MultiFab::setVal

MultiFab::ParallelCopy

MultiFab::Substract

16 of 31

Continuous Integration

WarpX/Regression/

Nightly builds on CRD clusters Battra (CPU) and Garuda (GPU)

  • Every night
  • See https://github.com/ECP-WarpX/regression_testing
  • Compare with ref to machine precision
  • Published at https://ccse.lbl.gov/pub/RegressionTesting/WarpX/

🡪 Catch everything (and more)

WarpX-tests.ini

WarpX/Examples/

  • Input files
  • Analysis script

TravisCI tests on every commit on GitHub

  • Every time you push on a branch with open PR
  • GitHub tells you when they fail
  • Jobs are submitted by batch (see .travis.yml)
  • Only tests compilation, run and analysis!!

🡪 Only catch what we ask for

prepare_file_travis.py

🡪 reformat

Also performance tests but, heh

17 of 31

Profiling in WarpX

void PhysicalParticleContainer::Evolve (){

WARPX_PROFILE("PPC::Evolve()");

WARPX_PROFILE_VAR_NS("PPC::FieldGather", blp_fg);

WARPX_PROFILE_VAR_START(blp_fg);

FieldGather();

WARPX_PROFILE_VAR_STOP(blp_fg);

}

-------------------------------------------------------------------------------------------

Name                                        NCalls  Excl. Min  Excl. Avg  Excl. Max   Max %

-------------------------------------------------------------------------------------------

PPC::FieldGather                               800      29.06      29.23      29.32  68.63%

PPC::Evolve()                                  100      8.175      8.264      8.329  19.50%

FabArray::ParallelCopy()                       600       1.11      1.239      1.434   3.36%

FillBoundary_finish()                         1200     0.6782     0.8679       1.03   2.41%

Redistribute_partition                         102     0.8006     0.8529     0.8762   2.05%

FillBoundary_nowait()                         1200     0.4906     0.6733     0.8346   1.95%

AMReX Tiny Profiler

BL_PROFILE

We have a wrapper in WarpX:

WARPX_PROFILE

I think current deposition is not instrumented

18 of 31

Why bother with PRs?

19 of 31

Write a PR

Write good code

  • To the point
  • At the right place
  • Modular
  • Straightforward
  • Comments
  • No duplication

Prepare a PR

  • Communicate!
  • One topic
  • To the point
  • Look at what’s already there

Test your code

  • 2D local is not enough
  • 2D/3D/RZ, CPU and GPU, serial and parallel, Threaded and not-threaded

Clean your PR

  • Print statements
  • Commented code
  • Useless functions/variables
  • Etc.

Comment your code

  • To the point
  • What’s needed and not more

Write a PR description

  • To the point
  • What’s needed and not more
  • Labels

Ask for review

  • Check the code
  • Focus on what’s tricky

20 of 31

Part II: Performance and GPUs

  • GPU computing

  • Parallelization

21 of 31

GPU-accelerated supercomputing

CPU supercomputer

Cori @ NERSC (13/500)

GPU-accelerated supercomputer

Summit @ OLCF (1/500)

MPI + x

MPI + x

All pre-exascale and planned exascale supercomputers are accelerated

1 Cori node

1 Intel KNL CPU

3 TFLOPS

112 GB

3 TFLOPS

96 DDR4 + 16 GB HBM

1 Summit node

2  IBM Power9 CPUs

6 NVIDIA V100 GPUs

42 TFLOPS

608 GB

0.8 TFLOPS

256 GB DDR4

7.8 TFLOPS

16 GB HBM

22 of 31

GPU-accelerated supercomputing

23 of 31

Kernel �launch

Kernel �launch

Kernel �launch

Kernel �launch

Compute

Compute

Device

Host

Compute

Compute

Kernel 1

Kernel 2

Kernel 3

Kernel 4

Asynchronous execution

You can explicitly synchronize with cudaDeviceSynchronize

GPU-accelerated supercomputing

ParallelFor()

24 of 31

Kernel �launch

Kernel �launch

Kernel �launch

Kernel �launch

Compute

Compute

Device

Host

Compute

Compute

Kernel 1

Kernel 2

Kernel 3

Kernel 4

time()

time()

time()

time()

time()

Makes it difficult to have accurate timers

Now, where is the data?

GPU-accelerated supercomputing

25 of 31

GPU-accelerated supercomputing

MPI_Init(&argc, &argv);

amrex::Initialize(argc,argv);

WarpX warpx;

warpx.InitData();

warpx.Evolve();

amrex::Finalize();

MPI_Finalize();

Data initialized on the host, in unified memory.

Managed/Unified memory: single memory address space!

Device

Host

Host execution

Host memory

256 GB

Device execution

Device memory

16 GB

Allocate data

Process data

ParallelFor(data)

= kernel launch

I need data!

pagefault

Process data

Don’t need data anymore.

  • No explicit copying to GPU memory
  • Pass raw pointers
  • Keep data on GPU as much as possible
  • Back to CPU if:
    • Out of memory
    • Communication/LB
    • Diagnostics
    • Any data access from the host…

170 GB/s

900 GB/s

50 GB/s

26 of 31

A detailed example

  • What is Array4?
  • Why #ifdef _OPENMP?
  • Why TilingIfNotGPU?
  • How many kernel launches?
  • What exactly is copied to the GPU memory (if anything)?

warpx/FieldSolver/FiniteDifferenceSolver/EvolveB.cpp

amrex/Base/AMReX_Array4.H

27 of 31

MFIter, device synchronize and CUDA streams

Kernel �launch

Kernel �launch

Kernel �launch

Compute

Device

Host

Kernel 1

Kernel 2

Kernel 3

Kernel �launch

Kernel �launch

Kernel �launch

Box 0

Box 1

Kernel 1

Kernel 2

1

2

3

1

2

3

Device

Kernel 1

Kernel 2

Kernel 3

Kernel 1

Kernel 2

Kernel 3

Stream 0

Stream 1

CUDA streams allow for executing multiple kernels simultaneously on a GPU!

28 of 31

See AMReX GPU strategy

https://amrex-codes.github.io/amrex/docs_html/GPU.html

29 of 31

See AMReX GPU strategy

https://amrex-codes.github.io/amrex/docs_html/GPU.html

// See PhysicalParticleContainer::Evolve

for (WarpXParIter pti(*this, lev); pti.isValid(); ++pti){

FArrayBox filtered_Ez;

Elixir ezeli;

const Box& tbox = …;

filtered_Ez.resize(amrex::convert(tbox,Ez.box().ixType()));

    ezeli = filtered_Ez.Elixir();

Filter[lev]->ApplyStencil(filtered_Ez, Ez[pti]);

    // Particles gather from filtered_Ez

}

Call ParIter (or MFIter) destructor

cudaDeviceSynchronize()

30 of 31

Parallelization

Grid = box

  • amr.max_grid_size is the maximum number of points sdf per grid along each direction (default amr.max_grid_size=32 in 3D).
  • amr.blocking_factor: The size of each grid must be divisible by the blocking_factor along all dimensions(default amr.blocking_factor=8).

  • the max_grid_size also has to be divisible by blocking_factor.
  • The actual box decomposition depends on the number of MPI ranks requested.

For GPU:

  • CUDA-aware MPI amrex.use_gpu_aware_mpi = 1
  • We currently Synchronize at every WARPX_PROFILE call

For a lot of optimizations, the Tiny Profiler and the standard output give enough information:

  • Problem well-decomposed?
  • Problem load-balanced, LB?
  • Problem fit in GPU memory?
  • Problem/GPU big enough?
  • Problem dominated by communications?

31 of 31

Run efficiently (on Summit, but also on any platform!)

  • The standard output tells you the total number of GPUs used (so far we are using 1 MPI per GPU) and the number of boxes (called "grids") before the first iteration. Use it to make sure that you are using a few (1-10) boxes per GPU, this is how AMReX is meant to operate.
  • A simulation executed efficiently should take < 1s / iteration on Summit, unless you are using the PSATD solver.
  • If a run is very slow, the most likely cause is that your simulation is running out of GPU memory (see pagefaults), generating CPU-GPU transfers at each iteration. This is extremely inefficient, and WarpX is not meant to run this way. It usually means that you need to run on a larger number of nodes.
  • If the simulations completes successfully, you can verify the memory utilization per GPUs with the lines (after Tiny Profiler timers)

Total GPU global memory (MB) spread across MPI: [16128 ... 16128]

🡪 hardware limit (16 GB)�Free GPU global memory (MB) spread across MPI: [4 ... 6]

🡪 GPU memory that is still available�[The Arena] space (MB) allocated spread across MPI: [24367 ... 27302]

🡪 Memory allocated by the Arena (the AMReX module that handles memory allocation and management).

  • As soon as Free GPU global memory is below 1000 (it will never be exactly 0), you may safely assume that you are overflowing the GPU memory. This is confirmed by the third line, which shows that the memory allocated is > 16 GB, so again this should run on more nodes.
  • AMReX has an option to abort if the run exceeds GPU memory: amrex.abort_on_out_of_gpu_memory. I like it a lot.
  • If a run has load imbalance, this can be seen in the Tiny Profiler output, for example in the current deposition time:

PPC::CurrentDeposition 1106 0 4.966 20.61 19.44%

  • meaning that 1 GPU is spending 0s in deposition, and another one is spending over 20s. In this case, note that load imbalance pollutes the timers for communication routines, so do not trust them too much. To fix this, you may use the AMReX load balancer with some of the options below (see the doc for all options, and ask Michael for more details). Feel free to try these on test2 (we didn't have time to cover it today)

# warpx.load_balance_int = 10�# algo. load_balance_costs_update = timers�# warpx.load_balance_with_sfc = 1