LLVM GPU Working Group Meeting
Agenda / Notes

This document is public

Meeting Information

Open LLVM GPU Working Group Meetings.

Meeting room link: https://argonne.zoomgov.com/j/1602786184

Meeting ID: 160 278 6184

Public calendar link: https://calendar.google.com/calendar/embed?src=c_f5cpcv8upnjksh60vb16kf7hik%40group.calendar.google.com

iCal link: https://calendar.google.com/calendar/ical/c_f5cpcv8upnjksh60vb16kf7hik%40group.calendar.google.com/public/basic.ics
Import to Google Calendar: https://calendar.google.com/calendar/u/0?cid=Y19mNWNwY3Y4dXBuamtzaDYwdmIxNmtmN2hpa0Bncm91cC5jYWxlbmRhci5nb29nbGUuY29t
Also in the main LLVM events calendar.

LLVM Community Code of Conduct

Agenda

Apr 14, 2023 8am PT, 10am CT, 11am ET, 4pm UTC, 9:30 IST

Sameer S: Add convergence control operand bundle and intrinsics

Sameer: The LLVM convergent attribute is not sufficient to express whether or not a call preserves a thread set
Introduce "convergence tokens" that represent converged instances.
Uses LLVM tokens because they cannot be used in a PHI
Dynamic instances create tokens at each block, if a block's tokens are equal they are considered converged
Nocolai: The tokens should let you look at the IR and visualize which communication takes place
Johannes: This has nothing to do with the tokens, why do we need them. We need to preserve convergences
Nicolai: Tokens are needed but the example is poor
Johannes: Need a clearer set of rules. Our goal is to modify underlying control flow, while preserving the old control flow to check for thread convergence.

Convergent right now simply says not to modify it. Tokens are fine in concept but we simply want rules to outline how they are used
We need a set of verifier rules. Such that we can detect if a transformation that does not understand the tokens breaks the rules.

Joseph H: Short GPU libc update

AMDGPU buildbot
RPC interface and test

Feb 3, 2023 8am PT, 10am CT, 11am ET, 4pm UTC, 9:30 IST

Joseph H: libc for GPUs

Use existing libc in LLVM and target it for the GPU
Currently, only builds basic support for libC functions that don’t require the OS (e.g., strdup that requires malloc), and not tested
Need remote procedure calls for OS functions
Goal: call C functions like printf on the device
Matt: Host call does that already, but does not work on all systems

Ptrs + implicit input buffer, associated runtime thread, black box to the compiler

Joseph: We want something common across implementations, as agnostic to the underlying runtime as possible

Generic server that can communicate with whatever
Right not host call is integrated into the runtime itself, need something more plug-and-play

Joseph: More testing for the GPUs

Get the libC project to run GPU tests, get coverage there
Matt: Recently could not figure out how to build the OpenMP test suite

This would make it easier for folks to write tests for it

Joseph: Need something like gtest tests

Loader + start code
Cross compile and launch on the GPU
Printf support will allow gtest on the GPU

Joseph: Eventually extend to libc++

Eventually convince people to use Clang instead of Nvidia tools

Ravi: When can we expect printf?

Joseph: Implementation details pending

Each warp/wave has its own queue, waiting for the RPC server
In nvidia implementation, flush at the end of the kernel

Joseph: Some generic way to call into host for reverse offloading

Need warp-local stuff
Async thread on the host

Ravi: People prefer something simple and fast

Joseph: Big important functions: malloc, printf, free

Difficult to make malloc fast, a lot of optimization work on the server-side
Starting initial stuff in the next week

Ravi: fprintf more difficult

Joseph: want print error for tests

Joseph: Timeline before release 17 / next LLVM dev mtg

Joseph H: Registration methods for offloading runtime.

In OpenMP, we depend on linker-defined symbols

Applies to CUDA/HIP as well

Alternatives that would be more common? Linker magic instead of special handling of sections. Export stubs dynamically and look them up in dynamic libraries (check for stub with a matching name)
High-level goal – unify logic across Clang for all OSs
Ravi: Intel/Windows should be similar, section name, dollar sign, …
Ravi: Clang should already do that for comdat
Joseph: should be also similar to what ASan does

Joseph H: Automate process to create generic IR libraries?

It’s difficult to make generic, CUDA relies on reflection passes to mask unsupported intrinsics, AMD requires weird linked libraries to override symbols
Matt: Building for every target is ridiculous, should build per generations

Work on making it easier to link libraries
Weak-linking based overloading for fast cases
ifunc/branches on target features

Joseph: Maintenance: can we automate this process

Automatically handle target-dependent functions
Matt: 6-20-ish functions, not >100
Combine identical cases

Matt: Few more general math intrinsics
Joseph: Ideally, one library per target triple

Need infrastructure for more generic GPU bitcode libraries

Matt: Ugly case: wavefront reduction functions

Deeply nested macros, but really lots of repetition/redundancy
Need to implement those in the backend
Eymen: +1, having those exposed as intrinsics would drastically simplify things

Jan 6, 2023 8am PT, 10am CT, 11am ET, 4pm UTC, 9:30 IST

Matt A: Adding uinc_wrap/udec_wrap to atomicrmw: https://reviews.llvm.org/D137361

Jakub K: What produces these?

Matt A: These are available in CUDA

Matt A: nofpclass attribute: https://reviews.llvm.org/D139902

Matt: Simplify library linking
Jakub K: Kinda similar to how memory effects are specified now (e.g., nocaptrue, noalias, readnone and similar flags → memory(...)): https://llvm.org/docs/LangRef.html#:~:text=be%20unnamed_addr.-,memory(...),-This%20attribute%20specifies

https://discourse.llvm.org/t/rfc-unify-memory-effect-attributes/65579

Dec 2, 2022 8am PT, 10am CT, 11am ET, 4pm UTC, 9:30 IST

Jakub K: Change of format for LLVM GPU News

Jakub: Biweekly cadence too time-consuming for me without contributions from other folks

Alternatives

Johannes D: Irregular release schedule, based on how much content there is
Suspending LLVM GPU News and sending Alex some tips on things to include in LLVM Weekly

Alexey B: LLVM GPU News was a good venue for out-of-tree projects like Intel’s SYCL that are outside of the scope of LLVM Weekly
Jakub: Try the irregular schedule first, suspend the newsletter if not interest within a month or so

Matt A: Uniformity analysis for irreducible CFGs: https://reviews.llvm.org/D130746
Matt A: Adding uinc_wrap/udec_wrap to atomicrmw: https://reviews.llvm.org/D137361
<More agenda items>

Oct 21, 2022 8am PDT, 10am CDT, 11am EDT, 3pm UTC, 8:30pm IST

Note: We switched to Zoom (instead of Meet) starting from this meeting: https://argonne.zoomgov.com/j/1602786184
Johannes: Implementing libm and libc for the GPU
Joseph H: Prerequisites for moving CUDA and HIP to the new driver

Sep 16, 2022 CANCELED 8am PDT, 10am CDT, 11am EDT, 3pm UTC, 8:30pm IST

Meeting canceled

Aug 19, 2022 8am PDT, 10am CDT, 11am EDT, 3pm UTC, 8:30pm IST

Sameer S: Uniformity Analysis

https://reviews.llvm.org/D130746
https://discourse.llvm.org/t/rfc-uniformity-analysis-for-irreducible-control-flow/64139
Recording link: https://www.youtube.com/watch?v=G4a8RXJ1ImQ

Jul 15, 2022 8am PDT, 10am CDT, 11am EDT, 3pm UTC, 8:30pm IST

Alex B: Opaque types - common requirements between DXIL, SPIR-V, WebAssembly. How to coordinate and collaborate.

Context: https://discourse.llvm.org/t/rfc-better-support-for-typed-pointers-in-an-opaque-pointer-world/63339
Chris B: From DX perspective, significant overlap
Nicolai H: Do WebAsm types have type parameters?
Alex: They can have parameters, structs can contain other types.
Alex: Do not need any demangling in the middle end, just need to preserve type identity until the backend.
Jakub K: Do you have some linking requirements?
Alex: Later support for importing/exporting types, not in the type system right now.
Alex: Ultimately a WebAsm verifier will check types
Johannes: Does WebAsm support physical pointers?
Alex: Yes, for main memory (which maps well to the LLVM model) but not for GC types (no int2ptr/ptr2int, no arbitrary bitcasts)
Johannes: Would like (typed) pointers to be useful to telling how objects are used
Matt: They are existing passes that do something like this, what else would we need?
Johannes: Reason about more than one level of indirections.
Jakub: For Wasm, would you need to keep offsets separate?
Alex: With Wasm you would not need to carry around an additional offset, but we would have to extend support for limited casting (inserted by the frontend to case the return value of an intrinsic, given intrinsics can’t be type-parameterised), intrinsics, name mangling
Nicolai: Name mangling can get messy, experimented with tablegen-based approach but not for parameterized types
Joshua: Name mangling is difficult for SPIR-V, we would like to move away from that hack. Because Clang generates allocas, you have to be able to load/store them.
Alex: In the wasm use case the reason I think we’ll need a textual type description (name mangled or looked up via indirection) is it’s not clear we can cleanly map all wasm GC types to LLVM’s type system. E.g. if struct field attributes aren’t available in LLVM IR etc.

Matt A: Disabling nondef for languages with cross lane operations

Nicolai: Should this be an intrinsic or builtin attribute?
Matt: Users are always allowed to call the exposed builtins, which are poorly defined
Johannes: Decide whether HIP/Cuda should ...?
Nicolai: In the graphics world, programs don't pass uninitialized values to functions, is this different for compute programs?
Matt: Is this a target property at the end? All or nothing approach.
Chris: For HLSL this should result in compiler errors, modulo implementation issues.
Johannes: Make exceptions for exactly these intrinsics.
Nicolai: You should not really have to initialize those values. In the source language if you initialize the variable the compiler cannot optimize this later.
Matt: In HIP this looks like a regular functions call (which internally calls the builtin)
Johannes: Should we support user wrapper functions like this? Is there enough usage to make it worth it? Instead, add an attribute that says don't put noundef on this.
Johannes: Removing extra information everywhere seems like the worst of both worlds.
Johannes: Passing undef to nonundef allows you to assume this does not happen and remove the code.
Johannes: Would 'maybeundef' attribute solve the problem?
Matt: HIP and Cuda don't really have language specs.
Nicolai: Would slightly prefer the attribute over special casing those functions.
Nicolai: Alternative would be to tell developers to initialize these variables and then try to remove the initialization in the backend.
Johannes: We would have to write this analysis and folks would have to modify the code to initialize variables. We can choose to drop the noundef, regardless of the spec.
Johannes: Will ping Clang folks and see if they are fine with a new attribute 'maybeundef'.
Matt: I would prefer to force users to initialize than optimize.
Nicolai: Some cases may be difficult to optimize away.
Jakub: If we do not optimize, how much performance do we lose?
Nicolai: These are rare in the first place.
Johannes: Maybe first try to go the optimization route, and then if we lose too much revisit the maybeundef attribute?
Matt: maybeundef if less work overall.
Post-script: https://discourse.llvm.org/t/llvm-gpu-working-group-meeting-friday-july-15-2022/63765/3

Johannes D: libm, libc(++), how to do it, where to put it

Johannes: Inside or outside of the main libc(++).
Johannes: What's the SYCL strategy?
Alexey: In Intel's implementation, separate GPU library
Johanes: The benefit of living outside of lib(c++) is that we can support libstdc++ as well
Ravi: Would have to restrict some functionality
Alexey: In our implementation, first develop the building blocks, and later revisit how to implement those standard libraries
Johannes: Like the idea of supporting the host libraries from the beginning
Alexey: Our implementation is already open source. License compatible with LLVM. Compatible with libstdc++ and Microsoft's libraries.

https://github.com/intel/llvm/tree/sycl/libdevice
Documentation of supported functionality: https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/supported/C-CXX-StandardLibrary.rst

Johannes: Could some of this go to LLVM mainline
Alexey: Yes, that's the goal
Johannes: Need an RFC to make sure we find the right place for these GPU libraries

Jun 17, 2022 8am PDT, 10am CDT, 11am EDT, 3pm UTC, 8:30pm IST

Anastasia S: Strawman patch to help with untyped pointers for SPIR-V and other backends: https://reviews.llvm.org/D127579

Designed to cover OpenCL use cases, fragile solution (metadata) but easy to do

Anastasia S: Representation of special types like textures, samplers, etc - using opaque pointers is fragile

Lost representation for special types with the transition to opaque ptrs
These types don't exist in LLVM (e.g., Image)
Matt A: These should use addr spaces and fat pointers
Anastasia: Native solution preferred
Nicolai: Similar direction would work for the AMD compiler.

Difficult to encode SPIR-V types as llvm addr spaces

Joshua C: Issues caused by ptr2int
Jakub K: How about addr spaces and module metadata
Joshua C, Anastasia: intrinsic type (bit width, identifier) would be more elegant
Jakub K: What kind of IR entity would that be?

Anastasia: something like opaque structs types

Justin H: would these be target-specific?

Anastasia: target-specific by design, for SPIR-V we work around it already

Joshua C: similar to scalable vector types: can put in a struct, load, store, but not convert to an int
Nicolai: bit sizes don't matter for SPIR-V anyway

You would expect these to appear in intrinsics, phi nodes, selects

Anastasia: Need this also for function declarations, otherwise won't link with spirv-link
Joshua C: Instead of metadata, element type attribute should work

Limited to intrinsics and inline assembly, but that should be easy to change

Anastasia: Would other languages/targets use it if introduced?

Chris B: DXIL support linking, would strongly prefer not to use metadata
Nicolai: When consuming DXIL, would have to preserve this as well

Lei Z: Android drivers use old toolchain and won't be updated, will still require typed pointers in the future

Joshua C: even old drivers should be able to handle new input SPIR-V

Joseph H: Should LTO be the default target for GPU compilation.

For CUDA this is always expected to improve performance
Previously AMD used bitcode linking but did not do LTO
Downside: longer compilation times
Chris B: should be on the language-by-language basis, should not be the default for HLSL
Joseph: check if supported by the toolchain first
Matt A: makes sense over opt-in
Johannes D: parallel compilation should solve the issue with longer compilation times
Matt A: all compilation time issues are in scheduling/reg alloc, LTO does not matter match
Joseph: for giant applications, 2x speedup at the cost of 20s more to compile
Ravi N: It's just pulling it in up front
Joseph: With LTO being the default, we can also enable the RDC mode by default
Jakub K: Can you start an RFC?

Joseph: Yes, but llvm 16 timeline

Matt A: Does anything break when enabled?

Joseph: Poor test suite now, but tested and works

Maybe some builds system workaround would stop working
Won't embed fat binaries into the module

Would lose mutual compatibility with how CUDA does this compilation, unclear how import that is

Johannes D: Write an RFC and post to the discourse
Ravi N: we will retain the option to opt-out
Ravi N: How does this increase the testing burden? You would have to exercise both flows.

Shouldn't depend on assumptions, should test this.
LTO may hide bugs.

May 20, 2022 8am PDT, 10am CDT, 11am EDT, 3pm UTC, 8:30pm IST

(Chris B) Discuss the impact of opaque pointers on GPU code generation targets

Both DXIL and SPIR-V require typed pointers in their output, can we share infrastructure?
A parallel approach for SPIR-V will work too
Chris B: Types pointers for DX are IR types but don’t live inside Module, similar to MemSSA
Jakub K: Can we pull out typed pointers and pointer type inference from outside of the DX namespace

Chris: We have to expect push back from the community

Joseph H: Doesn’t Clang already associate types with pointers

Chris: Only in some cases, Clang attaches attribute data to parameter values only (butlins only?)

Nicolai H: In SPIR-V pointers are very different, for example Logical Pointers come with a lot of restrictions

Vulkan is moving away from this

Jakub K: Let’s get back to this during the next meeting or on Discourse

(Nicolai H) Memory kind operand on fence instructions

Limits the memory kind(s) that are ordered by the fence.
Memory kinds would be target-specific. Analogous but orthogonal to syncscope.
AMDGPU needs at least “global” and “workgroup” (__shared memory)
No explicit memory kind operand ⇒ all memory is ordered by the fence
Prior art: SPIR-V’s UniformMemory, WorkgroupMemory, etc. bits on MemorySemantics fields
Also related: Vulkan/SPIR-V Private vs. NonPrivate
Jakub K: Is the set of memory scopes closed (fixed number) or open (arbitrary IDs, string, etc.)

Nicolai: Fixed number, like in SPIR-V

Joseph H: Why not address space?

Nicolai: No the same thing. Private scope != private address space.
Fences don’t have an address space at all.

For AMD, address spaces are about how bit pattern translates to addressing

It’s an orthogonal concept. We used to hack it this way, but gaps started to show.
Johannes D: Does not sound convincing
Justin H: This would allow for not having different address spaces for same memory regions
Nicolai: This would introduce a whole bunch of address space casts, most could be optimized away in practice

Justin H: Increases compilation time for nothing

Justin H: If these are target specific, how can middle-end optimize around this?

Nicolai: Conservative by default, maybe target-specific hooks in the future
Kind of like AA but applied to memory ordering. Not fully thought through.

Johannes D: It’s more about generating fences, not movement of instructions?

Nicolai: Right

Ravi N: What happens on CPUs?

Nicolai: Does not apply to CPUs

Johannes D: This has to be written up in an RFC, with both alternatives explained

Nicolai H: No timeline, one of those papercuts you can live with for a very long time, not inherently broken

(Joseph H) Discuss new driver linking CUDA upstream with LTO and applying to HIP / SYCL

Static libraries on the GPU with the new driver (e.g. libm.a)
Looking for folks to test stuff
No need to use external tools, new driver makes this work automatically

Ravi N: This makes the driver much simpler
Johannes H: Unifies the ecosystem

Waiting comments from SYCL folks, is Intel interested?

Alexey B: The Intel’s SYCL LTO implementation was implemented separately
Not sure if they have time to work on this
Tried HIP in the past, but did not have success with that

Joseph H: AMD does not really have relocatable object files, always used llvm-link IR linking

Johannes D: SYCL needs more time to figure things out, let’s keep in touch with the SYCL folks

Joseph H: Another new thing is static linking which should enable better libm support

(Johannes D) AS-aware Alias Analysis?

Jakub K: Some address spaces overlap, some not, right? How can we tell? Target data layout info?

Johannes D: We would need target-specific hooks to tell.

Nobody on the call is aware of any existing address space-aware AA implementation

Apr 22, 2022 8am PDT, 10am CDT, 11am EDT, 3pm UTC, 8:30pm IST

(Johannes) Where to put libm.<arch>.bc

For llvm.<math> support we still need some magic, e.g., implements by.
CUDA/HIP headers define all math builtins

Pre-include those headers for programs that use math
These math calls become intrinsics
Instead, we could compile those headers into bitcode files which can decide how to perform math at link time
Where should those things live?

Matt A: Resource directory (clang)

This is not sufficient to implement math intrinsics, we should invest more into making them work
Ravi N: How would you do name mangling?

Johannes: .bc file name with architecture name
Ravi N: Is API different for every architecture?
Johannes: API the same, different implementation.

Keep function names around so that the compiler can understand them
Eventually, we want to inline math functions

(Joseph, Johannes) New embedding for CUDA/HIP/SYCL

(Slides: New driver overview)
In the past driver, each implemented embedding differently
New driver simplifies this

The plan is to make all offloading models use the common implementation (?)

Ravi N: A data section? How is this different from image and obj

Johannes: Can embed into IR or object code.
Ravi N: Linker wrapper responsible for unbundling.
Johannes: Only requires shallow tools on top of existing elf tools, or llvm-extract for IR
Johannes: Same registration scheme for OpenMP and CUDA, improves OpenMP/CUDA interoperability

(Johannes) CUDA/HIP/SYCL runtime tests?

Matt A: Some in the HIP repo,^[a] but this is a different project
Johannes: LLVM test suite should have some tests other offloading models there as well
Jakub C: In OneAPI the choice was to make tests (GPU) target-independent
Johannes: LIT could check the GPU and pass it to down as a test parameter
Matt A: Should Clang auto-detect the GPU? I think not.
Johannes: Even if we are required to provide it, we could still pass it to LIT while making clang generic.
Matt A: It's like -march-native. Concerning.
Ravi N: Making it auto-detect could make users things all tests pass when only their host-supported tests ran.

Johannes: Buildbots are for that.

Matt A: Would be nicer if we could just use normal tools to test offloading.

Johannes: For example, opaque ptrs broke OpenMP for a week because we did not notice

Johannes: For HIP/CUDA we can't even offload to host

(Mahesha S): Flaky OpenMP tests

Johannes: That's some bug, it's not supposed to be like this. Please report this.

We are aware of some races in libompoffload

(Johannes): Main point from today: think where to put tests.
(Ravi N): Ask Joseph H. to give some docs on how embedding is done (not just flows).

Joseph H: Clang documentation

Mar 18, 2022 8am PDT, 10am CDT, 11am EDT, 3pm UTC, 8:30pm IST

(Chris Bieneman) HLSL and graphics code generation targets

See the RFC here!
Slides: Bringing HLSL Upstream
Johannes: What's the implementation plan for DXIL backend?

Restricted IR?
Chris: prototype backend, structured as a backend

Backend code does not really support emitting IR
Series of passes that massage the IR
Refactoring upstream bitcode writer

Jakub K: Do you need to emit DXIL textual representation besides bitcode?

Chris: No, will emit textual IR based on new LLVM

LLVM IR is not stable in general
Original SPIR was LLVM IR-based

Nicolai: How to represent graphics-specific operations, metadata, etc?

Chris: Specifics of DXIL in the backend
Want to use more general / flexible IR in the frontend, separated and abstracted away from the backend

Do not want to pollute the rest of LLVM
DXIL operations are functions not intrinsics, with the first parameter being a unique opcode value

Not clean but practical

Translate some metadata to target-specific attributes, clean up the codegen coming out of Clang

Diego: Isn't an AST -> MLIR -> DXIL / -> MLIR -> SPIR-V cleaner?

You would want to write passes on top of LLVM IR not DXIL
Chris: But DXIL is LLVM IR

Most transformations straightforward in LLVM, would have to reimplement them in MLIR and then lower to LLVM anyway

Diego: SPIR-V backend could become a DXIL -> SPIR-V conversion

The difficult part of the SPIR-V conversion is transformations on SPIR-V, currently uses SPIR-V backend

Lei: In MLIR we don't pull in SPIR-V tools, everything defined in MLIR

Diego: Mostly about SPIR-V legalization passes that depend on the SPIR-V Tools optimizer
Lei: Some passes like inlining already ported to MLIR
Diego: Replicating what SPIR-V tools do is a waste, would prefer to do this on LLVM IR / MLIR
Chris: MLIR-based flows would have wider benefits, but nobody upstream seems to be working on Clang -> MLIR

OTOH, SPIR-V backend is making progress
If the SPIR-V backend doesn't meet the needs, look into MLIR
MLIR support in Clang is a very large project
If the SPIR-V backend produces poor code, we can always call an external library / tool from spirv-tools (shell subprocess or something)

Chris: Going through the current codegen flow instead of MLIR has a much shorter timeframe, attractive from the perspective of dropping support to DXC earlier

Given infinite time and resources, MLIR path would be preferred

Johannes: much wider use case for SPIR-V codegen

Some users want to go via LLVM IR anyway
Lei: People have been trying to make LLVM IR work for graphics for many years with ongoing problems
Neither the SPIR-V backend nor MLIR SPIR-V support graphics now

Lei: MLIR has some graphics bits already

Diego: Given a DXIL backend, we can have a path from DXIL to SPIR-V

This will be done regardless of the HLSL inclusion into Clang

Chris: Want to have HLSL-based abstractions, so that DirectX does not restrict the Vulkan side

Hai: Make sure to include Khronos folks in future meetings
William M, post-meeting: I would recommend you look at the Polygeist incubator RFC (RFC: Polygeist LLVM Incubator Proposal - #17 by clattner) which contains a clang-based C/C++ frontend (though doesn’t yet cover the entire language).

(Joseph Huber) Short update on Math and Cuda using the new driver

Goal: Regular math call on the GPU, difficulty: nasty headers
Bitcode library turns intermediary call intro math intrinsics
Trying to figure out how to compile CUDA and HIP from OpenMP, mostly working through the new driver already
In the future, fully compile CUDA with the new driver
Make redistributable code work with Clang
This should make compiling CUDA with Clang as easy as with nvc^[b]c

Jakub K: Next meeting in a month, there will be an announcement on discourse

Feb 18, 2022 8am PST, 10am CST, 11am EST, 4pm UTC, 9:30pm IST

(Jakub Kuderski) Administrivia:

We have a gpu tag on Discourse, use it when creating GPU-related topics: https://discourse.llvm.org/tag/gpu
To join future meetings without waiting to be approved, you need to be on the list of guests. Message me on Discourse (username kuhar) or via email with your gmail account to be added.
Meeting duration: 30mins vs 45mins vs 60mins. Is it acceptable for a meeting to run longer than scheduled?

Johannes: We need to consider short cadence. Schedule for 45 mins but allow it to run longer if necessary.
Johannes: Need to figure out announcements (people not used to Discourse yet)

(Joseph Huber) Presentation: New OpenMP Offloading Driver (10-15 mins)

Discussion:

Johannes: One offloading embedding method for all languages, including the new ones
Joseph: Currently working on moving CUDA offloading to the new scheme
LTO step in the linker wrapper
Chris B: Treating GPU compilation as a linkage pipeline would simplify a lot of things
Johannes/Joseph: Eventually replace invoking Clang again with lld with more arguments
Alexy: Very similar thing for SYCL

Want to support standard linkers, considering linker plugins
Joseph: This scheme should work with whatever linker you use

Joseph: It's more straightforward to keep device code in ELF data sections

(Jakub Chlanda) Discussion: LLVM IR passes (such as GVN) not honoring semantics of barriers for multithreaded architectures

Issue originally seen in DPC++ with llvm.nvvm.barrier0 when targeting NVPTX, but also seen in discussion on the LLVM mailing list, believed to be a more general problem for synchronization intrinsics on multithreaded architectures.
Interested in thoughts of the community for how to address this.
Related LLVM mailing list thread: https://discourse.llvm.org/t/bug-gvn-memdep-bug-in-the-presence-of-intrinsics/59402
Related issue on intel/llvm project (DPC++): https://github.com/intel/llvm/issues/1258
Jakub C: Generic problem for other backends.
Nicolai: Root cause?

Jakub C: GVN?, intrinsics are treated special somewhere and assumed not to clobber anything
Matt A: Solved by nosync?
Johannes: Intrinsics need to be annotated probably, need to look for nosync in passes

Intrinsics should not be made special except for the attributes
Might look at it, make Jakub C the reviewer
File a proper bug for this

Jakub K: The next meeting in one month (March 18), check the agenda doc or shared calendar to confirm and watch for announcements on discourse.

Jan 14, 2022 8am PST, 10am CST, 11am EST, 4pm UTC, 9:30pm IST

Presentation: LLVM GPU Working Group --- First Meeting

(Jakub & Johannes) Administrative stuff: time slot, recurrence, meeting room tool, …

Mark your availability in https://www.when2meet.com/?14158022-7wcu2 (the time zone is CT)
How often should we meet? https://xoyondo.com/ap/tX1szIFqonJbe0t
Nicolai: how to decide what goes into the agenda?

A lot of different topics, e.g., MLIR, Offloading, divergence
Johannes: other meetings add a lot of things to agenda and then attempt to go over all of them

If we have too much content (presentations, etc.), trim it down and discuss later
This group is very diverse
The agenda doc should be persistent, add new meetings to the top
We should have a top-level list of future topics

(Jakub) Should we have a dedicated GPU category in LLVM discourse?

Context: LLVM is moving away from *-dev mailing lists: https://blog.llvm.org/posts/2022-01-07-moving-to-discourse/
Worry: too many top-level category
Right now, discourse mirrors the organization of the old mailing lists (e.g., llvm-dev, cfe-dev)

GPU/Offloading is a meta-category and does not fit exactly into any existing category

Follow up: official proposal to add a GPU/Offloading category to llvm discourse:

https://llvm.discourse.group/t/proposal-a-new-category-for-gpu-offloading/5762

(Johannes) LLVM GPU Math Library

Presentation slides: LLVM GPU Working Group --- First Meeting
We don't apply inst-combine-like optimizations over math functions

You are not allowed to create intrinsic out functions

Main idea: optimizer should be able to optimize based on the knowledge of math functions, even when we provide an implementation.
OpenMP is moving to the LTO pipeline.
Ravi: For OpenMP we can do this with variants

Johannes: we would need dynamic variants, maybe?
Johannes: we need a header as a math overlay
Johannes: but then we would have different symbols

Workaround: mapping from map intrinsics to math functions

Jon: concerned about accuracy

Matt: Only OpenCL gives you ULP bounds

Johannes: if you disable errno, you get a pure llvm. call
Discussion on how this interacts with constant folding based on host/device implementations
Matt: difference in who is responsible for providing these in CPU (the platform) and GPU compilation
Artem: can the optimizer materialize new math function calls?

Who should do the linking and when and where and how?
Matt: it already introduces math functions out of thin air
Discussion on bugs caused by llvm emitting calls to libc functions that may not be there

TODO(Johannes)^[c]: Clean up these notes and summarize the remaining discussion

(Johannes) LLVM GPU (pre-commit) buildbots^[d]^[e]^[f]

We did not get to this agenda item

[a]CUDA buidbots are running CUDA tests in the test-suite with different CUDA and the standard c++ library variants. https://github.com/llvm/llvm-test-suite/tree/main/External/CUDA

CUDA test-suite also supports running https://github.com/NVIDIA/thrust tests that were pretty good at finding all sorts of problems in both clang and nvcc, but those are rather heavyweight for the bots and are likely bitrotten by now for clang.

[b]Alas, math headers are not the only ones we need from the CUDA SDK and we are expected to pre-include them. So the --cuda-path would still be needed.

That said, I do think that we're moving in the right direction. The new driver + external intrinsic implementation + our own math library will go a long way towards making CUDA compilation way less kludgy than it is right now.

[c]@johannesdoerfert@gmail.com

_Assigned to johannesdoerfert@gmail.com_

[d]FYI: MLIR has a buildbot with a Nvidia GPU which includes running some end-to-end tests: https://lab.llvm.org/buildbot/#/builders/61

[e]And we have CUDA bots, too:

- https://lab.llvm.org/buildbot/#/builders/46

- https://lab.llvm.org/buildbot/#/builders/55

- https://lab.llvm.org/buildbot/#/builders/1

[f]Sweet!