LLVM GPU Working Group Meeting
Agenda / Notes
This document is public
Meeting Information
Open LLVM GPU Working Group Meetings.
Meeting room link: https://argonne.zoomgov.com/j/1602786184
Meeting ID: 160 278 6184
Public calendar link: https://calendar.google.com/calendar/embed?src=c_f5cpcv8upnjksh60vb16kf7hik%40group.calendar.google.com
LLVM Community Code of Conduct
Agenda
Apr 14, 2023 8am PT, 10am CT, 11am ET, 4pm UTC, 9:30 IST
- Sameer: The LLVM convergent attribute is not sufficient to express whether or not a call preserves a thread set
- Introduce "convergence tokens" that represent converged instances.
- Uses LLVM tokens because they cannot be used in a PHI
- Dynamic instances create tokens at each block, if a block's tokens are equal they are considered converged
- Nocolai: The tokens should let you look at the IR and visualize which communication takes place
- Johannes: This has nothing to do with the tokens, why do we need them. We need to preserve convergences
- Nicolai: Tokens are needed but the example is poor
- Johannes: Need a clearer set of rules. Our goal is to modify underlying control flow, while preserving the old control flow to check for thread convergence.
- Convergent right now simply says not to modify it. Tokens are fine in concept but we simply want rules to outline how they are used
- We need a set of verifier rules. Such that we can detect if a transformation that does not understand the tokens breaks the rules.
- Joseph H: Short GPU libc update
Feb 3, 2023 8am PT, 10am CT, 11am ET, 4pm UTC, 9:30 IST
- Use existing libc in LLVM and target it for the GPU
- Currently, only builds basic support for libC functions that don’t require the OS (e.g., strdup that requires malloc), and not tested
- Need remote procedure calls for OS functions
- Goal: call C functions like printf on the device
- Matt: Host call does that already, but does not work on all systems
- Ptrs + implicit input buffer, associated runtime thread, black box to the compiler
- Joseph: We want something common across implementations, as agnostic to the underlying runtime as possible
- Generic server that can communicate with whatever
- Right not host call is integrated into the runtime itself, need something more plug-and-play
- Joseph: More testing for the GPUs
- Get the libC project to run GPU tests, get coverage there
- Matt: Recently could not figure out how to build the OpenMP test suite
- This would make it easier for folks to write tests for it
- Joseph: Need something like gtest tests
- Loader + start code
- Cross compile and launch on the GPU
- Printf support will allow gtest on the GPU
- Joseph: Eventually extend to libc++
- Eventually convince people to use Clang instead of Nvidia tools
- Ravi: When can we expect printf?
- Joseph: Implementation details pending
- Each warp/wave has its own queue, waiting for the RPC server
- In nvidia implementation, flush at the end of the kernel
- Joseph: Some generic way to call into host for reverse offloading
- Need warp-local stuff
- Async thread on the host
- Ravi: People prefer something simple and fast
- Joseph: Big important functions: malloc, printf, free
- Difficult to make malloc fast, a lot of optimization work on the server-side
- Starting initial stuff in the next week
- Ravi: fprintf more difficult
- Joseph: want print error for tests
- Joseph: Timeline before release 17 / next LLVM dev mtg
- Joseph H: Registration methods for offloading runtime.
- In OpenMP, we depend on linker-defined symbols
- Applies to CUDA/HIP as well
- Alternatives that would be more common? Linker magic instead of special handling of sections. Export stubs dynamically and look them up in dynamic libraries (check for stub with a matching name)
- High-level goal – unify logic across Clang for all OSs
- Ravi: Intel/Windows should be similar, section name, dollar sign, …
- Ravi: Clang should already do that for comdat
- Joseph: should be also similar to what ASan does
- Joseph H: Automate process to create generic IR libraries?
- It’s difficult to make generic, CUDA relies on reflection passes to mask unsupported intrinsics, AMD requires weird linked libraries to override symbols
- Matt: Building for every target is ridiculous, should build per generations
- Work on making it easier to link libraries
- Weak-linking based overloading for fast cases
- ifunc/branches on target features
- Joseph: Maintenance: can we automate this process
- Automatically handle target-dependent functions
- Matt: 6-20-ish functions, not >100
- Combine identical cases
- Matt: Few more general math intrinsics
- Joseph: Ideally, one library per target triple
- Need infrastructure for more generic GPU bitcode libraries
- Matt: Ugly case: wavefront reduction functions
- Deeply nested macros, but really lots of repetition/redundancy
- Need to implement those in the backend
- Eymen: +1, having those exposed as intrinsics would drastically simplify things
Jan 6, 2023 8am PT, 10am CT, 11am ET, 4pm UTC, 9:30 IST
- Jakub K: What produces these?
- Matt A: These are available in CUDA
Dec 2, 2022 8am PT, 10am CT, 11am ET, 4pm UTC, 9:30 IST
- Jakub K: Change of format for LLVM GPU News
- Jakub: Biweekly cadence too time-consuming for me without contributions from other folks
- Johannes D: Irregular release schedule, based on how much content there is
- Suspending LLVM GPU News and sending Alex some tips on things to include in LLVM Weekly
- Alexey B: LLVM GPU News was a good venue for out-of-tree projects like Intel’s SYCL that are outside of the scope of LLVM Weekly
- Jakub: Try the irregular schedule first, suspend the newsletter if not interest within a month or so
Oct 21, 2022 8am PDT, 10am CDT, 11am EDT, 3pm UTC, 8:30pm IST
- Note: We switched to Zoom (instead of Meet) starting from this meeting: https://argonne.zoomgov.com/j/1602786184
- Johannes: Implementing libm and libc for the GPU
- Joseph H: Prerequisites for moving CUDA and HIP to the new driver
Sep 16, 2022 CANCELED 8am PDT, 10am CDT, 11am EDT, 3pm UTC, 8:30pm IST
Aug 19, 2022 8am PDT, 10am CDT, 11am EDT, 3pm UTC, 8:30pm IST
- Sameer S: Uniformity Analysis
Jul 15, 2022 8am PDT, 10am CDT, 11am EDT, 3pm UTC, 8:30pm IST
- Alex B: Opaque types - common requirements between DXIL, SPIR-V, WebAssembly. How to coordinate and collaborate.
- Context: https://discourse.llvm.org/t/rfc-better-support-for-typed-pointers-in-an-opaque-pointer-world/63339
- Chris B: From DX perspective, significant overlap
- Nicolai H: Do WebAsm types have type parameters?
- Alex: They can have parameters, structs can contain other types.
- Alex: Do not need any demangling in the middle end, just need to preserve type identity until the backend.
- Jakub K: Do you have some linking requirements?
- Alex: Later support for importing/exporting types, not in the type system right now.
- Alex: Ultimately a WebAsm verifier will check types
- Johannes: Does WebAsm support physical pointers?
- Alex: Yes, for main memory (which maps well to the LLVM model) but not for GC types (no int2ptr/ptr2int, no arbitrary bitcasts)
- Johannes: Would like (typed) pointers to be useful to telling how objects are used
- Matt: They are existing passes that do something like this, what else would we need?
- Johannes: Reason about more than one level of indirections.
- Jakub: For Wasm, would you need to keep offsets separate?
- Alex: With Wasm you would not need to carry around an additional offset, but we would have to extend support for limited casting (inserted by the frontend to case the return value of an intrinsic, given intrinsics can’t be type-parameterised), intrinsics, name mangling
- Nicolai: Name mangling can get messy, experimented with tablegen-based approach but not for parameterized types
- Joshua: Name mangling is difficult for SPIR-V, we would like to move away from that hack. Because Clang generates allocas, you have to be able to load/store them.
- Alex: In the wasm use case the reason I think we’ll need a textual type description (name mangled or looked up via indirection) is it’s not clear we can cleanly map all wasm GC types to LLVM’s type system. E.g. if struct field attributes aren’t available in LLVM IR etc.
- Nicolai: Should this be an intrinsic or builtin attribute?
- Matt: Users are always allowed to call the exposed builtins, which are poorly defined
- Johannes: Decide whether HIP/Cuda should ...?
- Nicolai: In the graphics world, programs don't pass uninitialized values to functions, is this different for compute programs?
- Matt: Is this a target property at the end? All or nothing approach.
- Chris: For HLSL this should result in compiler errors, modulo implementation issues.
- Johannes: Make exceptions for exactly these intrinsics.
- Nicolai: You should not really have to initialize those values. In the source language if you initialize the variable the compiler cannot optimize this later.
- Matt: In HIP this looks like a regular functions call (which internally calls the builtin)
- Johannes: Should we support user wrapper functions like this? Is there enough usage to make it worth it? Instead, add an attribute that says don't put noundef on this.
- Johannes: Removing extra information everywhere seems like the worst of both worlds.
- Johannes: Passing undef to nonundef allows you to assume this does not happen and remove the code.
- Johannes: Would 'maybeundef' attribute solve the problem?
- Matt: HIP and Cuda don't really have language specs.
- Nicolai: Would slightly prefer the attribute over special casing those functions.
- Nicolai: Alternative would be to tell developers to initialize these variables and then try to remove the initialization in the backend.
- Johannes: We would have to write this analysis and folks would have to modify the code to initialize variables. We can choose to drop the noundef, regardless of the spec.
- Johannes: Will ping Clang folks and see if they are fine with a new attribute 'maybeundef'.
- Matt: I would prefer to force users to initialize than optimize.
- Nicolai: Some cases may be difficult to optimize away.
- Jakub: If we do not optimize, how much performance do we lose?
- Nicolai: These are rare in the first place.
- Johannes: Maybe first try to go the optimization route, and then if we lose too much revisit the maybeundef attribute?
- Matt: maybeundef if less work overall.
- Post-script: https://discourse.llvm.org/t/llvm-gpu-working-group-meeting-friday-july-15-2022/63765/3
- Johannes D: libm, libc(++), how to do it, where to put it
- Johannes: Inside or outside of the main libc(++).
- Johannes: What's the SYCL strategy?
- Alexey: In Intel's implementation, separate GPU library
- Johanes: The benefit of living outside of lib(c++) is that we can support libstdc++ as well
- Ravi: Would have to restrict some functionality
- Alexey: In our implementation, first develop the building blocks, and later revisit how to implement those standard libraries
- Johannes: Like the idea of supporting the host libraries from the beginning
- Alexey: Our implementation is already open source. License compatible with LLVM. Compatible with libstdc++ and Microsoft's libraries.
- Johannes: Could some of this go to LLVM mainline
- Alexey: Yes, that's the goal
- Johannes: Need an RFC to make sure we find the right place for these GPU libraries
Jun 17, 2022 8am PDT, 10am CDT, 11am EDT, 3pm UTC, 8:30pm IST
- Designed to cover OpenCL use cases, fragile solution (metadata) but easy to do
- Anastasia S: Representation of special types like textures, samplers, etc - using opaque pointers is fragile
- Lost representation for special types with the transition to opaque ptrs
- These types don't exist in LLVM (e.g., Image)
- Matt A: These should use addr spaces and fat pointers
- Anastasia: Native solution preferred
- Nicolai: Similar direction would work for the AMD compiler.
- Difficult to encode SPIR-V types as llvm addr spaces
- Joshua C: Issues caused by ptr2int
- Jakub K: How about addr spaces and module metadata
- Joshua C, Anastasia: intrinsic type (bit width, identifier) would be more elegant
- Jakub K: What kind of IR entity would that be?
- Anastasia: something like opaque structs types
- Justin H: would these be target-specific?
- Anastasia: target-specific by design, for SPIR-V we work around it already
- Joshua C: similar to scalable vector types: can put in a struct, load, store, but not convert to an int
- Nicolai: bit sizes don't matter for SPIR-V anyway
- You would expect these to appear in intrinsics, phi nodes, selects
- Anastasia: Need this also for function declarations, otherwise won't link with spirv-link
- Joshua C: Instead of metadata, element type attribute should work
- Limited to intrinsics and inline assembly, but that should be easy to change
- Anastasia: Would other languages/targets use it if introduced?
- Chris B: DXIL support linking, would strongly prefer not to use metadata
- Nicolai: When consuming DXIL, would have to preserve this as well
- Lei Z: Android drivers use old toolchain and won't be updated, will still require typed pointers in the future
- Joshua C: even old drivers should be able to handle new input SPIR-V
- Joseph H: Should LTO be the default target for GPU compilation.
- For CUDA this is always expected to improve performance
- Previously AMD used bitcode linking but did not do LTO
- Downside: longer compilation times
- Chris B: should be on the language-by-language basis, should not be the default for HLSL
- Joseph: check if supported by the toolchain first
- Matt A: makes sense over opt-in
- Johannes D: parallel compilation should solve the issue with longer compilation times
- Matt A: all compilation time issues are in scheduling/reg alloc, LTO does not matter match
- Joseph: for giant applications, 2x speedup at the cost of 20s more to compile
- Ravi N: It's just pulling it in up front
- Joseph: With LTO being the default, we can also enable the RDC mode by default
- Jakub K: Can you start an RFC?
- Joseph: Yes, but llvm 16 timeline
- Matt A: Does anything break when enabled?
- Joseph: Poor test suite now, but tested and works
- Maybe some builds system workaround would stop working
- Won't embed fat binaries into the module
- Would lose mutual compatibility with how CUDA does this compilation, unclear how import that is
- Johannes D: Write an RFC and post to the discourse
- Ravi N: we will retain the option to opt-out
- Ravi N: How does this increase the testing burden? You would have to exercise both flows.
- Shouldn't depend on assumptions, should test this.
- LTO may hide bugs.
May 20, 2022 8am PDT, 10am CDT, 11am EDT, 3pm UTC, 8:30pm IST
- (Chris B) Discuss the impact of opaque pointers on GPU code generation targets
- Both DXIL and SPIR-V require typed pointers in their output, can we share infrastructure?
- A parallel approach for SPIR-V will work too
- Chris B: Types pointers for DX are IR types but don’t live inside Module, similar to MemSSA
- Jakub K: Can we pull out typed pointers and pointer type inference from outside of the DX namespace
- Chris: We have to expect push back from the community
- Joseph H: Doesn’t Clang already associate types with pointers
- Chris: Only in some cases, Clang attaches attribute data to parameter values only (butlins only?)
- Nicolai H: In SPIR-V pointers are very different, for example Logical Pointers come with a lot of restrictions
- Vulkan is moving away from this
- Jakub K: Let’s get back to this during the next meeting or on Discourse
- Limits the memory kind(s) that are ordered by the fence.
- Memory kinds would be target-specific. Analogous but orthogonal to syncscope.
- AMDGPU needs at least “global” and “workgroup” (__shared memory)
- No explicit memory kind operand ⇒ all memory is ordered by the fence
- Prior art: SPIR-V’s UniformMemory, WorkgroupMemory, etc. bits on MemorySemantics fields
- Also related: Vulkan/SPIR-V Private vs. NonPrivate
- Jakub K: Is the set of memory scopes closed (fixed number) or open (arbitrary IDs, string, etc.)
- Nicolai: Fixed number, like in SPIR-V
- Joseph H: Why not address space?
- Nicolai: No the same thing. Private scope != private address space.
- Fences don’t have an address space at all.
- For AMD, address spaces are about how bit pattern translates to addressing
- It’s an orthogonal concept. We used to hack it this way, but gaps started to show.
- Johannes D: Does not sound convincing
- Justin H: This would allow for not having different address spaces for same memory regions
- Nicolai: This would introduce a whole bunch of address space casts, most could be optimized away in practice
- Justin H: Increases compilation time for nothing
- Justin H: If these are target specific, how can middle-end optimize around this?
- Nicolai: Conservative by default, maybe target-specific hooks in the future
- Kind of like AA but applied to memory ordering. Not fully thought through.
- Johannes D: It’s more about generating fences, not movement of instructions?
- Ravi N: What happens on CPUs?
- Nicolai: Does not apply to CPUs
- Johannes D: This has to be written up in an RFC, with both alternatives explained
- Nicolai H: No timeline, one of those papercuts you can live with for a very long time, not inherently broken
- (Joseph H) Discuss new driver linking CUDA upstream with LTO and applying to HIP / SYCL
- Static libraries on the GPU with the new driver (e.g. libm.a)
- Looking for folks to test stuff
- No need to use external tools, new driver makes this work automatically
- Ravi N: This makes the driver much simpler
- Johannes H: Unifies the ecosystem
- Waiting comments from SYCL folks, is Intel interested?
- Alexey B: The Intel’s SYCL LTO implementation was implemented separately
- Not sure if they have time to work on this
- Tried HIP in the past, but did not have success with that
- Joseph H: AMD does not really have relocatable object files, always used llvm-link IR linking
- Johannes D: SYCL needs more time to figure things out, let’s keep in touch with the SYCL folks
- Joseph H: Another new thing is static linking which should enable better libm support
- (Johannes D) AS-aware Alias Analysis?
- Jakub K: Some address spaces overlap, some not, right? How can we tell? Target data layout info?
- Johannes D: We would need target-specific hooks to tell.
- Nobody on the call is aware of any existing address space-aware AA implementation
Apr 22, 2022 8am PDT, 10am CDT, 11am EDT, 3pm UTC, 8:30pm IST
- (Johannes) Where to put libm.<arch>.bc
- For llvm.<math> support we still need some magic, e.g., implements by.
- CUDA/HIP headers define all math builtins
- Pre-include those headers for programs that use math
- These math calls become intrinsics
- Instead, we could compile those headers into bitcode files which can decide how to perform math at link time
- Where should those things live?
- Matt A: Resource directory (clang)
- This is not sufficient to implement math intrinsics, we should invest more into making them work
- Ravi N: How would you do name mangling?
- Johannes: .bc file name with architecture name
- Ravi N: Is API different for every architecture?
- Johannes: API the same, different implementation.
- Keep function names around so that the compiler can understand them
- Eventually, we want to inline math functions
- (Joseph, Johannes) New embedding for CUDA/HIP/SYCL
- (Slides: New driver overview)
- In the past driver, each implemented embedding differently
- New driver simplifies this
- The plan is to make all offloading models use the common implementation (?)
- Ravi N: A data section? How is this different from image and obj
- Johannes: Can embed into IR or object code.
- Ravi N: Linker wrapper responsible for unbundling.
- Johannes: Only requires shallow tools on top of existing elf tools, or llvm-extract for IR
- Johannes: Same registration scheme for OpenMP and CUDA, improves OpenMP/CUDA interoperability
- (Johannes) CUDA/HIP/SYCL runtime tests?
- Matt A: Some in the HIP repo,[a] but this is a different project
- Johannes: LLVM test suite should have some tests other offloading models there as well
- Jakub C: In OneAPI the choice was to make tests (GPU) target-independent
- Johannes: LIT could check the GPU and pass it to down as a test parameter
- Matt A: Should Clang auto-detect the GPU? I think not.
- Johannes: Even if we are required to provide it, we could still pass it to LIT while making clang generic.
- Matt A: It's like -march-native. Concerning.
- Ravi N: Making it auto-detect could make users things all tests pass when only their host-supported tests ran.
- Johannes: Buildbots are for that.
- Matt A: Would be nicer if we could just use normal tools to test offloading.
- Johannes: For example, opaque ptrs broke OpenMP for a week because we did not notice
- Johannes: For HIP/CUDA we can't even offload to host
- (Mahesha S): Flaky OpenMP tests
- Johannes: That's some bug, it's not supposed to be like this. Please report this.
- We are aware of some races in libompoffload
- (Johannes): Main point from today: think where to put tests.
- (Ravi N): Ask Joseph H. to give some docs on how embedding is done (not just flows).
Mar 18, 2022 8am PDT, 10am CDT, 11am EDT, 3pm UTC, 8:30pm IST
- (Chris Bieneman) HLSL and graphics code generation targets
- Restricted IR?
- Chris: prototype backend, structured as a backend
- Backend code does not really support emitting IR
- Series of passes that massage the IR
- Refactoring upstream bitcode writer
- Jakub K: Do you need to emit DXIL textual representation besides bitcode?
- Chris: No, will emit textual IR based on new LLVM
- LLVM IR is not stable in general
- Original SPIR was LLVM IR-based
- Nicolai: How to represent graphics-specific operations, metadata, etc?
- Chris: Specifics of DXIL in the backend
- Want to use more general / flexible IR in the frontend, separated and abstracted away from the backend
- Do not want to pollute the rest of LLVM
- DXIL operations are functions not intrinsics, with the first parameter being a unique opcode value
- Translate some metadata to target-specific attributes, clean up the codegen coming out of Clang
- Diego: Isn't an AST -> MLIR -> DXIL / -> MLIR -> SPIR-V cleaner?
- You would want to write passes on top of LLVM IR not DXIL
- Chris: But DXIL is LLVM IR
- Most transformations straightforward in LLVM, would have to reimplement them in MLIR and then lower to LLVM anyway
- Diego: SPIR-V backend could become a DXIL -> SPIR-V conversion
- The difficult part of the SPIR-V conversion is transformations on SPIR-V, currently uses SPIR-V backend
- Lei: In MLIR we don't pull in SPIR-V tools, everything defined in MLIR
- Diego: Mostly about SPIR-V legalization passes that depend on the SPIR-V Tools optimizer
- Lei: Some passes like inlining already ported to MLIR
- Diego: Replicating what SPIR-V tools do is a waste, would prefer to do this on LLVM IR / MLIR
- Chris: MLIR-based flows would have wider benefits, but nobody upstream seems to be working on Clang -> MLIR
- OTOH, SPIR-V backend is making progress
- If the SPIR-V backend doesn't meet the needs, look into MLIR
- MLIR support in Clang is a very large project
- If the SPIR-V backend produces poor code, we can always call an external library / tool from spirv-tools (shell subprocess or something)
- Chris: Going through the current codegen flow instead of MLIR has a much shorter timeframe, attractive from the perspective of dropping support to DXC earlier
- Given infinite time and resources, MLIR path would be preferred
- Johannes: much wider use case for SPIR-V codegen
- Some users want to go via LLVM IR anyway
- Lei: People have been trying to make LLVM IR work for graphics for many years with ongoing problems
- Neither the SPIR-V backend nor MLIR SPIR-V support graphics now
- Lei: MLIR has some graphics bits already
- Diego: Given a DXIL backend, we can have a path from DXIL to SPIR-V
- This will be done regardless of the HLSL inclusion into Clang
- Chris: Want to have HLSL-based abstractions, so that DirectX does not restrict the Vulkan side
- Hai: Make sure to include Khronos folks in future meetings
- William M, post-meeting: I would recommend you look at the Polygeist incubator RFC (RFC: Polygeist LLVM Incubator Proposal - #17 by clattner) which contains a clang-based C/C++ frontend (though doesn’t yet cover the entire language).
- (Joseph Huber) Short update on Math and Cuda using the new driver
- Goal: Regular math call on the GPU, difficulty: nasty headers
- Bitcode library turns intermediary call intro math intrinsics
- Trying to figure out how to compile CUDA and HIP from OpenMP, mostly working through the new driver already
- In the future, fully compile CUDA with the new driver
- Make redistributable code work with Clang
- This should make compiling CUDA with Clang as easy as with nvc[b]c
- Jakub K: Next meeting in a month, there will be an announcement on discourse
Feb 18, 2022 8am PST, 10am CST, 11am EST, 4pm UTC, 9:30pm IST
- (Jakub Kuderski) Administrivia:
- We have a gpu tag on Discourse, use it when creating GPU-related topics: https://discourse.llvm.org/tag/gpu
- To join future meetings without waiting to be approved, you need to be on the list of guests. Message me on Discourse (username kuhar) or via email with your gmail account to be added.
- Meeting duration: 30mins vs 45mins vs 60mins. Is it acceptable for a meeting to run longer than scheduled?
- Johannes: We need to consider short cadence. Schedule for 45 mins but allow it to run longer if necessary.
- Johannes: Need to figure out announcements (people not used to Discourse yet)
- Johannes: One offloading embedding method for all languages, including the new ones
- Joseph: Currently working on moving CUDA offloading to the new scheme
- LTO step in the linker wrapper
- Chris B: Treating GPU compilation as a linkage pipeline would simplify a lot of things
- Johannes/Joseph: Eventually replace invoking Clang again with lld with more arguments
- Alexy: Very similar thing for SYCL
- Want to support standard linkers, considering linker plugins
- Joseph: This scheme should work with whatever linker you use
- Joseph: It's more straightforward to keep device code in ELF data sections
- (Jakub Chlanda) Discussion: LLVM IR passes (such as GVN) not honoring semantics of barriers for multithreaded architectures
- Jakub C: GVN?, intrinsics are treated special somewhere and assumed not to clobber anything
- Matt A: Solved by nosync?
- Johannes: Intrinsics need to be annotated probably, need to look for nosync in passes
- Intrinsics should not be made special except for the attributes
- Might look at it, make Jakub C the reviewer
- File a proper bug for this
- Jakub K: The next meeting in one month (March 18), check the agenda doc or shared calendar to confirm and watch for announcements on discourse.
Jan 14, 2022 8am PST, 10am CST, 11am EST, 4pm UTC, 9:30pm IST
Presentation: LLVM GPU Working Group --- First Meeting
- (Jakub & Johannes) Administrative stuff: time slot, recurrence, meeting room tool, …
- A lot of different topics, e.g., MLIR, Offloading, divergence
- Johannes: other meetings add a lot of things to agenda and then attempt to go over all of them
- If we have too much content (presentations, etc.), trim it down and discuss later
- This group is very diverse
- The agenda doc should be persistent, add new meetings to the top
- We should have a top-level list of future topics
- GPU/Offloading is a meta-category and does not fit exactly into any existing category
- Follow up: official proposal to add a GPU/Offloading category to llvm discourse:
- (Johannes) LLVM GPU Math Library
- You are not allowed to create intrinsic out functions
- Main idea: optimizer should be able to optimize based on the knowledge of math functions, even when we provide an implementation.
- OpenMP is moving to the LTO pipeline.
- Ravi: For OpenMP we can do this with variants
- Johannes: we would need dynamic variants, maybe?
- Johannes: we need a header as a math overlay
- Johannes: but then we would have different symbols
- Workaround: mapping from map intrinsics to math functions
- Jon: concerned about accuracy
- Matt: Only OpenCL gives you ULP bounds
- Johannes: if you disable errno, you get a pure llvm. call
- Discussion on how this interacts with constant folding based on host/device implementations
- Matt: difference in who is responsible for providing these in CPU (the platform) and GPU compilation
- Artem: can the optimizer materialize new math function calls?
- Who should do the linking and when and where and how?
- Matt: it already introduces math functions out of thin air
- Discussion on bugs caused by llvm emitting calls to libc functions that may not be there
- TODO(Johannes)[c]: Clean up these notes and summarize the remaining discussion
- (Johannes) LLVM GPU (pre-commit) buildbots[d][e][f]
- We did not get to this agenda item
[a]CUDA buidbots are running CUDA tests in the test-suite with different CUDA and the standard c++ library variants. https://github.com/llvm/llvm-test-suite/tree/main/External/CUDA
CUDA test-suite also supports running https://github.com/NVIDIA/thrust tests that were pretty good at finding all sorts of problems in both clang and nvcc, but those are rather heavyweight for the bots and are likely bitrotten by now for clang.
[b]Alas, math headers are not the only ones we need from the CUDA SDK and we are expected to pre-include them. So the --cuda-path would still be needed.
That said, I do think that we're moving in the right direction. The new driver + external intrinsic implementation + our own math library will go a long way towards making CUDA compilation way less kludgy than it is right now.
[c]@johannesdoerfert@gmail.com
_Assigned to johannesdoerfert@gmail.com_
[d]FYI: MLIR has a buildbot with a Nvidia GPU which includes running some end-to-end tests: https://lab.llvm.org/buildbot/#/builders/61
[e]And we have CUDA bots, too:
- https://lab.llvm.org/buildbot/#/builders/46
- https://lab.llvm.org/buildbot/#/builders/55
- https://lab.llvm.org/buildbot/#/builders/1