1 of 26

Embrace...� Extend...� Extinguish...

Valentin Churavy�vchuravy@mit.edu

2 of 26

3 of 26

Yet another high-level language?

Dynamically typed, high-level syntax

Open-source, permissive license

Built-in package manager

Interactive development

julia> function mandel(z)� c = z� maxiter = 80� for n = 1:maxiter� if abs(z) > 2� return n-1� end� z = z^2 + c� end� return maxiter� end

julia> mandel(complex(.3, -.6))�14

4 of 26

Yet another high-level language?

Typical features

Dynamically typed, high-level syntax

Open-source, permissive license

Built-in package manager

Interactive development

Unusual features

Great performance!

JIT AOT-style compilation

Most of Julia is written in Julia

Reflection and metaprogramming

5 of 26

Embrace...

Julia is built on the shoulders of giants.

  • LLVM as a core piece of compiler-infrastructure
  • Easy interoperability
    • C/Fortran
    • Python/R/...
    • Not so easy: C++
  • @cfunction creates a C-function pointer to use as a callback
    • Currently (-v1.8) can only be called from a Julia managed thread�(Except for very carefully crafted exceptions)
    • Julia v1.9 adds support for callbacks from arbitrary foreign threads
    • Calling Julia from C with MPI: https://github.com/omlins/libdiffusion

6 of 26

Example UCX.jl

function ucp_put_nb(ep, buffer, length, remote_addr, rkey, cb)

ccall(

(:ucp_put_nb, libucp),

ucs_status_ptr_t,

(ucp_ep_h, Ptr{Cvoid}, Csize_t, UInt64, ucp_rkey_h, ucp_send_callback_t),

ep, buffer, length, remote_addr, rkey, cb)

end

function send_callback(req::Ptr{Cvoid}, status::API.ucs_status_t, user_data::Ptr{Cvoid})

@assert user_data !== C_NULL

request = UCXRequest(user_data)

request.status = status

notify(request)

API.ucp_request_free(req)

nothing

end

function put!(ep::UCXEndpoint, request, data::Ptr, nbytes, remote_addr, rkey)

cb = @cfunction(send_callback, Cvoid, (Ptr{Cvoid}, API.ucs_status_t, Ptr{Cvoid}))

ptr = ucp_put_nb(ep, data, nbytes, remote_addr, rkey, cb)

return handle_request(request, ptr)

end

function put!(ep::UCXEndpoint, buffer, nbytes, remote_addr, rkey)

request = UCXRequest(ep, buffer) # rooted through ep.worker

GC.@preserve buffer begin

data = pointer(buffer)

put!(ep, request, data, nbytes, remote_addr, rkey)

end

end

7 of 26

Introspection and staged metaprogramming

AST

@code_lowered

Code

@edit

@which

LLVM IR

@code_llvm optimize=false

@code_llvm

Typed IR

@code_warntype

@code_typed optimize=false

@code_typed

@code_native

Assembly

LLVM.jl @asmcall

llvmcall

LLVM.jl

Generated functions

Cassette.jl passes

Macros

String macros

Julia my favorite LLVM frontend

8 of 26

Extend: Adding capabilities

  • Easy (and performant!) user-extensibility through JIT.�
  • Access to automatic-differentiation for UQ/ML/...
    • Together with Enzyme (AD for Julia/C++/Fortran/...) cross-language AD is feasible
    • Checkpointing.jl: Seamless integration of checkpointing for AD�
  • Outer loop/explorative work�

The PSAAP-III Center (cesmix.mit.edu) is an example of using Julia for the outer loop, while developing new capabilities.

9 of 26

10 of 26

(Language independent)

Automatic Gradient Synthesis

  • Compiler plugin for LLVM, which performs reverse-mode automatic differentiation
  • AD after the optimization pass of the compiler offers clear performance benefits: 4.2x speedup
  • Language-independent AD (C, C++, FORTRAN, Julia, Rust, Swift etc.)
  • Simulation code should not be rewritten in machine learning DSLs, but instead be integrated with AD
  • Spin-off from “representing parallelism in the compiler”

10

Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients

Moses, and Churavy, NeuroIPS 2020

% Slowdown vs. Enzyme

https://enzyme.mit.edu/

11 of 26

LAMMPS.jl — Prototype for extendability

using LAMMPS

# ...

command(lmp, "fix julia_lj all external pf/callback 1 1")

function compute_force(rsq, itype, jtype)

coeff = coefficients[itype][jtype]

r2inv = 1.0/rsq

r6inv = r2inv^3

lj1 = coeff[1]

lj2 = coeff[2]

return (r6inv * (lj1*r6inv - lj2))*r2inv

end

function compute_energy(rsq, itype, jtype)

# ...

return (r6inv * (lj3*r6inv - lj4))

end

# Register external fix

lj = LAMMPS.PairExternal(lmp, "julia_lj", "zero", compute_force, compute_energy, cutoff)

12 of 26

Extinguish

  • You can write fast, scalable, portable Julia code�
  • Large-scale HPC applications like Climate Models being built in Julia

eth-cscs/ImplicitGlobalGrid.jl

13 of 26

gets its Power from Extensible Compiler Design

13

Language design

Efficient execution

Julia: Dynamism and Performance�Reconciled by Design (doi:10.1145/3276490)

AST

IR

xPU back end

Effective Extensible Programming: Unleashing Julia on GPUs (doi:10.1109/TPDS.2018.2872064)

CPU

GPU

GPU

14 of 26

14

Why Julia for HPC?Walks like Python, talks like Lisp, runs like Fortran

AST

IR

xPU back end

CUDA.jl

oneAPI.jl

Metal.jl

AMDGPU.jl

Rich GPU Ecosystem

HPC suffers from the many language problem;

Domain experts and performance engineers use

different programming languages:

Communication and Collaboration bottleneck

https://www.nature.com/articles/d41586-019-02310-3

15 of 26

Magic of Julia

Abstraction, Specialization, and Multiple Dispatch

  1. Abstraction to obtain generic behavior:� Encode behavior in the type domain:transpose(A::Matrix{Float64})::Transpose{Float64,Matrix{Float64}}�
  2. Specialization of functions to produce optimal code�
  3. Multiple-dispatch to select optimized behavior

rand(N, M) * rand(K, M)'

Matrix * Transpose{Matrix}

function mul!(C::Matrix{T}, A::Matrix{T}, tB::Transpose{<:Matrix{T}}, a, b) where {T<:BlasFloat}

gemm_wrapper!(C, 'N', 'T', A, B, MulAddMul(a, b))

end

15

Did I really need to move memory for that transpose?

No I did not! I know ABT is the dot product of every row of A with every row of B .

compiles to

16 of 26

Parallel programming with 3 character changes: Array type programming model

Array types — where memory resides and how code is executed

16

A = Matrix{Float64}(64,32)

CPU (Intel, IBM, Apple)

A = CuMatrix{Float64}(64,32)

NVIDIA (CUDA) GPU

A = ROCMatrix{Float64}(64,32)

AMD (ROCm) GPU

Distribute(Blocks(16, 4), A)

Across nodes of a cluster;�orthogonal to the types above

composes with

17 of 26

Array programming

using LinearAlgebra

loss(w,b,x,y) = sum(abs2, y - (w*x .+ b)) / size(y,2)loss∇w(w, b, x, y) = ...lossdb(w, b, x, y) = ...

function train(w, b, x, y ; lr=.1)� w -= lmul!(lr, loss∇w(w, b, x, y))� b -= lr * lossdb(w, b, x, y)� return w, b�end

n = 100; p = 10�x = randn(n,p)'�y = sum(x[1:5,:]; dims=1) .+ randn(n)'*0.1�w = 0.0001*randn(1,p)�b = 0.0

for i=1:50� w, b = train(w, b, x, y)end

x = CuArray(x)

y = CuArray(y)

w = CuArray(w)

17

18 of 26

Sustain

Examples:

  • Working with Sherry Li (LBNL) to provide flexible bindings to SuperLU
  • PETSc.jl — Community effort to expose PETSc
  • MAGMA.jl
  • LAMMPS.jl

19 of 26

Julia’s development cycle

  • 0.0 — 2009
  • 0.1 — Feb 14, 2012
  • 0.2 — Nov 18, 2013
  • 0.3 — Aug 20, 2014
  • 0.4 — Oct 09, 2015
  • 0.5 — Sep 19, 2016
  • 0.6 — Jun 19, 2017
  • 0.7— Aug 8, 2018
  • 1.0 — Aug 8, 2018
  • 1.1 — Jan 21 2019
  • 1.2 — Aug 19, 2019
  • 1.3 — Nov 26, 2019
  • 1.4 — Mar 21, 2020
  • 1.5 — Aug 01, 2020
  • 1.6 — Mar 24, 2021 (LTS)
  • 1.7 — Nov 30, 2021
  • 1.8 — Aug 17, 2022
  • 1.9 — May 07, 2023

https://julialang.org/blog/2023/04/julia-1.9-highlights

https://julialang.org/blog/2019/08/release-process

Celeste.jl

CLIMA

CUDA

CESMIX

AMDGPU

oneAPI

Metal.jl

20 of 26

Ecosystem verification

CI/Benchmarking

  • PkgEval: �Nightly + on-demand compiler verification
  • Uses the package manager & �information from the “General” registry
  • Compares two states and highlights change
  • Needs manual examination to evaluate result

Packages all have similar structure.

95% of Julia packages in the registry had some form of CI (youtube.com/watch?v=9YWwiFbaRx8)

Packages often test against Julia nightly

  • JuliaLab is hosting CI for GPU codes. (NVIDIA/AMD/Intel/Apple)

Benchmarks:

  • Surprisingly challenging
  • Performance tracking for Julia Base�Nightly + on-demand
  • Could be much better, need better visibility into trends, measure more metrics
  • Bespoke benchmarks for speciality things:�GC, compilation time, latency

21 of 26

22 of 26

Who develops Julia?

https://julialang.org/blog/2019/02/julia-entities/

JuliaLab@MIT: Mixed CS/Applied Mathematics research lab

  • From GPU&HPC to runtime and compiler improvements
  • Automatic differentiation
  • SciML/Differential Equations

JuliaHub nee JuliaComputing

  • Products: PUMAS; CEDAR; JuliaSim; JuliaHub

Community of individuals, research groups and companies

23 of 26

Industry, academic, and national lab partners have brought these improved compiler tools to production

Automating parallelism and performance is transforming all collaborations

Faster Drug Development

More efficient batteries

Energy Efficient Buildings

Climate modeling for improved agriculture

24 of 26

JuliaCon: Yearly user and developer meetup

25th to 29th of July, 2023�Cambridge, MA

https://juliacon.org/2023

25 of 26

Monthly HPC user group call

  • Open to all / Birds of a feather style
  • Provides a repeated point of contact for folks who want to use Julia in HPC
  • Concentrated development efforts: example MPI.jl

4th Tuesday of the Month at 2pm EST (next May 23rd)� Details: https://julialang.org/community/ — Events

26 of 26

26