1 of 26

Embrace...� Extend...� Extinguish...

Valentin Churavy�vchuravy@mit.edu

2 of 26

3 of 26

Yet another high-level language?

Dynamically typed, high-level syntax

Open-source, permissive license

Built-in package manager

Interactive development

julia> function mandel(z)� c = z� maxiter = 80� for n = 1:maxiter� if abs(z) > 2� return n-1� end� z = z^2 + c� end� return maxiter� end

julia> mandel(complex(.3, -.6))�14

4 of 26

Yet another high-level language?

Typical features

Dynamically typed, high-level syntax

Open-source, permissive license

Built-in package manager

Interactive development

Unusual features

Great performance!

JIT AOT-style compilation

Most of Julia is written in Julia

Reflection and metaprogramming

5 of 26

Embrace...

Julia is built on the shoulders of giants.

LLVM as a core piece of compiler-infrastructure
Easy interoperability

C/Fortran
Python/R/...
Not so easy: C++

@cfunction creates a C-function pointer to use as a callback

Currently (-v1.8) can only be called from a Julia managed thread�(Except for very carefully crafted exceptions)
Julia v1.9 adds support for callbacks from arbitrary foreign threads
Calling Julia from C with MPI: https://github.com/omlins/libdiffusion

6 of 26

Example UCX.jl

function ucp_put_nb(ep, buffer, length, remote_addr, rkey, cb)

ccall(

(:ucp_put_nb, libucp),

ucs_status_ptr_t,

(ucp_ep_h, Ptr{Cvoid}, Csize_t, UInt64, ucp_rkey_h, ucp_send_callback_t),

ep, buffer, length, remote_addr, rkey, cb)

end

function send_callback(req::Ptr{Cvoid}, status::API.ucs_status_t, user_data::Ptr{Cvoid})

@assert user_data !== C_NULL

request = UCXRequest(user_data)

request.status = status

notify(request)

API.ucp_request_free(req)

nothing

end

function put!(ep::UCXEndpoint, request, data::Ptr, nbytes, remote_addr, rkey)

cb = @cfunction(send_callback, Cvoid, (Ptr{Cvoid}, API.ucs_status_t, Ptr{Cvoid}))

ptr = ucp_put_nb(ep, data, nbytes, remote_addr, rkey, cb)

return handle_request(request, ptr)

end

function put!(ep::UCXEndpoint, buffer, nbytes, remote_addr, rkey)

request = UCXRequest(ep, buffer) # rooted through ep.worker

GC.@preserve buffer begin

data = pointer(buffer)

put!(ep, request, data, nbytes, remote_addr, rkey)

end

7 of 26

Introspection and staged metaprogramming

AST

@code_lowered

Code

@edit

@which

LLVM IR

@code_llvm optimize=false

@code_llvm

Typed IR

@code_warntype

@code_typed optimize=false

@code_typed

@code_native

Assembly

LLVM.jl @asmcall

llvmcall

LLVM.jl

Generated functions

Cassette.jl passes

Macros

String macros

Julia my favorite LLVM frontend

8 of 26

Extend: Adding capabilities

Easy (and performant!) user-extensibility through JIT.�
Access to automatic-differentiation for UQ/ML/...

Together with Enzyme (AD for Julia/C++/Fortran/...) cross-language AD is feasible
Checkpointing.jl: Seamless integration of checkpointing for AD�

Outer loop/explorative work�

The PSAAP-III Center (cesmix.mit.edu) is an example of using Julia for the outer loop, while developing new capabilities.

9 of 26

10 of 26

(Language independent)

Automatic Gradient Synthesis

Compiler plugin for LLVM, which performs reverse-mode automatic differentiation
AD after the optimization pass of the compiler offers clear performance benefits: 4.2x speedup
Language-independent AD (C, C++, FORTRAN, Julia, Rust, Swift etc.)
Simulation code should not be rewritten in machine learning DSLs, but instead be integrated with AD
Spin-off from “representing parallelism in the compiler”

10

Instead of Rewriting Foreign Code for Machine Learning, Automatically Synthesize Fast Gradients

Moses, and Churavy, NeuroIPS 2020

% Slowdown vs. Enzyme

https://enzyme.mit.edu/

11 of 26

LAMMPS.jl — Prototype for extendability

using LAMMPS

# ...

command(lmp, "fix julia_lj all external pf/callback 1 1")

function compute_force(rsq, itype, jtype)

coeff = coefficients[itype][jtype]

r2inv = 1.0/rsq

r6inv = r2inv^3

lj1 = coeff[1]

lj2 = coeff[2]

return (r6inv * (lj1*r6inv - lj2))*r2inv

end

function compute_energy(rsq, itype, jtype)

# ...

return (r6inv * (lj3*r6inv - lj4))

end

# Register external fix

lj = LAMMPS.PairExternal(lmp, "julia_lj", "zero", compute_force, compute_energy, cutoff)

12 of 26

Extinguish

You can write fast, scalable, portable Julia code�
Large-scale HPC applications like Climate Models being built in Julia

eth-cscs/ImplicitGlobalGrid.jl

13 of 26

gets its Power from Extensible Compiler Design

13

Language design

Efficient execution

Julia: Dynamism and Performance�Reconciled by Design (doi:10.1145/3276490)

AST

IR

xPU back end

Effective Extensible Programming: Unleashing Julia on GPUs (doi:10.1109/TPDS.2018.2872064)

CPU

GPU

14 of 26

14

Why Julia for HPC?�Walks like Python, talks like Lisp, runs like Fortran

AST

IR

xPU back end

CUDA.jl

oneAPI.jl

Metal.jl

AMDGPU.jl

Rich GPU Ecosystem

HPC suffers from the many language problem;

Domain experts and performance engineers use

different programming languages:

Communication and Collaboration bottleneck

https://www.nature.com/articles/d41586-019-02310-3

15 of 26

Magic of Julia

Abstraction, Specialization, and Multiple Dispatch

Abstraction to obtain generic behavior:� Encode behavior in the type domain:� transpose(A::Matrix{Float64})::Transpose{Float64,Matrix{Float64}}�
Specialization of functions to produce optimal code�
Multiple-dispatch to select optimized behavior

rand(N, M) * rand(K, M)'

Matrix * Transpose{Matrix}

function mul!(C::Matrix{T}, A::Matrix{T}, tB::Transpose{<:Matrix{T}}, a, b) where {T<:BlasFloat}

gemm_wrapper!(C, 'N', 'T', A, B, MulAddMul(a, b))

end

15

Did I really need to move memory for that transpose?

No I did not! I know AB^T is the dot product of every row of A with every row of B .

compiles to

16 of 26

Parallel programming with 3 character changes: Array type programming model

Array types — where memory resides and how code is executed

�

16

A = Matrix{Float64}(64,32)	CPU (Intel, IBM, Apple)
A = CuMatrix{Float64}(64,32)	NVIDIA (CUDA) GPU
A = ROCMatrix{Float64}(64,32)	AMD (ROCm) GPU

Distribute(Blocks(16, 4), A)	Across nodes of a cluster;�orthogonal to the types above

composes with

17 of 26

Array programming

using LinearAlgebra

loss(w,b,x,y) = sum(abs2, y - (w*x .+ b)) / size(y,2)�loss∇w(w, b, x, y) = ...�lossdb(w, b, x, y) = ...

function train(w, b, x, y ; lr=.1)� w -= lmul!(lr, loss∇w(w, b, x, y))� b -= lr * lossdb(w, b, x, y)� return w, b�end

n = 100; p = 10�x = randn(n,p)'�y = sum(x[1:5,:]; dims=1) .+ randn(n)'*0.1�w = 0.0001*randn(1,p)�b = 0.0

for i=1:50� w, b = train(w, b, x, y)�end

x = CuArray(x)

y = CuArray(y)

w = CuArray(w)

17

18 of 26

Sustain

Broader reach
Easier accessibility
Slippery slope to systems programming
“Bridging HPC Communities through the Julia Programming Language”

Examples:

Working with Sherry Li (LBNL) to provide flexible bindings to SuperLU
PETSc.jl — Community effort to expose PETSc
MAGMA.jl
LAMMPS.jl

19 of 26

Julia’s development cycle

0.0 — 2009
0.1 — Feb 14, 2012
0.2 — Nov 18, 2013
0.3 — Aug 20, 2014
0.4 — Oct 09, 2015
0.5 — Sep 19, 2016
0.6 — Jun 19, 2017
0.7— Aug 8, 2018
1.0 — Aug 8, 2018
1.1 — Jan 21 2019

1.2 — Aug 19, 2019
1.3 — Nov 26, 2019
1.4 — Mar 21, 2020
1.5 — Aug 01, 2020
1.6 — Mar 24, 2021 (LTS)
1.7 — Nov 30, 2021
1.8 — Aug 17, 2022
1.9 — May 07, 2023

https://julialang.org/blog/2023/04/julia-1.9-highlights

https://julialang.org/blog/2019/08/release-process

Celeste.jl

CLIMA

CUDA

CESMIX

AMDGPU

oneAPI

Metal.jl

20 of 26

Ecosystem verification

CI/Benchmarking

PkgEval: �Nightly + on-demand compiler verification
Uses the package manager & �information from the “General” registry
Compares two states and highlights change
Needs manual examination to evaluate result

Packages all have similar structure.

95% of Julia packages in the registry had some form of CI (youtube.com/watch?v=9YWwiFbaRx8)

Packages often test against Julia nightly

JuliaLab is hosting CI for GPU codes. (NVIDIA/AMD/Intel/Apple)

Benchmarks:

Surprisingly challenging
Performance tracking for Julia Base�Nightly + on-demand
Could be much better, need better visibility into trends, measure more metrics
Bespoke benchmarks for speciality things:�GC, compilation time, latency

21 of 26

22 of 26

Who develops Julia?

https://julialang.org/blog/2019/02/julia-entities/

JuliaLab@MIT: Mixed CS/Applied Mathematics research lab

From GPU&HPC to runtime and compiler improvements
Automatic differentiation
SciML/Differential Equations

JuliaHub nee JuliaComputing

Products: PUMAS; CEDAR; JuliaSim; JuliaHub

Community of individuals, research groups and companies

23 of 26

Industry, academic, and national lab partners have brought these improved compiler tools to production

Automating parallelism and performance is transforming all collaborations

Faster Drug Development

More efficient batteries

Energy Efficient Buildings

Climate modeling for improved agriculture

24 of 26

JuliaCon: Yearly user and developer meetup

�25th to 29th of July, 2023�Cambridge, MA

https://juliacon.org/2023

25 of 26

Monthly HPC user group call

Open to all / Birds of a feather style
Provides a repeated point of contact for folks who want to use Julia in HPC
Concentrated development efforts: example MPI.jl

4th Tuesday of the Month at 2pm EST (next May 23rd)� Details: https://julialang.org/community/ — Events

26 of 26

26