1 of 14

Dask and Genomics

Dask Life Science Workshop

Dask Distributed Summit, May 19-21 2021

Tom White

3 of 14

UK Biobank scale:

10M variants x 500k samples

4 of 14

Statistical genetics toolkit in Python

5 of 14

Why sgkit?

high-quality native PyData library for statistical and population genetics
scale down - run on a single machine (like PLINK)
scale out - run on a cluster (like Hail)
from Oxford Big Data Institute (scikit-allel, tskit) and Related Sciences

9 of 14

5 Dask Challenges

10 of 14

Representing missing data for ints

Genotype data uses small ints (e.g. 0, 1) and “missing”
np.float32 uses 4x as much memory as needed
Dask, NumPy, CuPy don't support masked arrays uniformly

https://github.com/cupy/cupy/issues/2225

As a workaround we use a sentinel value (-1)
This needs special casing (e.g. Numba)

Wish: support masked arrays in CuPy; or float8

11 of 14

2. Optimizing chunking for performance

Getting the right chunking is critical to performance

E.g. 4x speedup here: https://github.com/pystatgen/sgkit/issues/448#issuecomment-777376979

Have to manually examine the shape of every intermediate array
Often the best chunking strategy is counterintuitive
Dask rechunking is unpredictable in time/memory - need to use rechunker library

E.g. https://github.com/dask/dask/issues/6745

Wish - better tools for choosing chunking; incorporate rechunker ideas in Dask

12 of 14

3. Scaling linear algebra operations

We’ve found (and fixed) matmul and SVD bugs

Dask can scale for these operations, but there are often corner cases that need care

Wish - a SLAB-like scalable linear algebra benchmark

https://github.com/ADALabUCSD/SLAB

13 of 14

4. Reasoning about Dask execution

Task-level - overwhelming amount of detail
Have to downsample to subset of input data manually to see what is going on

https://pystatgen.github.io/sgkit/latest/user_guide.html#visualizing-computations

Wish: high-level visualization of computation (including chunking)

https://github.com/dask/dask/issues/7141

14 of 14

5. Choosing a stable Dask release

Some versions/combinations of Dask + Distributed were unusable for us

E.g. https://github.com/pystatgen/sgkit/issues/390#issuecomment-748927304

Often only becomes apparent at scale
“Blessed” releases

https://github.com/dask/community/issues/101

Wish: “blessed” stable releases and quality testing by the vendors, in partnership with the community