1 of 14

Dask and Genomics

Dask Life Science Workshop

Dask Distributed Summit, May 19-21 2021

Tom White

2 of 14

3 of 14

UK Biobank scale:

10M variants x 500k samples

4 of 14

Statistical genetics toolkit in Python

5 of 14

Why sgkit?

  • high-quality native PyData library for statistical and population genetics
  • scale down - run on a single machine (like PLINK)
  • scale out - run on a cluster (like Hail)
  • from Oxford Big Data Institute (scikit-allel, tskit) and Related Sciences

6 of 14

7 of 14

8 of 14

9 of 14

5 Dask Challenges

10 of 14

  1. Representing missing data for ints
  • Genotype data uses small ints (e.g. 0, 1) and “missing”
  • np.float32 uses 4x as much memory as needed
  • Dask, NumPy, CuPy don't support masked arrays uniformly
  • As a workaround we use a sentinel value (-1)
  • This needs special casing (e.g. Numba)

Wish: support masked arrays in CuPy; or float8

11 of 14

2. Optimizing chunking for performance

  • Getting the right chunking is critical to performance
  • Have to manually examine the shape of every intermediate array
  • Often the best chunking strategy is counterintuitive
  • Dask rechunking is unpredictable in time/memory - need to use rechunker library

Wish - better tools for choosing chunking; incorporate rechunker ideas in Dask

12 of 14

3. Scaling linear algebra operations

Wish - a SLAB-like scalable linear algebra benchmark

https://github.com/ADALabUCSD/SLAB

13 of 14

4. Reasoning about Dask execution

  • Task-level - overwhelming amount of detail
  • Have to downsample to subset of input data manually to see what is going on

Wish: high-level visualization of computation (including chunking)

https://github.com/dask/dask/issues/7141

14 of 14

5. Choosing a stable Dask release

  • Some versions/combinations of Dask + Distributed were unusable for us
  • Often only becomes apparent at scale
  • “Blessed” releases

Wish: “blessed” stable releases and quality testing by the vendors, in partnership with the community