1 of 44

Tapestry: Compressed Sensing for Pooled Testing of COVID-19 Samples

Ajit Rajwade

In collaboration with:

Manoj Gopalkrishnan

Sabyasachi Ghosh, Rishi Agarwal, Mohammad Ali Rehan, Shreya Pathak, Yash Gupta, Pratyush Agarwal, Ritika Goyal, Sarthak Consul, Nimay Gupta, Prantik Chakraborty, Ritesh Goenka, Krishna Vemula, Akanksha Vyas

Interaction/collaboration with Profs. Dror Baron and Chau-Wai Wong (NCSU) , Chandra Murthy (IISc), Krishna Narayanan (TAMU)

2 of 44

Skeleton of the Talk

  • Introduction and Motivation
  • Computational Problem
  • Noise Model
  • Algorithms
  • Matrix Design
  • Results
  • Discussion: Practical Aspects
  • Discussion: Previous Work
  • Summary
  • More prior knowledge: contact tracing

2

3 of 44

Introduction and Motivation

  • The COVID-19 pandemic began in Wuhan, China.

  • COVID-19 has infected more than 139 million people worldwide

  • RT-PCR (Reverse Transcription Polymerase Chain Reaction) is the most popular method for testing a person for COVID-19

  • Dearth of resources for widespread testing: time, skilled manpower, reagents, testing kits, etc.

3

4 of 44

Introduction and Motivation

  • RT-PCR: naso- or oro-pharyngeal swab taken, mixed in liquid medium, tested in RT-PCR machine

  • Can we pool (mix) subsets of n samples and test the pools to save resources?

  • #pools (m) < #samples (n)

  • Equal portions of participating samples are taken for creating any pool.

  • A negative pool test implies all contributing samples are negative (non-infected) .

  • More work to be done if the pool tests positive (infected).

4

5 of 44

Computational problem

5

y

x

A

Vector of observed quantities proportional to viral loads in m pools

m << n

Known mixing matrix:

Aij = 1 if j-th sample contributes to i-th pool, and 0 otherwise

Sparse vector of unknown viral loads in the samples of n different people

m x 1

m x n, m << n

n x 1

Estimate x given y and A

η

Noise vector

m x 1

6 of 44

Designed pooling

7 of 44

RT-PCR Process

  • Viral RNA molecules in the liquid medium converted to cDNA – reverse transcription

  • DNA primers complementary to viral cDNA are then added

  • These primers attach themselves to sections of the cDNA (if the virus is present in the sample).

  • The cDNA then undergoes exponential amplification in the RT-PCR machine.

7

8 of 44

RT-PCR Process

  • The exponential amplification occurs during cycles of alternate heating and cooling in an RT-PCR machine.

  • cDNA produces fluorescence in response to markers – observable on a computer screen.

  • The time at which fluorescence exceeds a machine-specified threshold is called the cycle threshold (CT) – measured by the RT-PCR machine.

8

9 of 44

RT-PCR Process

  • Smaller CT value = greater viral load.

  • Typical CT values range from 16 to 40; can detect even single molecules (CT ~ 40)

  • Single RT-PCR test: 3-4 hours (non-adaptive tests are desirable)

9

10 of 44

Noise Model: RT-PCR

10

zi is the noisy value of initial viral count of the i-th pool, 1 <= i <= m and noisy observed value (𝜏i) of time at which fluorescence is observed. The corresponding noiseless values are Aix and ti respectively where x is the sparse vector of initial viral loads.

x is the vector of n viral loads. If the viral loads are high, then w is approximately equal to x elementwise as the standard deviation of a Poisson distribution with large mean is relatively negligible (as it equals the square root of the mean)

11 of 44

Noise model: RT-PCR

11

y

x

A

Vector of observed quantities proportional to viral loads in m pools

m << n

Known mixing matrix:

Aij = 1 if j-th sample contributes to i-th pool, and 0 otherwise

Sparse vector of unknown viral loads in the samples of n different people

m x 1

m x n, m << n

n x 1

Estimate x given y and A

η

Noise vector

m x 1

log

log

12 of 44

RT-PCR: 0-valued measurements

  • If the viral load zi in the i-th pool is 0, the fluorescence threshold is never crossed (negative test).

  • In this case, all participating samples xj (i.e. for which Aij = 1) are non-infected and can be removed from further computation.

  • If viral loads in infected samples are high, there is no conflation between negative and positive tests (negative result is noiseless).

12

13 of 44

RT-PCR: 0-valued measurements

  • “Removing negative samples” ≈ combinatorial group testing algorithm called COMP (Combinatorial Orthogonal Matching Pursuit)

  • But COMP regards the other samples to be positive (possible false positives)

  • After discarding negative samples, we may also get a few sure positives – positive tests with only one non-discarded contributing sample.

13

14 of 44

Compressed Sensing Algorithms

  • We now have a smaller-sized y, A, x after discarding the negatives in x, and pruning y, A accordingly.

14

y

x

A

m' x 1

m’ x n’

n’ x 1

η

m’ x 1

log

log

15 of 44

Compressed Sensing Algorithms

  • Run after the “COMP-step”
  • Non-negative LASSO

  • Non-negative Orthogonal Matching Pursuit

  • Sparse Bayesian Learning (E-M algorithm)

  • Brute-force search (for small k only)

15

16 of 44

Mixing matrix design

  • Many methods exist:
  • (*) Minimize mutual coherence of A
  • (#) Maximize mutual information between Ax and x [requires estimating high-D PDFs from representative signals]
  • ($) Minimize MMSE assuming Gaussian noise in y and GMM for x [requires GMM fitting on representative signals].

16

(*) Abdolghosemi et al, “On the optimization of the measurement matrix for CS”

(#) L Wang et al, “Information-theoretic measurement matrix design”

($) Renna et al, “Reconstruction of Signals Drawn from a Gaussian Mixture via Noisy Compressive Measurements”

17 of 44

Mixing matrix design

  • Constraints on A (size m x n):
  • Should allow for unique recovery of sparse x from A and noisy y : no sparse vector except 0 should lie in nullspace(A)
  • Non-negative
  • Binary and sparse (for ease of sample pooling in the lab)
  • Should be flexible, i.e. should adapt easily to slightly different sizes

17

18 of 44

Mixing matrix design: Expanders

  • Expander matrices: obey RIP-1, i.e. preserve norm of k-sparse vectors

18

Every sufficiently small subset of samples should participate in a sufficiently large number of pools

19 of 44

Mixing matrix design: Expanders

  • A randomly generated left-regular graph with left-degree d is an expander – with high probability.

  • Can be converted into a graph/matrix that is left-regular as well as right-regular with almost same parameters.

  • But difficult to verify whether a given matrix is an expander.

19

20 of 44

Mixing matrix design: deterministic matrices

  • Many methods for deterministic sensing matrix design (*)
  • We use Steiner triple matrices of size m x n where n = C(m,2)/3
  • Each column has at most 3 ones (out of m elements)
  • No two columns have a dot-product more than 1 (good mutual coherence)

20

(*) Devore, Deterministic constructions of compressed sensing matrices, 2007

(*) Naidu et al, Deterministic CS matrices: construction via Euler squares, 2016

21 of 44

Mixing matrix design: deterministic matrices

  • In particular, we use Kirkman triples – a type of Steiner triples where each block of consecutive m/3 columns (starting from index 0) sum up to 1.

  • Kirkman triple matrices have flexible sizes: n can be any integer multiple of m/3 and still each pool has the same number of samples.

21

22 of 44

Results

  • Synthetic/simulated data – refer our arxiv paper

  • Real data from labs at NCBS and Harvard, obtained by artificially injecting viral RNA copies in samples

  • Pools created using our designed mixing matrix A.

  • Results on blinded experiments (no idea of ground truth before reporting of results, or during creation of pools by technicians)

22

23 of 44

Results: figures of Merit

  • RMSE between estimated and true viral load

  • Sensitivity = #correctly detected positives/#true positives

  • Specificity = #correctly detected negatives/#true negatives

23

24 of 44

24

k = # of infected samples out of n

25 of 44

25

Experiment in a lab at Harvard, using a liquid handling robot

26 of 44

Clinical Trials with COVID19 samples

  • 72 samples in 36 tests with upto 12 positives
  • 105 samples in 45 tests with upto 15 positives
  • 320 samples in 48 tests with 5 positives
  • 500 samples in 60 tests with 10 positives
  • 961 samples in 93 tests with 10 positives

  • Very small number of false positives, no false negatives

27 of 44

A word about Dorfman Pooling

27

  • Testing in two stages (one after the other)
  • Adaptive method
  • First round: create (nk)0.5 pools of size (n/k)0.5 each; k = #infected people
  • Second round: individual testing of each person in a positive pool
  • Two of rounds of RT-PCR = time consuming

28 of 44

28

Comparing COMP-SBL (one-stage) with two-stage Dorfman pooling

k = # of infected samples out of n

29 of 44

Outputs

  • We have built a system called Tapestry.

  • Phone App called Byom.

  • Papers on medarxiv and arxiv.

  • Work cited in newsletters of IEEE Signal Processing Society and SIAM.

  • David Donoho’s talk at SIAM

29

30 of 44

31 of 44

Summary of Tapestry

  • Tapestry is non-adaptive, i.e. single stage (unlike Dorfman)

  • Tapestry has in-built methods to estimate number of infected people (k) from the RT-PCR test results on pools (resort to COMP if k is too high)

  • Tapestry handles noisy measurements.

  • Unlike some binary group-testing methods, Tapestry measures viral loads

31

32 of 44

Summary of Tapestry

  • Algorithmic parameters auto-tuned by cross-validation

  • Algorithms do not use any prior knowledge of k

  • In our design of A, each sample contributes to exactly 3 pools. Each pool contains contributions from ~40 samples.

32

33 of 44

Discussion: Practical Aspects

  • How many measurements? O(k log n) for random matrices, O(k2) for deterministic

  • Empirically, we see that k log n measurements are good enough for many of our results

  • How to predict minimum number of required measurements?

33

34 of 44

Discussion: Practical Aspects

  • Can you guarantee no false negatives for non-adaptive pooled tests with minimal false positives?

  • More accurate noise models?

34

35 of 44

Discussion: Related Work

  • (#) Work from Noam Shental’s group (LASSO with Reed-Solomon codes)

  • (*) Work from W. Xu’s group (synthetic data only)

  • ($) Binary PCR work by Dror Baron

  • (&) Work by Krishna Narayanan: two stage, adaptive

35

(#) N. Shental, S. Levy, S. Skorniakov, V. Wuvshet, Y. Shemer-Avni, A. Porgador, and T. Hertz, “Efficient high throughput SARS-CoV-2 testing to detect asymptomatic carriers

(*) J. Yi, R. Mudumbai, and W. Xu, “Low-cost and high-throughput testing of covid-19 viruses and antibodies via compressed sensing: System concepts and computational experiments

($) J. Zhu, K. Rivera, and D. Baron, "Noisy Pooled PCR for Virus Testing

(&) Heiderzadeh and Narayanan, “Two-stage adaptive pooling with RT-qPCR for COVID-19 screening

36 of 44

More prior knowledge: families

36

y

x

A

Vector of observed quantities proportional to viral loads in m pools

m << n

Known mixing matrix:

Aij = 1 if j-th sample contributes to i-th pool, and 0 otherwise

Sparse vector of unknown viral loads in the samples of n different people - divided into disjoint blocks (eg: families)

m x 1

m x n, m << n

n x 1

Estimate x given y and A

η

Noise vector

m x 1

log

log

Discussions with Dror Baron.

37 of 44

Group sparsity

37

L1 norm is a softened version of the L0 norm – which considers sparsity of x [we assumed that the number of infected individuals is small]

The group sparsity norm considers the number of infected families and assumes it to be small. Counting the number of infected families is called the group-L0 norm but it leads to NP-hard problems (just like the L0 norm). So we soften it to the group-L1 norm. NOTE: xGi is a vector that contains the viral load of all members of the i-th family. It is a subvector of x.

38 of 44

38

# infected people --> # infected families

Fully Infected Families

39 of 44

39

Partially Infected Families

40 of 44

Even more prior knowledge: contact tracing

40

y

x

A

Vector of observed quantities proportional to viral loads in m pools

m << n

Known mixing matrix:

Aij = 1 if j-th sample contributes to i-th pool, and 0 otherwise

Sparse vector of unknown viral loads in the samples of n different people - divided into blocks (eg: families) and you have contact tracing information

m x 1

m x n, m << n

n x 1

Estimate x given y and A

η

Noise vector

m x 1

log

log

A

B

D

C

E

F

G

41 of 44

Even more prior knowledge: contact tracing

41

42 of 44

More details about the contact tracing work

  • Ritesh Goenka, Shu-Jie Cao, Chau-Wai Wong, Ajit Rajwade, Dror Baron,"Contact Tracing Enhances the Efficiency of COVID-19 Group Testing", ICASSP 2021 (accepted to Special session on "Data Science Methods for COVID-19"). Arxiv version available.

  • Some results on false positive rate (FPR) and false negative rate (FNR) on next slide.

42

43 of 44

43

44 of 44

Thanks for listening! �Questions?�Website: http://www.cse.iitb.ac.in/~ajitvr