1 of 44

Tapestry: Compressed Sensing for Pooled Testing of COVID-19 Samples

Ajit Rajwade

In collaboration with:

Manoj Gopalkrishnan

Sabyasachi Ghosh, Rishi Agarwal, Mohammad Ali Rehan, Shreya Pathak, Yash Gupta, Pratyush Agarwal, Ritika Goyal, Sarthak Consul, Nimay Gupta, Prantik Chakraborty, Ritesh Goenka, Krishna Vemula, Akanksha Vyas

Interaction/collaboration with Profs. Dror Baron and Chau-Wai Wong (NCSU) , Chandra Murthy (IISc), Krishna Narayanan (TAMU)

2 of 44

Skeleton of the Talk

Introduction and Motivation
Computational Problem
Noise Model
Algorithms
Matrix Design
Results
Discussion: Practical Aspects
Discussion: Previous Work
Summary
More prior knowledge: contact tracing

3 of 44

Introduction and Motivation

The COVID-19 pandemic began in Wuhan, China.

COVID-19 has infected more than 139 million people worldwide

RT-PCR (Reverse Transcription Polymerase Chain Reaction) is the most popular method for testing a person for COVID-19

Dearth of resources for widespread testing: time, skilled manpower, reagents, testing kits, etc.

4 of 44

Introduction and Motivation

RT-PCR: naso- or oro-pharyngeal swab taken, mixed in liquid medium, tested in RT-PCR machine

Can we pool (mix) subsets of n samples and test the pools to save resources?

#pools (m) < #samples (n)

Equal portions of participating samples are taken for creating any pool.

A negative pool test implies all contributing samples are negative (non-infected) .

More work to be done if the pool tests positive (infected).

5 of 44

Computational problem

Vector of observed quantities proportional to viral loads in m pools

m << n

Known mixing matrix:

A_ij = 1 if j-th sample contributes to i-th pool, and 0 otherwise

Sparse vector of unknown viral loads in the samples of n different people

m x 1

m x n, m << n

n x 1

Estimate x given y and A

Noise vector

m x 1

6 of 44

Designed pooling

7 of 44

RT-PCR Process

Viral RNA molecules in the liquid medium converted to cDNA – reverse transcription

DNA primers complementary to viral cDNA are then added

These primers attach themselves to sections of the cDNA (if the virus is present in the sample).

The cDNA then undergoes exponential amplification in the RT-PCR machine.

8 of 44

RT-PCR Process

The exponential amplification occurs during cycles of alternate heating and cooling in an RT-PCR machine.

cDNA produces fluorescence in response to markers – observable on a computer screen.

The time at which fluorescence exceeds a machine-specified threshold is called the cycle threshold (CT) – measured by the RT-PCR machine.

Thermofisher document

9 of 44

RT-PCR Process

Smaller CT value = greater viral load.

Typical CT values range from 16 to 40; can detect even single molecules (CT ~ 40)

Single RT-PCR test: 3-4 hours (non-adaptive tests are desirable)

10 of 44

Noise Model: RT-PCR

zi is the noisy value of initial viral count of the i-th pool, 1 <= i <= m and noisy observed value (𝜏_i) of time at which fluorescence is observed. The corresponding noiseless values are Aⁱx and t_i respectively where x is the sparse vector of initial viral loads.

x is the vector of n viral loads. If the viral loads are high, then w is approximately equal to x elementwise as the standard deviation of a Poisson distribution with large mean is relatively negligible (as it equals the square root of the mean)

11 of 44

Noise model: RT-PCR

Vector of observed quantities proportional to viral loads in m pools

m << n

Known mixing matrix:

A_ij = 1 if j-th sample contributes to i-th pool, and 0 otherwise

Sparse vector of unknown viral loads in the samples of n different people

m x 1

m x n, m << n

n x 1

Estimate x given y and A

Noise vector

m x 1

log

12 of 44

RT-PCR: 0-valued measurements

If the viral load z_i in the i-th pool is 0, the fluorescence threshold is never crossed (negative test).

In this case, all participating samples x_j (i.e. for which A_ij = 1) are non-infected and can be removed from further computation.

If viral loads in infected samples are high, there is no conflation between negative and positive tests (negative result is noiseless).

13 of 44

RT-PCR: 0-valued measurements

“Removing negative samples” ≈ combinatorial group testing algorithm called COMP (Combinatorial Orthogonal Matching Pursuit)

But COMP regards the other samples to be positive (possible false positives)

After discarding negative samples, we may also get a few sure positives – positive tests with only one non-discarded contributing sample.

14 of 44

Compressed Sensing Algorithms

We now have a smaller-sized y, A, x after discarding the negatives in x, and pruning y, A accordingly.

m' x 1

m’ x n’

n’ x 1

m’ x 1

log

15 of 44

Compressed Sensing Algorithms

Run after the “COMP-step”
Non-negative LASSO

Non-negative Orthogonal Matching Pursuit

Sparse Bayesian Learning (E-M algorithm)

Brute-force search (for small k only)

16 of 44

Mixing matrix design

Many methods exist:
(*) Minimize mutual coherence of A
(#) Maximize mutual information between Ax and x [requires estimating high-D PDFs from representative signals]
($) Minimize MMSE assuming Gaussian noise in y and GMM for x [requires GMM fitting on representative signals].

(*) Abdolghosemi et al, “On the optimization of the measurement matrix for CS”

(#) L Wang et al, “Information-theoretic measurement matrix design”

($) Renna et al, “Reconstruction of Signals Drawn from a Gaussian Mixture via Noisy Compressive Measurements”

17 of 44

Mixing matrix design

Constraints on A (size m x n):
Should allow for unique recovery of sparse x from A and noisy y : no sparse vector except 0 should lie in nullspace(A)
Non-negative
Binary and sparse (for ease of sample pooling in the lab)
Should be flexible, i.e. should adapt easily to slightly different sizes

18 of 44

Mixing matrix design: Expanders

Expander matrices: obey RIP-1, i.e. preserve norm of k-sparse vectors

Every sufficiently small subset of samples should participate in a sufficiently large number of pools

19 of 44

Mixing matrix design: Expanders

A randomly generated left-regular graph with left-degree d is an expander – with high probability.

Can be converted into a graph/matrix that is left-regular as well as right-regular with almost same parameters.

But difficult to verify whether a given matrix is an expander.

20 of 44

Mixing matrix design: deterministic matrices

Many methods for deterministic sensing matrix design (*)
We use Steiner triple matrices of size m x n where n = C(m,2)/3
Each column has at most 3 ones (out of m elements)
No two columns have a dot-product more than 1 (good mutual coherence)

(*) Devore, Deterministic constructions of compressed sensing matrices, 2007

(*) Naidu et al, Deterministic CS matrices: construction via Euler squares, 2016

21 of 44

Mixing matrix design: deterministic matrices

In particular, we use Kirkman triples – a type of Steiner triples where each block of consecutive m/3 columns (starting from index 0) sum up to 1.

Kirkman triple matrices have flexible sizes: n can be any integer multiple of m/3 and still each pool has the same number of samples.

22 of 44

Results

Synthetic/simulated data – refer our arxiv paper

Real data from labs at NCBS and Harvard, obtained by artificially injecting viral RNA copies in samples

Pools created using our designed mixing matrix A.

Results on blinded experiments (no idea of ground truth before reporting of results, or during creation of pools by technicians)

23 of 44

Results: figures of Merit

RMSE between estimated and true viral load

Sensitivity = #correctly detected positives/#true positives

Specificity = #correctly detected negatives/#true negatives

24 of 44

k = # of infected samples out of n

25 of 44

Experiment in a lab at Harvard, using a liquid handling robot

26 of 44

Clinical Trials with COVID19 samples

72 samples in 36 tests with upto 12 positives
105 samples in 45 tests with upto 15 positives
320 samples in 48 tests with 5 positives
500 samples in 60 tests with 10 positives
961 samples in 93 tests with 10 positives

Very small number of false positives, no false negatives

27 of 44

A word about Dorfman Pooling

Testing in two stages (one after the other)
Adaptive method
First round: create (nk)^0.5 pools of size (n/k)^0.5 each; k = #infected people
Second round: individual testing of each person in a positive pool
Two of rounds of RT-PCR = time consuming

28 of 44

Comparing COMP-SBL (one-stage) with two-stage Dorfman pooling

k = # of infected samples out of n

29 of 44

Outputs

We have built a system called Tapestry.

Phone App called Byom.

Papers on medarxiv and arxiv.

Work cited in newsletters of IEEE Signal Processing Society and SIAM.

David Donoho’s talk at SIAM

31 of 44

Summary of Tapestry

Tapestry is non-adaptive, i.e. single stage (unlike Dorfman)

Tapestry has in-built methods to estimate number of infected people (k) from the RT-PCR test results on pools (resort to COMP if k is too high)

Tapestry handles noisy measurements.

Unlike some binary group-testing methods, Tapestry measures viral loads

32 of 44

Summary of Tapestry

Algorithmic parameters auto-tuned by cross-validation

Algorithms do not use any prior knowledge of k

In our design of A, each sample contributes to exactly 3 pools. Each pool contains contributions from ~40 samples.

33 of 44

Discussion: Practical Aspects

How many measurements? O(k log n) for random matrices, O(k²) for deterministic

Empirically, we see that k log n measurements are good enough for many of our results

How to predict minimum number of required measurements?

34 of 44

Discussion: Practical Aspects

Can you guarantee no false negatives for non-adaptive pooled tests with minimal false positives?

More accurate noise models?

35 of 44

Discussion: Related Work

(#) Work from Noam Shental’s group (LASSO with Reed-Solomon codes)

(*) Work from W. Xu’s group (synthetic data only)

($) Binary PCR work by Dror Baron

(&) Work by Krishna Narayanan: two stage, adaptive

(#) N. Shental, S. Levy, S. Skorniakov, V. Wuvshet, Y. Shemer-Avni, A. Porgador, and T. Hertz, “Efficient high throughput SARS-CoV-2 testing to detect asymptomatic carriers”

(*) J. Yi, R. Mudumbai, and W. Xu, “Low-cost and high-throughput testing of covid-19 viruses and antibodies via compressed sensing: System concepts and computational experiments”

($) J. Zhu, K. Rivera, and D. Baron, "Noisy Pooled PCR for Virus Testing"

(&) Heiderzadeh and Narayanan, “Two-stage adaptive pooling with RT-qPCR for COVID-19 screening”

36 of 44

More prior knowledge: families

Vector of observed quantities proportional to viral loads in m pools

m << n

Known mixing matrix:

A_ij = 1 if j-th sample contributes to i-th pool, and 0 otherwise

Sparse vector of unknown viral loads in the samples of n different people - divided into disjoint blocks (eg: families)

m x 1

m x n, m << n

n x 1

Estimate x given y and A

Noise vector

m x 1

log

Discussions with Dror Baron.

37 of 44

Group sparsity

L1 norm is a softened version of the L0 norm – which considers sparsity of x [we assumed that the number of infected individuals is small]

The group sparsity norm considers the number of infected families and assumes it to be small. Counting the number of infected families is called the group-L0 norm but it leads to NP-hard problems (just like the L0 norm). So we soften it to the group-L1 norm. NOTE: xGi is a vector that contains the viral load of all members of the i-th family. It is a subvector of x.

38 of 44

# infected people --> # infected families

Fully Infected Families

39 of 44

Partially Infected Families

40 of 44

Even more prior knowledge: contact tracing

Vector of observed quantities proportional to viral loads in m pools

m << n

Known mixing matrix:

A_ij = 1 if j-th sample contributes to i-th pool, and 0 otherwise

Sparse vector of unknown viral loads in the samples of n different people - divided into blocks (eg: families) and you have contact tracing information

m x 1

m x n, m << n

n x 1

Estimate x given y and A

Noise vector

m x 1

log

41 of 44

Even more prior knowledge: contact tracing

42 of 44

More details about the contact tracing work

Ritesh Goenka, Shu-Jie Cao, Chau-Wai Wong, Ajit Rajwade, Dror Baron,"Contact Tracing Enhances the Efficiency of COVID-19 Group Testing", ICASSP 2021 (accepted to Special session on "Data Science Methods for COVID-19"). Arxiv version available.

Some results on false positive rate (FPR) and false negative rate (FNR) on next slide.

1 of 44

2 of 44

3 of 44

4 of 44

5 of 44

6 of 44

7 of 44

8 of 44

9 of 44

10 of 44

11 of 44

12 of 44

13 of 44

14 of 44

15 of 44

16 of 44

17 of 44

18 of 44

19 of 44

20 of 44

21 of 44

22 of 44

23 of 44

24 of 44

25 of 44

26 of 44

27 of 44

28 of 44

29 of 44

30 of 44

31 of 44

32 of 44

33 of 44

34 of 44

35 of 44

36 of 44

37 of 44

38 of 44

39 of 44

40 of 44

41 of 44

42 of 44

43 of 44

44 of 44