Author: Robert Jacobson <[email protected]>
Acknowledgements: Thanks to Andy Rhyne for fruitfull discussions and valuable feedback. All errors are my own.
A variety of authors have investigated the benefits of sample pooling to test populations for COVID-19 infection. Of primary interest is reducing the consumption of testing resources, which are a scarce resource during the COVID-19 pandemic, and the accuracy and efficiency of estimating the number of infected individuals. This note discusses the limitations imposed by mathematics on the reduction in number of tests needed to be performed using this testing method and bounds on the number of infected individuals in terms of the number of infected pools. Other limitations related to testing supplies (consumption of pipets, conical tubes, etc.), equipment availability, protocol complexity, and so forth are not discussed. The test for infection is assumed to be 100% effective and 100% sensitive.
In naive sample testing, individuals in a population are tested for infection by taking a sample from every individual and testing each sample in isolation. In a population of
In (one-dimensional) sample pooling, samples from multiple individuals are combined into a single pool, and the pool is tested once. A negative result indicates that no individual in the pool is infected. We call such a pool a negative pool. A positive result indicates that at least one individual in the pool is infected. We call such a pool a positive pool. In the event of a positive result, the individuals in the pool must be retested individually to determine which individuals in the pool are infected. Variations on this method are considered.
The word sample is used in a biological sense, referring to a fluid or tissue sample from an individual that will be tested for infection. Both sample pooling and naive sample testing use samples from every individual in the population. These testing methods should not be confused with random sampling, which only takes samples from a subset of the population.
In one-dimensional sample pooling, the population of
Fig 1: One-dimensional sample pooling.
One test is used per pool. In this example, it takes 12 tests to test all 12 pools. Each individual in the positive pools are tested in isolation to identify which individuals are infected. The total number of tests performed is
Observe that naive sample testing is equivalent to the degenerate sample pooling cases of
In general, the number of tests required by one-dimensional sample pooling is
Pr$$\displaystyle (k=i)=\frac{g-1}{g}\cdot \frac{g-2}{g}\cdot ;\cdots; \cdot \frac{g-i+1}{g} = \frac{g!}{g^{i}(g-i)!}$$.
Fig 2: The probability that no two infected individuals share a pool increases as the number of pools increases relative to the number of infected individuals. In this graph the number of infected individuals is fixed at
The real distribution of people infected with COVID-19 is not random, and Pr$(k=i)$ should not be assumed to follow this formula. If the population in the sense used in this paper is considered to be a single qPCR well plate, then randomizing the locations of the samples is still not sufficient, as an estimate on
One-dimensional sample pooling can be improved by simultaneously pooling the columns as well as the rows. We expand on the previous example as follows.
Fig 3: Two-dimensional sample pooling.
An infected individual can only exist at the intersection of a positive row pool and a positive column pool. However, not every intersection of a positive row pool and a positive column pool need be an infected individual. For example, using row-major notation, we see an infected individual at position
This method can be generalized to
Fig. 4: A three-dimensional representation of the previous examples. There are 18 points at the intersection of three positive pools but only 3 infected individuals.
Instead of testing pools along different dimensions in parallel, testing along dimensions in sequence and removing every negative pool as it is discovered reduces the size of pools along dimensions not yet tested. This does not reduce the number of tests required, as any negative individual will not lie at the intersection of
We ask, before any pools are tested, what is the maximum number of tests possible that might have to be performed to determine which individuals are infected? The number of tests required by
$T = g_1+g_2+g_3+\cdots + g_d + k_1\cdot k_2 \cdot k_3 \cdot ; \cdots ; \cdot k_d $
where
Now suppose we have tested the pools and thus determined the values of each
In two dimensional sample pooling, no retesting is necessary if the retest set contains only one sample, as the single sample is by definition the only candidate positive sample. We can similarly elide retests for some cases of larger retests and in higher dimensions.
Consider the example of two-dimensional sample pooling illustrated below. The red circles represent infected individuals. After testing the pools, represented by the squares and triangles, two row pools and two column pools are positive. ( Only the retest set is shown.)
Fig. 5: The pools, represented by the squares and triangles, are all positive, whereas there are only two infected individuals.
The retest set is the set of circles lying at the intersection of a positive row and positive column, which is every circle. The minimum number of infected individuals needed to have both row pools be positive is two, one for each row. Similarly, there must be at least two infected individuals to have two column pools be positive. Without retesting, we cannot determine whether any particular individual is infected. The naive approach is to test them all, which requires four tests. Suppose, however, that we happen to test Circle 2 first, resulting in a negative outcome, and then Circle 3, giving another negative outcome. We would then know two particular circles of the four are negative and that there must be at least two positive circles. We conclude, therefore, that the remaining two circles must be positive without having to test them. (Indeed, we could determine both Circle 1 and Circle 4 are positive after finding Circle 2 to be negative, but we would still need to retest Circle 3.)
Fig. 6: The catalog of all orderings of retesting a retest set of four points. Observe that all orderings can be obtained from the first three by compositions of reflections and rotations. All orderings allow for either six or seven tests to be elided.
Figure 6 lists every order in which to retest a retest set of four points followed by a graphical depiction of that ordering followed by every combination of positive and negative samples that can result in a four point retest set annotated with which tests can be elided under that ordering. Observe that while there are
In the idealized model of this paper, the probability discussion in Bounds on the number of required tests for one-dimensional sample pooling indicates that the last two columns are the most likely of the four point retest sets. If real-world experience of test outcomes matches this behavior, then choosing an ordering which elides the maximal number of cases in the last two columns will reduce the number of retests by
The rule for eliding tests in a four point retest set in two dimensions can be stated as follows:
When a point tests negative, every point adjacent to that point (excluding diagonals) must be positive and thus need not be tested. With two-dimensional sample pooling and sufficiently small
$i$ , 3/8 of the retests can be elided.
When there are more than four samples in the retest set the efficacy of this optimization is decreased. In the four retest point two dimensional case, knowledge about yet untested samples is a consequence of the fact that there must be at least one positive sample in a two-sample positive pool, and one of the samples has been ruled out. In a three-sample positive pool, two samples must be ruled out to conclude that the remaining sample must be positive. In general, in a
This optimization is similarly weakened in higher dimensions for the same reason: there are more samples to rule out before one can conclude that the remaining sample in a positive pool is the only positive sample in the pool. For example, in Figure 7 below three of the four points within a pool (represented by a plane) such as the three blue points in the rear blue pool must be ruled out before the remaining point can be assumed to be infected.
Fig. 7: The three blue points must be ruled out to conclude that the remaining red point is infected when sampling the rear blue pool.
As before, under the most optimistic assumptions,
The number of infected individuals can be estimated by stopping the testing protocol after testing pools in the first
TODO.
—Include “cutoff points” of when scheme is no longer advantageous. Should have its own section?
- probability of having a retest set of size
$r$ .
We assume the test for infection is perfect, i.e. that the test is 100% effective and 100% sensitive. The table below gives the definition of the variables used.
Variable | Meaning | Formula | How It’s Determined |
---|---|---|---|
Number of individuals in the population being tested | Chosen | ||
Number of pools (one-dimensional pooling). | Chosen | ||
Size of each pool (one-dimensional pooling) | Derived from chosen values of |
||
Number of dimensions | Chosen | ||
Size of each pool along dimension |
Chosen | ||
Number of positive pools (one-dimensional pooling) | Empirically | ||
Number of positive pools among the total number of pools |
Empirically | ||
Number of infected individuals within the population | Empirically | ||
Number of tests required for the given application of the testing protocol | Chosen | ||
Maximum number of tests that can be required for the given testing protocol | $T_{\text{max}}=\begin{cases} \sum_{j=1}^d g_j + i^d& i<g_j \forall j \ \sum_{j=1}^d g_j + N & i=N\ \sum_{j=1}^d g_j + \prod_{j=1}^d \text{min}(i, g_j) & \text{generally} \end{cases}$ | Derived from test design and empirical value of |
|
Minimum number of tests that can be required for the given testing protocol | Derived from chosen values of |
||
- The total number of tests required for
$d$ -dimensional pooling is$T_{\text{max}}=\sum_{j=1}^d g_j + \prod_{j=1}^d \text{min}(i, g_j)$ $=\begin{cases} \sum_{j=1}^d g_j + i^d& \text{if }i<g_j; \forall j \ \sum_{j=1}^d g_j + N & \text{if }i=N \end{cases}$. - As a consequence of (1), pooled sample testing is of greatest benefit when the number of infected individuals is small (close to 0).
- As a consequence of (1), pooled sample testing requires more tests than naive sample testing if the number of infected individuals is large (close to
$N$ ). - Dimensions must be tested in order from smallest to largest to detect the absence of infected individuals after
$\text{min}_j ;g_j$ tests, which is optimal. - The number of tests used on pools is the total number of pools, which is
$\sum_{j=1}^d g_j$ . - The number of retests of individuals required is $\prod_{j=1}^d \text{min}(i, g_j) = \begin{cases} i^d& \text{if }i<g_j;\forall j \ N & \text{if }i=N \end{cases}$.