1 of 40

Decision Theory in Practice: Validation, Accuracy, and Trade-offs

By Vera Wilde, Ph.D.

2 of 40

About me

Academic:

Ph.D. in Politics, University of Virginia (NSF-supported)
Postdoctoral research: UCLA Psychology (Prejudice & Violence Lab), Harvard Kennedy School (Malcolm Wiener Center for Inequality)
Research focus: Bias and neutrality in tech-mediated decision-making (security, medicine, welfare)
Postdoc work on U.S. national policing database initiative

Civil society:

FOIA activism including a Knight Foundation-supported lawsuit resulting in a U.S. Circuit Court precedent (Sack v. DoD)
Co-creator of iBorderCtrl.no, cited as impactful science advocacy by Access Now and EDRi; engaged with CCC
Collaboration with media like McClatchy Newspapers and Wired

3 of 40

4 of 40

The Validation Problem

“There’s a major difference between asking people about something that

you can verify and asking

them about something that

you can’t…”

- Stephen Fienberg, 2009

interview

Image: The American Academy of Social and Political Science,

https://www.aapss.org/fellow/stephen-e-fienberg/.

I’m going to let this guy explain more. This is Stephen Fienberg.

Introduce Fienberg. He was a public statistician – a professor of statistics and social science at Carnegie Mellon University and member of the National Academy of Sciences, among numerous other positions and honors, whose work to make forensics in the U.S. more scientific led to major reform efforts translating statistical knowledge into courtroom practice – and so into saved lives for many wrongly accused or even wrongly convicted. (See also the Innocence Project in memoriam: https://innocenceproject.org/innocence-project-remembers-stephen-fienberg/.)

Stephen Fienberg said of next-generation “lie detection” programs like FAST (Future Attribute Screening Technology, a 2008 DHS program that purported to read travelers’ minds with their physiology)...

<<… well, at the airport, we want to get the man or woman who wants to get on the airplane and blow it up. I’m all for that. And if I knew a good way to do it, I’d be willing to surrender some liberties if there was really a device that I had some confidence in that worked. I’m willing to allow false positives if I’m really catching people; I haven’t seen the evidence.

There’s a major difference between asking people about something that you can verify and asking them about something that you can’t. I was sitting at the table last night with a group of colleagues from another committee, and they said, “So how would you validate this?” and I said “I don't know.”

That's part of the problem. I don’t know what’s the experiment I could

conduct that’s double-blind that would give me enough data to

conclude that this works or it doesn’t work, and I don’t know how to

plan for it, and I’m not hearing any of these people saying that they’re

worrying about that. They’re using it in some trial program, but I can’t

imagine they’re using it in the context where I want to see it

evaluated, and I find that pretty frightening.>>

- Stephen Fienberg, 2009 interview

***

Witch hunt analogy.

***

The validation problem carries far in MaSLoPP. Even in cancer screenings, which you might think are easier than security because at least you can image, you can do a biopsy, you can know if someone has cancer in a way you can’t examine a slice of someone’s psyche and know if they’re a criminal – there’s still even in the medical realm persistent inferential uncertainty about whether lives are saved by mass screenings like mammography, PSA, and colonoscopy. All-cause mortality benefits are arguably what we should care about, according to medical scholars like Prasad, Bewley, and Welch, and they’re really hard to establish. The best available current PSA/mammography/colonoscopy evidence doesn't establish those from these mass screenings. And in a given case, we can’t know whether a particular cancer was one of the many that was destined to not kill or even never bother you, or one of the lethal ones.

That uncertainty highlights the importance of considering secondary screening harms…

5 of 40

The validation problem is about uncertainty.

Some mass screenings or interventions, you never know what to infer from the info they produce.

Others, you do.

Prenatal testing success stories:

HIV example, longstanding success reducing vertical transmission.

England’s recent hepatitis B vertical transmission elimination another example (https://nationalscreening.blog.gov.uk/2024/08/15/england-achieves-who-certification-for-eliminating-mother-to-child-transmission-of-hepatitis-b; covered at https://wildetruth.substack.com/p/links-0831.

These kinds of interventions work for two reasons:

(1) Because uncertainty about test classifications (i.e., what’s a true or false positive or negative result) can be disambiguated by further testing/intervention. And �(2) harms from those secondary screenings/interventions are negligible (i.e., are minimal and offset by net gains accruing to the same people who risk the possible harms).�

Versus in a lot of policy problem contexts, uncertainty raises the scientific problem of test validation under real-world conditions (which often we cannot solve).

What do I mean by that?

I mean that there’s no way of assessing a test’s accuracy and error in the real world, as opposed to under lab conditions. This is especially a problem for things like most crimes, where you don’t know what the true population incidence is or whether someone committed them (usually) – and we don’t have a psychic X-ray to look inside people’s heads and find out.

As opposed to diseases, where you can do things like count cancer deaths in countries with and without screenings, and compare those counts with counts of non-lethal malignancies of the same type and data on treatment outcomes over time to get a pretty good sense of how well early detection works to save lives relative to how often the thing we’re detecting really kills people.

So persistent inferential uncertainty tends to be particularly problematic in security or other settings where dedicated attackers have incentives to game the system and lie, like espionage, terrorism, or child abuse screenings. But it’s also a problem for many health screenings where it might at first seem that these biological questions are empirically cut and dry.

For example, in prenatal testing, typical gestational diabetes screenings produce huge numbers of false positives, whereas it’s much more difficult to prove someone does NOT have GD; most tests leave some persistent inferential uncertainty, without available second-order sure things.

Mammography for breast cancer and PSA testing for prostate cancer share the same structure, where if what we care about is cancers that will cause preventable deaths, just identifying a cancer doesn’t tell us whether it would go on to be deadly (usually not in the case of DCIS - ductal carcinoma in situ, or cancer in your milk ducts - which doesn’t usually progress; same for most prostate cancer). Nor does IDing a cancer tell us whether it would go on to be deadly but incurable anyway.

So testing for cancer that would go on to have clinical and practical significance while being treatable is a much higher bar than just IDing cancer. This is why we look for cancer-specific or (better yet) all-cause mortality benefits from these tests. Because otherwise it’s not clear that we have inferred the right thing from the result, even with what might seem like second-order sure thing tests like biopsies.

***

By contrast, in prenatal screening for hepatitis B, you can test the mother, selectively vaccinate newborns of positive moms, screen positive mothers’ 1 year olds, and routinely vaccinate all kids.

This is how the UK recently eliminated vertical transmission of hepatitis B. :)

The UK NSC writes this success story demonstrates “that nationally managed screening is not just about testing, but an end to-end pathway that involves many different health professionals and services, often across different settings. No test on its own helps people.” (Tsitsi Muchayingeyi for the UK National Screening Committee blog, https://nationalscreening.blog.gov.uk/2024/08/15/england-achieves-who-certification-for-eliminating-mother-to-child-transmission-of-hepatitis-b/).

6 of 40

Binary Screening Test Results

Confusion matrices for binary screening tests like the ones we’re looking at show 4 outcomes: true and false, and right and wrong.

But they risk hiding (nts 09-27-24) the potentially persistent uncertainty of what category a particular case/person really goes in, in the real world.

***

A mathematical theorem called Bayes’ Rule says: the probability of A given B, or the posterior probability of A given B, is equal to the probability of B given A times the probability of A divided by the probability of B. It’s derived from the definition of conditional probability and is about conditioning probability calculations based on evidence. It’s particularly useful when we have evidence that different groups of people are going to have different outcomes.

[pulling from SCQ, boil it down https://www.scq.ubc.ca/refugee-screening-a-brief-introduction-and-a-request-for-equipment/]

More specifically, Bayes Rule gives us a logic for thinking about probabilities that takes base rates (or rates of occurrence in a larger population) into account by breaking events down into four categories. Any screening, be it for a medical condition, crime, or something else, produces:

1. false positives (people who are healthy or innocent, but the screening says aren’t),

2. true positives (people who are ill or guilty, and the screening says so),

3. false negatives (people who are ill or guilty, but the screening says are fine), and

4. true negatives (people who are healthy or innocent, and the screening says so).

It’s generally safe to assume all of these categories are populated with non-zero values, because existing tools for assessing wellness or investigating crime only correctly identify ill/guilty people at rates above chance but below perfection.That looks like this…

This means that for very rare illnesses or crimes (problems with low base rates), such as syphilis or espionage/terrorism, screening tests may produce very large numbers of false positives—misidentifying lots of people as possibly ill/guilty. This is the case with a common syphilis test, for example. One in three “positives” might not actually have the disease; out of those false positives, a disproportionate number have an unrelated disease instead (lupus).

But people tend to ignore false positives and focus only on true positives when they think about the accuracy rates of screenings. This selective attention produces a logical error called the base rate fallacy. As a species, we have poor intuition about rare events—especially when given information about a specific case of possibly detecting them.

Because of the salience of incidence, Bayes’ rule is required to reason about screening tests, and native intuitions tend to be poor; this may help explain why overconfidence increases more than prediction accuracy for experts (Piattelli-Palmarini).

The same basic problem applies to screenings for extremely rare crimes such as espionage/terrorism…

7 of 40

Case Study: Chat Control

Scaled up from Fienberg et al’s 2003 NAS polygraph report, Table S-1: https://nap.nationalacademies.org/read/10420/chapter/2#5.

Chat Control is Pirate MEP Patrick Breyer’s nickname for the EU’s proposed Child Sexual Abuse Regulation or CSAR, which would entail client-side scanning of all digital communications for child sexual abuse material.

The National Academy of Sciences polygraph table scales up to demonstrate the problem with Chat Control and other programs that share this structure. High accuracy and low false positive rates. But you still wind up with an overwhelmingly large number of false positives, just because of the sheer volume.

In presenting info this way, Fienberg was also applying the insight from a seminal 1995 decision science paper, “How to Improve Bayesian Reasoning Without Instruction: Frequency Formats,” by Gerd Gigerenzer and Ulrich Hoffrage (Psychological Review, 102(4) 1995, 684-704), that getting mathematical info in frequency not probability or percent format, helps people have better Bayesian statistical intuitions about outcome spreads. It’s not really surprising that we do a much better job assessing risk with body counts, which we evolved seeing, than probabilities, which are evolutionarily novel. But it’s important that risk analysts acknowledge they’re not psychologically identical, even though they’re mathematically identical; and the format makes a huge difference in usability. So this kind of table is like UX for math.

Assuming a 1/1000 child sexual abuse material [CSAM] base rate, an 80% detection threshold, and a .90 accuracy index, scanning 10 billion messages – a fraction of daily European emails – would generate over 1.5 billion false positives. There’s a 99.5% probability (1,598,000,000/1,606,000,000) that a message flagged as abusive would be innocent. There’s an almost 16% chance (1,598,000,000/9,990,000,000) an innocent message would get flagged. Yet, 20% of the time, abuse would evade detection. Meanwhile, there would be almost 200 messages mistakenly flagged for each abusive message found (1,598,000,000 false positives / 8,000,000 abusive messages).

And that’s just for the tough version of the algorithm set to detect the majority of abuse (suspicious mode). We could also run it in friendly mode, to minimize false positives instead, and get the alternate outcomes spread shown below that one here. This is the problem… It’s not even that this structure, mass screenings for low-prevalence problems, offers one bad option…

8 of 40

Case Study: Chat Control

9 of 40

Irresolvable Tension

Binary classifications yield four types of results.
Maximizing true-positive rates and minimizing false-positive rates are in tension.
Both have implications for practical outcomes of interest (e.g., security, health, research/information quality).

Talk of striking a balance misses this point: many mass screenings for low-prevalence problems will never work as intended, because the problems they target are too rare, our uncertainty about whether they exist in a given case or not is too persistent in the real world, and secondary screenings to disambiguate true from false alarms cause too much harm for mass use to accrue net benefit.

This is due to the accuracy-error trade-off or dilemma, the empirical trade-off in any signal detection problem between sensitivity and specificity, or true and false negatives and positives — empirical outcome spreads that do not map directly onto values like security and liberty.

***

h/t Jacob Goldstein-Greenwood (via Clay Ford), https://data.library.virginia.edu/roc-curves-and-auc-for-models-used-for-binary-classification.

10 of 40

Why the Trade-Off?

Idealized Holy Grail model.
Probabilistic signal

detection realities.

Tech just categorizes.
No exit from universal

mathematical laws

This is a picture from a series I painted in grad school of psychic X-rays [tools that purport to see inside your head/heart], lie det works from this conception of science, there are perfect cues - but mostly probabilistic associations which imply the accuracy-error trade-off.

No exit from this statistical reality through tech is possible cos tech doesn’t affect the world, it just categorizes it. This is not a technology problem. The structure of the world is like this.

Increasing accuracy doesn’t solve it, imperfection persists.

Probabilistic association is the core of the problem, where we want to use the info in these imperfect cues, but we have to be careful about the contexts in which we use them and the inferences we make from them.

Imperfection management is notoriously difficult for us humans, but perfect solutions are usually bad ones. E.g., I can catch all the spies by locking everybody up, or guarantee that we don’t harm any innocents by ignoring the problem. The trade-off arises by negotiating between those two extremes. At any point between those two extremes, we’re catching more and more and producing more bycatch, or catching fewer and reducing bycatch.

So the probabilistic nature of most cues structures the world; and increasing accuracy doesn’t solve the problem because it can’t change the structure. Tech can’t escape universal mathematical laws.

This inescapable accuracy-error trade-off is why we need to worry about massive societal damages to exactly the values we are seeking to protect through mass (population) surveillance (screening) programs — an end that usually ends up at safety when you ask “why” enough times (also in programs outside of that domain, like cancer screenings). The ironic danger in this excessive safety-seeking presents most starkly under common conditions of rarity (when the problem at issue is low-prevalence), persistent uncertainty (when we can’t know for sure whether the test results are right), and secondary screening harms (when we do harm trying to figure it out). While rarity implicates the common cognitive bias known as the base rate fallacy, uncertainty raises the scientific problem of test validation under real-world conditions (which often we cannot solve); and the issue of secondary screening harms raises the problem of measuring benefits and harms.

***

May Stroom talk 2023 nts -

There are different conceptions of science, and one that a lot of lay people and many scientists work from is the Holy Grail model.

// hidden ovulation

Lie detection as a science - literal lens

truth-telling/deception // hidden ovulation

Searching for the Holy Grail of a unique sign, response, biomarker as AI “lie detector” iBC creators called it, of lying.

[iBC received € 4 501 877,50 (over 4.5 million euros) in Horizon 2020 EU research and innovation funding. It built a “deception detection” component into new border check system being tested at non-EU borders. That component was originally called Silent Talker, and was a technology developed between 2000 and 2002 by a team of computer scientists at Manchester Metropolitan University - Janet Rothwell, Zuhair Bandar, James D. O’Shea, and David McLean]

Through the science lens, the search for a holy grail of a unique lie response to detect assumed truthfulness/deception were something like hidden ovulation – with enough biomarkers, we have a good enough proxy of processes that are hidden, to know when women are fertile even though ovulation in humans is hidden.

***

Dec 2023 nts -

There are a great many areas in science that have advanced beyond lie detection, which is widely regarded as pseudoscience. So why isn’t it possible to have some tech that detects all the true cases and never misclassifies?

There are things that are perfect diagnostic cues - those things exist - this is not what we’re talking about - usually, clues only probabilistically associated with the thing - the trade-off arises from the imperfect association between the cues we might use, and the things we’re trying to see that we can’t see.

The techno-solutionist myth often comes up in conversation when I explain this problem. A lot of people are very hopeful that AI will solve it. But no tech can escape the universal laws of mathematics. Tech is still manmade, we still make mistakes, signals are imperfect cues.

Tech can’t escape this trade-off, because it doesn’t affect the world, it just categorizes it. This is not a technology problem. The structure of the world is like this.

Probabilistic association is the core of it, where we want to use the info in these imperfect cues, but we have to be careful about the contexts in which we use them and the inferences we make from them.

Imperfection management is notoriously difficult for us humans, but perfect solutions are usually bad ones. E.g., I can catch all the spies by locking everybody up, or guarantee that we don’t harm any innocents by ignoring the problem. The trade-off arises by negotiating between those two extremes. At any point between those two extremes, we’re catching more and more and producing more bycatch, or catching fewer and reducing bycatch.

***

11 of 40

So in one view, statisticians/science communicators have to debunk techno-solutionism by explaining the implications of probability theory for mass screenings for low-prevalence problems, or mass interventions where we can’t solve the validation problem, and secondary screening costs are high/risk net harm.

In that view…

Myth: Once we have better tech, accuracy will tip the scales and there will be no possible net harm from such programs. Better tech will solve the problems of mass screenings/interventions in terms of false pos, false neg, reverse causality, and zero-sum resource expenditure here taking away from targeted investigations, screenings, or care.

When it’s acknowledged that we do indeed have to worry about false pos harm, we usually hear talk of striking a balance. Liberty v security (reification bias; pervasive lack of transparent policy efficacy data to validate what are almost certainly gross mismappings) + talk in which people assume that better tech will, one day, if it hasn’t already, tip the scales…

Statisticians tend to argue that this misconception reflects a lack of understanding about the implications of probability theory, and specifically the mathematical theorem known as Bayes’ theorem or Bayes’ rule.

***

The source of the term “solutionism,” “technological solutionism,” or “techno-solutionism” is Evgeny Morozov, “To Save Everything, Click Here: The Folly of Techno-logical Solutionism,” Public Affairs, New York, 2013. For this citation I am indebted to Bettina Berendt’s reference in “AI for the Common Good?” Paladyn, Journal of Behavioral Robotics, 2019, https://www.degruyter.com/document/doi/10.1515/pjbr-2019-0004/html?lang=en.

12 of 40

What’s the Problem?

Net harm possible.
Probability theory implies this when:

Rarity.
Uncertainty.
Harm.

Image: Sherkiya Wedgeworth, CC Attribution-NonComm. 4.0 Int’l Lic.

Net harm possible.

Safety-seeking can backfire.

Bayes’ theorem - under certain common conditions.

***

We can’t protect everyone from everything — and can do harm trying. Armed with new technologies to promote public interests including health, safety, and truth, we hit old limitations from universal mathematical laws — laws whose implications are not widely recognised. Thus, as technologies advance, proposed programs keep hitting the same wall, known in probability theory as Bayes’ theorem. Bayes’ says the probability of an event changes depending on the subgroup.

When applied to mass screening for low-prevalence problems, Bayes’ rule implies that these programs backfire under certain conditions:

Rarity. Even with high accuracy and low false positive rates, the sheer volume involved in screening entire populations for low-prevalence problems generates either many false positives, or many false negatives (the accuracy-error trade-off).

Uncertainty. When real-world scientific testing cannot validate results, inferential uncertainty persists.

Harm. When secondary screening to reduce uncertainty carries serious costs and risks, resultant harm may outweigh benefit for the overwhelming majority of affected people.

That’s why, as former Sandia senior scientist Al Zelicoff wrote in The Washington Post, the NAS polygraph report concluded mass screening of lab scientists would be “Worse Than Worthless.” https://www.washingtonpost.com/archive/opinions/2003/05/27/polygraphs-worse-than-worthless/dcc1bd64-dd0e-4b0d-802e-61f652594d7a/?utm_term=.93c42d3fffbf.

And the same goes for mass screening for low-prevalence problems as a class, that they risk net damaging exactly the values and people they’re ostensibly intended to protect, according to the implications of probability theory.

Image: https://commons.wikimedia.org/wiki/File:Polygraph_Test.jpg; citing source https://federalsoup.com/articles/2019/06/26/bill-would-omit-polygraph-requirement-for-certain-cbp-applicants.aspx.

13 of 40

14 of 40

15 of 40

Case Study: UTIs in Primary Care

Table 1 Summary statistics for laboratory tests and initial antibiotic prescribing
	All tested			Positive test		Negative test
	N	Bacterial rate	Prescrib. rate	N	Prescrib. rate	N	Prescrib. rate
2010	17,513	0.37	0.39	6,411	0.60	11,102	0.27
2011	21,237	0.39	0.39	8,305	0.60	12,932	0.25
2012	27,169	0.39	0.39	10,510	0.61	16,659	0.25
Total	65,919	0.38	0.39	25,226	0.61	40,693	0.26

16 of 40

So is that it? Is everything known/solved?

It’s not that tech that allows us to escape the accuracy-error trade-off. It’s that we get a ML insight into how to use algorithmic decision-making vs when people do a better job. So it’s how we use the tech that allows us to decrease at least type II errors… But…

What was the goal? Policymakers wanted to lower antibiotic resistance. There are two paths to that: overuse, and underuse. Appendix G: “When physicians treat instantly and bacteria are found, these have one to five percentage points lower resistance levels than when physicians decide to wait and bacteria are found.” ��So they decreased the false positive error rate, but what about the false negative rate?

We don’t know, and it matters…

They say they achieve a 20.3% decrease in overprescribing with the middle-risk range delegation approach (section 6.2; abstract). They also say (Table 2) counterfactual outcomes for 2011 and 2012 show 17.9% [95% CIs 15.1-20.8] change in overprescribing.

They say there’s initial underprescribing of 39% and overprescribing of 26% (section 3.4). They don’t say how much/whether they also decrease underprescribing.

So there is a cost to false negatives that accrues to the value we care about most, antibiotic resistance.

We might also care about whether false negatives increase while false positives decrease, and in what subgroups…

(Possible normative concerns: patient QOL, complications, fairness implications.)

17 of 40

Bias-variance trade-off

Image: The American Academy of Social and Political Science,

https://www.aapss.org/fellow/stephen-e-fienberg/.

A typical problem in AI and other areas where scientific modeling is used to predict outcomes: the accuracy in the lab is way higher than the accuracy in the field. Why?

Because, as with accuracy and error in probability theory, we’re generally stuck trading off between bias and variance in fitting a ML model.

This typical lab-field accuracy cliff-face tends to create concerns about fairness, since often the lab sample was a convenience CS one dominated by white males, as in iBorderCtrl’s training dataset. We don’t know much about Chat Control’s training data to assess this; such nontransparency problems are endemic, not just in security but also, e.g., in educational and research integrity contexts where the software being used to assess “trustworthiness” is proprietary. But it’s safe to say the real world usually has more variance than the lab due to all sorts of selection effects…

And this variance often involves subgroups along traditional axes of ascriptive discrimination or bias in that other sense of the term (taste-based discrimination/prejudice). One plausible reason is that, by definition, minorities tend to be less sampled, creating less representative subgroup datasets even from representative samples. There are also all sorts of response biases implicated here, e.g., in health and other research you usually see healthy volunteer bias in which people who are unwell, disabled, have mental health problems, are less well off socio-economically and less well-educated – are all less likely to participate in research as volunteers, even when that research targets everyone. (Example: The UK Biobank database targeted 9 million NHS patients, but wound up with a highly selected sample. See citations here: https://wildetruth.substack.com/p/designer-baby-maybe.)

These sorts of selection effects raise questions about advancing overall accuracy at the cost of compromising subgroup fairness in algorithmic decision-making…

Here’s an example where we care about this tension…

18 of 40

Consider the case of mammography for early breast cancer detection. The Harding Center for Risk Literacy summarizes mammography’s benefits and risks according to the best available evidence here.

(There is also a great analogue in PSA testing for early prostate cancer screening for men, with a similar Harding fact box and similar takeaway.)

The benefit: In 1,000 women who participated in mammography screening for around 11 years, researchers estimate that one breast cancer death was prevented. (Gøtzsche estimated it at one in 2,000 over 10 years.) It’s not clear that this translates to an all-cause mortality benefit. In other words, mammography has not been demonstrated to save lives. It’s fair game to hone in on that one saved life from breast cancer; it’s also fair game to wait for the all-cause mortality jury to come back, although then you’ll probably be waiting for a long time. Mammography may be one of the best-studied medical interventions ever (10 randomized trials at Welch et al’s writing), and we still don’t have enough data to know if it net saves lives — just that it net saves about one life in 1,000 from breast cancer death.

The risk: In turn, 100 of 1,000 mammography screened women had false alarms, often leading to biopsies — and 5 with non-progressive breast cancer unnecessarily had partial or complete removal of a breast. That usually means DCIS (Ductal Carcinoma in Situ). These biopsies may contribute to the spread of cancer cells (we don’t know). These mastectomies are major surgery, with all its risks (including death). There are concerns that radiation exposure from repeated mammography screening may contribute to cancer incidence and deaths. Similarly, there are also concerns that mammography compression and needle biopsy may contribute to the spread of some cancers (we don’t know).

So it’s the same story as with other mass screenings for low-prevalence problems, where the common (false positives) overwhelms the rare (true positives). But, notably, in this context that means secondary screening harms affect 100/1000 women – 10%.

�***

A brief meta-science note on the multiple discourses involved here, and their usual mistakes: The cited article in the last instance — Förnvik et al’s “Detection of circulating tumor cells and circulating tumor DNA before and after mammographic breast compression in a cohort of breast cancer patients scheduled for neoadjuvant treatment” in Breast Cancer Research and Treatment, 177, p. 447–455 (2019) — makes classic statistical significance misuse interpretation mistakes, but the evidence it reports suggests that compression-related spread could be a problem: We don’t know. Some case reports are concerning. Nonetheless, there are a lot of “misinformation” debunking search results stating in no uncertain terms that mammography compression does not contribute to cancer spread, when the reality is that we don’t know. This appears to be another instance of uncertainty aversion contributing to misinformation in misinformation discourse itself, as is also common, e.g., in the Covid and abortion discourses (no surprise, since calling something “misinformation” in the first place is a political act). This reflects a pervasive pattern of ambiguous evidence, dichotomaniacal scientific misinterpretation, and parrot-like misinformation discourse misrepresentation of uncertainty as certainty. The pattern recurs in the case of needle biopsy risk spread, as in many other contexts. Experts often tell subjects to trust them, not worry, and undergo the procedures that they implement. But they have structural incentives to do this.

19 of 40

20 of 40

21 of 40

22 of 40

So several leading experts, like evidence-based medicine pioneer Susan Bewley and leading medical methodologist Peter Gøtzsche, hold that mass mammography doesn’t meet the evidentiary bar for establishing net benefit as opposed to risking net harm, and the Harding Center fact box shows why. But the general trend is to do age-based mass screening anyway, and you tend to see pushes for expansion instead of retraction of these programs.

The UK is a good example: the NHS offers age-based mass screening, plus additional screening for high-risk women. And the UK NSC recently held a workshop for experts to discuss more personalized breast cancer screening based on risk. They evaluated different models for advancing from a mass to a targeted subgroup screening program…

This is the abstract from https://bmcproc.biomedcentral.com/articles/10.1186/s12919-024-00306-0 a recent journal article about that workshop.

One of the concerns they addressed in the workshop has to do with accuracy vs fairness in subgroups. The subgroups they seem particularly worried about are racial/ethnic ones. Why might that be?

(Some reasons:

Healthy volunteer bias;
less screening participation;
smaller numbers/power concerns;
normative social justice concerns.)

This raises the crucial question of increasing accuracy through risk stratification FOR WHOM. Fairness issues tend to be implicated in these sorts of models. If nothing else, because we can generally make accuracy better for some groups (like the ethnic majority using national registry data or whatnot) and not necessarily for others.

***

But those subgroups are only some of the more affected ones. If we care about cancer deaths instead of cancer cases as the outcome, which we arguably should if we want to avoid overdiagnosis, then there’s another subgroup we really care about that’s not mentioned in this meeting report…

23 of 40

How bad are these mortality rates? Very bad: some typical mortality rate estimates trends – 34% for PPBC diagnosed within five years of last birth, and 22% for up to 10 years between birth and diagnosis.

https://wildetruth.substack.com/p/show-me-the-mommies-part-1

It’s totally normal in the medical literature for analysts to adjust for whether women have given birth recently, and how recently. But mothers are at substantially increased risk of lethal breast cancer. Modeling away this information is a different strategy from zooming in on this at-risk subgroup. There’s a case for zooming in, instead: Postpartum breast cancer (PPBC) is around 2-5x more likely to metastasize and thus kill than other types. How do we know?

In a pooled analysis of 15 prospective studies, Nichols et al 2018 found childbirth substantially increased women's breast cancer risk, peaking around five years postpartum, with the association reversing in sign only around 24 years after birth.

In a retrospective cohort study of 619 women aged ≤45 years diagnosed with breast cancer during 1981-2011 (the Colorado Young Women’s Breast Cancer Cohort), Callihan et al 2013 found cases diagnosed <5 years postpartum had substantially increased risk of distant recurrence (31.1 versus 14.8 percent) compared to nulliparous cases, and even more substantially decreased chance of five-year overall survival probability (65.8 percent versus 98 percent). Even after adjustment for biologic subtype, stage, and year of diagnosis, PPBC was more likely to metastasize and kill.

In a cohort study of 701 women aged ≤45 diagnosed with breast cancer (also using cases from the Colorado cohort, but now diagnosed between 1981-2014), Goddard et al 2019 replicated and extended these findings to show that PPBC diagnosed within 10 years of birth was also associated with increased lymphovascular invasion, lymph node involvement, and metastasis risk -- around 2x average increased distant metastasis risk compared to nulliparous cases, going up to 3.5-5x higher risk for cases diagnosed at stage I or II, and remaining significant after adjustment for other predictive features.

***

[Could mentioned the name change from PBAC to PPBC, the different stories about what/when/why, Pepper Schedin’s group’s work on lactation remodeling and lymphatic latticework tissue cancer cell invasion, and the irony since this is a bf -> cancer story and the conventional wisdom says the opposite. Again, the point is mechanisms matter.]

***��So let’s briefly take a closer look at the models the UK NSC reviewed.

24 of 40

Now, the UK NSC reviewed models seem to be interested in age (as is standard), and it’s also standard to account for reproductive history in doing further risk stratification. But passing up on the causal modeling part of that accounting is a big mistake. Because just accounting for age when we also consider parity is complicated. �

This matters practically because the PPBC effect takes about 25 years to reverse – affecting the current UK mammography screening age range of 50-70. So this methodological problem already afflicts current UK risk-stratified breast cancer screening, and also threatens efforts to improve it with more risk stratification.

Again, the problem is that we need to diagram causality before running analyses to visualize how age and parity contribute to deaths.

***

One of the other things you might notice in the models they’re comparing is that it’s possible to get a range of different primary research questions about this risk stratification question. Pashayan is asking about optimal risk thresholds to improve cost-benefits balance of risk-targeted screening. Bhatt is asking about optimal screening strategies by starting and ending age, frequency of screening, and different baselines – no or current screening. Hill compares cost-effectiveness of 8 different proposals for risk-stratified screening vs current or no national screening. Antoniou is about personalizing more better. ��This is another instantiation of what must be a now-familiar problem wherein decisionmakers, policymakers and practitioners, stakeholders, whoever you want to zoom in on – don’t necessarily agree on what they’re maximizing, how to balance different values.

This is actually the same problem on another level. Because if we agree we care about reducing deaths (not cases), then we have to causally model deaths, and that means risk stratifying for metastases, one of the main risk factors for which is parity – but in complex relation to age…

25 of 40

https://wildetruth.substack.com/p/rethinking-risk

We can’t just code age and parity as risk factors, throw them in a model, and have a nice day without making further assumptions about how the two causally relate.

The reason is that they manifest opposing base rate trends: Mothers face heightened risks of deadly breast cancer that rise with age at first birth — older first-time mothers risk more postpartum breast cancer cases, and those cases have this heightened lethality risk compared to postmenopausal BC. But, at the same time, younger mothers diagnosed with postpartum breast cancer suffer still higher metastasis rates and fatality likelihoods by case.

This makes the literature sound paradoxical (because it is imprecise and usually lacking in causal diagramming): PPBC can’t be highly lethal across the board, more common in older mothers — and yet more lethal in younger and more recent mothers.

The moral of the story: Subgroup comparisons are harder than they seem.

***

So what’s going on?

A probable causal story here is that cancer risks increase with age anyway (as immune systems get worse at chasing down aberrant cells). But, by the same token, probably the younger you are when you have a detected cancer, the more virulent it had to be, to overcome comparatively strong homeostasis to be detected. So older moms have more PPBC cases, but younger moms have more lethal ones.

Meanwhile, pregnancy creates a carcinogenic hormonal and chemical environment (with estrogen and/or supplemental folic acid perhaps contributing to cancer cell growth, and progesterone suppressing normal immune activation against it). At the same time, it usually triggers a breast tissue remodeling process that includes both constructive differentiation (maybe a long-term net positive for cancer risk because differentiated tissue is less dangerous vis-a-vis cancer) and destructive/reconstructive weaning programs (which appears to be very dangerous vis-a-vis cancer, like other wound healing processes, sometimes giving cancer cells the deadly opportunity to seed lymphatic tissue). So PPBC risks are particularly lethal for women across the board, compared to the postmenopausal breast cancers most people have in mind when they go in for mammograms. But some of each type are being caught in the UK’s 50-70 y o screening program, because PPBC risks have a long tail (decades). But the timing of the PPBC also matters: Both PPBC risk and its associated lethality spike after birth before decreasing for decades.

Overall, this suggests it may be suboptimal to look at breast cancer risk by age category, if what we really care about is death risk. That may skew the death risk calculus in exactly the wrong direction, since older women carry higher cancer risks — but younger moms who are diagnosed will have some of the worse prognoses. It may be useful to think of this problem in terms of two subgroup effects that vary by age: (1) the base rate of breast cancer increases with age, while (2) the base rate of death given breast cancer seems to decrease with it. In other words, the Scylla of more diagnosed breast cancer cases threatens older mothers, while the Charybdis of more fatal diagnoses threatens younger mothers.

***

The methodological point is, you have to diagram causality first in order to stratify screening appropriately because causality – even just in relation to simple demographics like age and parity – can get complicated fast. Otherwise, you are making models that risk introducing more bias than they correct for – collider-stratification bias (https://wildetruth.substack.com/p/simpsons-paradox-and-existential).

But this doesn’t appear to have been done (not a DAG in sight in the published/publicly available UK NSC’s reviewed models).

The committee also doesn’t appear to have addressed concerns that screenings may unintentionally, causally contribute to cancer cases and/or cancer progression (e.g., through mammography compression or follow-up needle biopsy circulating cancer cells to other nearby tissues – e.g., lymphatic tissues – they can then invade, or through radiation exposure — particularly as it accumulates from repeated screenings). This possible iatrogenesis complicates calculations of net costs and benefits. This is not a fringe concern, though it is often dismissed as misinformation (see the meta-science note in https://wildetruth.substack.com/p/book-review-overdiagnosed).

***

There are also difficult, important racial/ethnic subgroup stories in the breast cancer case study that the UK NSC does a better job nodding to:

Healthy (privileged?) volunteer bias extends to race/ethnicity; minorities less likely to participate in screening, be represented in data used for (gah) polygenic risk scoring.
This matters: “Another consideration for a risk-stratified NHS BSP is that women in the most deprived groups are generally less likely to participate in breast screening (relative risk (RR) 0.89 for the most deprived groups compared with the least deprived) [50] but are more likely to die from BC [1].”
But… they don’t come out and acknowledge that this may be because of racial/ethnic subgroup vulnerabilities to more lethal cancer subtypes in the first place (so screening may not matter, or may not matter as much – it doesn’t help patients survive longer if we just catch more untreatable cancers). E.g., some black women (possibly of Western versus Eastern African ancestry) are at heightened risk for particularly lethal triple-negative breast cancer.

26 of 40

Persistent Uncertainties

When and how do we need to worry about reverse causality – screening/intervention contributing to exactly the problem it seeks to mitigate?
What about acceptability of algorithmic decision-making?
What about Ullrich’s concerns about strategic actors trying to game when they know there’s a threshold – equilibrium effects?
Where could there be heterogeneity in how people will use the tech (e.g., automation bias and its opposite)?
Under what conditions can we solve the validation problem? How generalizable are Ullrich’s results?

On reverse causality in mammos case (https://wildetruth.substack.com/p/book-review-overdiagnosed):

There are concerns that radiation exposure from repeated mammography screening may contribute to cancer incidence and deaths. Similarly, there are also concerns that mammography compression may contribute to the spread of some cancers (we don’t know).

A brief meta-science note on the multiple discourses involved here, and their usual mistakes: The cited article in the last instance — Förnvik et al’s “Detection of circulating tumor cells and circulating tumor DNA before and after mammographic breast compression in a cohort of breast cancer patients scheduled for neoadjuvant treatment” in Breast Cancer Research and Treatment, 177, p. 447–455 (2019) — makes classic statistical significance misuse interpretation mistakes, but the evidence it reports suggests that compression-related spread could be a problem: We don’t know. Some case reports are concerning. Nonetheless, there are a lot of “misinformation” debunking search results stating in no uncertain terms that mammography compression does not contribute to cancer spread, when the reality is that we don’t know. This appears to be another instance of uncertainty aversion contributing to misinformation in misinformation discourse itself, as is also common, e.g., in the Covid and abortion discourses (no surprise, since calling something “misinformation” in the first place is a political act). This reflects a pervasive pattern of ambiguous evidence, dichotomaniacal scientific misinterpretation, and parrot-like misinformation discourse misrepresentation of uncertainty as certainty. The pattern reoccurs in the case of needle biopsy risk spread, as in many other contexts. Experts often tell subjects to trust them, not worry, and undergo the procedures that they implement. But they have structural incentives to do this.

***

The UK NSC meeting report on acceptability (https://bmcproc.biomedcentral.com/articles/10.1186/s12919-024-00306-0):

Acceptability

A woman’s categorisation may change over time if she has repeat risk assessments. Further evaluation is needed to see whether repeat assessments and changes in screening offered are preferable to a one-off risk assessment.

The meeting considered the likelihood of a risk-stratified screening programme leading to increased anxiety and to confusion for women whose categorisation changed due to a repeated risk assessment (caused by some of the modifiable risk factors changing). Such concerns may be unfounded, even for women informed that they are at high risk [40].

Concerns were raised about the acceptability of safely extending screening intervals for women at lower risk, with one study showing 51% of women saying they would not accept less screening and 37% saying they would not accept stopping screening altogether [41].

There is some existing UK evidence supporting the need to undertake additional acceptability research [42] related to less frequent screening for low-risk women. Both women and health care professionals are less enthusiastic about reduced screening in low risk but in favour of more screening for high risk. Any change to the programme must consider safety and acceptability alongside clinical and cost effectiveness and consider how to communicate to women offered screening, for example by developing decision aids and counselling programmes [43].

There are still uncertainties about the effects that introducing risk stratification might have on screening uptake, as conducting a risk assessment may discourage some women from screening. It will be important to consider the impact of implementing AI on the accuracy, time to report, cost, and acceptability to women undergoing tailored screening.

27 of 40

Image: Cristian Faezi & Omar Vidal.

This is a vaquita caught in a fishing net. It’s a species of porpoise native to the northern Gulf of California, it’s endangered, and it is the world’s rarest marine mammal – there may be as few as 10 left in the world (https://www.worldwildlife.org/species/vaquita).

Bycatch is a big problem, and not just for vaquitas. It kills about 38 million tons of marine animals annually, or 40% of fish that are caught (https://www.fishforward.eu/en/project/by-catch/). For obvious reasons - the size of the net holes - it’s worse for some fishing operations than for others. E.g., shrimp, 1 kg of shrimp produces 5-20 kg of bycatch. Annually, bycatch kills about 300,000 small whales and dolphins, 250,000 endangered turtles, and 300,000 birds.

***

Image from https://eia-international.org/blog/victory-for-maui-dolphins-as-us-court-bans-some-fish-imports-from-new-zealand-over-fishing-practices.

28 of 40

Application	Target	Bycatch
Trawling	Tuna	Dolphin
Polygraph	Spies, Terrorists	Non-spies, non-terrorists
iBorderCtrl	Bad crossings	Innocent crossings
ChatControl	CSAM	Innocent coms
Asymptomatic cancer screenings	Deaths	Healthy people
Lifestyle diseases	Big problems	Mild cases
Advanced medical imaging	See problems	Harmless anomalies
Educational ethics	Plagiarism and AI use in writing	Innocent students
Misinformation	Provably wrong	Ambiguity, dissent
Disinformation	Hostile propaganda	Counterpoint

Under bycatch - Talk about specific harms. Like in pg case I guess false pos - how these investigations can go into all kinds of personal terrain, there aren’t protections like people might imagine for religious, political, or sexual behavior, orientation, or belief, these investigations can tear people apart. With iBC and Chat Control even worse. Also false neg - false sense of security. And also those are positive or direct harms, we should also consider the negative or indirect harms resulting from limited resources being used for mass instead of targeted screening and intervention. Like investigative resources used up on false positives can’t be used to investigate specific spy, migration, or child abuse concerns. In addition to resource constraints, and perverse incentives (quotas, quantification, uncertainty aversion) and cognitive bias also drive societal adoption of this type of trawling program and the problems it causes.

What follows (next slide) is my explanation of why these all have the same mechanism.

Lifestyle diseases - thresholding changes driven by industry - milder cases of lifestyle diseases hypertension, diabetes, high cholesterol, and osteoporosis.

Imaging advances. E.g., CT, MRI, ultrasound -> incidentalomas, meniscal damage (40%), gallstones (10%).

***

Mass screening for low-prevalence problems are common across diverse domains. Here are a few you might be familiar with. They all share the same mathematical structure. They’re sometimes called signal detection problems. A signal detection problem is the most abstract level of any type of problem where we are trying to figure out whether a signal is there in some noise, or not.

Security:

Chat Control was Pirate MEP Patrick Breyer’s name for the provision in an EU CSAM law that would require client-side scanning of digital communications.
iBorderCtrl was an EU Horizon 2020-funded automated border security system that was tested on non-EU citizens at non-Schengen (external) borders a few years ago.
Polygraphs (aka lie detectors) are 20th century collections of physiological recordings, usually cardio, respiratory, and skin conductance, monitored through software in an interrogation framework whose structure is intended to produce testing for lie responses, or sometimes for guilty knowledge responses, which maybe works better.

Medicine:

Asymptomatic cancer screenings. E.g., mammography, PSA testing, colonoscopy.
Threshold changes. E.g., hypertension, diabetes, high cholesterol, and osteoporosis.
Imaging advances. E.g., CT, MRI, ultrasound -> incidentalomas, meniscal damage (40%), gallstones (10%).

Education:

Vanderbilt University announced on August 16, 2023 that it “decided to disable Turnitin’s AI detection tool for the foreseeable future.” https://www.vanderbilt.edu/brightspace/2023/08/16/guidance-on-ai-detection-and-why-were-disabling-turnitins-ai-detector/ Turnitin is a U.S. company that sells software licenses to unis and high schools to screen student assignments for plagiarism and AI use in writing. It sounds like Vanderbilt still uses the software for plagiarism screening, just not its AI detection tool.

Social media:

Governments have delegated screening for digital misinformation (wrong facts) to tech companies, while employing a mixture of governmental and corporate tactics against disinformation (usually construed as hostile government misinformation) - like blocking RT in huge swaths of the world.

29 of 40

Application	Target	Bycatch
Trawling	Tuna	Dolphin
Polygraph	Spies, Terrorists	Non-spies, non-terrorists
iBorderCtrl	Bad crossings	Innocent crossings
ChatControl	CSAM	Innocent coms
Asymptomatic cancer screenings	Deaths	Healthy people
Lifestyle diseases	Big problems	Mild cases
Advanced medical imaging	See problems	Harmless anomalies
Educational ethics	Plagiarism and AI use in writing	Innocent students
Misinformation	Provably wrong	Ambiguity, dissent
Disinformation	Hostile propaganda	Counterpoint

Under bycatch - Talk about specific harms. Like in pg case I guess false pos - how these investigations can go into all kinds of personal terrain, there aren’t protections like people might imagine for religious, political, or sexual behavior, orientation, or belief, these investigations can tear people apart. With iBC and Chat Control even worse. Also false neg - false sense of security. And also those are positive or direct harms, we should also consider the negative or indirect harms resulting from limited resources being used for mass instead of targeted screening and intervention. Like investigative resources used up on false positives can’t be used to investigate specific spy, migration, or child abuse concerns. In addition to resource constraints, and perverse incentives (quotas, quantification, uncertainty aversion) and cognitive bias also drive societal adoption of this type of trawling program and the problems it causes.

What follows (next slide) is my explanation of why these all have the same mechanism.

Lifestyle diseases - thresholding changes driven by industry - milder cases of lifestyle diseases hypertension, diabetes, high cholesterol, and osteoporosis.

Imaging advances. E.g., CT, MRI, ultrasound -> incidentalomas, meniscal damage (40%), gallstones (10%).

***

Mass screenings for low-prevalence problems are common across diverse domains. Here are a few you might be familiar with. They all share the same mathematical structure. They’re sometimes called signal detection problems. A signal detection problem is the most abstract level of any type of problem where we are trying to figure out whether a signal is there in some noise, or not.

Security:

Chat Control was Pirate MEP Patrick Breyer’s name for the provision in an EU CSAM law that would require client-side scanning of digital communications.
iBorderCtrl was an EU Horizon 2020-funded automated border security system that was tested on non-EU citizens at non-Schengen (external) borders a few years ago.
Polygraphs (aka lie detectors) are 20th century collections of physiological recordings, usually cardio, respiratory, and skin conductance, monitored through software in an interrogation framework whose structure is intended to produce testing for lie responses, or sometimes for guilty knowledge responses, which maybe works better.

Medicine:

Asymptomatic cancer screenings. E.g., mammography, PSA testing, colonoscopy.
Threshold changes. E.g., hypertension, diabetes, high cholesterol, and osteoporosis.
Imaging advances. E.g., CT, MRI, ultrasound -> incidentalomas, meniscal damage (40%), gallstones (10%).

Education:

Vanderbilt University announced on August 16, 2023 that it “decided to disable Turnitin’s AI detection tool for the foreseeable future.” https://www.vanderbilt.edu/brightspace/2023/08/16/guidance-on-ai-detection-and-why-were-disabling-turnitins-ai-detector/ Turnitin is a U.S. company that sells software licenses to unis and high schools to screen student assignments for plagiarism and AI use in writing. It sounds like Vanderbilt still uses the software for plagiarism screening, just not its AI detection tool.
Specifically for AI use in writing, there’s research suggesting AI designed to catch it is biased against non-native English speakers: https://www.sciencedirect.com/science/article/pii/S2666389923001307.
Digital Science is a tech company that recently acquired Ripeta, Leslie McIntosh’s “Trust Markers” software designed to catch suspicious research. Information is unavailable about the tech’s accuracy, reliability, or method of validation. At least one of the intended use cases is mass screenings for low-prevalence problems — checking research for trust markers and intervening “before publication, if needed” (Figure 2, p. 3) at the institutional level (https://wildetruth.substack.com/p/algorithmic-hype-perverse-incentives).

Social media:

Governments have delegated screening for digital misinformation (wrong facts) to tech companies, while employing a mixture of governmental and corporate tactics against disinformation (usually construed as hostile government misinformation) - like blocking RT in huge swaths of the world.

30 of 40

Application	Target	Bycatch
Trawling	Tuna	Dolphin
Polygraph	Spies, Terrorists	Non-spies, non-terrorists
iBorderCtrl	Bad crossings	Innocent crossings
ChatControl	CSAM	Innocent coms
Asymptomatic cancer screenings	Deaths	Healthy people
Lifestyle diseases	Big problems	Mild cases
Advanced medical imaging	See problems	Harmless anomalies
Educational ethics	Plagiarism and AI use in writing	Innocent students
Misinformation	Provably wrong	Ambiguity, dissent
Disinformation	Hostile propaganda	Counterpoint

Under bycatch - Talk about specific harms. Like in pg case I guess false pos - how these investigations can go into all kinds of personal terrain, there aren’t protections like people might imagine for religious, political, or sexual behavior, orientation, or belief, these investigations can tear people apart. With iBC and Chat Control even worse. Also false neg - false sense of security. And also those are positive or direct harms, we should also consider the negative or indirect harms resulting from limited resources being used for mass instead of targeted screening and intervention. Like investigative resources used up on false positives can’t be used to investigate specific spy, migration, or child abuse concerns. In addition to resource constraints, and perverse incentives (quotas, quantification, uncertainty aversion) and cognitive bias also drive societal adoption of this type of trawling program and the problems it causes.

What follows (next slide) is my explanation of why these all have the same mechanism.

Lifestyle diseases - thresholding changes driven by industry - milder cases of lifestyle diseases hypertension, diabetes, high cholesterol, and osteoporosis.

Imaging advances. E.g., CT, MRI, ultrasound -> incidentalomas, meniscal damage (40%), gallstones (10%).

***

Mass screenings for low-prevalence problems are common across diverse domains. Here are a few you might be familiar with. They all share the same mathematical structure. They’re sometimes called signal detection problems. A signal detection problem is the most abstract level of any type of problem where we are trying to figure out whether a signal is there in some noise, or not.

Security:

Chat Control was Pirate MEP Patrick Breyer’s name for the provision in an EU CSAM law that would require client-side scanning of digital communications.
iBorderCtrl was an EU Horizon 2020-funded automated border security system that was tested on non-EU citizens at non-Schengen (external) borders a few years ago.
Polygraphs (aka lie detectors) are 20th century collections of physiological recordings, usually cardio, respiratory, and skin conductance, monitored through software in an interrogation framework whose structure is intended to produce testing for lie responses, or sometimes for guilty knowledge responses, which maybe works better.

Medicine:

Asymptomatic cancer screenings. E.g., mammography, PSA testing, colonoscopy.
Threshold changes. E.g., hypertension, diabetes, high cholesterol, and osteoporosis.
Imaging advances. E.g., CT, MRI, ultrasound -> incidentalomas, meniscal damage (40%), gallstones (10%).

Education:

Vanderbilt University announced on August 16, 2023 that it “decided to disable Turnitin’s AI detection tool for the foreseeable future.” https://www.vanderbilt.edu/brightspace/2023/08/16/guidance-on-ai-detection-and-why-were-disabling-turnitins-ai-detector/ Turnitin is a U.S. company that sells software licenses to unis and high schools to screen student assignments for plagiarism and AI use in writing. It sounds like Vanderbilt still uses the software for plagiarism screening, just not its AI detection tool.
Specifically for AI use in writing, there’s research suggesting AI designed to catch it is biased against non-native English speakers: https://www.sciencedirect.com/science/article/pii/S2666389923001307.
Digital Science is a tech company that recently acquired Ripeta, Leslie McIntosh’s “Trust Markers” software designed to catch suspicious research. Information is unavailable about the tech’s accuracy, reliability, or method of validation. At least one of the intended use cases is mass screenings for low-prevalence problems — checking research for trust markers and intervening “before publication, if needed” (Figure 2, p. 3) at the institutional level (https://wildetruth.substack.com/p/algorithmic-hype-perverse-incentives).

Social media:

Governments have delegated screening for digital misinformation (wrong facts) to tech companies, while employing a mixture of governmental and corporate tactics against disinformation (usually construed as hostile government misinformation) - like blocking RT in huge swaths of the world.

31 of 40

Application	Target	Bycatch
Trawling	Tuna	Dolphin
Polygraph	Spies, Terrorists	Non-spies, non-terrorists
iBorderCtrl	Bad crossings	Innocent crossings
ChatControl	CSAM	Innocent coms
Asymptomatic cancer screenings	Deaths	Healthy people
Lifestyle diseases	Big problems	Mild cases
Advanced medical imaging	See problems	Harmless anomalies
Educational ethics	Plagiarism and AI use in writing	Innocent students
Misinformation	Provably wrong	Ambiguity, dissent
Disinformation	Hostile propaganda	Counterpoint

Under bycatch - Talk about specific harms. Like in pg case I guess false pos - how these investigations can go into all kinds of personal terrain, there aren’t protections like people might imagine for religious, political, or sexual behavior, orientation, or belief, these investigations can tear people apart. With iBC and Chat Control even worse. Also false neg - false sense of security. And also those are positive or direct harms, we should also consider the negative or indirect harms resulting from limited resources being used for mass instead of targeted screening and intervention. Like investigative resources used up on false positives can’t be used to investigate specific spy, migration, or child abuse concerns. In addition to resource constraints, and perverse incentives (quotas, quantification, uncertainty aversion) and cognitive bias also drive societal adoption of this type of trawling program and the problems it causes.

What follows (next slide) is my explanation of why these all have the same mechanism.

Lifestyle diseases - thresholding changes driven by industry - milder cases of lifestyle diseases hypertension, diabetes, high cholesterol, and osteoporosis.

Imaging advances. E.g., CT, MRI, ultrasound -> incidentalomas, meniscal damage (40%), gallstones (10%).

Elena feedback: Draw out earlier harms of biopsies - anxiety, disfigurement, infection, death; uncertainty about needle biopsies, compression, radiation increasing cancer risk; uncertainty about net death risk/benefit.

***

Mass screenings for low-prevalence problems are common across diverse domains. Here are a few you might be familiar with. They all share the same mathematical structure. They’re sometimes called signal detection problems. A signal detection problem is the most abstract level of any type of problem where we are trying to figure out whether a signal is there in some noise, or not.

Security:

Chat Control was Pirate MEP Patrick Breyer’s name for the provision in an EU CSAM law that would require client-side scanning of digital communications.
iBorderCtrl was an EU Horizon 2020-funded automated border security system that was tested on non-EU citizens at non-Schengen (external) borders a few years ago.
Polygraphs (aka lie detectors) are 20th century collections of physiological recordings, usually cardio, respiratory, and skin conductance, monitored through software in an interrogation framework whose structure is intended to produce testing for lie responses, or sometimes for guilty knowledge responses, which maybe works better.

Medicine:

Asymptomatic cancer screenings. E.g., mammography, PSA testing, colonoscopy.
Threshold changes. E.g., hypertension, diabetes, high cholesterol, and osteoporosis. (Adding 09/27/24: gestational diabetes. Already had it in talk on slide 9.)
Imaging advances. E.g., CT, MRI, ultrasound -> incidentalomas, meniscal damage (40%), gallstones (10%).

Education:

Vanderbilt University announced on August 16, 2023 that it “decided to disable Turnitin’s AI detection tool for the foreseeable future.” https://www.vanderbilt.edu/brightspace/2023/08/16/guidance-on-ai-detection-and-why-were-disabling-turnitins-ai-detector/ Turnitin is a U.S. company that sells software licenses to unis and high schools to screen student assignments for plagiarism and AI use in writing. It sounds like Vanderbilt still uses the software for plagiarism screening, just not its AI detection tool.
Specifically for AI use in writing, there’s research suggesting AI designed to catch it is biased against non-native English speakers: https://www.sciencedirect.com/science/article/pii/S2666389923001307.
Digital Science is a tech company that recently acquired Ripeta, Leslie McIntosh’s “Trust Markers” software designed to catch suspicious research. Information is unavailable about the tech’s accuracy, reliability, or method of validation. At least one of the intended use cases is mass screenings for low-prevalence problems — checking research for trust markers and intervening “before publication, if needed” (Figure 2, p. 3) at the institutional level (https://wildetruth.substack.com/p/algorithmic-hype-perverse-incentives).

Social media:

Governments have delegated screening for digital misinformation (wrong facts) to tech companies, while employing a mixture of governmental and corporate tactics against disinformation (usually construed as hostile government misinformation) - like blocking RT in huge swaths of the world.

32 of 40

Application	Target	Bycatch
Trawling	Tuna	Dolphin
Polygraph	Spies, Terrorists	Non-spies, non-terrorists
iBorderCtrl	Bad crossings	Innocent crossings
ChatControl	CSAM	Innocent coms
Asymptomatic cancer screenings	Deaths	Healthy people
Lifestyle diseases	Big problems	Mild cases
Advanced medical imaging	See problems	Harmless anomalies
Educational ethics	Plagiarism and AI use in writing	Innocent students
Misinformation	Provably wrong	Ambiguity, dissent
Disinformation	Hostile propaganda	Ambiguity, dissent

Under bycatch - Talk about specific harms. Like in pg case I guess false pos - how these investigations can go into all kinds of personal terrain, there aren’t protections like people might imagine for religious, political, or sexual behavior, orientation, or belief, these investigations can tear people apart. With iBC and Chat Control even worse. Also false neg - false sense of security. And also those are positive or direct harms, we should also consider the negative or indirect harms resulting from limited resources being used for mass instead of targeted screening and intervention. Like investigative resources used up on false positives can’t be used to investigate specific spy, migration, or child abuse concerns. In addition to resource constraints, and perverse incentives (quotas, quantification, uncertainty aversion) and cognitive bias also drive societal adoption of this type of trawling program and the problems it causes.

What follows (next slide) is my explanation of why these all have the same mechanism.

Lifestyle diseases - thresholding changes driven by industry - milder cases of lifestyle diseases hypertension, diabetes, high cholesterol, and osteoporosis.

Imaging advances. E.g., CT, MRI, ultrasound -> incidentalomas, meniscal damage (40%), gallstones (10%).

***

Mass screenings for low-prevalence problems are common across diverse domains. Here are a few you might be familiar with. They all share the same mathematical structure. They’re sometimes called signal detection problems. A signal detection problem is the most abstract level of any type of problem where we are trying to figure out whether a signal is there in some noise, or not.

Security:

Chat Control was Pirate MEP Patrick Breyer’s name for the provision in an EU CSAM law that would require client-side scanning of digital communications.
iBorderCtrl was an EU Horizon 2020-funded automated border security system that was tested on non-EU citizens at non-Schengen (external) borders a few years ago.
Polygraphs (aka lie detectors) are 20th century collections of physiological recordings, usually cardio, respiratory, and skin conductance, monitored through software in an interrogation framework whose structure is intended to produce testing for lie responses, or sometimes for guilty knowledge responses, which maybe works better.

Medicine:

Asymptomatic cancer screenings. E.g., mammography, PSA testing, colonoscopy.
Threshold changes. E.g., hypertension, diabetes, high cholesterol, and osteoporosis.
Imaging advances. E.g., CT, MRI, ultrasound -> incidentalomas, meniscal damage (40%), gallstones (10%).

Education:

Vanderbilt University announced on August 16, 2023 that it “decided to disable Turnitin’s AI detection tool for the foreseeable future.” https://www.vanderbilt.edu/brightspace/2023/08/16/guidance-on-ai-detection-and-why-were-disabling-turnitins-ai-detector/ Turnitin is a U.S. company that sells software licenses to unis and high schools to screen student assignments for plagiarism and AI use in writing. It sounds like Vanderbilt still uses the software for plagiarism screening, just not its AI detection tool.
Specifically for AI use in writing, there’s research suggesting AI designed to catch it is biased against non-native English speakers: https://www.sciencedirect.com/science/article/pii/S2666389923001307.
Digital Science is a tech company that recently acquired Ripeta, Leslie McIntosh’s “Trust Markers” software designed to catch suspicious research. Information is unavailable about the tech’s accuracy, reliability, or method of validation. At least one of the intended use cases is mass screenings for low-prevalence problems — checking research for trust markers and intervening “before publication, if needed” (Figure 2, p. 3) at the institutional level (https://wildetruth.substack.com/p/algorithmic-hype-perverse-incentives).

Social media:

Governments have delegated screening for digital misinformation (wrong facts) to tech companies, while employing a mixture of governmental and corporate tactics against disinformation (usually construed as hostile government misinformation) - like blocking RT in huge swaths of the world.

33 of 40

A Dangerous Structure

Mass screenings for low-prevalence problems (MaSLoPP)
Signal detection
Shared structure.
Shared problems.

Image: Russell Lee, 1942, public domain. Coolidge, Pinal County, Arizona. Casa Grande Farms, FSA (Farm Security Administration) project. Pigs at a feed trough.

My magic trick: These look very different - but underneath, they share a common mathematical structure.

These are all signal detection problems. Security, medicine, education, digital com platforms - authority or expert trying to pick out a signal in noise.

Shared structure insight comes from signal detection theory. The classic citation for this is Green and Swets 1966 (http://andrei.gorea.free.fr/Teaching_fichiers/SDT%20and%20Psytchophysics.pdf ), which originated in U.S. Army research to measure radar operators’ performance. The “signal” refers to whatever the test is trying to detect. It could be the presence of a crime, disease, or some other low-prevalence thing that we might want to screen for across these very diverse contexts.

***

MaSLoPP, mass screening for low-prevalence problems, is common across diverse domains. Here are a few you might be familiar with. They all share the same mathematical structure. They’re sometimes called signal detection problems. A signal detection problem is the most abstract level of any type of problem where we are trying to figure out whether a signal is there in some noise, or not.

Security:

Chat Control was Pirate MEP Patrick Breyer’s name for the provision in an EU CSAM law that would require client-side scanning of digital communications.
iBorderCtrl was an EU Horizon 2020-funded automated border security system that was tested on non-EU citizens at non-Schengen (external) borders a few years ago.
Polygraphs (aka lie detectors) are 20th century collections of physiological recordings, usually cardio, respiratory, and skin conductance, monitored through software in an interrogation framework whose structure is intended to produce testing for lie responses, or sometimes for guilty knowledge responses, which maybe works better.

Medicine:

Asymptomatic cancer screenings. E.g., mammography, PSA testing, colonoscopy.
Threshold changes. E.g., hypertension, diabetes, high cholesterol, and osteoporosis.
Imaging advances. E.g., CT, MRI, ultrasound -> incidentalomas, meniscal damage (40%), gallstones (10%).

Education:

Vanderbilt University announced on August 16, 2023 that it “decided to disable Turnitin’s AI detection tool for the foreseeable future.” https://www.vanderbilt.edu/brightspace/2023/08/16/guidance-on-ai-detection-and-why-were-disabling-turnitins-ai-detector/ Turnitin is a U.S. company that sells software licenses to unis and high schools to screen student assignments for plagiarism and AI use in writing. It sounds like Vanderbilt still uses the software for plagiarism screening, just not its AI detection tool.

Social media:

Governments have delegated screening for digital misinformation (wrong facts) to tech companies, while employing a mixture of governmental and corporate tactics against disinformation (usually construed as hostile government misinformation) - like blocking RT in huge swaths of the world.

34 of 40

So that’s how a bit about how the structure scales up in the abstract to different kinds of programs and technologies. What about where this class of dangerously structured programs stops?

There are a lot of controversial programs that are mass preventive interventions (not screenings) for low-prevalence problems. So they do something instead of testing for something. Notably, mandatory Covid vaccination and boosting in children and young people is one; the discredited DARE program in the US and other places in another.

Random screening is also often bad, has problems as well.

And in terms of scaling up bycatch, laws writ large are basically all bycatch, almost all of the time. The law to not murder applies to all of us, and hopefully none of us is a murderer.

To pick a more mundane example, there’s a lot of bureaucracy around medical records - data releases about offices keeping the data, some offices won’t email them, only mail or fax them, or make you come in - because of strong data protection laws in Germany due to the Nazi and Stasi legacies. But when I want my bloodwork and can’t get it without going out when it’s below freezing to pay for and pick up a physical printout, that’s bycatch.

One of the reasons to extend the logic out this far is to make the point that the argument isn’t that these programs are always bad — we need laws. Sometimes we want them, sometimes we don’t. It’s not that I hate screening. It’s that we need the net harm/benefit analysis to know which programs help instead of hurting society.

My focus is on this smallest subset of this structure, mass screenings for low-prevalence problems, because these screening contexts make for especially clear logic and math, because the outcomes are binary, and the outcome of interest is already defined by the test structure – unlike in Covid vaccination, where it could be disputed what the outcome of interest is.

35 of 40

Validation problem spectrum

Chat Control

(unsolved)

UTIs in Danish primary care

36 of 40

Validation problem spectrum

Chat Control

(unsolved)

UTIs in Danish primary care

Policy problems:

Border security - Online promotion of terrorism
Scientific integrity - Educational ethics
Diabetes - Breast cancer - Hypertension
(Vaccine) misinformation - (Ukraine) disinformation
Opioid prescriptions - Lupus diagnosis

Here’s 11 policy problems — where are they on the scale? What do you think?

***

Lupus case –

You don’t have to read a statistics textbook to get why ignoring uncertainty is bad science that hurts people. At the human level, I key this into some experiences with and later critiques of rheumatology and diagnostic cut-offs closing sick people out of needed treatment for no good reason. (3 approaches to uncertainty Springer chapter: https://verawil.de/shame-name-give-up-the-game-three-approaches-to-uncertainty.)

It’s not just AI. Human decision-makers do this, too. But here’s an AI example:

When Batu et al used Adamichou et al’s lupus diagnostic aid (SLE Risk Probability Index, SLERPI) on kids, they raised the diagnostic threshold to ostensibly improve accuracy (better specificity; fewer false-positives), accepting more false-negatives (worse sensitivity) in the process. But in real life, clinicians should probably treat patients who score a 7 using this tool the same as they should treat those scoring an 8. Increasing the tool’s accuracy like this looks good on paper, but stands to hurt people in practice.

Both Adamichou et al and Bantu et al also used patient data excluding uncertain diagnoses and assuming lupus is a binary instead of a continuum (e.g., with Antiphospholipid Syndrome toward one pole and Undifferentiated Connective Tissue Disorder NOS, or “lupus lite,” toward the other). This inflates accuracy and excludes a lot of people who could benefit from better care. You don’t necessarily want to treat possible false-negatives any differently from how you treat true or false-positives, when it comes to offering low-risk treatments and preventive care to sick kids. Ignoring uncertainty — instead of thinking about how to give a large number of people with uncertain diagnoses better care — is especially bad in rheumatology, a field notorious for difficult diagnoses.

But trying to deny or categorize away uncertainy in other contexts does a world of preventable harm, too.

37 of 40

References

Signal detection theory and psychophysics, by Green & Swets. John Wiley, 1966.

“How to Improve Bayesian Reasoning Without Instruction: Frequency Formats,”

Gerd Gigerenzer & Ulrich Hoffrage, Psychological Review, 102(4), October

1995.

The Polygraph and Lie Detection, Stephen Fienberg et al, National Academies

Press, 2002.

“The Need for Cognitive Science in Methodology,” Sander Greenland, American

Journal of Epidemiology, Vol. 186, No. 6, 15 September 2017, p. 639–645.

38 of 40

References

Bayesian Inference in Statistical Analysis, by Box & Tiao, especially the aphorism

about point estimates (relevant, e.g., to quoted accuracy rates of screening

tests): “To the idea that people like to have a single number we answer that

usually they shouldn’t get it,” p. 310.

Statistical Rethinking, by Richard McElreath, especially the vampire example in

Chapter 3, “Sampling the Imaginary.”

Inevitable Illusions: How mistakes of reason rule our minds, by Massimo

Piattelli-Palmarini, especially Chapters 6, “The Fallacy of Near Certainty”

(Bayes’ rule is required to reason about screening tests, and native intuitions

tend to be poor) and 7, “The Seven Deadly Sins” (e.g., overconfidence increases

more than prediction accuracy for experts, anchoring works, and untrained

statistical intuitions tend to be wrong).

39 of 40

References

Michael A. Ribers & Hannes Ullrich, “Complementarities between algorithmic and

human decision-making: The case of antibiotic prescribing,” Quantitative

Marketing and Economics, 2024.

Harding Center for Risk Literacy, “Early detection of breast cancer by

mammography screening,” Fact Box,

https://www.hardingcenter.de/en/transfer-and-impact/fact-boxes/early-detect

ion-of-cancer/early-detection-of-breast-cancer-by-mammography-screening.

“Risk stratification in breast screening workshop,” Andrew Anderson, Cristina

Visintin, Antonis Antoniou, Nora Pashayan, Fiona J. Gilbert, Allan Hackshaw,

Rikesh Bhatt, Harry Hill, Stuart Wright, Katherine Payne, Gabriel Rogers, Bethany Shinkins, Sian Taylor-Phillips & Rosalind Given-Wilson, BMC Proceedings, Vol. 18, No. 22, 2024.

40 of 40

References

Overdiagnosed: Making People Sick in The Pursuit of Health, H.

Gilbert Welch, Lisa M. Schwartz, and Steven Woloshin (MDs;

Beacon Press, 2011).

“Quantifying Biases in Causal Models: Classical Confounding vs

Collider-Stratification Bias,” Sander Greenland, Epidemiology

14(3):p 300-306, May 2003.

“Causal Diagrams,” by M. Maria Glymour and Sander Greenland,

Chapter 12.