1 of 35

Evidence Practices in Software Science

Andreas Stefik, Ph.D.

2 of 35

Introduction

  • Many areas of science are becoming increasingly empirical
  • In the next few classes, we will be discussing empiricism and state-of-the-art methods in software science. This includes:
    • A variety of methodological procedures
    • The history of evidence gathering and the specific lessons we have learned over time
    • Major mistakes made in history (e.g., ethics issues)
    • We will read at least one modern study (which will be on the exam)

3 of 35

Subjectivity: is it a Problem?

  • Some believe that the world we live in can be understood largely through personal experiences
  • There are, however, some problems with this point of view:
    • Issues with Perception
    • Issues with Communication
    • Issues with Bias

4 of 35

Perception

  • Humans Perceive in different ways
  • Optical illusions
    • Can be correctly interpreted in multiple ways
    • Can influence what we think know about a phenomenon
  • Consider the work of Proffitt et al. [1]
    • Humans have trouble perceiving even something like distance
    • Small changes (e.g., wearing a backpack) impact our judgement

[1] Dennis R. Proffitt, Jeanine Stefanucci, Tom Banton, and William Epstein. The role of effort in perceiving distance. Psychological Science, 14(2):106–112, 2003.

5 of 35

Communication

  • For science, we have to communicate our perceptions to others
  • Problem is that studies show:
    • Communication can influence others
    • We can "plant" false memories in others
    • We have fallible memories about what we perceived
  • This has implications for the legal system (e.g., witnesses)

Consider a study by Braun, Ellis, and Loftus [2] on the topic. I summarize it as follows:

  • They noticed advertisements played on nostalgia and personal experience
  • People were asked to report about a trip to disneyland
  • They were either shown, or not shown, an advertisement
  • They were either told they met Mickey Mouse or Bugs Bunny (which was false)
  • People reported that even the impossible experiences actually happened

[2] Kathryn A. Braun, Rhiannon Ellis, and Elizabeth F. Loftus. Make my memory: How advertising can change our memories of the past. Psychology and Marketing, 19(1):1–23, 2002.

6 of 35

Bias, Trickery, and Propaganda

  • So far, we have assumed people are honest
  • Unfortunately, they often are not
    • Fraud has been a major problem in society for generations
    • Many areas of science, and obviously politics, have people attempting to deceive
    • Tricking people turns out to be easy

Consider the Pepsi vs. Coke experiment by Woolfork, Castellan, and Brooks

  • Ask people to choose their favorite soda
  • Label the cups intentionally incorrectly
  • Turns out, people claim their preference is the cup with the wrong label

[3] Mary E. Woolfolk, William Castellan, and Charles I. Brooks. Pepsi versus coke: Labels, not tastes, prevail. Psychological Reports, 52(1):185–186, 1983.

7 of 35

Group Assignment 1:

Consider a claim you have heard in Software Engineering. Is the claim subjective and is there potential problems with it?

8 of 35

Over time, Scientists began to Doubt Subjectivity

  • In 1721, Boston was confronted with a Smallpox epidemic
    • It killed approximately 400,000 people
    • A fellow named Cotton Mather (a minister) about "inoculation"
    • At the time, there were no methods for resolving debates on whether the practice was effective
  • When people just "disagree," we observe "rhetoric"

Example claims:

William Douglass claimed it was a "wicked and criminal practice"

Mather claimed back that "I have read that thousands of lives have been saved by inoculation, and not one of thousands has miscarried by it. This is related by wise & learned men who would not have imposed on the world a false narrative."

9 of 35

The Story of Homeopathy

  • Two doctors in the early 18th century disagreed on the practice of homeopathy
  • Von Hoven wrote a scathing letter about it to a newspaper, which ended with

Homeopathy is an atrocity to me, I consider its downfall a blessing for mankind

  • Reuter, a homeopath, wrote his own rhetoric, saying von Hoven was lying

Friedrich Willhelm von Hoven (1759-1838)

10 of 35

What about early Discussions of Programming Languages

  • In the 1970s, the Department of Defense wanted progress made on the programming language wars:
    • They wanted to know what was desirable in a language
    • Digital Computer Software was costing DOD ballpark $3.5 billion annually
  • Problem was, work by Kaijanaho shows that before 1976, no studies existed

Hence, without evidence, we got competing rhetoric:

David Gries

Our first speaker talked about having variables which can contain procedure bodies as objects. The question is [...] whether or not it is a reasonable thing to have in the language."

Ichbiah then claimed:

"It seems to me that [...] forms of abstract data types are now quite well-known. [...] I terribly disagree with Gries [...] when he say that encapsulated data types are not yet within the state of the art,"

[4] John H. Williams and D. A. Fisher, editors. Design and Implementation of Programming Languages: Proceedings of a DOD Sponsored Workshop, Ithaca, Oct., 1976. Springer-Verlag, Berlin, Heidelberg, 1977.

11 of 35

From Subjectivity to Falsification

  • One major problem with subjective statements is that we can just deny that they are true
  • Non-deniability is crucial:
    • Counting the dead
    • Quality of life metrics
    • Measuring programmer productivity
    • Dollars Spent on a project

12 of 35

Introducing the "Control Group"

  • Let's go back to Boston
  • In this case, two groups were compared:
    • Those that received inoculation
    • Those that did not
  • Results showed inoculation significantly reduced deaths

Blake Described the result:

Altogether, since April, 5,889 people, of whom 844 died, had had the smallpox. This one disease caused more than three-fourths of all the deaths in Boston during the year of the epidemic. During the same period Boylston inoculated 242 persons, with 6 deaths[5].

[5] John B. Blake. The inoculation controversy in boston: 1721-1722. The New England Quarterly, 25(4):489–506, 1952.

13 of 35

The Precision Problem

  • Problem is, suppose we have an idea about software developer productivity
  • Let's say we think our "new fancy tool or procedure" will improve productivity by 20%
  • We have a control group and measure the results
  • Let's suppose everyone in the control takes exactly 100.0 seconds to complete tasks
  • In the experimental group, let's suppose the data looks like this

Seconds

81.54

76.52

77.04

84.79

83.23

81.01

Did we Confirm or Refute our Belief?

14 of 35

The "Numerical Method"

  • Pierre-Charles Alexandre Lois had an idea: let's track people using averages
  • He applied this to the concept "blood-letting"
  • His data suggested the practice was increasing mortality
  • That said, he misinterpreted his data significantly, in part because averages are "not good enough"

Picture From https://en.wikipedia.org/wiki/Pierre_Charles_Alexandre_Louis

15 of 35

Modern Science Often Uses Statistics

  • Software Engineering today casts its data in the language of statistics. You cannot understand it without it
  • This began with the "Chi-squared" test (𝞦2) (1900)
  • By 1908, this was expanded to the "Student's t-test" by William Sealy Gosset, publishing under a pseudonym
  • However, the most significant advances of the century in this field were largely made by Ronald Fisher (1890 - 1962)
  • He formalized the relationship between measures of
    • Centricity (e.g., average)
    • Dispersion (e.g., variance)
  • Given software developers, we use these same basic techniques today

16 of 35

Example Fisher-Style Experiment

  • 120 Software Developers are in a room
  • They are randomly Assigned to solve a problem using one of three programming languages
    • A
    • B
    • C
  • Using Fisher's "ANOVA," a statistical test, we obtain the following

17 of 35

Managing Trust (or lack Thereof)

  • In experiments, we often consider three major issues of "trust"
    • Did the experimenter manipulate who was assigned to the groups?
    • Did the people under study deceive themselves and believe things that were not true?
    • Did the experimenter compare their tool/treatment against something fair?
  • Manipulation can happen on purpose or on accident
  • The first study to take this on was the Nuremburg Salt Test (1834)
  • We have three standard procedures for dealing with these:
    • Randomization
    • Blinding
    • Control Groups

Friedrich Willhelm von Hoven (1759-1838)

18 of 35

Group Assignment 2:

What potential Issues of "Trust" exist in Software Engineering and how can we protect against it in experiments?

19 of 35

Experimentation in the 20th Century

20 of 35

Early "Reporting" Standards hoped to Distinguish Fact from Fraud

  • By the early 20th century, the U.S. Congress passed several laws to combat fraud:
    • Biologics Control Act of 1902
    • Pure Food and Drug Act of 1906
    • Formation of the FDA: 1938
  • While not a complete history, the broad goal of the legislation was to:
    • Force companies to state what was in their medications
    • Decide what the rules should be in prescribing drugs
    • Be careful not to limit the freedom of individual doctors
    • By 1938, for the first time, drugs had to be tested for safety before sold�

Lesson:

Sometimes reporting requirements have changed because of adjustments to the law.

21 of 35

Scholars at the Time did not Trust their Generation's Science

After the biologics control act of 1906, some therapeutic reformers were doubtful doctors would individually make, or perhaps even could make, sensible decisions��“Unfortunately, however, the physicians training is likely to be such that he cannot distinguish the rank fraud from the efficacious remedy, honestly made and sold [7].”

Discoveries were welcome, and medical drugs may have had chemical theories or treatments, but this did not mean they had a positive impact on people or communities��We cannot blame manufacturing chemists for finding new things or advertising them as cleverly as possible. That they and the nostrum vendor are surprisingly successful in selling their wares is largely our fault [6].

[6] N.S. Davis, “Effect of Proprietary Literature on Medical Men” JAMA 46 (May 5, 1906)

[7] W.A. Puckner, “The Nostrum from the point of view of the Pharmacist,” JAMA 46 (May 5, 1906)

22 of 35

By 1918, it was clear we needed Better Standardization and Reporting

Even well-funded Institutes had tremendous difficulty in standardizing almost anything.

“Even at the well-endowed Rockefeller Institute, it proved difficult to allow each investigator control over the resources desired for coordinated laboratory and clinical studies of new treatments. Clinical investigators at other institutions found it even more difficult to accomplish their scientific aims, because they lacked the means to compel cooperation of others. Researchers at the Russell Sage Institute of Pathology, for example, which aspired to do in studying metabolic disorders what the Rockefeller Institute had accomplished in researching infectious disease, were severely handicapped by shortages of funds and a lack of control over clinical material [8].”

[8] Harry Marks, The Progress of Experiment: Science and Therapeutic reform in the United States, 1900 - 1990 (United Kingdom: Cambridge University Press, 1997).

23 of 35

Reformers like John Stokes started the "Cooperative Research Study"

  • Stokes' studies had four key innovations. He tried to standardize:
    • Selection Procedures
    • Classification Procedures
    • Treatment Procedures
    • Patient Evaluation Procedures
  • This was a major breakthrough, but lacked an enforcement mechanism until World War 2

John H. Stokes in 1937 discussing Syphilis with his students. Picture from:�https://www.youtube.com/watch?v=bXMJifagmbA

Note: This is not directly related to the “Tuskegee Syphilis Experiment,” the “Guatemala syphilis experiment,” or the “Terre Haute prison experiments.”�

24 of 35

During World War 2, Standardization was an Active Goal

  • President Roosevelt started the Office of Scientific Research and Development in 1941
  • The goals of the committees:
    • Conduct reliable studies
    • Protect soldiers during wartime
    • Incentivize scientists to follow standard procedures through government grants
  • Despite this, by 1943, half of of the data from studies on penicillin and syphilis had to be tossed

John F. Mahoney (1889 - 1957), studied penicillin and syphilis

25 of 35

By Bradford-Hill and Richard Doll, the Randomized Controlled Trial was Born

  • Bradford-Hill pioneered evaluating causality
  • The core innovations combined:
    • Epidemiological and lab studies
    • Data-centric Ronald Fisher like statistical conventions
    • Multivariate models that rule out alternative hypotheses
    • A variety of different kinds of data gathering mechanisms
    • Strongly incentivized standardization procedures

[9] Richard Doll and Austin Bradford Hill, Smoking and Carcinoma of the Lung, British Medical Journal, 1950.�

26 of 35

Bradford-Hill style studies became Common (and the Law followed)

  • After Bradford Hill pioneered the modern randomized controlled trial, the technique proliferated widely
    • Usage of RCTs increased by 11.2% per year [10]
    • Saw collaboration agreements amongst research institutions
    • Had Strong anti-fraud procedures built in
    • Could use field or lab data, so long as shared procedures were followed
  • The original work is fascinating!�

[10] Tsay, M., & Yang, Y. (2005). Bibliometric analysis of the literature of randomized controlled trials. Journal of the Medical Library Association, 93(4), 450–458.�

27 of 35

Replication Issues Remained a Large Challenge

  • In some cases, replication was impossible
    • Studies were not always reported the same (reporting)
    • One famous study on tolbutamide appeared to cause death (ethics)
  • Reporting and standardization were fixable!
  • This study has been debated for decades and is a discussion all on its own

[11] Winegrad AI, Davidson JK, Ricketts HT, Sprague RG, Hurd JB, Fajans SS, Ellenberg M, Scoville AB, Grinshaw WH, Hardin RC. The University Group Diabetes Program Study Pertaining to Phenformin. JAMA.1971;217(6):817. doi:10.1001/jama.1971.03190060055014��

28 of 35

Eventually, the FDA Categorized Studies

  • Multiple stages that start small to mitigate risks
  • The FDA has approximately the following guidelines for sample sizes [12]
    • Phase 1: 20 - 80
    • Phase 2: A few dozen to 300
    • Phase 3: Several hundred to about 3,000
  • Scholars state in a publication what kind of study it is. The assumptions are then known

[12] https://www.fda.gov/drugs/resourcesforyou/consumers/ucm143534.htm�

29 of 35

By the early 1990s, Scientists developed the "CONSORT" standard

  • There is a standard checklist of how to report a study. It is simple
  • This was a major milestone of 20th century medicine
  • It has been endorsed by 600 biomedical journals
  • It standardizes reporting across generations and journals

30 of 35

Studies on Evidence Standards Show they Improve Reporting

  • The CONSORT standards have been empirically evaluated, to test their impact on reporting
  • Overall, the results are good, but not perfect:
    • Approximately 50 empirical studies have evaluated the evidence standards, which led to a systematic review (Cochrane)
    • Overall, they lead to better reporting, but endorsing journals still need reporting improvement
    • One review evaluated 16,604 studies in making this claim [8]
  • The history here is complicated, but the standards are not!

Key Finding:

Evidence Standards Improve Reporting of Empirical Studies in Academic Journals.

31 of 35

The WWC and CONSORT are very Different (and there are others)

WWC

  • A variety of study paradigms
  • Reviewer Certification
  • Complicated rulebook
  • Slick website categorizing studies
  • Rules for evaluation change (to my eyes) in somewhat ad-hoc ways
  • Education focused
  • By law, studies are mapped to "tiers of evidence"

CONSORT

  • Randomized Controlled Trials Only
  • No reviewer certification
  • Simple checklist
  • Studies on the evidence standard show it is effective
  • Journals optionally accept the standard
  • General, but "sort of" medical focused
  • Strong evidence in the literature that it works

32 of 35

Evidence Standards almost always Standardize Reporting

33 of 35

Sections are directly Mapped to the Standards

34 of 35

Group Project 3:

In Software Engineering, what changes to the law would you recommend changing and what impact could that have on ethics, evidence, or other issues?

35 of 35

Summary

  • We have gone on a whirlwind tour of evidence practices over the centuries
  • Overall, we learned that
    • Many experimentation strategies are designed to reduce the "need" for trust
    • Standard procedures like controls, randomization, and blinding, are bare bones and common
    • The 20th century saw significant changes to the law, which improved science
    • The "randomized controlled trial" contains many checks and balances
    • Modern "scientific evidence standards" can be used to map a series of comparable studies
  • With these ideas in mind, we will be reading a modern study in software engineering. Try to keep in mind historical context as you read