2 of 57

About me

I started as an undergrad in CS and Psychology (1998-2002), and stumbled into research with Margaret Burnett, who was studying spreadsheet testing.

I did my Ph.D. at Carnegie Mellon University, with HCI researcher Brad Myers (2002-2008). Later, I started publishing in Software Engineering with the help of other mentors at CMU, such as Jonathan Aldrich and Gail Murphy.

I then joined UW’s Information School, mostly publishing tools and studies in HCI, SE, and Computing Education, always about programming.

These days, I think of myself as a Computing Education researcher, who applies PL, SE, and HCI methods and ideas to the learning and teaching of computing.

3 of 57

About this talk

How can we learn things about programming through the things we make?

If you have a PL background, you’re probably used to proofs, and perhaps some benchmark studies.

But there are other ways of knowing.

Observing what people do
Asking people about their practices
Theorizing about behavior, cognition, social contexts

Most of academia, including HCI, uses these ways to make progress. PL should too.

4 of 57

Learning objectives

“User studies” are about more than checking a box for publication; they’re a way of deciding what to make and understanding the significance of what you’ve made.

One study is not enough to understand something; if it takes dozens to believe a drug or vaccine is safe, why would one be enough in CS?

Study design situated and highly complex. Take it slowly.

HCI is more than user studies—it’s a massive research area with hundreds of subareas and tens of thousands of researchers, inventing the future of interactive computing and understanding its impacts. Studying programming is a sliver of it.

5 of 57

About the rest of this talk

I’ll teach these ideas by telling the story of my dissertation (2002-2008) in five parts:

Part I: “What is debugging???”
Part II: “Aha”
Part III: “How do I know if it works?”
Part IV: “How would this really work…”
Part V: “Does this really work?”

We’ll then return to the learning objectives to reflect, and I’ll share some resources to learn more.

6 of 57

Part I

“What is debugging???”

7 of 57

My interests when I started grad school

Broadly, I was vaguely interested in “making programming easier.”

I didn’t really know what that meant. All I knew is that I wanted to build things that helped programmers productively express their ideas.

My advisor suggested I go watch some people program and see if I could find opportunities to make something.

(Strangely, he didn’t suggest I read, but I did anyway—hundreds of papers about the psychology of programming, leveraging a literature review his former Ph.D. student had published in his dissertation. That was equally important.)

8 of 57

My first semester

I decided to watch artists, sound engineers, and developers use Alice—a programming language and IDE for building interactive 3D virtual worlds—try to implement a behavior.

They would write a line of code, test, it wouldn’t work, they’d get frustrated, confused, and lost. No matter how careful they were, their programs were molded through iteration.

A student using Alice to build a Monster’s Inc. interactive game with his team, which included a sound engineering student and an industrial design student.

9 of 57

A realization

After about 30 hours of watching people of all kinds try to write Alice programmers, a few things became very clear:

Debugging is not peripheral to programming, it dominates programming
Debugging, at least how my participants did it, was driven by guess work (“maybe this is the problem”)
People’s guesses were usually wrong; it took them ages to find out, and then think of another guess to check.
The only judgements that were sound were whether the output was wrong, and sometimes even those were wrong.

“Wait, why did Sully move…”

“Maybe it was the do together I just wrote?”

“Let me try undoing that…”

[5 minutes later] “No, it still happens”

“Maybe it was another event? Let’s disable…”

[5 minutes later] “No, it still happens…”

Amy Ko (2003). A Contextual Inquiry of Expert Programmers in an Event-Based Programming Environment ACM Conference on Human Factors in Computing Systems (CHI), 1036-1037.

10 of 57

Testing my hypothesis

I decided to watch more closely. I had several students come in and try to build something with Alice, and recorded them.

After analyzing 12 hours of video, the sequence was clear:

Developer creates a defect
Long after they were created, they would notice a failure
Recency bias shaped their guesses, but defects were rarely recent
Eventually, after laborious guesswork, and correction of their mental model of what they’d built, they would discover the defect.

A participant creates a defect without realizing it, notices a failure 30 minutes later, spends the rest of the 90 minute session trying to localize it.

Amy J. Ko, Brad A. Myers (2003)

Development and Evaluation of a Model of Programming Errors.

IEEE Symposia on Human-Centric Computing Languages and Environments (VL/HCC), 7-14.

11 of 57

Note: I hadn’t thought about tools at all yet. In HCI, understanding problems requires stepping back from making, thinking critically about contexts, tasks.

12 of 57

Part II

“Aha!”

13 of 57

Implications

What did all this mean for tools?

Most debugging tools assume that a developer knows what code might be faulty. But I’d observed that this assumption was often wrong.
Debugging tools that facilitate stepping (breakpoints, time travel) are only useful to the effect that a developer has a good hypothesis. They often don’t.
The only thing that developers did know was that some output is wrong. Any tool that assumes a developer knows more than that about a defect would be garbage in, garbage out, only amplifying a bad premise.

14 of 57

Aha!

What if a debugging tool could start from unwanted output—the only reliable information a developer has—and automatically identify the things that caused it, presenting them that to a developer to inspect?

But how would developers identify the output?
And how would the system identify the causes?
And how would the system present the causes to help a developer carefully analyze them?

15 of 57

Sketching

My mind spun with ideas about how a developer and a debugging tool might dialog with each other:

“Pointing” to output to indicate it
Asking questions about it: “Why are you here”, “Why didn’t you appear, output?” “Why aren’t you 5 instead of 2?”
The system telling a developer a story, “First, this happened, then this happened, and then this; do any of those look wrong?”

Early prototypes of Whyline “answers”, with wild flailing about filters, temporal masks, and other overly complicated ideas

16 of 57

Discovering slicing

Looking for ways to make these possible, I remembered a 1994 paper I’d read called Program Slicing, by Mark Weiser. Mark had made similar observations as me, but come to a different “Aha!”

As conceived, it was a batch process, in which one would specify a variable and get a set of lines of code that influence that variable, statically or dynamically.

The handful of studies on it showed that slices were too big, too hard to comprehend. But what if this was just a bad interface for reasoning about slices?

17 of 57

The Whyline

The following design emerged:

Identify lines of code that generate as output.
Create a menu of those lines of code, organized by objects in the Alice world, presenting them as questions.
Allow developers to choose an output statement.
Incrementally compute a dynamic slice on the code using an execution trace
Allow developer to interactively select which parts of the incremental slice seem faulty, using their knowledge of correct program behavior to search through the slice.

My last high fidelity prototype.

18 of 57

The implementation

Lots of work to do:

I replaced the runtime to generate an execution trace sufficient for slicing
I changed the IDE to expose output statements that had executed in the trace
I created a visualization of the slice to convey causality

I showed it to my lab mates, my advisor, and their reaction was—”Why hasn’t debugging always worked this way?”

The Whyline for Alice, showing a question about why a particular output statement did not execute, and an answer computed as a backwards dynamic slice on the execution of that line and it’s values at a given time.

19 of 57

Notice that I hadn’t thought about evaluation at all yet. Making requires stepping back from evaluating.

20 of 57

Chapter III

“How do I know if it works?”

21 of 57

Informal feedback

After building my prototype, I got plenty of informal feedback:

My advisor thought it was “cool”
My lab mates wished they had it for their preferred language and IDE
The Alice team wanted to merge my branch of Alice

All good signs. But would it actually help with debugging? How could I know?

22 of 57

Study idea #1: benchmark comparison

Many prior debugging tools had used this approach. I could

Organize all of the real defects I’d seen developers create in Alice.
Compare the amount of “work” to identify the defect with the Whyline and other methods.
Show that the Whyline takes less “work”

I found this unsatisfying—how could I possibly emulate the work involved in debugging, when I’d demonstrated in prior studies that it was so completely dependent on a developers’ prior knowledge, the sequence of their actions, and other context?

23 of 57

Study idea #2: task-based evaluation

I could:

Select a defective program
Run a controlled experiment to compare the time it took for developers’ to localize the defect with and without the Whyline
Show that the Whyline caused significantly faster debugging

This would be fine, and but it would be highly dependent on which task I selected. Moreover, my prior work had shown that debugging time was highly variable, so there was a risk this variation would mask the effects of the tool.

24 of 57

Study idea #3: spec-based evaluation

I could:

Give developers a specification with six mostly orthogonal requirements
Let them introduce and debug defects organically, hoping they would introduce similar defects within each requirement
Match and compare organic debugging scenarios by time

This had the advantage of being more ecologically valid. The distinct tasks also allowed me to measure task progress. But it also posed some risks, because participants could create wildly different defects, making them incomparable.

25 of 57

The spec

I took the risk, and tried to minimize it by generating a specification with six tasks that minimally mutually dependent, and constrained enough in implementation that developers were likely to make the same mistakes.

I brought 9 participants into the lab, gave 5 the Whyline, 5 the latest version of Alice, offered them a 15-minute tutorial on Alice, and 90 minutes to make a simple Pac Man game.

Pac must always move. His direction should change in response to the arrow keys.
Ghost must move in random directions half of the time and directly towards Pac the other half.
If Ghost is chasing and touches Pac, Pac must flatten and stop moving forever.
If Pac eats the big dot, ghost must run away for 5 seconds, then return to chasing.
If Pac touches running ghost, Ghost must flatten and stop for 5 seconds, then chase again.
If Pac eats all of the dots, Ghost must stop and Pac must hop indefinitely.

26 of 57

The results

It worked. I found six distinct defects that occured in both groups, compared their times, and the Whyline debugging times were 8x faster (10-40 seconds versus 49-330 seconds), p<.05.

Upon inspecting video recordings of their debugging, this effect was due to preventing long investigations of wrong hypotheses.

This was the first evidence that the Whyline was helpful for real bugs, created and debugged by real developers, with minimal training.

Participants were particularly enamored with the question menu, and its ability to instantly link the question in their head to the code it corresponded to. “Gosh, that’s really intuitive. Can you make this for Java?” - participant 5

Ko, A. J., & Myers, B. A. (2004). Designing the Whyline: a debugging interface for asking questions about program behavior. In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 151-158).

27 of 57

Caveats

This was by no means a perfect study →

At least within HCI, though, evaluations are to help shape ideas, not to “prove” them useful.

Even with these limitations, a small formative study was enough to suggest to others that the idea had promise and that others should explore it.

The study also began to reveal why it worked: linking the question in a developer’s mind to the code and execution history that concerns it was a powerful way of searching for defects.

Small sample

Small programs

Highly-constrained, simple specification

Novice programmers

Intentionally simple programming language

28 of 57

Note that the goal of the study wasn’t to prove that the Whyline was better, but whether it was and why.

29 of 57

Part IV

“How would this really work…”

30 of 57

But what about [insert language here]?

Publishing the CHI paper lead to a lot of questions.

Fans of the work wondered about other programming languages →
Software engineering researchers wondered about scale, complexity.
Programming language researchers wondered about the risk of helping too much, weakening developers’ “debugging muscles.”
Microsoft, Apple, Adobe, Mozilla, and more about the implications for their language stacks.

An inmate at Allegheny County Jail wondered, what about Perl? Apache configurations? Java?

31 of 57

Pause

I was only in my first year of grad school.

I wasn’t ready to commit to the Whyline to be the entirety of my dissertation, nor was I ready to graduate.

I decided to play for a few years, and think about these questions while gaining other expertise.

I studied API learning (Six Learning Barriers in End-User Programming Systems)
I studied program understanding in IDEs (Eliciting Design Requirements for Maintenance-Oriented IDEs: A Detailed Study of Corrective and Perfective Maintenance Tasks)
I designed a programming language (Citrus: A Language and Toolkit for Simplifying the Creation of Structured Editors for Code and Data)
I studied interruptions (Examining Task Engagement in Sensor-Based Statistical Models of Human Interruptibility)
I built novel code editors (Barista: An Implementation Framework for Enabling New Tools, Interaction Techniques and Views for Code Editors)
I did a Whyline for user interfaces (Answering Why and Why Not Questions in User Interfaces)
I studied developers information seeking at Microsoft (Information Needs in Collocated Software Development Teams)

32 of 57

What I learned during 3 years on pause

The architecture of programming language stacks

The architecture of IDEs

A range of qualitative methods

How professional software developers do their work

How to publish in software engineering venues

All of these shaped my understanding of what a Whyline for a general purpose programming language would need to be, helping me gain research method skills, identify requirements, and better understand software engineering.

33 of 57

My thesis proposal

A summary of everything that had been done in debugging and program understanding
A summary of everything I had learned in my prior work
Mockups of what I proposed to build
Pseudocode for algorithms I proposed to invent.
A study design plan for precisely evaluating the effects of the tool on debugging time and task completion.
An 18-month timeline for the work.
Work that was out of scope for my dissertation.

Early Java Whyline prototypes.

34 of 57

9 months of engineering, design, life

Rebuilt a JVM to do bytecode instrumentation

Built a execution history format with real-time compression, organized for incremental loading, random access

Built an incremental static and dynamic analysis engine (call graphs, slicing, question answering)

Built an output history view to recreate output history for interrogation

Built a repository and history viewer, to support questions, answers, browsing, searching

Build a breakpoint debugger for comparison.

Weekly usability testing to verify that all interfaces were usable by experienced Java developers (software engineering masters students down the hall)

Weekly usefulness testing on a growing corpus of defective open source Java projects to surface new requirements about question and answer presentation.

I was also getting divorced at the time; programming was a good escape, but not a very healthy one. It took me years to process that trauma in more healthy ways.

Amy J. Ko, Brad A. Myers (2010)

Extracting and Answering Why and Why Not Questions about Java Program Output

ACM Transactions on Software Engineering and Methodology, 22(2), Article 4, Article 4.

35 of 57

Release candidate

After 9 months, I had a prototype that was stable, fast, usable, and (appeared to be) useful on a corpus of defects.

There were some things I was certain of: 1) it was novel, 2) it was a huge amount of engineering work, 3) I was really proud of it.

There was one big thing I was uncertain of: would it actually be useful to Java developers?

The Java Whyline release candidate, showing the debugging of a defect in a painting program.

36 of 57

Note that evaluation of usability and usefulness was iterative, and embedded in the design process—not an afterthought.

37 of 57

Part V

“Does this really work?”

38 of 57

Critiquing on the Alice Whyline evaluation

Small sample, small programs, novice programmers, simple language—all things that had dissatisfied the software engineering community.

This time, I needed to aim for scale and complexity, with experienced developers

I wanted to do something more naturalistic, as with the Alice study, but I’d learned on hiatus how skeptical software engineering researchers were of anything other than a controlled experiment.

I decided to go with a controlled experiment, to appease reviewers, and not risk my upcoming academic job search.

39 of 57

Study design

Between subjects

A control group that used a breakpoint debugger, built within the Whyline stack to minimize confounding factors
A treatment group that received the Whyline

Training introduced each group to the features of their respective tool.

20 participants, split evenly, recruited from the masters in software engineering, which was all experienced developers from industry.

Tasks were based on the ArgoUML project, a 150K LOC software architecture design tool. Two real defects reported in its bug repository:

Removing a deprecated checkbox and associated functionality.
A drop down menu that was missing identical class names.

Participants were told that they were new hires, and they were just assigned these bugs as their warm up tasks. They were to write a change recommendation for a patch and told to prioritize correctness over speed.

40 of 57

It was useful

With the Whyline, participants were far more successful at writing a correct change recommendation, and much faster on one task.

I love it!
This is really great!
I think this will really help.
This is really going to reduce the burden on programmers. This is great, when can I get this for C?
It's so nice and straight and simple...
My god, this is so cool.
This is very nice.

41 of 57

But why?

Whyline participants:

Spent more of their time inspecting more relevant files and functions
Avoided unproductive guesswork (e.g., text searches, hypothesis testing)
Spent more time carefully reading the semantics of relevant code

A theory emerged:

Strategies shape whether a developer debugs productively
Tools shape the strategies that are possible and their various costs and risks
Breakpoint debuggers promote unproductive, risky strategies
The Whyline promoted productive, empirically-grounded strategies

42 of 57

Aftermath

I presented the work in 2008 and won a distinguished paper award; this helped me get several offers in academia and industry.

It led to several consulting gigs with language maintainers (e.g., Microsoft, Mozilla, Adobe, Apple), and much impact on their language stacks to support tracing (2009-2015)

The design, engineering, and evaluation work helped sharpen my skills along many dimensions.

In 2018, I won a most influential award for the work, and got to reflect on what I’d learned.

I had the honor of giving a fancy talk for my most influential paper award. It’s available at http://faculty.uw.edu/ajko/talks

43 of 57

Reflection

44 of 57

“User studies” are about discovery

I often hear PL researchers use the phrase “user study” as if it’s merely a subtask of publishing.

My dissertation story shows that “user studies” are about much more: each time I watched a developer work, I discovered opportunities, requirements, and constraints for making, as well as explanations of the value of what I’d made, and evidence of its limitations.

Observing people use tools and languages is essential to making sense of the significance of what we make. More strongly: if you don’t observe people using what you’ve made, you don’t really know it’s significance.

45 of 57

One study is never enough

My dissertation story showed that one study is not enough to understand the Whyline. Two studies wasn’t enough enough: even the highly controlled experiment that wrapped up my work was limited: one code base, 20 developers, only Java.

I honestly still don’t know when the Whyline is useful, 12 years later, even though I’m sure it’s useful sometimes.

That doesn’t mean that evaluations are worthless. They just incrementally contribute to an accumulation of evidence. There’s no better time to start accumulating than now.

46 of 57

Study design is situated, complex

Situated — I did the study that made sense for the tool I’d created, for the resources I had, for the people I had access to, for the skills I had at the time, for the audiences I was speaking to. There is no best study design, there’s just the study design that you’re most capable of doing, that most effectively answers your question, and that serves your other goals.

Complex — I had to account for hundreds of factors, and this necessarily requires careful, slow, thoughtful iteration, test runs (known as piloting), and lots of time.

Do study design slowly, deliberately, and with the help of an expert. It is always more complex than you think.

47 of 57

HCI is about more than evaluation

Many people in PL appear to view HCI narrowly as concerned with user studies and user interfaces. My dissertation story should show that HCI is integral at every step of inventing a tool—before, during, and after.

Therefore, don’t ask an HCI researcher to design your evaluation after you’ve built a tool—that’s too late, and disregards their expertise, which is just as much about the invention of novel forms of interactive computing as it is critically evaluating the impact of these forms on the world.

49 of 57

If study design is hard, how do you learn?

Start by reading. There are so many good resources! They will teach you:

Major decisions to make
Caveats, confounds, and challenges to anticipate
Factors to plan around
How to refine your design

Here are some examples...

50 of 57

How to avoid meaningless evaluations

Greenberg, S., & Buxton, B. (2008, April). Usability evaluation considered harmful (some of the time). In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 111-120).

Explains how the choice of evaluation methodology should arise from the actual problem or research question under consideration, not some

The risks of evaluating too soon, too narrowly.

51 of 57

Designing meaningful research questions

Amy J. Ko, Sally Fincher (2019). A Study Design Design Process. Cambridge Handbook on Computing Education Research (Sally Fincher, Anthony Robin, Eds.).

Describes how to design coherent research questions, either standalone ones, or ones that concern the evaluation of a tool.

A process for refining a research question.

52 of 57

Using theory to inform evaluations

Nelson, G. L., & Ko, A. J. (2018, August). On use of theory in computing education research. In Proceedings of the 2018 ACM Conference on International Computing Education Research (pp. 31-39).

Explains the importance of using theory to inform design and evaluation, but also the perils of relying on it too much.

Theories are powerful, but with great power comes great responsibility.

53 of 57

Designing controlled experiments

Amy J. Ko, Thomas LaToza, Margaret M. Burnett (2013). A Practical Guide to Controlled Experiments of Software Engineering Tools with Human Participants. Empirical Software Engineering, 110-141.

Describes nearly everything you need to know to design a good experiment to study a developer tool

Also explains why you should probably not design an experiment.

A diagram of the structure of a controlled experiment

54 of 57

Preventing usability problems in studies

Amy J. Ko, Margaret M. Burnett, Thomas R.G. Green, Karen J. Rothermel, Curtis R. Cook (2002). Improving the Design of Visual Programming Language Experiments Using Cognitive Walkthroughs. Journal of Visual Languages and Computing, 13(5), 517-544.

Explains how to verify, without piloting, that developers won’t encounter usability problems when trying to use your tool.

An adapted Cognitive Walkthrough process for detecting usability problems in your study.

55 of 57

Avoiding participant response bias

Dell, N., Vaidyanathan, V., Medhi, I., Cutrell, E., & Thies, W. (2012, May). " Yours is better!" participant response bias in HCI. In Proceedings of the sigchi conference on human factors in computing systems (pp. 1321-1330).

Explains the risks of evaluating your own tools.

Participants are more likely to say they like what you made if you’re the one asking.