Studying Programming through Making
Amy J. Ko, Ph.D.
University of Washington
About me
I started as an undergrad in CS and Psychology (1998-2002), and stumbled into research with Margaret Burnett, who was studying spreadsheet testing.
I did my Ph.D. at Carnegie Mellon University, with HCI researcher Brad Myers (2002-2008). Later, I started publishing in Software Engineering with the help of other mentors at CMU, such as Jonathan Aldrich and Gail Murphy.
I then joined UW’s Information School, mostly publishing tools and studies in HCI, SE, and Computing Education, always about programming.
These days, I think of myself as a Computing Education researcher, who applies PL, SE, and HCI methods and ideas to the learning and teaching of computing.
About this talk
How can we learn things about programming through the things we make?
If you have a PL background, you’re probably used to proofs, and perhaps some benchmark studies.
But there are other ways of knowing.
Most of academia, including HCI, uses these ways to make progress. PL should too.
Learning objectives
“User studies” are about more than checking a box for publication; they’re a way of deciding what to make and understanding the significance of what you’ve made.
One study is not enough to understand something; if it takes dozens to believe a drug or vaccine is safe, why would one be enough in CS?
Study design situated and highly complex. Take it slowly.
HCI is more than user studies—it’s a massive research area with hundreds of subareas and tens of thousands of researchers, inventing the future of interactive computing and understanding its impacts. Studying programming is a sliver of it.
About the rest of this talk
I’ll teach these ideas by telling the story of my dissertation (2002-2008) in five parts:
We’ll then return to the learning objectives to reflect, and I’ll share some resources to learn more.
Part I
“What is debugging???”
My interests when I started grad school
Broadly, I was vaguely interested in “making programming easier.”
I didn’t really know what that meant. All I knew is that I wanted to build things that helped programmers productively express their ideas.
My advisor suggested I go watch some people program and see if I could find opportunities to make something.
(Strangely, he didn’t suggest I read, but I did anyway—hundreds of papers about the psychology of programming, leveraging a literature review his former Ph.D. student had published in his dissertation. That was equally important.)
My first semester
I decided to watch artists, sound engineers, and developers use Alice—a programming language and IDE for building interactive 3D virtual worlds—try to implement a behavior.
They would write a line of code, test, it wouldn’t work, they’d get frustrated, confused, and lost. No matter how careful they were, their programs were molded through iteration.
A student using Alice to build a Monster’s Inc. interactive game with his team, which included a sound engineering student and an industrial design student.
A realization
After about 30 hours of watching people of all kinds try to write Alice programmers, a few things became very clear:
“Wait, why did Sully move…”
“Maybe it was the do together I just wrote?”
“Let me try undoing that…”
[5 minutes later] “No, it still happens”
“Maybe it was another event? Let’s disable…”
[5 minutes later] “No, it still happens…”
Amy Ko (2003). A Contextual Inquiry of Expert Programmers in an Event-Based Programming Environment ACM Conference on Human Factors in Computing Systems (CHI), 1036-1037.
Testing my hypothesis
I decided to watch more closely. I had several students come in and try to build something with Alice, and recorded them.
After analyzing 12 hours of video, the sequence was clear:
A participant creates a defect without realizing it, notices a failure 30 minutes later, spends the rest of the 90 minute session trying to localize it.
Amy J. Ko, Brad A. Myers (2003)
Development and Evaluation of a Model of Programming Errors.
IEEE Symposia on Human-Centric Computing Languages and Environments (VL/HCC), 7-14.
Note: I hadn’t thought about tools at all yet. In HCI, understanding problems requires stepping back from making, thinking critically about contexts, tasks.
Part II
“Aha!”
Implications
What did all this mean for tools?
Aha!
What if a debugging tool could start from unwanted output—the only reliable information a developer has—and automatically identify the things that caused it, presenting them that to a developer to inspect?
Sketching
My mind spun with ideas about how a developer and a debugging tool might dialog with each other:
Early prototypes of Whyline “answers”, with wild flailing about filters, temporal masks, and other overly complicated ideas
Discovering slicing
Looking for ways to make these possible, I remembered a 1994 paper I’d read called Program Slicing, by Mark Weiser. Mark had made similar observations as me, but come to a different “Aha!”
As conceived, it was a batch process, in which one would specify a variable and get a set of lines of code that influence that variable, statically or dynamically.
The handful of studies on it showed that slices were too big, too hard to comprehend. But what if this was just a bad interface for reasoning about slices?
The Whyline
The following design emerged:
My last high fidelity prototype.
The implementation
Lots of work to do:
I showed it to my lab mates, my advisor, and their reaction was—”Why hasn’t debugging always worked this way?”
The Whyline for Alice, showing a question about why a particular output statement did not execute, and an answer computed as a backwards dynamic slice on the execution of that line and it’s values at a given time.
Notice that I hadn’t thought about evaluation at all yet. Making requires stepping back from evaluating.
Chapter III
“How do I know if it works?”
Informal feedback
After building my prototype, I got plenty of informal feedback:
All good signs. But would it actually help with debugging? How could I know?
Study idea #1: benchmark comparison
Many prior debugging tools had used this approach. I could
I found this unsatisfying—how could I possibly emulate the work involved in debugging, when I’d demonstrated in prior studies that it was so completely dependent on a developers’ prior knowledge, the sequence of their actions, and other context?
Study idea #2: task-based evaluation
I could:
This would be fine, and but it would be highly dependent on which task I selected. Moreover, my prior work had shown that debugging time was highly variable, so there was a risk this variation would mask the effects of the tool.
Study idea #3: spec-based evaluation
I could:
This had the advantage of being more ecologically valid. The distinct tasks also allowed me to measure task progress. But it also posed some risks, because participants could create wildly different defects, making them incomparable.
The spec
I took the risk, and tried to minimize it by generating a specification with six tasks that minimally mutually dependent, and constrained enough in implementation that developers were likely to make the same mistakes.
I brought 9 participants into the lab, gave 5 the Whyline, 5 the latest version of Alice, offered them a 15-minute tutorial on Alice, and 90 minutes to make a simple Pac Man game.
The results
It worked. I found six distinct defects that occured in both groups, compared their times, and the Whyline debugging times were 8x faster (10-40 seconds versus 49-330 seconds), p<.05.
Upon inspecting video recordings of their debugging, this effect was due to preventing long investigations of wrong hypotheses.
This was the first evidence that the Whyline was helpful for real bugs, created and debugged by real developers, with minimal training.
Participants were particularly enamored with the question menu, and its ability to instantly link the question in their head to the code it corresponded to. “Gosh, that’s really intuitive. Can you make this for Java?” - participant 5
Ko, A. J., & Myers, B. A. (2004). Designing the Whyline: a debugging interface for asking questions about program behavior. In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 151-158).
Caveats
This was by no means a perfect study →
At least within HCI, though, evaluations are to help shape ideas, not to “prove” them useful.
Even with these limitations, a small formative study was enough to suggest to others that the idea had promise and that others should explore it.
The study also began to reveal why it worked: linking the question in a developer’s mind to the code and execution history that concerns it was a powerful way of searching for defects.
Small sample
Small programs
Highly-constrained, simple specification
Novice programmers
Intentionally simple programming language
Note that the goal of the study wasn’t to prove that the Whyline was better, but whether it was and why.
Part IV
“How would this really work…”
But what about [insert language here]?
Publishing the CHI paper lead to a lot of questions.
An inmate at Allegheny County Jail wondered, what about Perl? Apache configurations? Java?
Pause
I was only in my first year of grad school.
I wasn’t ready to commit to the Whyline to be the entirety of my dissertation, nor was I ready to graduate.
I decided to play for a few years, and think about these questions while gaining other expertise.
What I learned during 3 years on pause
The architecture of programming language stacks
The architecture of IDEs
A range of qualitative methods
How professional software developers do their work
How to publish in software engineering venues
All of these shaped my understanding of what a Whyline for a general purpose programming language would need to be, helping me gain research method skills, identify requirements, and better understand software engineering.
My thesis proposal
Early Java Whyline prototypes.
9 months of engineering, design, life
Rebuilt a JVM to do bytecode instrumentation
Built a execution history format with real-time compression, organized for incremental loading, random access
Built an incremental static and dynamic analysis engine (call graphs, slicing, question answering)
Built an output history view to recreate output history for interrogation
Built a repository and history viewer, to support questions, answers, browsing, searching
Build a breakpoint debugger for comparison.
Weekly usability testing to verify that all interfaces were usable by experienced Java developers (software engineering masters students down the hall)
Weekly usefulness testing on a growing corpus of defective open source Java projects to surface new requirements about question and answer presentation.
I was also getting divorced at the time; programming was a good escape, but not a very healthy one. It took me years to process that trauma in more healthy ways.
Amy J. Ko, Brad A. Myers (2010)
Extracting and Answering Why and Why Not Questions about Java Program Output
ACM Transactions on Software Engineering and Methodology, 22(2), Article 4, Article 4.
Release candidate
After 9 months, I had a prototype that was stable, fast, usable, and (appeared to be) useful on a corpus of defects.
There were some things I was certain of: 1) it was novel, 2) it was a huge amount of engineering work, 3) I was really proud of it.
There was one big thing I was uncertain of: would it actually be useful to Java developers?
The Java Whyline release candidate, showing the debugging of a defect in a painting program.
Note that evaluation of usability and usefulness was iterative, and embedded in the design process—not an afterthought.
Part V
“Does this really work?”
Critiquing on the Alice Whyline evaluation
Small sample, small programs, novice programmers, simple language—all things that had dissatisfied the software engineering community.
This time, I needed to aim for scale and complexity, with experienced developers
I wanted to do something more naturalistic, as with the Alice study, but I’d learned on hiatus how skeptical software engineering researchers were of anything other than a controlled experiment.
I decided to go with a controlled experiment, to appease reviewers, and not risk my upcoming academic job search.
Study design
Between subjects
Training introduced each group to the features of their respective tool.
20 participants, split evenly, recruited from the masters in software engineering, which was all experienced developers from industry.
Tasks were based on the ArgoUML project, a 150K LOC software architecture design tool. Two real defects reported in its bug repository:
Participants were told that they were new hires, and they were just assigned these bugs as their warm up tasks. They were to write a change recommendation for a patch and told to prioritize correctness over speed.
It was useful
With the Whyline, participants were far more successful at writing a correct change recommendation, and much faster on one task.
But why?
Whyline participants:
A theory emerged:
Aftermath
I presented the work in 2008 and won a distinguished paper award; this helped me get several offers in academia and industry.
It led to several consulting gigs with language maintainers (e.g., Microsoft, Mozilla, Adobe, Apple), and much impact on their language stacks to support tracing (2009-2015)
The design, engineering, and evaluation work helped sharpen my skills along many dimensions.
In 2018, I won a most influential award for the work, and got to reflect on what I’d learned.
I had the honor of giving a fancy talk for my most influential paper award. It’s available at http://faculty.uw.edu/ajko/talks
Reflection
“User studies” are about discovery
I often hear PL researchers use the phrase “user study” as if it’s merely a subtask of publishing.
My dissertation story shows that “user studies” are about much more: each time I watched a developer work, I discovered opportunities, requirements, and constraints for making, as well as explanations of the value of what I’d made, and evidence of its limitations.
Observing people use tools and languages is essential to making sense of the significance of what we make. More strongly: if you don’t observe people using what you’ve made, you don’t really know it’s significance.
One study is never enough
My dissertation story showed that one study is not enough to understand the Whyline. Two studies wasn’t enough enough: even the highly controlled experiment that wrapped up my work was limited: one code base, 20 developers, only Java.
I honestly still don’t know when the Whyline is useful, 12 years later, even though I’m sure it’s useful sometimes.
That doesn’t mean that evaluations are worthless. They just incrementally contribute to an accumulation of evidence. There’s no better time to start accumulating than now.
Study design is situated, complex
Situated — I did the study that made sense for the tool I’d created, for the resources I had, for the people I had access to, for the skills I had at the time, for the audiences I was speaking to. There is no best study design, there’s just the study design that you’re most capable of doing, that most effectively answers your question, and that serves your other goals.
Complex — I had to account for hundreds of factors, and this necessarily requires careful, slow, thoughtful iteration, test runs (known as piloting), and lots of time.
Do study design slowly, deliberately, and with the help of an expert. It is always more complex than you think.
HCI is about more than evaluation
Many people in PL appear to view HCI narrowly as concerned with user studies and user interfaces. My dissertation story should show that HCI is integral at every step of inventing a tool—before, during, and after.
Therefore, don’t ask an HCI researcher to design your evaluation after you’ve built a tool—that’s too late, and disregards their expertise, which is just as much about the invention of novel forms of interactive computing as it is critically evaluating the impact of these forms on the world.
Guidelines
If study design is hard, how do you learn?
Start by reading. There are so many good resources! They will teach you:
Here are some examples...
How to avoid meaningless evaluations
Greenberg, S., & Buxton, B. (2008, April). Usability evaluation considered harmful (some of the time). In Proceedings of the SIGCHI conference on Human factors in computing systems (pp. 111-120).
Explains how the choice of evaluation methodology should arise from the actual problem or research question under consideration, not some
The risks of evaluating too soon, too narrowly.
Designing meaningful research questions
Amy J. Ko, Sally Fincher (2019). A Study Design Design Process. Cambridge Handbook on Computing Education Research (Sally Fincher, Anthony Robin, Eds.).
Describes how to design coherent research questions, either standalone ones, or ones that concern the evaluation of a tool.
A process for refining a research question.
Using theory to inform evaluations
Nelson, G. L., & Ko, A. J. (2018, August). On use of theory in computing education research. In Proceedings of the 2018 ACM Conference on International Computing Education Research (pp. 31-39).
Explains the importance of using theory to inform design and evaluation, but also the perils of relying on it too much.
Theories are powerful, but with great power comes great responsibility.
Designing controlled experiments
Amy J. Ko, Thomas LaToza, Margaret M. Burnett (2013). A Practical Guide to Controlled Experiments of Software Engineering Tools with Human Participants. Empirical Software Engineering, 110-141.
Describes nearly everything you need to know to design a good experiment to study a developer tool
Also explains why you should probably not design an experiment.
A diagram of the structure of a controlled experiment
Preventing usability problems in studies
Amy J. Ko, Margaret M. Burnett, Thomas R.G. Green, Karen J. Rothermel, Curtis R. Cook (2002). Improving the Design of Visual Programming Language Experiments Using Cognitive Walkthroughs. Journal of Visual Languages and Computing, 13(5), 517-544.
Explains how to verify, without piloting, that developers won’t encounter usability problems when trying to use your tool.
An adapted Cognitive Walkthrough process for detecting usability problems in your study.
Avoiding participant response bias
Dell, N., Vaidyanathan, V., Medhi, I., Cutrell, E., & Thies, W. (2012, May). " Yours is better!" participant response bias in HCI. In Proceedings of the sigchi conference on human factors in computing systems (pp. 1321-1330).
Explains the risks of evaluating your own tools.
Participants are more likely to say they like what you made if you’re the one asking.
Consider this the beginning of your learning
Knowing anything about programming is hard!
We’ll be here to help.
Questions?
Amy J. Ko, Ph.D.