1 of 44

Scaling Manual Code Review

with codePost

April 8th, 2021

Jérémie Lumbroso, Princeton University

James Evans, codePost

2 of 44

Grading code could be like reading an essay

3 of 44

This workshop is interactive

We want your questions!

You may raise your hand on Zoom

You will be unmuted and you will be able to ask a question

You may ask your question (possibly anonymously) on https://sli.do

Or upvote questions by others!
Event code is #7481

We want this to be a wonderful experience for you, please speak up!

4 of 44

“I don’t have the time”

&

“I don’t have the resources”

1.

5 of 44

CS2 grading at Princeton circa 2014 (1)

weekly programming assignments
assignments by Sedgewick & Wayne
(same as Coursera Algorithms)
~130 students, 6 sections
1 instructor, 2 faculty section leaders, 3 grad TAs, 4–5 undergrad grading assistants
expansive legacy autograders tests

some exposed to students
rest used for grading/diagnostic

applied deduction, grade on 40pts
no solution code (plagiarism!)

39/40 “good code”

6 of 44

CS2 grading at Princeton circa 2014 (2)

Lots of paper

Time wasted printing
Tracking physical location of submission
Destroying old exams

Grading

Applying complex rubric consistently
“Assessing worth of student”
No pedagogy, no feedback

Many documents, tools, many authors, contradictory indications, grading as a logistical challenge

...

*-----------------------------------------------------------

Running 8 total tests.

A point in an m-by-m grid means that it is of the form (i/m, j/m),

where i and j are integers between 0 and m

Test 1: insert n random points; check size() and isEmpty() after each insertion

(size may be less than n because of duplicates)

* 5 random points in a 1-by-1 grid

* 50 random points in a 8-by-8 grid

* 100 random points in a 16-by-16 grid

* 1000 random points in a 128-by-128 grid

* 5000 random points in a 1024-by-1024 grid

* 50000 random points in a 65536-by-65536 grid

==> passed

Test 2: insert n random points; check contains() with random query points

* 1 random points in a 1-by-1 grid

* 10 random points in a 4-by-4 grid

* 20 random points in a 8-by-8 grid

* 10000 random points in a 128-by-128 grid

* 100000 random points in a 1024-by-1024 grid

* 100000 random points in a 65536-by-65536 grid

==> passed

Test 3: insert random points; check nearest() with random query points

* 10 random points in a 4-by-4 grid

* 15 random points in a 8-by-8 grid

* 20 random points in a 16-by-16 grid

* 100 random points in a 32-by-32 grid

* 10000 random points in a 65536-by-65536 grid

==> passed

Test 4: insert random points; check range() with random query rectangles

* 2 random points and random rectangles in a 2-by-2 grid

* 10 random points and random rectangles in a 4-by-4 grid

* 20 random points and random rectangles in a 8-by-8 grid

...

autograder output

submission server

...

* contains() / get() broken

[ -5 get not implemented or hopelessly flawed ]

[ -3 because of using reference equality instead of equals() ]

[ -3 because of testing only x-coordinates, but not y-coordinates ]

[ -3 because 2-way logic for (x < p.x) and (x > p.x) but no (x == p.x)

common symptom = incorrect drawing for circle.txt ]

[ -1 can't handle when root is null or other NullPointerException ]

[ -1 not handling (xmin == xmax) ]

[ -1 get works, but not contains ]

* range()

[ -5 not implemented or hopelessly flawed ]

[ -3 major flaws ]

[ -1 if only fails when N = 0 or no points in range ]

* nearest()

[ -8 not implemented or hopelessly flawed ]

[ -1 if fails only when N = 0 ]

[ -3 nearest only goes down insert path so doesn't always find correct

answer but sure is fast! ]

[ -3 nearest always tries left/bottom path first ]

[ -3 pruning is done incorrectly causing wrong answer sometimes ]

[ -2 if exception for corner case ]

...

rubric

“grading sheet”

7 of 44

CS2 grading at Princeton circa 2014 (3)

Problems for students

No/little feedback, and autograder output is laconic
Rubric/deductions appear arbitrary
Since not given solution (plagiarism concerns), no improvement possible

Problems for instructors

Bulk of time lost in logistics (compiling, printing, assigning to graders, tracking submissions as they are graded, pregrading, entering grades in LMS, processing late submissions)
Limited oversight of graders’ work
No/limited insights on students’ work

Problem for graders (= possibly instructor themselves)

Bulk of time lost in repetitive work (flipping through 5-page rubric, filling in grading sheet, adding points up, handwriting terse comments)
Adversarial work: Find everything that is wrong with student’s work
No time to read code!!! Factory-line work
Lots of different moving parts to master

Feedback I was the proudest of (in Fall 2014)

8 of 44

2014

2016

2019

9 of 44

“Resources haven’t changed but our tool and process have changed”

~120 students (Fall 2014)

1 instructor, 2 co-lead faculty section leaders, 3 grad students, 4-5 undergrad grading assistants

5 hrs running autograder

10 hrs printing + stapling

2 hrs dispatching to graders

60 hrs grading (~6 hrs/person)

3 hrs collecting graded work

2 hrs redistributing

82 hours → ~40 min/student

output is a grade + handful of words

time is spent moving paper around and looking through the rubric

~300 students (Spring 2020)

1 lead faculty coordinator +

30-50 undergrad grading assistants

2 hrs preparing grading lesson

1 hrs teaching graders

30-70 hrs grading (~1-2 hrs/person)

10 hrs writing explanations (only once)

1-2 hrs auditing class-wide work

35-85 hours → ~6-17 min/student

output is appropriate assignment-targeted explanations + custom feedback on code

time is spent reading code, honoring student and improving pedagogy

audience:

labor:

breakdown:

total:

summary:

10 of 44

11 of 44

12 of 44

21st century code grading toolbox

Limit/eliminate “manual transfer operations” (students → submission server → autograder → printer, printer → graders, graders → …)
Autograder:

Tries to ensure student code compiles
Help students avoid obvious problems; help weaker students make progress
Trade-off between time to write a test, and usefulness of test

Rubric:

Provides direction to human graders
Helps ensure consistency of grading

“Explanations”: Instructor-authored paragraphs shown to students, provides bulk of quantitative feedback received—linked to rubric items
Custom-comments: Individualized comments, left by graders, which both rewards students and helps address individual code problems

13 of 44

Rather be doing this… … or writing this?

14 of 44

Next steps

What is code review / code quality?

Why is autograding alone not sufficient?
Who does code review? Why is it essential?

Preamble: Getting students to submit reviewable code

How to help students submit code that can be reviewed
What information can be extracted from a submission before human graders see it?

Strategies for scaling code review

What are techniques when doing this alone (instructor alone)
How to leverage (and quality-check) a larger staff (instructor + TAs)�

Live codePost exercise for participants [1 hour hands on]

15 of 44

What is code review / code quality?

2.

16 of 44

Why code review?

public static int dayOfYear(int month, int dayOfMonth, int year) {

if (month == 2) {

dayOfMonth += 31;

} else if (month == 3) {

dayOfMonth += 59;

} else if (month == 4) {

dayOfMonth += 90;

} else if (month == 5) {

dayOfMonth += 31 + 28 + 31 + 30;

} else if (month == 6) {

dayOfMonth += 31 + 28 + 31 + 30 + 31;

} else if (month == 7) {

dayOfMonth += 31 + 28 + 31 + 30 + 31 + 30;

} else if (month == 8) {

dayOfMonth += 31 + 28 + 31 + 30 + 31 + 30 + 31;

} else if (month == 9) {

dayOfMonth += 31 + 28 + 31 + 30 + 31 + 30 + 31 + 31;

} else if (month == 10) {

dayOfMonth += 31 + 28 + 31 + 30 + 31 + 30 + 31 + 31 + 30;

} else if (month == 11) {

dayOfMonth += 31 + 28 + 31 + 30 + 31 + 30 + 31 + 31 + 30 + 31;

} else if (month == 12) {

dayOfMonth += 31 + 28 + 31 + 30 + 31 + 30 + 31 + 31 + 30 + 31 + 31;

}

return dayOfMonth;

}

Some “correct” code

Two discussion questions:

What is wrong with this code?
What tests could you write to detect these problems?

Source: https://web.mit.edu/6.005/www/fa15/classes/04-code-review/

17 of 44

Why code review

Code review is ubiquitous in industry

Helps ensure code hygiene: maintainability, human-readability. Correct code != good production code
Allows for discussion and triage of correctness issues

Case study:

At codePost, ~25% of development time is dedicated to code review
Important but rarely taught skills we focus on:

Assuming someone other than the original author will maintain the code you write
Writing specific, actionable comments about others’ code
Reacting constructively, not defensively to suggestions about code, even correct code

18 of 44

What makes code especially hard to review?

Code that doesn’t compile or contains syntax errors

This code will fail all automated tests
Debugging this code (by finding the errors) can be extremely labor-intensive, crowding out more meaningful feedback

Code that doesn’t adhere to a specified API

Failed tests might not expose bugs
Harder to explore the code by stepping outside pattern recognition developed from other submissions

Code with wacky style

Extra long lines, huge blocks of code, bad indentation, etc, make reading code tedious

19 of 44

Making code review easier

One way to avoid this type of code: incentivize students to submit “reviewable” code

Feedback loop: Create automated tests to check for the above symptoms, and expose these tests to students at the point of submission
Gamify: Group these tests into a group called “Level 1 requirements” (or something to indicate that they represent the most basic requirements)
Incentivize: Attach point values to these tests

Level 1 requirements exposed to students in codePost

20 of 44

Personalized feedback workflows

3.

21 of 44

Personal Feedback Workflow: Disclaimer

This section will be concrete efficient personal feedback workflow:

techniques for instructors alone

these readily transfer to a group/distributed setting

and how to leverage (and quality-check) a larger staff (instructor + TAs)

but all examples are based on my workflow in Princeton’s CS1:

300 submissions
I manage 30-70 undergraduate grader over a period of 1-3 hours
the main advantage is parallelization and speed, but this could be done with a smaller number of full-time TAs

22 of 44

An important distinction

In codePost, there are two complementary notions for comments:

Rubric comments belong to a rubric

instantiated by the graders
everything about them controlled centrally (and retro-actively) by instructor:

grader description,
student explanation,
point delta

they also contain a small part that is filled in by the grader (the customization of the comment)

Custom comments are discretionary comments left by graders

Notions are important both for quality control and for scale efficiency

23 of 44

This is a rubric comment

What the grader typically sees:

What the student sees:

grader[-facing] caption

(written once)

student[-facing] “explanation”

(written once)

“customization”

(written by grader, each time comment is applied)

24 of 44

Individual Scenario:

Grading exam or new assignment

no existing rubric

single instructor doing the grading

25 of 44

Broad outline

To grade the assignments, you can follow three steps:

“Tag First, Explain Later”

Step 1: Grade submissions, and create the rubric as you go using the in-line collaborative rubric feature (but alone)
Step 2: Once you have tagged your submissions, your explore your data set, and use the combined examples to help you write an explanation for each rubric item.

“Iterative Rubric Creation”

Step 3: If you left custom comments in your submissions, you may audit them to see if you can merge some to become rubric comments

The rubric is the

26 of 44

Step 1: create rubric

As you go along, you can either

add custom comments (if you think comment is unique)
create a rubric comment as described here�

This will build the rubric for your assignment and keep every submission linked to the corresponding rubric items

1.

2.

3.

27 of 44

Step 2: Explain!

Add explanations to rubric items; adjust deductions

Have fun and go crazy! You won’t ever have to do it again

28 of 44

Step 3: Audit

You can audit the custom comments after grading

to make sure some shouldn’t be rubric comments instead (consistency)
to see if there are similar custom comments that would suggest creating a rubric comment (efficiency)

29 of 44

COS 126 audit in Spring 2021

30 of 44

Staff Scenario:

Grading existing assignment

pre-existing deductive rubric

instructor with staff of TAs

31 of 44

Rubrics for COS126

Dan Leyzberg and course staff

deductive
on 4 pts
(same normalization as exams)
roughly correspond to certain learning objectives

assuming this can’t be changed (time, hierarchy, legacy, etc.)

We will show how to apply and give feedback with team of TAs

32 of 44

Context

The rubric has been entered for the staff of graders to use:

They can apply the rubric comments, and optionally add their customization
They are encouraged to provide personal feedback as custom comments

We have already shown how to audit custom comments, but rubric comments can also be checked

33 of 44

Step 1: Applying comments from the rubric

When you have a rubric predefined, it appears (with grader-specific captions if available) and is ready to be applied

rubric window

rubric comment (without customization)

custom comment (currently empty)

34 of 44

RUBRIC

EXPLORER

35 of 44

Step 2: Exploring

Explore every application of each rubric comment
Able to look how this rubric item was applied
Can be used to write explanations, and to audit graders

36 of 44

Bonus miscellaneous

Mining the Rubric Dataset

using the scale of

your class in your favor

37 of 44

Iteration via student feedback

Improve your rubric by soliciting feedback from students
Things to catch:

Unclear explanations
....

Bonus: use last year’s data to improve this year’s teaching

Distribution of rubric comments (combined with comprehension scores) can point to learning breakdowns => can tweak curriculum
Can leverage previous applications of rubric comments to train new staff (and students!)

38 of 44

Ensure fairness

What does fairness mean for grading?

Avoid conflicts of interest
Consistent scoring

Avoid conflicts of interest with anonymous grading mode

Added benefit of removing unconscious bias from grading process, besides explicit conflicts of interest

Consistent scoring

Much easier to adjudicate if TAs are grading random submissions: otherwise, you may need to account for systematic deviations in submission quality by TA
Data to assess fairness across TAs:

Average score awarded
Average score awarded, normalized for automated test failures
Frequency of rubric comment usage

39 of 44

Ensure quality

Hard problem: what makes a good code review?

Feedback quantity: lots of comments
Feedback quality: specific, actionable, reference student code, use rubrics

How to enforce:

Rubric-only mode: in this mode, graders can’t create custom comments, and are instead forced to use the rubric.
Instruction text: nudge graders to personalize rubric comments in specific ways.

How to measure

{insert section on codePost API}

40 of 44

Live exercise for participants

facilitated by James Evans

4.

41 of 44

API, SDK and beyond

5.

42 of 44

codePost has an open API and a Python SDK

43 of 44

Dataset of the comments

{

"assignment": {

"id": 2763,

"name": "Hello"

},

"submission_id": 122350,

"comment_id": 285902,

"grader": "xxxxxxxx@princeton.edu",

"point_delta": 0.0,

"rubric_comment": null,

"feedback": 0,

"comment": {

"code_blobs": [

{

"language": "java",

"code": "\nboolean isOrdered = ((a < b) && (b < c)) || ((a > b) && (b > c))\n"

}

],

"content": "you can declare and initialize the boolean in one statement:\n```\nboolean isOrdered = ((a < b) && (b < c)) || ((a > b) && (b > c))\n```",

"length": 133,

"wordcount": 30

},

"location": {

"filename": "Ordered.java",

"extension": ".java",

"start_line": 5,

"start_column": 0,

"end_line": 6,

"end_column": 65

},

"tests": {

"total": 29,

"passed": 28,

"failed": [

3609

]

},

"variables": {

"file": [

"args",

"b",

"isOrdered",

"a",

"c"

],

"comment": [

"isOrdered",

"a",

"c",

"b"

],

"coincidence": [

"b",

"isOrdered",

"a",

"c"

],

"overlap": true

},

"indicators": {

"uses_rubric_comment": false,

"uses_code": true,

"uses_learner_tokens": true

},

"statistics": {

"ratio_code": 49.62406015037594,

"ratio_test_passed": 0.9655172413793104

}

44 of 44

THANK YOU

to you +

to the organizers of SIGCSE 2020

and board