1 of 44

Scaling Manual Code Review

with codePost

April 8th, 2021

Jérémie Lumbroso, Princeton University

James Evans, codePost

2 of 44

Grading code could be like reading an essay

3 of 44

This workshop is interactive

  • We want your questions!
    • You may raise your hand on Zoom
      • You will be unmuted and you will be able to ask a question
    • You may ask your question (possibly anonymously) on https://sli.do
      • Or upvote questions by others!
      • Event code is #7481
  • We want this to be a wonderful experience for you, please speak up!

4 of 44

I don’t have the time

&

I don’t have the resources

1.

5 of 44

CS2 grading at Princeton circa 2014 (1)

  • weekly programming assignments
  • assignments by Sedgewick & Wayne
  • (same as Coursera Algorithms)
  • ~130 students, 6 sections
  • 1 instructor, 2 faculty section leaders, 3 grad TAs, 4–5 undergrad grading assistants
  • expansive legacy autograders tests
    • some exposed to students
    • rest used for grading/diagnostic
  • applied deduction, grade on 40pts
  • no solution code (plagiarism!)

39/40 “good code”

6 of 44

CS2 grading at Princeton circa 2014 (2)

  • Lots of paper
    • Time wasted printing
    • Tracking physical location of submission
    • Destroying old exams
  • Grading
    • Applying complex rubric consistently
    • “Assessing worth of student”
    • No pedagogy, no feedback
  • Many documents, tools, many authors, contradictory indications, grading as a logistical challenge

...

*-----------------------------------------------------------

Running 8 total tests.

A point in an m-by-m grid means that it is of the form (i/m, j/m),

where i and j are integers between 0 and m

Test 1: insert n random points; check size() and isEmpty() after each insertion

(size may be less than n because of duplicates)

* 5 random points in a 1-by-1 grid

* 50 random points in a 8-by-8 grid

* 100 random points in a 16-by-16 grid

* 1000 random points in a 128-by-128 grid

* 5000 random points in a 1024-by-1024 grid

* 50000 random points in a 65536-by-65536 grid

==> passed

Test 2: insert n random points; check contains() with random query points

* 1 random points in a 1-by-1 grid

* 10 random points in a 4-by-4 grid

* 20 random points in a 8-by-8 grid

* 10000 random points in a 128-by-128 grid

* 100000 random points in a 1024-by-1024 grid

* 100000 random points in a 65536-by-65536 grid

==> passed

Test 3: insert random points; check nearest() with random query points

* 10 random points in a 4-by-4 grid

* 15 random points in a 8-by-8 grid

* 20 random points in a 16-by-16 grid

* 100 random points in a 32-by-32 grid

* 10000 random points in a 65536-by-65536 grid

==> passed

Test 4: insert random points; check range() with random query rectangles

* 2 random points and random rectangles in a 2-by-2 grid

* 10 random points and random rectangles in a 4-by-4 grid

* 20 random points and random rectangles in a 8-by-8 grid

...

autograder output

submission server

...

* contains() / get() broken

[ -5 get not implemented or hopelessly flawed ]

[ -3 because of using reference equality instead of equals() ]

[ -3 because of testing only x-coordinates, but not y-coordinates ]

[ -3 because 2-way logic for (x < p.x) and (x > p.x) but no (x == p.x)

common symptom = incorrect drawing for circle.txt ]

[ -1 can't handle when root is null or other NullPointerException ]

[ -1 not handling (xmin == xmax) ]

[ -1 get works, but not contains ]

* range()

[ -5 not implemented or hopelessly flawed ]

[ -3 major flaws ]

[ -1 if only fails when N = 0 or no points in range ]

* nearest()

[ -8 not implemented or hopelessly flawed ]

[ -1 if fails only when N = 0 ]

[ -3 nearest only goes down insert path so doesn't always find correct

answer but sure is fast! ]

[ -3 nearest always tries left/bottom path first ]

[ -3 pruning is done incorrectly causing wrong answer sometimes ]

[ -2 if exception for corner case ]

...

rubric

“grading sheet”

7 of 44

CS2 grading at Princeton circa 2014 (3)

Problems for students

  • No/little feedback, and autograder output is laconic
  • Rubric/deductions appear arbitrary
  • Since not given solution (plagiarism concerns), no improvement possible

Problems for instructors

  • Bulk of time lost in logistics (compiling, printing, assigning to graders, tracking submissions as they are graded, pregrading, entering grades in LMS, processing late submissions)
  • Limited oversight of graders’ work
  • No/limited insights on students’ work

Problem for graders (= possibly instructor themselves)

  • Bulk of time lost in repetitive work (flipping through 5-page rubric, filling in grading sheet, adding points up, handwriting terse comments)
  • Adversarial work: Find everything that is wrong with student’s work
  • No time to read code!!! Factory-line work
  • Lots of different moving parts to master

Feedback I was the proudest of (in Fall 2014)

8 of 44

2014

2016

2019

9 of 44

“Resources haven’t changed but our tool and process have changed”

~120 students (Fall 2014)

1 instructor, 2 co-lead faculty section leaders, 3 grad students, 4-5 undergrad grading assistants

5 hrs running autograder

10 hrs printing + stapling

2 hrs dispatching to graders

60 hrs grading (~6 hrs/person)

3 hrs collecting graded work

2 hrs redistributing

82 hours → ~40 min/student

output is a grade + handful of words

time is spent moving paper around and looking through the rubric

~300 students (Spring 2020)

1 lead faculty coordinator +

30-50 undergrad grading assistants

2 hrs preparing grading lesson

1 hrs teaching graders

30-70 hrs grading (~1-2 hrs/person)

10 hrs writing explanations (only once)

1-2 hrs auditing class-wide work

35-85 hours → ~6-17 min/student

output is appropriate assignment-targeted explanations + custom feedback on code

time is spent reading code, honoring student and improving pedagogy

audience:

labor:

breakdown:

total:

summary:

10 of 44

11 of 44

12 of 44

21st century code grading toolbox

  • Limit/eliminate “manual transfer operations” (students → submission server → autograder → printer, printer → graders, graders → …)
  • Autograder:
    • Tries to ensure student code compiles
    • Help students avoid obvious problems; help weaker students make progress
    • Trade-off between time to write a test, and usefulness of test
  • Rubric:
    • Provides direction to human graders
    • Helps ensure consistency of grading
  • “Explanations”: Instructor-authored paragraphs shown to students, provides bulk of quantitative feedback received—linked to rubric items
  • Custom-comments: Individualized comments, left by graders, which both rewards students and helps address individual code problems

13 of 44

Rather be doing this… … or writing this?

14 of 44

Next steps

  • What is code review / code quality?
    • Why is autograding alone not sufficient?
    • Who does code review? Why is it essential?
  • Preamble: Getting students to submit reviewable code
    • How to help students submit code that can be reviewed
    • What information can be extracted from a submission before human graders see it?
  • Strategies for scaling code review
    • What are techniques when doing this alone (instructor alone)
    • How to leverage (and quality-check) a larger staff (instructor + TAs)�
  • Live codePost exercise for participants [1 hour hands on]

15 of 44

What is code review / code quality?

2.

16 of 44

Why code review?

public static int dayOfYear(int month, int dayOfMonth, int year) {

if (month == 2) {

dayOfMonth += 31;

} else if (month == 3) {

dayOfMonth += 59;

} else if (month == 4) {

dayOfMonth += 90;

} else if (month == 5) {

dayOfMonth += 31 + 28 + 31 + 30;

} else if (month == 6) {

dayOfMonth += 31 + 28 + 31 + 30 + 31;

} else if (month == 7) {

dayOfMonth += 31 + 28 + 31 + 30 + 31 + 30;

} else if (month == 8) {

dayOfMonth += 31 + 28 + 31 + 30 + 31 + 30 + 31;

} else if (month == 9) {

dayOfMonth += 31 + 28 + 31 + 30 + 31 + 30 + 31 + 31;

} else if (month == 10) {

dayOfMonth += 31 + 28 + 31 + 30 + 31 + 30 + 31 + 31 + 30;

} else if (month == 11) {

dayOfMonth += 31 + 28 + 31 + 30 + 31 + 30 + 31 + 31 + 30 + 31;

} else if (month == 12) {

dayOfMonth += 31 + 28 + 31 + 30 + 31 + 30 + 31 + 31 + 30 + 31 + 31;

}

return dayOfMonth;

}

Some “correct” code

Two discussion questions:

  • What is wrong with this code?
  • What tests could you write to detect these problems?

Source: https://web.mit.edu/6.005/www/fa15/classes/04-code-review/

17 of 44

Why code review

  • Code review is ubiquitous in industry
    • Helps ensure code hygiene: maintainability, human-readability. Correct code != good production code
    • Allows for discussion and triage of correctness issues
  • Case study:
    • At codePost, ~25% of development time is dedicated to code review
    • Important but rarely taught skills we focus on:
      • Assuming someone other than the original author will maintain the code you write
      • Writing specific, actionable comments about others’ code
      • Reacting constructively, not defensively to suggestions about code, even correct code

18 of 44

What makes code especially hard to review?

  • Code that doesn’t compile or contains syntax errors
    • This code will fail all automated tests
    • Debugging this code (by finding the errors) can be extremely labor-intensive, crowding out more meaningful feedback
  • Code that doesn’t adhere to a specified API
    • Failed tests might not expose bugs
    • Harder to explore the code by stepping outside pattern recognition developed from other submissions
  • Code with wacky style
    • Extra long lines, huge blocks of code, bad indentation, etc, make reading code tedious

19 of 44

Making code review easier

  • One way to avoid this type of code: incentivize students to submit “reviewable” code
    • Feedback loop: Create automated tests to check for the above symptoms, and expose these tests to students at the point of submission
    • Gamify: Group these tests into a group called “Level 1 requirements” (or something to indicate that they represent the most basic requirements)
    • Incentivize: Attach point values to these tests

Level 1 requirements exposed to students in codePost

20 of 44

Personalized feedback workflows

3.

21 of 44

Personal Feedback Workflow: Disclaimer

This section will be concrete efficient personal feedback workflow:

  • techniques for instructors alone
    • these readily transfer to a group/distributed setting
  • and how to leverage (and quality-check) a larger staff (instructor + TAs)

but all examples are based on my workflow in Princeton’s CS1:

  • 300 submissions
  • I manage 30-70 undergraduate grader over a period of 1-3 hours
  • the main advantage is parallelization and speed, but this could be done with a smaller number of full-time TAs

22 of 44

An important distinction

In codePost, there are two complementary notions for comments:

  • Rubric comments belong to a rubric
    • instantiated by the graders
    • everything about them controlled centrally (and retro-actively) by instructor:
      • grader description,
      • student explanation,
      • point delta
    • they also contain a small part that is filled in by the grader (the customization of the comment)
  • Custom comments are discretionary comments left by graders

Notions are important both for quality control and for scale efficiency

23 of 44

This is a rubric comment

What the grader typically sees:

What the student sees:

grader[-facing] caption

(written once)

student[-facing] “explanation”

(written once)

“customization”

(written by grader, each time comment is applied)

24 of 44

Individual Scenario:

Grading exam or new assignment

no existing rubric

single instructor doing the grading

25 of 44

Broad outline

To grade the assignments, you can follow three steps:

Tag First, Explain Later

  • Step 1: Grade submissions, and create the rubric as you go using the in-line collaborative rubric feature (but alone)
  • Step 2: Once you have tagged your submissions, your explore your data set, and use the combined examples to help you write an explanation for each rubric item.

Iterative Rubric Creation

  • Step 3: If you left custom comments in your submissions, you may audit them to see if you can merge some to become rubric comments

The rubric is the

26 of 44

Step 1: create rubric

  • As you go along, you can either
    • add custom comments (if you think comment is unique)
    • create a rubric comment as described here�
  • This will build the rubric for your assignment and keep every submission linked to the corresponding rubric items

1.

2.

3.

27 of 44

Step 2: Explain!

Add explanations to rubric items; adjust deductions

Have fun and go crazy! You won’t ever have to do it again

28 of 44

Step 3: Audit

You can audit the custom comments after grading

  • to make sure some shouldn’t be rubric comments instead (consistency)
  • to see if there are similar custom comments that would suggest creating a rubric comment (efficiency)

29 of 44

COS 126 audit in Spring 2021

30 of 44

Staff Scenario:

Grading existing assignment

pre-existing deductive rubric

instructor with staff of TAs

31 of 44

Rubrics for COS126

Dan Leyzberg and course staff

  • deductive
  • on 4 pts
  • (same normalization as exams)
  • roughly correspond to certain learning objectives

assuming this can’t be changed (time, hierarchy, legacy, etc.)

We will show how to apply and give feedback with team of TAs

32 of 44

Context

The rubric has been entered for the staff of graders to use:

  • They can apply the rubric comments, and optionally add their customization
  • They are encouraged to provide personal feedback as custom comments

We have already shown how to audit custom comments, but rubric comments can also be checked

33 of 44

Step 1: Applying comments from the rubric

When you have a rubric predefined, it appears (with grader-specific captions if available) and is ready to be applied

rubric window

rubric comment (without customization)

custom comment (currently empty)

34 of 44

RUBRIC

EXPLORER

35 of 44

Step 2: Exploring

  • Explore every application of each rubric comment
  • Able to look how this rubric item was applied
  • Can be used to write explanations, and to audit graders

36 of 44

Bonus miscellaneous

Mining the Rubric Dataset

using the scale of

your class in your favor

37 of 44

Iteration via student feedback

  • Improve your rubric by soliciting feedback from students
  • Things to catch:
    • Unclear explanations
    • ....

  • Bonus: use last year’s data to improve this year’s teaching
    • Distribution of rubric comments (combined with comprehension scores) can point to learning breakdowns => can tweak curriculum
    • Can leverage previous applications of rubric comments to train new staff (and students!)

38 of 44

Ensure fairness

  • What does fairness mean for grading?
    • Avoid conflicts of interest
    • Consistent scoring
  • Avoid conflicts of interest with anonymous grading mode
    • Added benefit of removing unconscious bias from grading process, besides explicit conflicts of interest
  • Consistent scoring
    • Much easier to adjudicate if TAs are grading random submissions: otherwise, you may need to account for systematic deviations in submission quality by TA
    • Data to assess fairness across TAs:
      • Average score awarded
      • Average score awarded, normalized for automated test failures
      • Frequency of rubric comment usage

39 of 44

Ensure quality

  • Hard problem: what makes a good code review?
    • Feedback quantity: lots of comments
    • Feedback quality: specific, actionable, reference student code, use rubrics
  • How to enforce:
    • Rubric-only mode: in this mode, graders can’t create custom comments, and are instead forced to use the rubric.
    • Instruction text: nudge graders to personalize rubric comments in specific ways.
  • How to measure
    • {insert section on codePost API}

40 of 44

Live exercise for participants

facilitated by James Evans

4.

41 of 44

API, SDK and beyond

5.

42 of 44

codePost has an open API and a Python SDK

43 of 44

Dataset of the comments

{

"assignment": {

"id": 2763,

"name": "Hello"

},

"submission_id": 122350,

"comment_id": 285902,

"grader": "xxxxxxxx@princeton.edu",

"point_delta": 0.0,

"rubric_comment": null,

"feedback": 0,

"comment": {

"code_blobs": [

{

"language": "java",

"code": "\nboolean isOrdered = ((a < b) && (b < c)) || ((a > b) && (b > c))\n"

}

],

"content": "you can declare and initialize the boolean in one statement:\n```\nboolean isOrdered = ((a < b) && (b < c)) || ((a > b) && (b > c))\n```",

"length": 133,

"wordcount": 30

},

"location": {

"filename": "Ordered.java",

"extension": ".java",

"start_line": 5,

"start_column": 0,

"end_line": 6,

"end_column": 65

},

"tests": {

"total": 29,

"passed": 28,

"failed": [

3609

]

},

"variables": {

"file": [

"args",

"b",

"isOrdered",

"a",

"c"

],

"comment": [

"isOrdered",

"a",

"c",

"b"

],

"coincidence": [

"b",

"isOrdered",

"a",

"c"

],

"overlap": true

},

"indicators": {

"uses_rubric_comment": false,

"uses_code": true,

"uses_learner_tokens": true

},

"statistics": {

"ratio_code": 49.62406015037594,

"ratio_test_passed": 0.9655172413793104

}

}

44 of 44

THANK YOU

to you +

to the organizers of SIGCSE 2020

and board