Scaling Manual Code Review
with codePost
April 8th, 2021
Jérémie Lumbroso, Princeton University
James Evans, codePost
Grading code could be like reading an essay
This workshop is interactive
“I don’t have the time”
&
“I don’t have the resources”
1.
CS2 grading at Princeton circa 2014 (1)
39/40 “good code”
CS2 grading at Princeton circa 2014 (2)
...
*-----------------------------------------------------------
Running 8 total tests.
A point in an m-by-m grid means that it is of the form (i/m, j/m),
where i and j are integers between 0 and m
Test 1: insert n random points; check size() and isEmpty() after each insertion
(size may be less than n because of duplicates)
* 5 random points in a 1-by-1 grid
* 50 random points in a 8-by-8 grid
* 100 random points in a 16-by-16 grid
* 1000 random points in a 128-by-128 grid
* 5000 random points in a 1024-by-1024 grid
* 50000 random points in a 65536-by-65536 grid
==> passed
Test 2: insert n random points; check contains() with random query points
* 1 random points in a 1-by-1 grid
* 10 random points in a 4-by-4 grid
* 20 random points in a 8-by-8 grid
* 10000 random points in a 128-by-128 grid
* 100000 random points in a 1024-by-1024 grid
* 100000 random points in a 65536-by-65536 grid
==> passed
Test 3: insert random points; check nearest() with random query points
* 10 random points in a 4-by-4 grid
* 15 random points in a 8-by-8 grid
* 20 random points in a 16-by-16 grid
* 100 random points in a 32-by-32 grid
* 10000 random points in a 65536-by-65536 grid
==> passed
Test 4: insert random points; check range() with random query rectangles
* 2 random points and random rectangles in a 2-by-2 grid
* 10 random points and random rectangles in a 4-by-4 grid
* 20 random points and random rectangles in a 8-by-8 grid
...
autograder output
submission server
...
* contains() / get() broken
[ -5 get not implemented or hopelessly flawed ]
[ -3 because of using reference equality instead of equals() ]
[ -3 because of testing only x-coordinates, but not y-coordinates ]
[ -3 because 2-way logic for (x < p.x) and (x > p.x) but no (x == p.x)
common symptom = incorrect drawing for circle.txt ]
[ -1 can't handle when root is null or other NullPointerException ]
[ -1 not handling (xmin == xmax) ]
[ -1 get works, but not contains ]
* range()
[ -5 not implemented or hopelessly flawed ]
[ -3 major flaws ]
[ -1 if only fails when N = 0 or no points in range ]
* nearest()
[ -8 not implemented or hopelessly flawed ]
[ -1 if fails only when N = 0 ]
[ -3 nearest only goes down insert path so doesn't always find correct
answer but sure is fast! ]
[ -3 nearest always tries left/bottom path first ]
[ -3 pruning is done incorrectly causing wrong answer sometimes ]
[ -2 if exception for corner case ]
...
rubric
“grading sheet”
CS2 grading at Princeton circa 2014 (3)
Problems for students
Problems for instructors
Problem for graders (= possibly instructor themselves)
Feedback I was the proudest of (in Fall 2014)
2014
2016
2019
“Resources haven’t changed but our tool and process have changed”
~120 students (Fall 2014)
1 instructor, 2 co-lead faculty section leaders, 3 grad students, 4-5 undergrad grading assistants
5 hrs running autograder
10 hrs printing + stapling
2 hrs dispatching to graders
60 hrs grading (~6 hrs/person)
3 hrs collecting graded work
2 hrs redistributing
82 hours → ~40 min/student
output is a grade + handful of words
time is spent moving paper around and looking through the rubric
~300 students (Spring 2020)
1 lead faculty coordinator +
30-50 undergrad grading assistants
2 hrs preparing grading lesson
1 hrs teaching graders
30-70 hrs grading (~1-2 hrs/person)
10 hrs writing explanations (only once)
1-2 hrs auditing class-wide work
35-85 hours → ~6-17 min/student
output is appropriate assignment-targeted explanations + custom feedback on code
time is spent reading code, honoring student and improving pedagogy
audience:
labor:
breakdown:
total:
summary:
21st century code grading toolbox
Rather be doing this… … or writing this?
Next steps
What is code review / code quality?
2.
Why code review?
public static int dayOfYear(int month, int dayOfMonth, int year) {
if (month == 2) {
dayOfMonth += 31;
} else if (month == 3) {
dayOfMonth += 59;
} else if (month == 4) {
dayOfMonth += 90;
} else if (month == 5) {
dayOfMonth += 31 + 28 + 31 + 30;
} else if (month == 6) {
dayOfMonth += 31 + 28 + 31 + 30 + 31;
} else if (month == 7) {
dayOfMonth += 31 + 28 + 31 + 30 + 31 + 30;
} else if (month == 8) {
dayOfMonth += 31 + 28 + 31 + 30 + 31 + 30 + 31;
} else if (month == 9) {
dayOfMonth += 31 + 28 + 31 + 30 + 31 + 30 + 31 + 31;
} else if (month == 10) {
dayOfMonth += 31 + 28 + 31 + 30 + 31 + 30 + 31 + 31 + 30;
} else if (month == 11) {
dayOfMonth += 31 + 28 + 31 + 30 + 31 + 30 + 31 + 31 + 30 + 31;
} else if (month == 12) {
dayOfMonth += 31 + 28 + 31 + 30 + 31 + 30 + 31 + 31 + 30 + 31 + 31;
}
return dayOfMonth;
}
Some “correct” code
Two discussion questions:
Source: https://web.mit.edu/6.005/www/fa15/classes/04-code-review/
Why code review
What makes code especially hard to review?
Making code review easier
Level 1 requirements exposed to students in codePost
Personalized feedback workflows
3.
Personal Feedback Workflow: Disclaimer
This section will be concrete efficient personal feedback workflow:
but all examples are based on my workflow in Princeton’s CS1:
An important distinction
In codePost, there are two complementary notions for comments:
Notions are important both for quality control and for scale efficiency
This is a rubric comment
What the grader typically sees:
What the student sees:
grader[-facing] caption
(written once)
student[-facing] “explanation”
(written once)
“customization”
(written by grader, each time comment is applied)
Individual Scenario:
Grading exam or new assignment
no existing rubric
single instructor doing the grading
Broad outline
To grade the assignments, you can follow three steps:
“Tag First, Explain Later”
“Iterative Rubric Creation”
The rubric is the
Step 1: create rubric
1.
2.
3.
Step 2: Explain!
Add explanations to rubric items; adjust deductions
Have fun and go crazy! You won’t ever have to do it again
Step 3: Audit
You can audit the custom comments after grading
COS 126 audit in Spring 2021
Staff Scenario:
Grading existing assignment
pre-existing deductive rubric
instructor with staff of TAs
Rubrics for COS126
Dan Leyzberg and course staff
assuming this can’t be changed (time, hierarchy, legacy, etc.)
We will show how to apply and give feedback with team of TAs
Context
The rubric has been entered for the staff of graders to use:
We have already shown how to audit custom comments, but rubric comments can also be checked
Step 1: Applying comments from the rubric
When you have a rubric predefined, it appears (with grader-specific captions if available) and is ready to be applied
rubric window
rubric comment (without customization)
custom comment (currently empty)
RUBRIC
EXPLORER
Step 2: Exploring
Bonus miscellaneous
Mining the Rubric Dataset
using the scale of
your class in your favor
Iteration via student feedback
Ensure fairness
Ensure quality
Live exercise for participants
facilitated by James Evans
4.
API, SDK and beyond
5.
codePost has an open API and a Python SDK
Dataset of the comments
{
"assignment": {
"id": 2763,
"name": "Hello"
},
"submission_id": 122350,
"comment_id": 285902,
"grader": "xxxxxxxx@princeton.edu",
"point_delta": 0.0,
"rubric_comment": null,
"feedback": 0,
"comment": {
"code_blobs": [
{
"language": "java",
"code": "\nboolean isOrdered = ((a < b) && (b < c)) || ((a > b) && (b > c))\n"
}
],
"content": "you can declare and initialize the boolean in one statement:\n```\nboolean isOrdered = ((a < b) && (b < c)) || ((a > b) && (b > c))\n```",
"length": 133,
"wordcount": 30
},
"location": {
"filename": "Ordered.java",
"extension": ".java",
"start_line": 5,
"start_column": 0,
"end_line": 6,
"end_column": 65
},
"tests": {
"total": 29,
"passed": 28,
"failed": [
3609
]
},
"variables": {
"file": [
"args",
"b",
"isOrdered",
"a",
"c"
],
"comment": [
"isOrdered",
"a",
"c",
"b"
],
"coincidence": [
"b",
"isOrdered",
"a",
"c"
],
"overlap": true
},
"indicators": {
"uses_rubric_comment": false,
"uses_code": true,
"uses_learner_tokens": true
},
"statistics": {
"ratio_code": 49.62406015037594,
"ratio_test_passed": 0.9655172413793104
}
}
THANK YOU
to you +
to the organizers of SIGCSE 2020
and board