The Heightened Responsibility of CSS
Is it possible to be ethical?
Open science
Replicability
Ensuring data quality
Data for the public good
Things we care about that the IRB doesn’t necessarily care about
Systematic Content Analysis of Litigation EventS
S
C
A
L
🤷♀️ �Insight from millions of court records is functionally inaccessible.
The problem:
$.10/page + nonuniform local practices + no meaningful bulk access
free, but limited in scope or coverage
costly + coverage often unclear + no bulk access
x hundreds of state and local courts
SCALES aims to bring transparency
to the systems and processes
of the U.S. courts,
ensuring that they are
fair, efficient, and accurate
SCALES team
SOCIAL
SCIENCE
ENGINEERING
JOURNALISM
LAW
SCALES Outcomes
© SCALES OKN
What SCALES is
�An “open” knowledge network (OKN) would be available to all stakeholders, including the researchers who will help push this technology further. An OKN requires a nonproprietary, public–private development effort that spans the entire data science community and results in an open, shared infrastructure.
© SCALES OKN
�
© SCALES OKN
What SCALES does
Platform enables systematic analysis of court records
© SCALES OKN
What SCALES has
~750,000+ civil and criminal docket sheets
Court docs from ~12,800+ cases
© SCALES OKN
SCALES tools
https://github.com/scales-okn
Software automatically runs PACER in a browser to download cases
Can scrape:
© SCALES OKN
Data extraction and transformation
HTML
JSON
Extracts and keys:
© SCALES OKN
Understand and enrich data
© SCALES OKN
Understand and enrich data
Entity Recognition & Disambiguation
Relation to case
Entity Type
© SCALES OKN
Judge recognition and disambiguation
IFP decisions: case study
Other settings,
e.g. bond decisions?
© SCALES OKN
Named Entity Recognition (NER)
Amy Joan St. Eve
Amy St. Eve
A.J. St. Eve
St. Eve
Honorable
District Judge
Judge
D.J.
Rules-Based Pruning
© SCALES OKN
Custom NER pipeline
Precision compared to spaCy’s out-of-the-box NER model for 2017 dockets
PRESIDE
98.6%
spaCy en_core_web_lg
85.9%
using spaCy’s ‘PERSON’ tag
spaCy v.3 trained from scratch
© SCALES OKN
Lawyer and Law Firm Disambiguation
© SCALES OKN
Corporate Party Disambiguation
© SCALES OKN
Visualizing litigation (civil example*)
Complaint
Defendant’s MTD granted in full
Settlement
Discovery
Defendant’s MSJ
Defendant’s MSJ granted in full
Settlement: Consent Decree
Trial verdict: Plaintiff
Default judgment
Defendant’s MTD denied in full
Settlement
Defendant’s MTD granted in part
Settlement of remaining claims
*criminal in progress.
Defendant’s MTD
© SCALES OKN
Litigation event ontology
Goal: Enable users to answer the following types of questions*:
Plus layer in court, judge, party, lawyer, and claim attributes.
*Do better than the Federal Judicial Center’s Integrated Database
© SCALES OKN
MVP litigation events
Case beginning
Answer
Discovery beginning
Trial beginning
How to choose which events to include?
© SCALES OKN
MVP litigation events (cont.)
Dispositive events (claim, case, or party)
Notice of Appeal
© SCALES OKN
Satyrn
© SCALES OKN
GUI
© SCALES OKN
Make analysis simple
© SCALES OKN
Available on Github
Published and available under GPL licensing
https://github.com/scales-okn
General public license giving users 4 freedoms: to run, to study, to share, and to modify the software
© SCALES OKN
Documentation site
Available through SCALES main site: https://scales-okn.org/
© SCALES OKN
Ethical concerns at SCALES
What data is germane to the public interest?
What is the responsibility of SCALES as a data architect?
PUBLIC RIGHT TO DATA
de jure vs. de facto public data
Therefore, managerialized rights shift the burden for protection to the exploited or potentially exploited party
Quantifying Failure
Why Merging Pacer and USSC data is probably doomed
***if you care about accuracy and are using non-anomalous cases
JustFair Data Loss
Schanzenbach & Tiller 2007 - matching USSC data with PACER records based on the date and length of the sentence, and when necessary, the amount of any fine, the offense type, and the Hispanic ethnicity of the defendant
Yang 2014 - sentencing year, sentencing month, offense type, sentence length in months, probation length in months, amount of monetary fine, whether the case ended by trial or plea agreement, and whether the case resulted in a life sentence – required middle merge to (Transactional Records Access Clearinghouse (TRAC))
Ciocanel et al. 2020 – merge USSC to Federal Judicial Center data (which has docket numbers) merge to pacer by matching district + docket number to judge initials
Our options: assumptions & problems
Our options: variables to merge ‘easily’
How bad is it?
Crime categories with heavy duplication disproportionately impact Black and Hispanic defendants
Top 4 charge categories with duplicates constitute 73% of all crimes
State of the field
Data connections & the future
Data connections & the future
Sweeney 2002
Data connections & the future
Sweeney 2002
Data connections & the future
Sweeney 2002
Surveillance
Replication
Replication
19% of citing claims either failed to include important nuances of results (9.3%) or completely mischaracterized findings from prior research altogether (9.5%).
Open science
Diamond
https://oabooks-toolkit.org/lifecycle/article/13868103-green-gold-diamond-different-models-for-open-access-books
Gold
Green
Black
Hybrid
https://blogs.openbookpublishers.com/green-gold-diamond-black-what-does-it-all-mean/