1 of 52

The Heightened Responsibility of CSS

Is it possible to be ethical?

2 of 52

3 of 52

Open science

Replicability

Ensuring data quality

Data for the public good

Things we care about that the IRB doesn’t necessarily care about

4 of 52

Systematic Content Analysis of Litigation EventS

S

C

A

L

5 of 52

🤷‍♀️ �Insight from millions of court records is functionally inaccessible.

The problem:

$.10/page + nonuniform local practices + no meaningful bulk access

free, but limited in scope or coverage

costly + coverage often unclear + no bulk access

x hundreds of state and local courts

6 of 52

SCALES aims to bring transparency

to the systems and processes

of the U.S. courts,

ensuring that they are

fair, efficient, and accurate

7 of 52

SCALES team

SOCIAL

SCIENCE

ENGINEERING

JOURNALISM

LAW

8 of 52

SCALES Outcomes

© SCALES OKN

9 of 52

What SCALES is

�An “open” knowledge network (OKN) would be available to all stakeholders, including the researchers who will help push this technology further. An OKN requires a nonproprietary, public–private development effort that spans the entire data science community and results in an open, shared infrastructure.

© SCALES OKN

10 of 52

© SCALES OKN

11 of 52

What SCALES does

Platform enables systematic analysis of court records

© SCALES OKN

12 of 52

What SCALES has

  • Docket sheets for all criminal and civil cases filed in 2016 and 2017 in all 94 U.S. district courts
  • Docket sheets for all criminal and civil cases filed in 2002 – March 2021 in the Northern District of Illinois
    • Downloaded as of Fall 2020

  • All documents filed in N.D. Ill. in 2016

~750,000+ civil and criminal docket sheets

Court docs from ~12,800+ cases

© SCALES OKN

13 of 52

SCALES tools

https://github.com/scales-okn

Software automatically runs PACER in a browser to download cases

Can scrape:

  • Queries
  • Dockets
  • Case summaries
  • Documents

© SCALES OKN

14 of 52

Data extraction and transformation

HTML

JSON

Extracts and keys:

  • Case header information (nature of suit, judge, filing dates)
  • Parties (including address) and charges
  • Lawyers (address, phone, pro hac status)
  • Docket entries and documents if requested

© SCALES OKN

15 of 52

Understand and enrich data

© SCALES OKN

16 of 52

Understand and enrich data

Entity Recognition & Disambiguation

Relation to case

Entity Type

© SCALES OKN

17 of 52

Judge recognition and disambiguation

IFP decisions: case study

Other settings,

e.g. bond decisions?

© SCALES OKN

18 of 52

Named Entity Recognition (NER)

Amy Joan St. Eve

Amy St. Eve

A.J. St. Eve

St. Eve

Honorable

District Judge

Judge

D.J.

  • 159,000 name variants
  • 40 honorific variants
  • 10 million+ string combinations

Rules-Based Pruning

  1. Order signed by District Judge Doe
  2. Preliminary hearing held in chambers of Doe, District Judge. All parties advised...
  3. Plaintiff Doe’s motion for extension of time.

© SCALES OKN

19 of 52

Custom NER pipeline

Precision compared to spaCy’s out-of-the-box NER model for 2017 dockets

PRESIDE

98.6%

spaCy en_core_web_lg

85.9%

using spaCy’s ‘PERSON’ tag

spaCy v.3 trained from scratch

© SCALES OKN

20 of 52

Lawyer and Law Firm Disambiguation

© SCALES OKN

21 of 52

Corporate Party Disambiguation

© SCALES OKN

22 of 52

Visualizing litigation (civil example*)

Complaint

Defendant’s MTD granted in full

Settlement

Discovery

Defendant’s MSJ

Defendant’s MSJ granted in full

Settlement: Consent Decree

Trial verdict: Plaintiff

Default judgment

Defendant’s MTD denied in full

Settlement

Defendant’s MTD granted in part

Settlement of remaining claims

*criminal in progress.

Defendant’s MTD

© SCALES OKN

23 of 52

Litigation event ontology

Goal: Enable users to answer the following types of questions*:

  • What events happen in litigation?
  • How do cases and claims conclude?
  • What is the pathway by which cases travel from beginning to end?

Plus layer in court, judge, party, lawyer, and claim attributes.

*Do better than the Federal Judicial Center’s Integrated Database

© SCALES OKN

24 of 52

MVP litigation events

Case beginning

  • Complaint
  • Notice of Removal
  • Writ of Habeas Corpus
  • Inbound transfer
  • Other

Answer

Discovery beginning

  • Scheduling conference?
  • Initial disclosures?
  • Work backward from MSJ?

Trial beginning

How to choose which events to include?

  • Balance comprehensive classification of each docket entry (“boil the ocean”) with MVP achievability.
  • Beginning 🡪 middle (few events) 🡪 end
  • Build in user search options as backstop.

© SCALES OKN

25 of 52

MVP litigation events (cont.)

    • Trial
      • Verdict (jury)
      • Findings of fact and conclusions of law (bench trial)

  • Party resolution
      • Settlement
      • Rule 68
      • Voluntary dismissal
      • Party-provided judgment, e.g. consent decree

  • Granted/partially granted dispositive motions
      • Motion to dismiss
      • Motion for judgment on the pleadings
      • Motion for summary judgment
      • Motion for a judgment as a matter of law
  • Default judgment
  • Terminating sanctions and sua sponte dismissals
  • Outbound transfer or remand

Dispositive events (claim, case, or party)

Notice of Appeal

© SCALES OKN

26 of 52

Satyrn

© SCALES OKN

27 of 52

GUI

© SCALES OKN

28 of 52

Make analysis simple

© SCALES OKN

29 of 52

Available on Github

Published and available under GPL licensing

https://github.com/scales-okn

General public license giving users 4 freedoms: to run, to study, to share, and to modify the software

© SCALES OKN

30 of 52

Documentation site

Available through SCALES main site: https://scales-okn.org/

© SCALES OKN

31 of 52

Sign up for Satyrn:

http://satyrn.scales-okn.org/sign-up

32 of 52

Ethical concerns at SCALES

33 of 52

What data is germane to the public interest?

What is the responsibility of SCALES as a data architect?

34 of 52

PUBLIC RIGHT TO DATA

  • In the United States the public has a constitutionally protected right to access court proceedings and court records
  • Craig v. Harney (1947), Richmond Newspapers, Inc v. Virginia (1980) & Globe Newspaper v. Superior Court (1982)
    • 1st amendment gives public constitutional right of access to criminal trials
  • Associated Press v. District Court (1983)
    • Shifted focus to the right to examine court records

35 of 52

de jure vs. de facto public data

  • Information can be de jure public, but practically almost impossible to obtain
  • Veracity and practicality of the term ‘public record’ (see Salzmann 2000)

  • Mangerialized rights framework

    • Characterized by a group of people whose rights have been violated, but who are not vindicating those rights
    • These rights are not self-enforcing, and are enforced different depending on the parties

Therefore, managerialized rights shift the burden for protection to the exploited or potentially exploited party

36 of 52

37 of 52

Quantifying Failure

Why Merging Pacer and USSC data is probably doomed

***if you care about accuracy and are using non-anomalous cases

38 of 52

JustFair Data Loss

  • Start with 1,265,688 USSC records
  • Match 804,128 with their FJC data
    • Using district, sentencing month, probation, restitution, fine, sentencing year
    • Unique only
  • Match 524,393 to Pacer by matching district + docket number

  • Total lost: 741,295 or ~59% of data

Schanzenbach & Tiller 2007 - matching USSC data with PACER records based on the date and length of the sentence, and when necessary, the amount of any fine, the offense type, and the Hispanic ethnicity of the defendant

Yang 2014 - sentencing year, sentencing month, offense type, sentence length in months, probation length in months, amount of monetary fine, whether the case ended by trial or plea agreement, and whether the case resulted in a life sentence – required middle merge to (Transactional Records Access Clearinghouse (TRAC))

Ciocanel et al. 2020 – merge USSC to Federal Judicial Center data (which has docket numbers) merge to pacer by matching district + docket number to judge initials

39 of 52

Our options: assumptions & problems

  • Assumption: All criminal cases are in Pacer
    • FALSE
      • S&T removed all non-sentencing cases, still found ~14% to just be missing 5-6 years after sentencing
  • Assumption: The data are all correct
    • FALSE

  • Problem: no ground truth for match quality
    • Other groups either manually make a small one or restrict only to unique cases
    • No one has dealt with the issue of Pacer not having all the cases in the first place

40 of 52

Our options: variables to merge ‘easily’

  • Month, year, total months sentenced, number of counts, probation (if any), restitution, district

  • Senttot, senttot0, nocounts, sentmon, probatn, amtrest, totrest, district

41 of 52

How bad is it?

  • Sample year of USSC in 2017
  • Where we left off in our code year, month, counts, months sentenced – uniquely identifies 5,496 observations out of 66,873
  • All variables in except district – uniquely identifies 14,110 out of 66,873
  • All variables in with district – uniquely identifies 34,258

42 of 52

Crime categories with heavy duplication disproportionately impact Black and Hispanic defendants

Top 4 charge categories with duplicates constitute 73% of all crimes

43 of 52

44 of 52

State of the field

45 of 52

Data connections & the future

46 of 52

Data connections & the future

Sweeney 2002

47 of 52

Data connections & the future

Sweeney 2002

48 of 52

Data connections & the future

Sweeney 2002

49 of 52

Surveillance

  • 130,000 images
  • Used AI to guess who is gay
  • Claimed to be correct 81-91% of time at identifying sexuality

50 of 52

Replication

51 of 52

Replication

19% of citing claims either failed to include important nuances of results (9.3%) or completely mischaracterized findings from prior research altogether (9.5%).

52 of 52

Open science

Diamond

https://oabooks-toolkit.org/lifecycle/article/13868103-green-gold-diamond-different-models-for-open-access-books

Gold

Green

Black

Hybrid

https://blogs.openbookpublishers.com/green-gold-diamond-black-what-does-it-all-mean/