1 of 11

Towards Explainability and Fairness in Swiss Judgement Prediction: Benchmarking on a Multilingual Dataset

Santosh T.Y.S.S, Nina Baumgartner,

Matthias Stürmer, Matthias Grabmair,

Joel Niklaus

2 of 11

Need for Explainable LJP

  • Determine case’s outcome from the facts description
  • Deep learning methods predict solely based on case facts, bypassing the interpretable legal reasoning process.
  • Significant risk, when they rely on factors that may be predictive but lack legal relevance or involve sensitive attributes
  • Such reliance lead to unjust and biased outcomes undermines the priciples of fairness and equal treatment within the legal system.
  • Need to be analyzed from an explainability standpoint to enhance the trust

3 of 11

Our Contributions: Explainability & Bias

  • Explainability rationales for 108 Swiss cases at fine-grained sub-sentence level,
    • Labels: support/oppose Judgment and neutral
    • Perturbation-based Occlusion
      • Remove the rationales from the fact and measure the change in the prediction confidence
  • Bias
    • Supreme Court of Switzerland handles the cases arose from lower court
    • Test bed to assess how much model rely on these lower court names and measure this bias through lower court insertion (LCI)
    • Insert other lower court names into each case and measure the changes in prediction confidence scores.

4 of 11

SJP (Niklaus et al., 2021)

  • 85,000 cases from the Federal Supreme Court of Switzerland (FSCS)
  • 2000-2020, chronologically split into training (2000-14), validation (2015-16) and test (2017-20)
  • Written in three languages:
    • German (50K)
    • French (31k)
    • Italian (4K)

5 of 11

Explanation Rationale & Lower Court

  • Annotations for 108 cases from validation and test set
  • Equal proportion of three languages, legal areas
  • 3 legal experts - 2 law students, 1 lawyer
  • Annotation Task
    • Annotate sentences or sub-sentences in the facts that "support" or "oppose" the final outcome
      • Additional label for “oppose”, unlike previous works, this considers perspectivism in legal reasoning
    • Annotate “Neutral” sentences, not a label, rest of the other sentences to assist in segmenting legal text into sentences
    • Annotate the lower court mentions in the fact

6 of 11

IAA

Agreement within the categories "Lower Court" and "Supports Judgment" is notably high in comparison to "Opposes Judgment".

Can be attributed to difficulty in identifying them

7 of 11

Occlusion & LCI Dataset

  • Occlusion dataset
    • Instances by occluding 1,2,3,4 number of sentences with same label (opposes/supports/neural)
    • pair with baseline actual text to measure difference of prediction probability between them
  • LCI dataset
    • Derive counterfactual based test set by replacing actual lower court mention with other lower court names
    • Pair with actual baseline to measure change between them

8 of 11

Models

Metrics

  • Hierarchical model to deal with longer input
    • Monolingual model
      • GermanBert (German),CamemBert (French), Umberto (Italian)
    • Multilingual
      • XLM Roberta
    • Mono/Multilingual with DA
      • easyNMT2 to get translated data
    • Joint training with all languages
  • Explainability using occlusion
    • Difference in temperature scaled baseline and occluded instance
      • negative - “oppose”
      • positive - “support”
      • do not change - neutral
    • Report F1-score for each label
  • LCI fairness:
    • change in explainability score and report average of positive and negative values to measure
      • positive - pro dismissal
      • negative - pro-approval
    • Flip ratio of predicted label with inserted lower court name from baseline

9 of 11

Occlusion Explainability

  • Better scores for supports than neutral and opposes
  • French better in support judgement but does not do well for other labels
  • Multilingual models/Joint training improve the scores for supports mainly, neural in some cases
  • DA helped in multi/monolingual to improve explainability, but it did not improve in case of Joint training

10 of 11

LCI bias

  • Change in 5% of the confidence score in both directions
    • These can bring label flips
  • With DA component the bias further increased
  • Joint training model improved prediction performance on Italian
    • but it came at a cost of increasing bias scores of Italian, with hIgher flip rate indicating representational bias of dataset

11 of 11

Conclusion

  • Rationale dataset of 108 trilingual cases at fine grained level for supporting and opposing factors for SJP
  • Perturbation based occlusion dataset to assess explainability
    • Lower explainability scores across models indicate the current models do not predict right for the right reasons
  • Bias of lower court using LCI test
    • Average of 7 token has potential to flip predictions