1 of 11

Towards Explainability and Fairness in Swiss Judgement Prediction: Benchmarking on a Multilingual Dataset

Santosh T.Y.S.S, Nina Baumgartner,

Matthias Stürmer, Matthias Grabmair,

Joel Niklaus

2 of 11

Need for Explainable LJP

Determine case’s outcome from the facts description
Deep learning methods predict solely based on case facts, bypassing the interpretable legal reasoning process.
Significant risk, when they rely on factors that may be predictive but lack legal relevance or involve sensitive attributes
Such reliance lead to unjust and biased outcomes undermines the priciples of fairness and equal treatment within the legal system.
Need to be analyzed from an explainability standpoint to enhance the trust

3 of 11

Our Contributions: Explainability & Bias

Explainability rationales for 108 Swiss cases at fine-grained sub-sentence level,

Labels: support/oppose Judgment and neutral
Perturbation-based Occlusion

Remove the rationales from the fact and measure the change in the prediction confidence

Bias

Supreme Court of Switzerland handles the cases arose from lower court
Test bed to assess how much model rely on these lower court names and measure this bias through lower court insertion (LCI)
Insert other lower court names into each case and measure the changes in prediction confidence scores.

4 of 11

SJP (Niklaus et al., 2021)

85,000 cases from the Federal Supreme Court of Switzerland (FSCS)
2000-2020, chronologically split into training (2000-14), validation (2015-16) and test (2017-20)
Written in three languages:

German (50K)
French (31k)
Italian (4K)

5 of 11

Explanation Rationale & Lower Court

Annotations for 108 cases from validation and test set
Equal proportion of three languages, legal areas
3 legal experts - 2 law students, 1 lawyer
Annotation Task

Annotate sentences or sub-sentences in the facts that "support" or "oppose" the final outcome

Additional label for “oppose”, unlike previous works, this considers perspectivism in legal reasoning

Annotate “Neutral” sentences, not a label, rest of the other sentences to assist in segmenting legal text into sentences
Annotate the lower court mentions in the fact

6 of 11

IAA

Agreement within the categories "Lower Court" and "Supports Judgment" is notably high in comparison to "Opposes Judgment".

Can be attributed to difficulty in identifying them

7 of 11

Occlusion & LCI Dataset

Occlusion dataset

Instances by occluding 1,2,3,4 number of sentences with same label (opposes/supports/neural)
pair with baseline actual text to measure difference of prediction probability between them

LCI dataset

Derive counterfactual based test set by replacing actual lower court mention with other lower court names
Pair with actual baseline to measure change between them

8 of 11

Models

Metrics

Hierarchical model to deal with longer input

Monolingual model

GermanBert (German),CamemBert (French), Umberto (Italian)

Multilingual

XLM Roberta

Mono/Multilingual with DA

easyNMT2 to get translated data

Joint training with all languages

Explainability using occlusion

Difference in temperature scaled baseline and occluded instance

negative - “oppose”
positive - “support”
do not change - neutral

Report F1-score for each label

LCI fairness:

change in explainability score and report average of positive and negative values to measure

positive - pro dismissal
negative - pro-approval

Flip ratio of predicted label with inserted lower court name from baseline

9 of 11

Occlusion Explainability

Better scores for supports than neutral and opposes
French better in support judgement but does not do well for other labels
Multilingual models/Joint training improve the scores for supports mainly, neural in some cases
DA helped in multi/monolingual to improve explainability, but it did not improve in case of Joint training

10 of 11

LCI bias

Change in 5% of the confidence score in both directions

These can bring label flips

With DA component the bias further increased
Joint training model improved prediction performance on Italian

but it came at a cost of increasing bias scores of Italian, with hIgher flip rate indicating representational bias of dataset

11 of 11

Conclusion

Rationale dataset of 108 trilingual cases at fine grained level for supporting and opposing factors for SJP
Perturbation based occlusion dataset to assess explainability

Lower explainability scores across models indicate the current models do not predict right for the right reasons

Bias of lower court using LCI test

Average of 7 token has potential to flip predictions