Back to blog

LCN Blogs

The rule of law – evaluating fairness in legal text processing

The rule of law – evaluating fairness in legal text processing

Cassandra Zhou


Reading time: four minutes

Pull Quote Content

— A. V. Dicey, 1885

The rule of law

The rule of law’s definition seems pretty straightforward. In a perfect world, everything would revolve around “fairness”, as the law treats every citizen in the same way. 

The law might be absolute, but people are the ones who apply it in real life – and people tend to be biased. According to Angwin et al., 2016, in the US, Black people are almost twice as likely as white people to be “mislabelled” as having high risks of reoffending.

In China, severe fairness gaps across gender were reported by Wang et al. last year when examining recent criminal judgement patterns. We might not realise such biases in our daily life, but they are detrimental to the long-term development of our society and the upholding of the rule of law. 

FairLex is a multilingual benchmark, recently developed by Chalkidis et al., as a collaboration between Danish and Chinese scholars to address such problems.

Their fairness benchmarks cover:

  • four jurisdictions: the EU, the US, Switzerland and China;
  • five languages: English, German, French, Italian and Chinese; and
  • five fairness attributes: gender, age, region, language and legal area.

FairLex benchmarks are hierarchical BERT-based models similar to those used in their previous work on the European Court of Human Rights (ECtHR) cases – they are essentially classifiers. They serve slightly different purposes depending on the jurisdiction.

The EU

Given the facts of a case from the ECtHR public database of 11,000 cases, the model predicts the European Convention of Human Rights (ECHR) Article(s) violated.

The US

The US model learns from historical cases from the Supreme Court Database. Given the court's opinion, the model predicts the legal area where the subject matter falls. 


The model learns from more than 85,000 decisions recorded in the Swiss-Judgement-Predict dataset written in either German, French or Italian. It then predicts the outcome of a case given the facts as either “approval” or “dismissal”. We usually call classifiers with only two outcomes “binary classifiers”.


This model learns from the Chinese AI and Law challenge dataset, which comprises over one million criminal cases. Given a fresh set of facts, the model must predict the relevant criminal articles violated, the criminal charge, the imprisonment term and any monetary penalty. 

The results reveal significant fairness gaps in multiple attributes across all jurisdictions.

Here are some examples in terms of the accuracy of prediction by the models:

  • For the EU – the outcomes predicted for male plaintiffs are a whopping 54% more accurate than those predicted for female plaintiffs. 
  • For the US – the outcomes predicted when the defendants are private individuals are 21% more accurate than when the defendants are organisations
  • For Switzerland – the outcomes predicted when the cases are recorded in French are 30% more accurate than those recorded in Italian 
  • For China – the prediction accuracy when the defendants are male is 92% and for female defendants, it’s 8%

As a computational model relying on cold, hard algorithms unburdened by pre-loaded personal biases, its performance should be roughly the same regardless of what attributes the defendant, the plaintiff, or the material facts have.

In the same jurisdiction, at least – for example, if the prediction accuracy when the plaintiff is male is 80%, then the model should achieve a similar accuracy when the plaintiff is female. 

Yet, astonishing disparities still exist. In their defence, the authors of FairLex have experimented with quite a few “group robust” algorithms that are supposed to be insensitive to demographic attributes. The algorithms include ERM (V. Vapnik. 1992), Group DRO (Sagawa et al., 2020), V-REx (Krueger et al., 2020) etc.

They were all developed relatively recently, but the authors of FairLex cannot identify any one of them that performs better than all the others. It looks like a good starting point to improve upon in future studies!

Another potential point for improvement is how we label our datasets. For example, the authors have simplified gender representation by dividing people into males and females. This classification is, of course, no longer applicable in real life!