1 of 9

Automated Fact Checking Based on Czech

Wikipedia

Tomáš Mlynář

Bachelor’s thesis presentation

Supervisor: Ing. Herbert Ullrich

2 of 9

Assignment

Combine previous results from AIC to build a showcase application.

  • Explore SOTA methods in NLI and document retrieval.
  • Acquire Czech data for Wikipedia-based fact-checking.
  • Train NLI and document retrieval models.
  • Integrate into initial version of the fact-checking pipeline.
  • Evaluate models and pipeline.
  • Create a prototype showcase application.

2/9

3 of 9

Dataset

  • FEVER dataset
  • Wikipedia dump
  • Translation
  • Mapping of articles
  • Unfold evidence sets

3/9

4 of 9

Filtering

  • Predict new labels and scores
  • Temperature scaling
  • Optimize thresholds - F1, precision, 0.7
  • Filter
  • New annotations of 1 % of data points

4/9

5 of 9

Document Retrieval

  • sparse - Anserini (BM25) - traditional baseline
  • hybrid - Anserini+Cross-encoder

5/9

6 of 9

Natural Language Inference

  • XLM-RoBERTa Large @SQuAD2 - finetuned by Deepset
  • models available on Hugging Face
  • F1 macro scores for different models (vertical) and datasets (horizontal):

6/9

7 of 9

Prototype Showcase Application

7/9

8 of 9

Conclusion

Thesis Accomplishments:

  • Explored state-of-the-art methods for NLI and document retrieval.
  • Preprocessed Wikipedia dump from Hugging Face.
  • Localized FEVER claims using implemented translator and translation model.
  • Created new Czech dataset available on Hugging Face.
  • Applied noise filtering on three optimized thresholds (with limited success).
  • Finetuned and evaluated NLI models on the new datasets (available on Hugging Face).
  • Implemented and evaluated Anserini and Hybrid document retrievers.
  • Integrated NLI and document retrieval models into a pipeline.
  • Developed a prototype showcase application.

8/9

9 of 9

Thank you for your attention!

9