1 of 20

Relation Extraction

Aeirya Mohammadi

2 of 20

Relation

A predicate between e1, …, en

RDF: subject property object

Relational Graphs

Knowledge Graph

Semantic Graph

3 of 20

Methods

Traditional: using pos tags, regex

Graph-based: Created with the help of above features

Neural: CNN, GCN, RNN (BiLSTM), RSN

Deep: Transformers (T5, BERT)

And now, LLMs

4 of 20

Transformers

Most prominent

BERT can be used for:

  • Better input encodings for other methods
  • Classification of relations and entity types

T5 for seq2seq tasks, as in REBEL (sota)

5 of 20

LLM

LLMs can do relation extraction out of the box.

Fine Tuning takes a lot of computation resource

Two other interesting options: Instruction tuning, In-context learning

6 of 20

PiVE*

Iteratively improve the result of LLM

Uses a verifier, a T5 model fine-tuned on RE datasets.

. Online and offline mode

* Prompting with Iterative Verification Improving Graph-based Generative Capability of LLMs

7 of 20

Datasets

SMiLER (Multilingual, has farsi)

REBEL

DocRED, REDFM, ..

GenWiki, WebNLG, CONLL04, NYT

Re-TACRED (Relation classification)

T-REx: Uses an old entity linking tool

8 of 20

Persian Datasets available

PARLEX (available in farsbase website)

And that’s it!

Did not find links for RePersian, …

9 of 20

PARLEX

The first Persian dataset for relation extraction.

Bilingual dataset (direct translation of SemEval-2010-Task-8 dataset)

But has only sentence-level examples.

Size: 4 MB

Available in: Farsbase

10 of 20

SMiLER

By Samsung

Cons:

  • Very sparse data for Farsi
  • Mediocre F1 score for predicting both entities and labels right

(Next slide)

11 of 20

F1

12 of 20

Making a Persian dataset

1. Automatic ready extraction using tools like crocodile

2. Implement RePersian

3. Using GPT4 prompts to generate data

4. Translating existing datasets

13 of 20

Translation

We can leverage existing translation models.

PARS-BERT is better than pretrained mT5 for text summarization.

For translation task, we need to check if t5-fa competes with mBART.

14 of 20

Distant Supervision

Distant supervised paradigm is described as follows:

"If two entities participate in a relation, any sentence that contains those two entities might express that relation."

15 of 20

Datasets: Gold and Silver

Larger datasets are distantly supervised. And then:

(1.manually or 2.automatically) verify their validation and test dataset

16 of 20

A Distant Supervised Approach for Relation Extraction in Farsi Texts

17 of 20

NER

How to find entities? Wikipedia(url) -> DBpedia

Knowledge bases: Farsbase (Persian Freebase), Wikidata

18 of 20

Downfalls of RE methods:

Correferences, missing more than one or nested relations, computation, need for lots of data

19 of 20

My Project Proposal: PiVE for Persian

  1. Verifier; google/flan-t5-small
  2. LLM: mistralai/Mistral-7B-Instruct-v0.2
    1. Running LLM on colab and using an api to call it
  3. Dataset:
    • PARLEX for sentence-level FT
    • GPT4 generated data
      1. Auto-verifier NLI module: xlm-roberta-large-xnli
      2. Manually annotation of the wrong ones
    • Gather more data with help of:
      • Knowledge graph
      • Verifier (trained gradually)
      • Wikipedia, Wikidata, DBpedia

20 of 20

Version 0

  • Finetune flanT5 on 20 GPT4 generated data
  • Datasets:
    • PARLEX
    • 20 translated sentences from HIQ