File size: 4,972 Bytes

ed578a0
 
aaa466a
 
3e7b309
5a7c540
3e7b309
5e6dbee
253a185
 
9054912
ed578a0
 
b24b964
 
ffbbef2
ed578a0
b24b964
 
ed578a0
 
4844daa
f5edeb6
3824639
 
ed578a0
8b504b5
 
7ce66f9
ed578a0
 
f637b66
 
ed578a0
 
b24b964
 
 
 
 
 
 
 
 
aaa466a
b24b964
466b3db
aaa466a
e21bac2
aaa466a
 
 
 
 
b24b964
 
8e50991
b24b964
8e50991
 
 
 
 
 
 
 
 
 
 
f637b66
 
 
 
 
8e50991
 
 
 
 
 
 
b24b964
f637b66
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8e50991
 
23c24aa
ed578a0

---
license: cc-by-nc-4.0
language:
- en
datasets:
- google/trueteacher
- anli
- cnn_dailymail
tags:
- natural-language-inference
- news-articles-summarization
---

# **TrueTeacher**

This is a **Factual Consistency Evaluation** model, introduced in the [TrueTeacher paper (Gekhman et al, 2023)](https://aclanthology.org/2023.emnlp-main.127.pdf).

## Model Details

The model is optimized for evaluating factual consistency in **summarization**.

It is the main model from the paper (see "T5-11B w. ANLI + TrueTeacher full" in Table 1) which is based on a **T5-11B** [(Raffel
et al., 2020)](https://jmlr.org/papers/volume21/20-074/20-074.pdf) fine-tuned with a mixture of the following datasets:
- [TrueTeacher](https://huggingface.co/datasets/google/trueteacher) ([Gekhman et al., 2023](https://arxiv.org/pdf/2305.11171.pdf))
- [ANLI](https://huggingface.co/datasets/anli) ([Nie et al., 2020](https://aclanthology.org/2020.acl-main.441.pdf))

The TrueTeacher dataset contains model-generated summaries of articles from the train split of the **CNN/DailyMail** dataset [(Hermann et al., 2015)](https://proceedings.neurips.cc/paper_files/paper/2015/file/afdec7005cc9f14302cd0474fd0f3c96-Paper.pdf) 
which are annotated for factual consistency using **FLAN-PaLM 540B** [(Chung et al.,2022)](https://arxiv.org/pdf/2210.11416.pdf). 
Summaries were generated using summarization models which were trained on the **XSum** dataset [(Narayan et  al.,  2018)](https://aclanthology.org/D18-1206.pdf).

The input format for the model is: "premise: GROUNDING_DOCUMENT hypothesis: HYPOTHESIS_SUMMARY".
To accomodate the input length of common summarization datasets we recommend setting **max_length** to **2048**.

The model predicts a binary label ('1' - Factualy Consistent, '0' - Factualy Inconsistent).

## Evaluation results

This model achieves the following ROC AUC results on the summarization subset of the [TRUE benchmark (Honovich et al, 2022)](https://arxiv.org/pdf/2204.04991.pdf):

| **MNBM** | **QAGS-X** | **FRANK** | **SummEval** | **QAGS-C** | **Average** |
|----------|-----------|-----------|--------------|-----------|-------------|
| 78.1     | 89.4      | 93.6      | 88.5         | 89.4      | 87.8        |


## Intended Use

This model is intended for a research use (**non-commercial**) in English.

The recommended use case is evaluating factual consistency in summarization.

## Out-of-scope use 
Any use cases which violate the **cc-by-nc-4.0** license.

Usage in languages other than English. 

## Usage examples

#### classification
```python
from transformers import T5ForConditionalGeneration
from transformers import T5Tokenizer

model_path = 'google/t5_11b_trueteacher_and_anli'
tokenizer = T5Tokenizer.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path)

premise = 'the sun is shining'
for hypothesis, expected in [('the sun is out in the sky', '1'), 
                             ('the cat is shiny', '0')]:
  input_ids = tokenizer(
      f'premise: {premise} hypothesis: {hypothesis}',
      return_tensors='pt',
      truncation=True,
      max_length=2048).input_ids
  outputs = model.generate(input_ids)
  result = tokenizer.decode(outputs[0], skip_special_tokens=True)
  print(f'premise: {premise}')
  print(f'hypothesis: {hypothesis}')
  print(f'result: {result} (expected: {expected})\n')
```

#### scoring
```python
from transformers import T5ForConditionalGeneration
from transformers import T5Tokenizer
import torch

model_path = 'google/t5_11b_trueteacher_and_anli'
tokenizer = T5Tokenizer.from_pretrained(model_path)
model = T5ForConditionalGeneration.from_pretrained(model_path)

premise = 'the sun is shining'
for hypothesis, expected in [('the sun is out in the sky', '>> 0.5'), 
                             ('the cat is shiny', '<< 0.5')]:
  input_ids = tokenizer(
      f'premise: {premise} hypothesis: {hypothesis}',
      return_tensors='pt',
      truncation=True,
      max_length=2048).input_ids
  decoder_input_ids = torch.tensor([[tokenizer.pad_token_id]])
  outputs = model(input_ids=input_ids, decoder_input_ids=decoder_input_ids)
  logits = outputs.logits
  probs = torch.softmax(logits[0], dim=-1)
  one_token_id = tokenizer('1').input_ids[0]
  entailment_prob = probs[0, one_token_id].item()
  print(f'premise: {premise}')
  print(f'hypothesis: {hypothesis}')
  print(f'score: {entailment_prob:.3f} (expected: {expected})\n')
```

## Citation

If you use this model for a research publication, please cite the TrueTeacher paper (using the bibtex entry below), as well as the ANLI, CNN/DailyMail, XSum, T5 and FLAN papers mentioned above.

```
@misc{gekhman2023trueteacher,
      title={TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models}, 
      author={Zorik Gekhman and Jonathan Herzig and Roee Aharoni and Chen Elkind and Idan Szpektor},
      year={2023},
      eprint={2305.11171},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```