Update README.md
Browse files
README.md
CHANGED
@@ -14,6 +14,7 @@ The introduced text generation regularization markedly improves the accuracy of
|
|
14 |
|
15 |
This reward model is finetuned from [llama3_8b_instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) using the [hendrydong/preference_700K](https://huggingface.co/datasets/hendrydong/preference_700K) dataset.
|
16 |
|
|
|
17 |
|
18 |
## Evaluation
|
19 |
We evaluate GRM on the [reward model benchmark](https://huggingface.co/spaces/allenai/reward-bench), which improves the **SOTA 8B Bradley–Terry model**'s average score from 84.7 to 87.0.
|
@@ -22,6 +23,7 @@ We evaluate GRM on the [reward model benchmark](https://huggingface.co/spaces/al
|
|
22 |
| Model | Average | Chat | Chat Hard | Safety | Reasoning |
|
23 |
|:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|
|
24 |
| **Ray2333/GRM-llama3-8B-sftreg**(Ours, 8B) | 87.0 | 98.6 | 67.8 | 89.4 |92.3 |
|
|
|
25 |
| openai/gpt-4-0125-preview | 85.9 | 95.3 | 74.3 | 87.2 | 86.9 |
|
26 |
| sfairXC/FsfairX-LLaMA3-RM-v0.1 (8B) | 84.7 | 99.4 | 65.1 | 87.8 | 86.4 |
|
27 |
|
@@ -55,6 +57,15 @@ with torch.no_grad():
|
|
55 |
```
|
56 |
|
57 |
|
58 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
59 |
|
60 |
|
|
|
14 |
|
15 |
This reward model is finetuned from [llama3_8b_instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) using the [hendrydong/preference_700K](https://huggingface.co/datasets/hendrydong/preference_700K) dataset.
|
16 |
|
17 |
+
A distilled BT model using the features of this GRM can be found at [Ray2333/GRM-llama3-8B-distill](https://huggingface.co/Ray2333/GRM-llama3-8B-distill).
|
18 |
|
19 |
## Evaluation
|
20 |
We evaluate GRM on the [reward model benchmark](https://huggingface.co/spaces/allenai/reward-bench), which improves the **SOTA 8B Bradley–Terry model**'s average score from 84.7 to 87.0.
|
|
|
23 |
| Model | Average | Chat | Chat Hard | Safety | Reasoning |
|
24 |
|:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|
|
25 |
| **Ray2333/GRM-llama3-8B-sftreg**(Ours, 8B) | 87.0 | 98.6 | 67.8 | 89.4 |92.3 |
|
26 |
+
| [**Ray2333/GRM-llama3-8B-distill**](https://huggingface.co/Ray2333/GRM-llama3-8B-distill)(Ours, 8B) | 86.1 | 98.3 | 68.4 | 86.1 | 91.3 |
|
27 |
| openai/gpt-4-0125-preview | 85.9 | 95.3 | 74.3 | 87.2 | 86.9 |
|
28 |
| sfairXC/FsfairX-LLaMA3-RM-v0.1 (8B) | 84.7 | 99.4 | 65.1 | 87.8 | 86.4 |
|
29 |
|
|
|
57 |
```
|
58 |
|
59 |
|
60 |
+
## Citation
|
61 |
+
If you find this model helpful for your research, please cite GRM
|
62 |
+
```
|
63 |
+
@article{yang2024regularizing,
|
64 |
+
title={Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs},
|
65 |
+
author={Yang, Rui and Ding, Ruomeng and Lin, Yong and Zhang, Huan and Zhang, Tong},
|
66 |
+
journal={arXiv preprint arXiv:2406.10216},
|
67 |
+
year={2024}
|
68 |
+
}
|
69 |
+
```
|
70 |
|
71 |
|