Ray2333 commited on
Commit
df5f163
1 Parent(s): 8b72edc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +12 -1
README.md CHANGED
@@ -14,6 +14,7 @@ The introduced text generation regularization markedly improves the accuracy of
14
 
15
  This reward model is finetuned from [llama3_8b_instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) using the [hendrydong/preference_700K](https://huggingface.co/datasets/hendrydong/preference_700K) dataset.
16
 
 
17
 
18
  ## Evaluation
19
  We evaluate GRM on the [reward model benchmark](https://huggingface.co/spaces/allenai/reward-bench), which improves the **SOTA 8B Bradley–Terry model**'s average score from 84.7 to 87.0.
@@ -22,6 +23,7 @@ We evaluate GRM on the [reward model benchmark](https://huggingface.co/spaces/al
22
  | Model | Average | Chat | Chat Hard | Safety | Reasoning |
23
  |:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|
24
  | **Ray2333/GRM-llama3-8B-sftreg**(Ours, 8B) | 87.0 | 98.6 | 67.8 | 89.4 |92.3 |
 
25
  | openai/gpt-4-0125-preview | 85.9 | 95.3 | 74.3 | 87.2 | 86.9 |
26
  | sfairXC/FsfairX-LLaMA3-RM-v0.1 (8B) | 84.7 | 99.4 | 65.1 | 87.8 | 86.4 |
27
 
@@ -55,6 +57,15 @@ with torch.no_grad():
55
  ```
56
 
57
 
58
- ## To be added ...
 
 
 
 
 
 
 
 
 
59
 
60
 
 
14
 
15
  This reward model is finetuned from [llama3_8b_instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) using the [hendrydong/preference_700K](https://huggingface.co/datasets/hendrydong/preference_700K) dataset.
16
 
17
+ A distilled BT model using the features of this GRM can be found at [Ray2333/GRM-llama3-8B-distill](https://huggingface.co/Ray2333/GRM-llama3-8B-distill).
18
 
19
  ## Evaluation
20
  We evaluate GRM on the [reward model benchmark](https://huggingface.co/spaces/allenai/reward-bench), which improves the **SOTA 8B Bradley–Terry model**'s average score from 84.7 to 87.0.
 
23
  | Model | Average | Chat | Chat Hard | Safety | Reasoning |
24
  |:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|
25
  | **Ray2333/GRM-llama3-8B-sftreg**(Ours, 8B) | 87.0 | 98.6 | 67.8 | 89.4 |92.3 |
26
+ | [**Ray2333/GRM-llama3-8B-distill**](https://huggingface.co/Ray2333/GRM-llama3-8B-distill)(Ours, 8B) | 86.1 | 98.3 | 68.4 | 86.1 | 91.3 |
27
  | openai/gpt-4-0125-preview | 85.9 | 95.3 | 74.3 | 87.2 | 86.9 |
28
  | sfairXC/FsfairX-LLaMA3-RM-v0.1 (8B) | 84.7 | 99.4 | 65.1 | 87.8 | 86.4 |
29
 
 
57
  ```
58
 
59
 
60
+ ## Citation
61
+ If you find this model helpful for your research, please cite GRM
62
+ ```
63
+ @article{yang2024regularizing,
64
+ title={Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs},
65
+ author={Yang, Rui and Ding, Ruomeng and Lin, Yong and Zhang, Huan and Zhang, Tong},
66
+ journal={arXiv preprint arXiv:2406.10216},
67
+ year={2024}
68
+ }
69
+ ```
70
 
71