Ray2333
/

GRM-llama3-8B-sftreg

Text Classification

text-generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Ray2333 commited on Jul 5

Commit

df5f163

•

1 Parent(s): 8b72edc

Update README.md

Files changed (1) hide show

README.md +12 -1

README.md CHANGED Viewed

@@ -14,6 +14,7 @@ The introduced text generation regularization markedly improves the accuracy of
 This reward model is finetuned from [llama3_8b_instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) using the [hendrydong/preference_700K](https://huggingface.co/datasets/hendrydong/preference_700K) dataset.
 ## Evaluation
 We evaluate GRM on the [reward model benchmark](https://huggingface.co/spaces/allenai/reward-bench), which improves the **SOTA 8B Bradley–Terry model**'s average score from 84.7 to 87.0.
@@ -22,6 +23,7 @@ We evaluate GRM on the [reward model benchmark](https://huggingface.co/spaces/al
 |       Model               | Average       |  Chat     |     Chat Hard      |     Safety      |     Reasoning     |
 |:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|
 |  **Ray2333/GRM-llama3-8B-sftreg**(Ours, 8B) | 87.0     |   98.6  |  67.8 |   89.4 |92.3     |
 |    openai/gpt-4-0125-preview                             |    85.9     |   95.3      |  74.3  |  87.2 |    86.9    |
 |    sfairXC/FsfairX-LLaMA3-RM-v0.1      (8B)                          |   	84.7     |   99.4     |   65.1   | 87.8  |    86.4    |
@@ -55,6 +57,15 @@ with torch.no_grad():
 ```
-## To be added ...

 This reward model is finetuned from [llama3_8b_instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) using the [hendrydong/preference_700K](https://huggingface.co/datasets/hendrydong/preference_700K) dataset.
+A distilled BT model using the features of this GRM can be found at [Ray2333/GRM-llama3-8B-distill](https://huggingface.co/Ray2333/GRM-llama3-8B-distill).
 ## Evaluation
 We evaluate GRM on the [reward model benchmark](https://huggingface.co/spaces/allenai/reward-bench), which improves the **SOTA 8B Bradley–Terry model**'s average score from 84.7 to 87.0.
 |       Model               | Average       |  Chat     |     Chat Hard      |     Safety      |     Reasoning     |
 |:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|
 |  **Ray2333/GRM-llama3-8B-sftreg**(Ours, 8B) | 87.0     |   98.6  |  67.8 |   89.4 |92.3     |
+|  [**Ray2333/GRM-llama3-8B-distill**](https://huggingface.co/Ray2333/GRM-llama3-8B-distill)(Ours, 8B) | 86.1     |   98.3  |  68.4 |   86.1 | 91.3     |
 |    openai/gpt-4-0125-preview                             |    85.9     |   95.3      |  74.3  |  87.2 |    86.9    |
 |    sfairXC/FsfairX-LLaMA3-RM-v0.1      (8B)                          |   	84.7     |   99.4     |   65.1   | 87.8  |    86.4    |
 ```
+## Citation
+If you find this model helpful for your research, please cite GRM
+```
+@article{yang2024regularizing,
+  title={Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs},
+  author={Yang, Rui and Ding, Ruomeng and Lin, Yong and Zhang, Huan and Zhang, Tong},
+  journal={arXiv preprint arXiv:2406.10216},
+  year={2024}
+}
+```