Ray2333
/

GRM-llama3-8B-sftreg

Text Classification

text-generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Ray2333 commited on Jul 4

Commit

fcde6fc

•

1 Parent(s): 686fb74

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -6,11 +6,11 @@ pipeline_tag: text-classification
 ---
 # Introduction
-The Generalizable Reward Model (GRM) aims to enhance the generalization ability of reward models for LLMs via regularizing the hidden states.
 Paper: [Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs](https://arxiv.org/abs/2406.10216).
-The introduced regularization technique markedly improves the accuracy of learned reward models across a variety of out-of-distribution tasks and effectively alleviate the over-optimization issue in RLHF, offering a more reliable and robust preference learning paradigm.
 This reward model is finetuned from [llama3_8b_instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) using the [hendrydong/preference_700K](https://huggingface.co/datasets/hendrydong/preference_700K) dataset.
@@ -22,7 +22,7 @@ We evaluate GRM on the [reward model benchmark](https://huggingface.co/spaces/al
 |       Model               | Average       |  Chat     |     Chat Hard      |     Safety      |     Reasoning     |
 |:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|
 |  **Ray2333/GRM-llama3-8B-sftreg**(Ours, 8B) | 87.0     |   98.6  |  67.8 |   89.4 |92.3     |
-|    openai/gpt-4-0125-preview     (8B)                          |    85.9     |   95.3      |  74.3  |  87.2 |    86.9    |
 |    sfairXC/FsfairX-LLaMA3-RM-v0.1      (8B)                          |   	84.7     |   99.4     |   65.1   | 87.8  |    86.4    |

 ---
 # Introduction
+The Generalizable Reward Model (GRM) aims to enhance the generalization ability of reward models for LLMs through regularizing the hidden states.
 Paper: [Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs](https://arxiv.org/abs/2406.10216).
+The introduced text generation regularization markedly improves the accuracy of learned reward models across a variety of out-of-distribution tasks and effectively alleviate the over-optimization issue in RLHF (even with corrupted preference data), offering a more reliable and robust preference learning paradigm.
 This reward model is finetuned from [llama3_8b_instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) using the [hendrydong/preference_700K](https://huggingface.co/datasets/hendrydong/preference_700K) dataset.
 |       Model               | Average       |  Chat     |     Chat Hard      |     Safety      |     Reasoning     |
 |:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|
 |  **Ray2333/GRM-llama3-8B-sftreg**(Ours, 8B) | 87.0     |   98.6  |  67.8 |   89.4 |92.3     |
+|    openai/gpt-4-0125-preview                             |    85.9     |   95.3      |  74.3  |  87.2 |    86.9    |
 |    sfairXC/FsfairX-LLaMA3-RM-v0.1      (8B)                          |   	84.7     |   99.4     |   65.1   | 87.8  |    86.4    |