Ray2333 commited on
Commit
fcde6fc
1 Parent(s): 686fb74

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -6,11 +6,11 @@ pipeline_tag: text-classification
6
  ---
7
 
8
  # Introduction
9
- The Generalizable Reward Model (GRM) aims to enhance the generalization ability of reward models for LLMs via regularizing the hidden states.
10
 
11
  Paper: [Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs](https://arxiv.org/abs/2406.10216).
12
 
13
- The introduced regularization technique markedly improves the accuracy of learned reward models across a variety of out-of-distribution tasks and effectively alleviate the over-optimization issue in RLHF, offering a more reliable and robust preference learning paradigm.
14
 
15
  This reward model is finetuned from [llama3_8b_instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) using the [hendrydong/preference_700K](https://huggingface.co/datasets/hendrydong/preference_700K) dataset.
16
 
@@ -22,7 +22,7 @@ We evaluate GRM on the [reward model benchmark](https://huggingface.co/spaces/al
22
  | Model | Average | Chat | Chat Hard | Safety | Reasoning |
23
  |:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|
24
  | **Ray2333/GRM-llama3-8B-sftreg**(Ours, 8B) | 87.0 | 98.6 | 67.8 | 89.4 |92.3 |
25
- | openai/gpt-4-0125-preview (8B) | 85.9 | 95.3 | 74.3 | 87.2 | 86.9 |
26
  | sfairXC/FsfairX-LLaMA3-RM-v0.1 (8B) | 84.7 | 99.4 | 65.1 | 87.8 | 86.4 |
27
 
28
 
 
6
  ---
7
 
8
  # Introduction
9
+ The Generalizable Reward Model (GRM) aims to enhance the generalization ability of reward models for LLMs through regularizing the hidden states.
10
 
11
  Paper: [Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs](https://arxiv.org/abs/2406.10216).
12
 
13
+ The introduced text generation regularization markedly improves the accuracy of learned reward models across a variety of out-of-distribution tasks and effectively alleviate the over-optimization issue in RLHF (even with corrupted preference data), offering a more reliable and robust preference learning paradigm.
14
 
15
  This reward model is finetuned from [llama3_8b_instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) using the [hendrydong/preference_700K](https://huggingface.co/datasets/hendrydong/preference_700K) dataset.
16
 
 
22
  | Model | Average | Chat | Chat Hard | Safety | Reasoning |
23
  |:-------------------------:|:-------------:|:---------:|:---------:|:--------:|:-----------:|
24
  | **Ray2333/GRM-llama3-8B-sftreg**(Ours, 8B) | 87.0 | 98.6 | 67.8 | 89.4 |92.3 |
25
+ | openai/gpt-4-0125-preview | 85.9 | 95.3 | 74.3 | 87.2 | 86.9 |
26
  | sfairXC/FsfairX-LLaMA3-RM-v0.1 (8B) | 84.7 | 99.4 | 65.1 | 87.8 | 86.4 |
27
 
28