Wenboz commited on
Commit
d61a622
1 Parent(s): 0461d1d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -195
README.md CHANGED
@@ -1,195 +1 @@
1
- ---
2
- license: llama3
3
- ---
4
-
5
- # Absolute-Rating Multi-Objective Reward Model (ArmoRM) with Mixture-of-Experts (MoE) Aggregation of Reward Objectives
6
-
7
-
8
-
9
- + **Authors** (* indicates equal contribution)
10
-
11
- [Haoxiang Wang*](https://haoxiang-wang.github.io/), [Wei Xiong*](https://weixiongust.github.io/WeiXiongUST/index.html), [Tengyang Xie](https://tengyangxie.github.io/), [Han Zhao](https://hanzhaoml.github.io/), [Tong Zhang](https://tongzhang-ml.org/)
12
-
13
- + **Blog**: https://rlhflow.github.io/posts/2024-05-29-multi-objective-reward-modeling/
14
- + **Tech Report**: https://arxiv.org/abs/2406.12845
15
- + **Model**: [ArmoRM-Llama3-8B-v0.1](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1)
16
- + Finetuned from model: [FsfairX-LLaMA3-RM-v0.1](https://huggingface.co/sfairXC/FsfairX-LLaMA3-RM-v0.1)
17
- - **Code Repository:** https://github.com/RLHFlow/RLHF-Reward-Modeling/
18
- + **Architecture**
19
-
20
- <p align="center">
21
- <img width="800" alt="image" src="https://github.com/RLHFlow/RLHFlow.github.io/blob/main/assets/ArmoRM-MoE.png?raw=true">
22
- </p>
23
-
24
- ## RewardBench LeaderBoard
25
-
26
- | Model | Base Model | Method | Score | Chat | Chat Hard | Safety | Reasoning | Prior Sets (0.5 weight) |
27
- |:--------------------------------------------------------------------------------|:-----------------------------------------------------------------------|:-----:|:-----|:----------|:-------|:----------|:-----------------------|:------------------------|
28
- | ArmoRM-Llama3-8B-v0.1 | Llama-3 8B | ArmoRM + MoE | **89.0** | 96.9 | **76.8** | **92.2** | **97.3** | 74.3 |
29
- | Cohere May 2024 | Unknown | Unknown | 88.3 | 96.4 | 71.3 | **92.7** | **97.7** | **78.2** |
30
- | [pair-preference-model](https://huggingface.co/RLHFlow/pair-preference-model-LLaMA3-8B)| Llama-3 8B | [SliC-HF](https://arxiv.org/abs/2305.10425) | 85.7 | 98.3 | 65.8 | 89.7 | 94.7 | 74.6 |
31
- | GPT-4 Turbo (0125 version) | GPT-4 Turbo | LLM-as-a-Judge | 84.3 | 95.3 | 74.3 | 87.2 | 86.9 | 70.9 |
32
- | [FsfairX-LLaMA3-RM-v0.1](https://huggingface.co/sfairXC/FsfairX-LLaMA3-RM-v0.1) | Llama-3 8B | Bradley-Terry | 83.6 | **99.4** | 65.1 | 87.8 | 86.4 | 74.9 |
33
- | [Starling-RM-34B](https://huggingface.co/Nexusflow/Starling-RM-34B) | Yi-34B | Bradley-Terry | 81.4 | 96.9 | 57.2 | 88.2 | 88.5 | 71.4 |
34
-
35
- ## Demo Code
36
- ```python
37
- import torch
38
- from transformers import AutoModelForSequenceClassification, AutoTokenizer
39
- device = "cuda"
40
- path = "RLHFlow/ArmoRM-Llama3-8B-v0.1"
41
- model = AutoModelForSequenceClassification.from_pretrained(path, device_map=device,
42
- trust_remote_code=True, torch_dtype=torch.bfloat16)
43
- tokenizer = AutoTokenizer.from_pretrained(path, use_fast=True)
44
- # We load a random sample from the validation set of the HelpSteer dataset
45
- prompt = 'What are some synonyms for the word "beautiful"?'
46
- response = "Nicely, Beautifully, Handsome, Stunning, Wonderful, Gorgeous, Pretty, Stunning, Elegant"
47
- messages = [{"role": "user", "content": prompt},
48
- {"role": "assistant", "content": response}]
49
- input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt").to(device)
50
- with torch.no_grad():
51
- output = model(input_ids)
52
- # Multi-objective rewards for the response
53
- multi_obj_rewards = output.rewards.cpu().float()
54
- # The gating layer's output is conditioned on the prompt
55
- gating_output = output.gating_output.cpu().float()
56
- # The preference score for the response, aggregated from the
57
- # multi-objective rewards with the gating layer
58
- preference_score = output.score.cpu().float()
59
- # We apply a transformation matrix to the multi-objective rewards
60
- # before multiplying with the gating layer's output. This mainly aims
61
- # at reducing the verbosity bias of the original reward objectives
62
- obj_transform = model.reward_transform_matrix.data.cpu().float()
63
- # The final coefficients assigned to each reward objective
64
- multi_obj_coeffs = gating_output @ obj_transform.T
65
- # The preference score is the linear combination of the multi-objective rewards with
66
- # the multi-objective coefficients, which can be verified by the following assertion
67
- assert torch.isclose(torch.sum(multi_obj_rewards * multi_obj_coeffs, dim=1), preference_score, atol=1e-3)
68
- # Find the top-K reward objectives with coefficients of the highest magnitude
69
- K = 3
70
- top_obj_dims = torch.argsort(torch.abs(multi_obj_coeffs), dim=1, descending=True,)[:, :K]
71
- top_obj_coeffs = torch.gather(multi_obj_coeffs, dim=1, index=top_obj_dims)
72
-
73
- # The attributes of the 19 reward objectives
74
- attributes = ['helpsteer-helpfulness','helpsteer-correctness','helpsteer-coherence',
75
- 'helpsteer-complexity','helpsteer-verbosity','ultrafeedback-overall_score',
76
- 'ultrafeedback-instruction_following', 'ultrafeedback-truthfulness',
77
- 'ultrafeedback-honesty','ultrafeedback-helpfulness','beavertails-is_safe',
78
- 'prometheus-score','argilla-overall_quality','argilla-judge_lm','code-complexity',
79
- 'code-style','code-explanation','code-instruction-following','code-readability']
80
-
81
- example_index = 0
82
- for i in range(K):
83
- attribute = attributes[top_obj_dims[example_index, i].item()]
84
- coeff = top_obj_coeffs[example_index, i].item()
85
- print(f"{attribute}: {round(coeff,5)}")
86
- # code-complexity: 0.19922
87
- # helpsteer-verbosity: -0.10864
88
- # ultrafeedback-instruction_following: 0.07861
89
-
90
- # The actual rewards of this example from the HelpSteer dataset
91
- # are [3,3,4,2,2] for the five helpsteer objectives:
92
- # helpfulness, correctness, coherence, complexity, verbosity
93
- # We can linearly transform our predicted rewards to the
94
- # original reward space to compare with the ground truth
95
- helpsteer_rewards_pred = multi_obj_rewards[0, :5] * 5 - 0.5
96
- print(helpsteer_rewards_pred)
97
- # [2.78125 2.859375 3.484375 1.3847656 1.296875 ]
98
- ```
99
-
100
- ## Easy to use Pipeline
101
-
102
- ```python
103
- from typing import Dict, List
104
- import torch
105
- from transformers import AutoModelForSequenceClassification, AutoTokenizer
106
-
107
-
108
- class ArmoRMPipeline:
109
- def __init__(self, model_id, device_map="auto", torch_dtype=torch.bfloat16, truncation=True, trust_remote_code=False, max_length=4096):
110
- self.model = AutoModelForSequenceClassification.from_pretrained(
111
- model_id,
112
- device_map=device_map,
113
- trust_remote_code=trust_remote_code,
114
- torch_dtype=torch_dtype,
115
- )
116
- self.tokenizer = AutoTokenizer.from_pretrained(
117
- model_id,
118
- use_fast=True,
119
- )
120
- self.truncation = truncation
121
- self.device = self.model.device
122
- self.max_length = max_length
123
-
124
- def __call__(self, messages: List[Dict[str, str]]) -> Dict[str, float]:
125
- """
126
- messages: OpenAI chat messages to be scored
127
- Note: no batching since due to length differences, the model will have to pad to the max length which is not efficient
128
- Returns: a dictionary with the score between 0 and 1
129
- """
130
- input_ids = self.tokenizer.apply_chat_template(
131
- messages,
132
- return_tensors="pt",
133
- padding=True,
134
- truncation=self.truncation,
135
- max_length=self.max_length,
136
- ).to(self.device)
137
- with torch.no_grad():
138
- output = self.model(input_ids)
139
- score = output.score.float().item()
140
- return {"score": score}
141
-
142
- # Create Reward Model Pipeline
143
- prompt = 'What are some synonyms for the word "beautiful"?'
144
- rm = ArmoRMPipeline("RLHFlow/ArmoRM-Llama3-8B-v0.1", trust_remote_code=True)
145
- # score the messages
146
- response1 = 'Nicely, Beautifully, Handsome, Stunning, Wonderful, Gorgeous, Pretty, Stunning, Elegant'
147
- score1 = rm([{"role": "user", "content": prompt}, {"role": "assistant", "content": response1}])
148
- print(score1)
149
-
150
- response2 = '''Certainly! Here are some synonyms for the word "beautiful":
151
-
152
- 1. Gorgeous
153
- 2. Lovely
154
- 3. Stunning
155
- 4. Attractive
156
- 5. Pretty
157
- 6. Elegant
158
- 7. Exquisite
159
- 8. Handsome
160
- 9. Charming
161
- 10. Alluring
162
- 11. Radiant
163
- 12. Magnificent
164
- 13. Graceful
165
- 14. Enchanting
166
- 15. Dazzling
167
-
168
- These synonyms can be used in various contexts to convey the idea of beauty.'''
169
- score2 = rm([{"role": "user", "content": prompt}, {"role": "assistant", "content": response2}])
170
- print(score2)
171
-
172
- response3 = 'Sorry i cannot answer this.'
173
- score3 = rm([{"role": "user", "content": prompt}, {"role": "assistant", "content": response3}])
174
- print(score3)
175
-
176
- ```
177
-
178
- ## Citation
179
-
180
- If you find this work useful for your research, please consider citing:
181
- ```
182
- @article{ArmoRM,
183
- title={Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts},
184
- author={Haoxiang Wang and Wei Xiong and Tengyang Xie and Han Zhao and Tong Zhang},
185
- journal={arXiv preprint arXiv:2406.12845},
186
- }
187
-
188
- @inproceedings{wang2024arithmetic,
189
- title={Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards},
190
- author={Haoxiang Wang and Yong Lin and Wei Xiong and Rui Yang and Shizhe Diao and Shuang Qiu and Han Zhao and Tong Zhang},
191
- year={2024},
192
- booktitle={ACL},
193
- }
194
- ```
195
- The second entry, "[Arithmetic Control of LLMs for Diverse User Preferences: Directional Preference Alignment with Multi-Objective Rewards](https://arxiv.org/abs/2402.18571)", is another recent work of ours that trained a multi-objective reward model and adopted it for LLM alignment, which motivated us to develop the current work.
 
1
+ Clone repo "RLHFlow/ArmoRM-Llama3-8B-v0.1"