Collaboration?

#10

by dnhkng - opened Sep 14

Discussion

dnhkng

Sep 14

https://x.com/zhouwenmeng/status/1834899729165304198

Qwen2.5 + RYS + Calme?

MaziyarPanahi

Owner Sep 15

We've been waiting for that since we've heard the news! I'll make some RLHF and do some merging.

These models change sometimes, old pipelines may not work as best, so let's evaluate them first then we do the rest by upscaling it

dnhkng

Sep 15

Yeah, it will take about ~4 days to run RYS on Qwen2.5. Did you also try to fine tune RYS-Large-Base (the model based on Qwen2)?

That might be a better process than the previous Qwen2 -> Calme -> Calme-RYS -> Calme-RYS-2.4.

prudant

Sep 21

any chance to get a qwen 2.5 32b and 14b versions ? it would be amazing! how much gpu power need to tune those models? I have 4x3090= 96 gb vram, but I think its too litle for that tuning task.

dnhkng

23 days ago

I can try. The RYS models show improvement on larger models, from what I think is an emergent phenomenon. I doubt the 14B model will show improvement, but there's a chance the 32B will!

MaziyarPanahi

Owner 23 days ago

I can try. The RYS models show improvement on larger models, from what I think is an emergent phenomenon. I doubt the 14B model will show improvement, but there's a chance the 32B will!

@dnhkng I can't speak much about the exact technique, but is there another example where RYS model ended up scoring higher than the original model? (beside the one that is based oncalme model)

dnhkng

23 days ago

For small models, I see no improvement in benchmarks. I have tested most of the models, and the effects are interesting. Although I don't modify weights but rather the layer configuration, I do see massive changes in 'personality'. For example, something the models get silly, tell jokes, and start laughing (and never stop). Other variants start using more and more 'flowery' language and use a huge vocabulary of rare words.

For the Llama3-8B models, the RYS model 'security' breaks with simple logic. You will get a denial response if you ask for information on a forbidden topic. But if you point out that the information is freely available on the internet, the model will often agree that the restriction is silly, and provide the information! (again, no weights are modified!!!)

As far as benchmarks go, I saw improvements on Miqu (https://eqbench.com/ - miiqu-f16), then I focused on Qwen2. I assume the models are not 'optimal' as there is 'scarring' from the layer transformation. I'm working on fixing that in the next release. This might also 'fix' the issue with smaller models not showing improved benchmarks.

MaziyarPanahi

Owner 22 days ago

Although I don't modify weights but rather the layer configuration, I do see massive changes in 'personality'.

This happens when the model becomes unhinged due to manipulating the layers. We frequently clone and replicate layers, which can be mistaken for an improvement in creative writing. However, the truth is that nobody would use it for storytelling, as it often goes off the rails.

As far as benchmarks go, I saw improvements on Miqu (https://eqbench.com/ - miiqu-f16), then I focused on Qwen2. I assume the models are not 'optimal' as there is 'scarring' from the layer transformation. I'm working on fixing that in the next release.

I have followed your experiments on Llama-3 70B, Llama-3.1 70B, and Qwen2 70B. As far as I can see on the Leaderboard, upscaling these models failed to improve the overall quality compared to the original. This is expected when replicating layers without further training.

I appreciate the focus on studying layers, but if success is only achieved once, we may need to consider post-training to refine the model. (I have locally reproduced RYS Large via mergekit, and that method only works with the Calme fine-tune; the other models showed a significant drop in quality.)

dnhkng

21 days ago

•

edited 20 days ago

Finally, the RYS-Large-Base is on the Leaderboard. The eval failed over and over, I had to resubmit it 3 times! (I submitted it well over a month ago)

It is an un-fine-tuned model self-merge of Qwen2-72B. It shows improvement on several benchmarks over the base model. Of note, it scores slightly higher than the Calme version of Qwen2 that RYS-Large is based on. I think this is evidence the methods are truly orthogonal, i.e. both methods improve on Qwen2 independently and more so when combined.

RYS-Qwen2.5-72B and RYS-Qwen2.5-32B should be uploaded by the end of the week.

PS. Can you get Malay-style roti pratha in France?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment