Title: Disconcerting Perception Surrounding Invisible "Merge" Due to Open LLM Leaderboard Defaults

#629
by gate369 - opened

The current setup of Huggingface's open LLM leaderboard, wherein the "Merge/moerge" option is hidden by default upon loading, has inadvertently instigated a subtle yet potentially misleading association for its users. Despite my models consistently attaining top positions, the preselected concealment of the merge function, bundled alongside flagged and erased models' options, may inadvertently foster an unwarranted negative perception towards model merges.

In an ideal scenario, maintaining transparency and equal emphasis on each operation would reinforce a neutral understanding of the "Merge" feature. As it stands, however, the default selection implies an unintended stigma to a practice that should neither be seen as sinister nor tainted by association with negative actions. I think if mixture of experts models are automatically shown, merge/moerge should as well, or include moe to be hidden by default as well. Hate to complain but ive achieved number 1 a bunch of times but its kinda been over-shadowed by them being hidden by default for having the "merge" tag in the models read me.

gate369 changed discussion title from I negative implications "Merge/moerge " option being hidden by default to Title: Disconcerting Perception Surrounding Invisible "Merge" Due to Open LLM Leaderboard Defaults
deleted

@222limin I'm neither staff, nor even that knowledgeable about how LLMs work, so take what I'm about to say with a grain of salt.

However, as an LLM hobbyist and tester I can assure you that the test scores of mergers are not remotely accurate. For example, the Mistral 7b Einstein v4 only scores 66.7, but performs comparably to mergers scoring around 76. And the much larger Mixtral 8x7b and Nous-Hermes2 Yi-34b LLMs both score lower than 76 despite being clearly superior overall to any Mistral 7b.

The issue seems to be that when you merge LLMs there's inadvertent additive test contamination, hence the scores keep going up sans commensurate performance gains.

I agree that it's unfortunate that this stigmatizes mergers, most of which were made with good intentions, and my go to LLM is actually a merger (Trinity v1), but the primary point of the leaderboard to to objectively evaluate performance, which is no longer the case after merging.

Again, the 100s of Mistraly 7b mergers currently on the leaderboard scoring higher than Mixtral and Yi-34b are nowhere near matching their performance.

So non of these scores have been accurate? How is it tricking the evaluation into giving it a higher score, I know you aren’t staff but this is interesting. Huge curve ball

deleted

@222limin I don't know. Like I said, I'm not an expert. However, you can search past discussions about this, and participate in the discussion at the top of this page about contamination (mostly about mergers), in the link below.

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/472

An example of an old discussion is the following. It's about starting a separate leaderboard for mergers.

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/473

What appears to be happening (theory by some, including myself) is that there's a little test contamination in every LLM, so by merging them together this contamination adds together, artificially boosting the LLM test scores of mergers more than the individuals models they were merged from.

@Phil337 so ive noticed contamination as far as just merging models without finetuning. however, what about the method Maxime Labonne proposed of continuously merging then fine-tuning the same model? Does the problem still persist even when finetuning is included in the process?
thanks for the links imma check them out.

deleted

@gate369 That's over my head.

I started to read A Beginner’s Guide to LLM Fine-Tuning by Maxime Labonne (until it asked me to create an account to read further). I don't see how merging the same model while repeatedly fine-tuning with the same data set could possibly raise or artificially inflate scores.

@Phil337 love and hate medium haha. I found this https://medium.com/towards-data-science/merge-large-language-models-with-mergekit-2118fb392b54 , i cant find the article I was referencing he's made so many regarding merging in general lol.

ANYWAYS after going thru an existential crisis over this and doing a ton of research it seems like the contamination comes from merging already merged models. Too many ingredients in the soup. This being said it seems like people need to be more responsible about which models they merge in the future. It also seems like the open llm leaderboard's tests should be kept private... though i digress. I appreciate you enlightening me to all this again man

deleted

@gate369 Glad I could help, and thanks for the link. I glossed over it and bookmarked it for later. It's a good reference because it covers all the varying techniques with simple explanations , including how Solar 10.7 was made.

Also, private testing is one of the things being explored. See the discussion below. Something about crowd source rotating private questions to evaluate LLMs.

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/481

Open LLM Leaderboard org

Hi!
There are several issues which led us to hide merges, but they all come down to the same root problem: it's almost impossible to get a clear view of the model lineage.

When you fine tune a llama model, for example, you know what is in your finetuning data, and you are using a base pretrained model. You have "two" data-related operations, in a sense, the initial pretraining, and the later fine-tuning. However, for merges, it's very hard to know precisely what they contain (data wise), as we now have merges of merges of merges of deleted models etc - so 1) it's hard to reconstruct their history (I often see models that merge several models fine-tuned on almost the same dataset) and 2) they are very easy to accidentally contaminate since it's hard to know the data contents.

This led to the observation that merged models, after a couple of iterations (merge/finetune/merge/dpo/...), often have academic results that do not translate well to actual use. That's why they are hidden by default.

A number of other solutions are being explored, from private or rotating sets to more complex evaluations

clefourrier changed discussion status to closed

Sign up or log in to comment