Looks like not as good as Qwen2.5 7B

#5
by MonolithFoundation - opened

Many metrics are fall behind.

I wish there was an official online hub to test Ministral. Even Mistrals free trial doesn't include it as an option. It's performing very bad for me on my PC and the available HF space, but that could be due to a configuration issue.

Anyways, in reference to Qwen2.5 7b being better because of the metrics I have strong doubts. I extensively tested Qwen2.5 7b, including 72b, both offline (quantized) and full float online (e.g. on LMsys) and they were making the same errors everywhere. Q2.5 really is very bad outside of coding and math.

By far the number one AI complaint by the general population is hallucinations, especially when it comes to very popular common knowledge, and Qwen2.5 trained on so many coding and math tokens that it significantly scrambled its previously learned information. For example, Qwen2.5 72b scored 68.4, vs 85.9 for Qwen2 72b, on my world knowledge test.

So Qwen's obsession with the metrics is making the metrics meaningless, so I understand why other companies wouldn't compare against them. Qwen isn't playing fair. As mentioned above, they're suppressing the majority of world knowledge to train on more data that overlaps the STEM, coding, and math tests.

Additionally, since the major LLM tests are multiple choice they're not very useful in evaluating LLMs. That is, multiple choice tests provide the answer to choose from, jogging the memory of an LLM which could otherwise not accurately retrieve the right answer. Sure enough, Qwen2.5 scored higher on the MMLU than Qwen2, but when asked to retrieve the information in real-world Q&A (no right answer to chose from) it succeeded less frequently than Qwen2. So basically Qwen2.5 under trained on a large amount of data, capturing it well enough to pick the right answer out of a lineup (multiple choice MMLU tests), but not well enough to retrieve it directly.

In short, Qwen2.5 7b is REALLY REALLY bad as a general purpose LLM, while it's unusually good at coding and math for its size. So for some people (mainly coders) they think it's great, when it's really so bad it's unusable. Little more than an hallucination generator for very basic world knowledge.

@phil111 What is the best small model to you? in the 7b-9b range for general use of course.

@lixbo This will be unoriginal, but it's Llama 3.1 8b and Gemma 2 9b.

But I use Llama 3.1 8b by default primarily because it has less censorship, including less information stripped from the corpus with censorship in mind. L3.1 also has broader abilities, such as writing better poems. However, Gemma 2 9b can write slightly better stories and is overall a little more powerful/intelligent.

In my testing Llama 3.1 8b scored 69.7 vs 69.1 for Gemma 2 (if not for censorship G2 would have scored slightly higher than L3.1). For context, L3.1 70b scored 88.5, and Qwen2 7b 54.1, InternLM2.5 7b 45.9, and so on. To be fair I don't test math and coding since the majority of tests cover these things, and this is were models like Qwen shine.

What's generally good about L3.1 & G2 is instruction following. For example, if you make a very custom story request (e.g. numerous story inclusions and exclusions) models like Phi3.5, InternLM and Qwen will lean more towards just regurgitating stories they were trained on vs writing a more original story that aligns with your prompted desires. Consequently, to the casual observer they may appear more eloquent, and may make them score higher on emotional intelligence benchmarks, but it's really just cheating (story regurgitation). Plus I should add Phi is so censored and filled with blind spots, even the larger versions, that it's utterly useless as a general purpose LLM, hence inferior to Qwen2 7b. Plus Qwen2.5 7b lost a lot of world knowledge it previously had, so it's far too ignorant of most things to be useful beyond more academic tasks like STEM, coding, and math.

@phil111 Thank you very much, I should keep L3.1 then. Also, have you tried any finetunes or merges of it?

@lixbo Yes, I tried numerous fine-tunes, primarily in an attempt to get around censorship (e.g. Dolphin & abliterated), but they all perform much worse at things like instruction following, broad abilities, and knowledge retrieval. Plus jail breaks, system prompts, clever prompting (e.g. I'm doing research), has always gotten around any unreasonable censorship. I don't ask about illegal things like how to make meth.

Am not sure is Qwen2.5 trained on public benchmark overlapping data or not, but for Multi-Modal field, which might need LLM to understanding image or audio, Qwen2.5 7B is currently the best model of all. LLama3.1 8b is far behind the performance.

In my experience, Qwen2.5 7B could still be the best model under 9b in the world.

@MonolithFoundation Don't get me wrong, Qwen2.5 7b has a lot of power. I didn't mean to imply they cheated on the tests (e.g. trained on contamination), which they didn't.

My point was only that they took the world's most popular knowledge and discarded the bulk of it because it was non-academic (e.g. pop culture), plus also spent far less computer hours training on it, in order to train more on academic tokens. As a result it boosted test scores like the MMLU, and is better at select things like math and coding.

However, they shouldn't be applauded for doing this. Anyone, like Meta and Google, could have done the same in order to boost the academic test scores at the expense of general world knowledge and abilities. They chose not to, and rightfully so, because it would no longer be a general purpose instruct/chat LLM that normal people would use.

Qwen2 7b scored extremely low on my world knowledge test, and Qwen2.5 7b scored even lower still. It's so profoundly ignorant of very popular things that it's effectively useless to the general population outside the circle jerk of first adopter coding nerds.

@phil111 Could you please share your test data and setup (like the temperature, top_p, and how many times you ran for one question etc.)

@jlzhou When testing I always use GGUF Q4_K_M temp 0 (top-p 1, min-p 1, top-k 0, repeat penalty 1.08 over 150 tokens). This always produces the highest test scores. And the system prompt is a generic 'you are a helpful assistant'.

I sometimes re-run the test with my preferred temp 0.3 (top-p 1, min-p 0.6, top-k 0). I found this provides the maximum amount of variability without notably increasing hallucinations (a <2% drop in the final test scores compared to temp 0).

I also frequently run problematic prompts with the original full float AI model (e.g. on LMsys) to help rule out configuration or llama.cpp compatibility issues.

Anyways, as for the test itself, I prefer to keep most of the details private. But it's basically everything that can't be tested, or isn't tested well, by automated LLM evaluators.

A few example of what can't be tested are (1) "The following limerick doesn’t rhyme. Re-write the limerick so it adheres to the standard AABBA rhyming scheme of a limerick..." (2) "Make a simple list of 9 single word synonyms for extra."; weaker LLMs at temp 0 will start repeating words, include phrases, switch to antonyms... (and 3) "Repeat the following paragraph word for word while fixing any spelling and grammar errors you come across, then list all corrections you made when done...".

And an example of what can't be tested well is academic/STEM knowledge with the MMLU. Multiple choice knowledge tests don't make much sense. In real world use cases people almost never provide the answer when trying to retrieve information. As a consequence very small (or large under-trained LLMs), despite having comparable or higher MMLUs than SOTA models, are reliably far worse at returning the desired information unless the correct answer is available to jog their memories (multiple choice format).

Example: All the large quality models, including GPT4o, Sonnet 3.5, Gemini, and even Mixtral 8x7b, return "Thorne–Żytkow object (TZO)" in response to one of my prompts, but the small or under-trained LLMs almost never get it right, returning something completely unrelated like hypernova. But when the answer is shown to them (multiple choice format) they can usually get it right. And even when they come close they commonly make errors (e.g. Thorne-Zytkowski star in the case of Qwen2.5 72b).

Anyways, the large bulk of the test is just questions about the world's most popular information (known to countless millions of people), including movies, music, shows, games, celebrities & sports. Primarily because this is almost completely overlooked by the standardized tests (e.g. the MMLU), yet is absolutely essential for a general purpose LLM to have to serve >95% of the world's population who are more likely to chat about popular topics of interest, ask questions about popular things (who acted in this movie, sang this song...), writing fan fiction... than to write code, solve math/logic problems...

Most of these pop culture questions come in three varieties (list, retrieve, and trick).

Example List: What are the 6 main characters, and the actors who portrayed them, on the TV show Friends? Don't add details, just list them. And what year did the show first air?

Example Retrieve: Who portrayed Frasier’s ex-wife and mother of his son in the sitcom Frasier?

Example Trick:: Tom Cruise was married twice. Which two women did he marry? (he was married 3x)

That's basically it. It's a simple test. But it clearly shows how some open source LLMs, in their attempt to match the scores of large proprietary models with fewer parameters, are turning into empty shells, such as Qwen2.5 72b which scored 68.4 on my test vs 85.9 for Qwen2 72b. Not only do they hallucinate like crazy trying to recall the world's most popular knowledge, which proprietary models with comparable MMLU scores correctly answer nearly 100% of the time, they are also far more likely to be unable to accurately retrieve the information covered by the MMLU unless it's presented in the multiple choice format.

Sign up or log in to comment