Clémentine Fourrier

clefourrier

http://clefourrier.github.io

AI & ML interests

None yet

Articles

Introducing the Open FinLLM Leaderboard

19 days ago

• 55

BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks

Jun 18

• 36

Falcon 2: An 11B parameter pretrained language model and VLM, trained on over 5000B tokens tokens and 11 languages

May 24

• 24

CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models

May 24

• 21

The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare

Apr 19

• 106

Introducing the LiveCodeBench Leaderboard - Holistic and Contamination-Free Evaluation of Code LLMs

Apr 16

• 14

Introducing the Chatbot Guardrails Arena

Mar 21

• 4

Introducing ConTextual: How well can your Multimodal model jointly reason over text and image in text-rich scenes?

Mar 5

• 4

TTS Arena: Benchmarking Text-to-Speech Models in the Wild

Feb 27

• 32

Introducing the Red-Teaming Resistance Leaderboard

Feb 23

• 12

Introducing the Open Ko-LLM Leaderboard: Leading the Korean LLM Evaluation Ecosystem

Feb 20

• 3

NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

Feb 2

• 2

Introducing the Enterprise Scenarios Leaderboard: a Leaderboard for Real World Use Cases

Jan 31

• 3

The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models

Jan 29

• 14

A guide to setting up your own Hugging Face leaderboard: an end-to-end example with Vectara's hallucination leaderboard

Jan 12

• 6

2023, year of open LLMs

Dec 18, 2023

• 5

Open LLM Leaderboard: DROP deep dive

Dec 1, 2023

• 3

Overview of natively supported quantization schemes in 🤗 Transformers

Sep 12, 2023

• 10

What's going on with the Open LLM Leaderboard?

Jun 23, 2023

• 18

Introduction to Graph Machine Learning

Jan 3, 2023

• 15

Organizations

clefourrier's activity

posted an update 6 months ago

Post

5021

In a basic chatbots, errors are annoyances. In medical LLMs, errors can have life-threatening consequences 🩸

It's therefore vital to benchmark/follow advances in medical LLMs before even thinking about deployment.

This is why a small research team introduced a medical LLM leaderboard, to get reproducible and comparable results between LLMs, and allow everyone to follow advances in the field.

openlifescienceai/open_medical_llm_leaderboard

Congrats to @aaditya and @pminervini !
Learn more in the blog: https://huggingface.co/blog/leaderboard-medicalllm

posted an update 6 months ago

Post

3998

Contamination free code evaluations with LiveCodeBench! 🖥️

LiveCodeBench is a new leaderboard, which contains:
- complete code evaluations (on code generation, self repair, code execution, tests)
- my favorite feature: problem selection by publication date 📅

This feature means that you can get model scores averaged only on new problems out of the training data. This means... contamination free code evals! 🚀

Check it out!

Blog: https://huggingface.co/blog/leaderboard-livecodebench
Leaderboard: livecodebench/leaderboard

Congrats to @StringChaos @minimario @xu3kev @kingh0730 and @FanjiaYan for the super cool leaderboard!

posted an update 6 months ago

Post

2197

🆕 Evaluate your RL agents - who's best at Atari?🏆

The new RL leaderboard evaluates agents in 87 possible environments (from Atari 🎮 to motion control simulations🚶and more)!

When you submit your model, it's run and evaluated in real time - and the leaderboard displays small videos of the best model's run, which is super fun to watch! ✨

Kudos to @qgallouedec for creating and maintaining the leaderboard!
Let's find out which agent is the best at games! 🚀

open-rl-leaderboard/leaderboard

posted an update 7 months ago

Post

2207

Fun fact about evaluation, part 2!

How much do scores change depending on prompt format choice?

Using different prompts (all present in the literature, from Prompt question? to Question: prompt question?\nChoices: enumeration of all choices\nAnswer: ), we get a score range of...

10 points for a single model!
Keep in mind that we only changed the prompt, not the evaluation subsets, etc.
Again, this confirms that evaluation results reported without their details are basically bullshit.

Prompt format on the x axis, all these evals look at the logprob of either "choice A/choice B..." or "A/B...".

Incidentally, it also changes model rankings - so a "best" model might only be best on one type of prompt...

replied to their post 7 months ago

Ha shoot, missed it! I restarted the space

replied to ggbetz's post 7 months ago

Can you send me an email (to [email protected]) so I can add you to our slack? Will make it easier to collab :)

replied to ggbetz's post 7 months ago

Congratulations, this is very cool!
Would you want to collaborate on a blog post featured on HF to introduce your leaderboard to the community?

replied to their post 7 months ago

The most likely explanation is that some examples are better/worse for specific models. There were some interesting discussions on twitter if you're curious :)

replied to their post 7 months ago

I have no idea what you are talking about by "resetting the context window". Every question is asked independently of the previous ones, and has its own context and context window.

None of these evaluations are usually done manually (we use lm_eval for the Open LLM Leaderboard for ex), but people rarely report their precise evaluation setup; papers often contain things like "we evaluate MMLU in 5-shot and it gives us this amazing SOTA result", without providing the precise evaluation script, the prompt, etc.

As we can see in this post, even an extremely minor change to an evaluation prompt (providing the few shot samples, fixed, in one order or another) can have a magnitude of impact on the results. Hence why people really should only report results with follow up evaluation scripts.

posted an update 7 months ago

Post

2344

Fun fact about evaluation!

Did you know that, if you evaluate the same model, with the same prompt formatting & the same fixed few-shot examples, only changing
♻️the order in which the few shot examples are added to the prompt ♻️
you get a difference of up to 3 points in evaluation score?

I did a small experiment using some MMLU subsets on the best performing 7B and lower pretrained models from the leaderboard.

I tried 8 different prompting methods (containing more or less information, such as just the question, or Question: question, or Question: question Choices: ..., see the x axis) that are commonly used in evaluation.

I then compared the results for all these methods, in 5-shot, during 2 runs. The *only difference* between the first and second run being that the samples used in few-shot are not introduced in the same order.
For example, run one would have been "A B C D E Current sample", vs, in run 2, "D C E A B Current sample".
All the other experiment parameters stayed exactly the same.

As you can see on the attached picture, you get a difference of up to 3 points between the 2 few-shot samples shuffling.

So, when just changing *the order of the few shot samples* can change your results by several points, what is the impact of all other "minimal" and unreported prompting changes?

-> Any kind of model score, provided without an evaluation script for reproducibility, is basically bullshit (or coms).
-> This is why we need reproducible evaluation in a fair and exactly similar setup, using evaluation suites such as lm_eval from the Harness, lighteval from HF, or the Open LLM Leaderboard.

4 replies

posted an update 7 months ago

Post

2009

Are you looking for the perfect leaderboard/arena for your use case? 👀

There's a new tool for this!
https://huggingface.co/spaces/leaderboards/LeaderboardFinder

Select your modality, language, task... then search! 🔍
Some categories of interest:
- does the leaderboard accept submissions?
- is the test set private or public?
- is it using an automatic metric, human evaluators, or llm as a judge?

The spaces list is build from space metadata, and reloaded every hour.

Enjoy!

replied to giux78's post 7 months ago

Shouldn't be linked to the tier if it's just parsing a file - I'm unsure, maybe @pngwn would know?

replied to giux78's post 7 months ago

We have so many applicants that the one thing that is important (for me as a hiring manager) is the motivation letter (for the first stage at least) - do you understand what the internship is about? have you used (in this case) the open llm leaderboard, or another leaderboard? do you actually have evaluation experience? do you know about the OSS ML ecosystem? (having OSS experience is obviously a bonus)

Meh motivation letters fell into 2 categories:

"here is my complete life story about how I started into ML from a comment my grandma told me when I was 8" (too long, hard for me to see which parts of your experience are relevant, not matching our posting so hard to see if you are a fit)
a nice rephrase of the keywords in the offer (super specific to the offer but not backed up by any evidence or experience).

replied to giux78's post 7 months ago

We're reaching the offer part of the process atm ^^

replied to giux78's post 7 months ago

Congratulations, it a very cool leaderboard!
It could make your leaderboard more findable to add leaderboard as a tag in your readme :)

posted an update 7 months ago

Post

1524

How talkative is your chatbot about your internal data? 😬

As more chatbots get deployed in production, with access to internal databases, we need to make sure they don't leak private information to anyone interacting with them.

The Lighthouz AI team therefore introduced the Chatbot Guardrails Arena to stress test models and see how well guarded your private information is.
Anyone can try to make models reveal information they should not share 😈
(which is quite fun to do for the strongest models)!

The votes will then be gathered to create an Elo ranking of the safest models with respect to PII.

In the future, with the support of the community, this arena could inform safety choices that company make, when choosing models and guardrails on their resistance to adversarial attacks.
It's also a good way to easily demonstrate the limitations of current systems!

Check out the arena: lighthouzai/guardrails-arena
Learn more in the blog: https://huggingface.co/blog/arena-lighthouz

replied to osanseviero's post 8 months ago

Have you read the works of @irenesolaiman ?
https://huggingface.co/papers/2302.04844

posted an update 8 months ago

Post

🔥 New multimodal leaderboard on the hub: ConTextual!

Many situations require models to parse images containing text: maps, web pages, real world pictures, memes, ... 🖼️
So how do you evaluate performance on this task?

The ConTextual team introduced a brand new dataset of instructions and images, to test LMM (large multimodal models) reasoning capabilities, and an associated leaderboard (with a private test set).

This is super exciting imo because it has the potential to be a good benchmark both for multimodal models and for assistants' vision capabilities, thanks to the instructions in the dataset.

Congrats to @rohan598 , @hbXNov , @kaiweichang and @violetpeng !!

Learn more in the blog: https://huggingface.co/blog/leaderboard-contextual
Leaderboard: ucla-contextual/contextual_leaderboard

posted an update 8 months ago

Post

First big community contribution on our evaluation suite, lighteval ⛅️

@Ali-C137 added 3 evaluation tasks in Arabic:
- ACVA, a benchmark about Arabic culture
- MMLU, translated
- Exams, translated
(datasets provided/translated by the AceGPT team)

Congrats to them!
https://github.com/huggingface/lighteval/pull/44

1 reply

posted an update 8 months ago

Post

New base pretrained models on the Open LLM Leaderboard!

Two new OSS models by Google, who's getting back in the game 😎
The 7B is 2nd of the leaderboard, and better than Mistral (notably on GSM8K, aka math).

google/gemma-7b
google/gemma-2b

Check more results on the leaderboard https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

posted an update 8 months ago

Post

🔥 New LLM leaderboard blog: Open Ko LLM!

One of the oldest leaderboards on the hub, it has already evaluated more than 1000 models! It uses Korean translations of MMLU, ARC, HellaSwag, TruthfulQA, and a new dataset, Korean CommonGen, about specific common sense alignement.

upstage/open-ko-llm-leaderboard

What's interesting about this leaderboard is how it drove LLM development in Korea, with on average about 4 submissions/models per day since it started!
Really looking forward to seeing similar initiatives in other languages, to help qualitative models emerge outside of "just English" (for the other 2/3rds of the world).

Read more about how the leaderboard in the intro blog: https://huggingface.co/blog/leaderboards-on-the-hub-upstage
Congrats to @Chanjun , @hunkim and the Upstage team!

replied to DavidVivancos's post 9 months ago

Hi! Amazing to see a leaderboard for brain signals!
Are the scores self reported?

replied to yuchenlin's post 9 months ago

This is super cool!
Would love to feature your work in our "Leaderboards on the Hub" series (https://huggingface.co/blog?tag=leaderboard) once you get enough data to actually publish a leaderboard :)

posted an update 9 months ago

Post

🔥 New LLM leaderboard on the hub: NPHardEval!

It uses questions of logic, of different mathematical complexities, as a proxy for reasoning abilities. It notably removes questions relying on arithmetic, to really focus on logical abilities.
What's interesting imo is the potential to really study a model performance at different levels of complexity.

Bonus: Since the questions can be generated automatically, it's going to be dynamic, updated monthly! 🚀
NPHardEval/NPHardEval-leaderboard

Read more about how their questions are generated in the intro blog: https://huggingface.co/blog/leaderboards-on-the-hub-nphardeval

Congrats to @lizhouf , @wenyueH , @hyfrankl and their teams!

replied to their post 9 months ago

Thanks a lot, edited the post :)

posted an update 9 months ago

Post

🔥 New LLM leaderboard on the hub: an Enterprise Scenarios Leaderboard!

This work evaluates LLMs on several real world use cases (Finance documents, Legal confidentiality, Customer support, ...), which makes it grounded, and interesting for companies! 🏢
Bonus: the test set is private, so it's hard to game 🔥
PatronusAI/enterprise_scenarios_leaderboard

Side note: I discovered through this benchmark that you could evaluate "Engagingness" of an LLM, which could also be interesting for our LLM fine-tuning community out there.

Read more about their different tasks and metrics in the intro blog: https://huggingface.co/blog/leaderboards-on-the-hub-patronus

Congrats to @sunitha98 who led the leaderboard implementation, and to @rebeccaqian and @anandnk24 , all at Patronus AI !

2 replies

posted an update 9 months ago

Post

🔥 New LLM leaderboard on the hub: an LLM Hallucination Leaderboard!

Led by @pminervini , it evaluates the propensity of models to *hallucinate*, either on factuality (= say false things) or faithfulness (= ignore user instructions). This is becoming an increasingly important avenue of research, as more and more people are starting to rely on LLMs to find and search for information!
It contains 14 datasets, grouped over 7 concepts, to try to get a better overall view of when LLMs output wrong content.
hallucinations-leaderboard/leaderboard

Their introductory blog post also contains an in depth analysis of which LLMs get what wrong, which is super interesting: https://huggingface.co/blog/leaderboards-on-the-hub-hallucinations

Congrats to the team! 🚀

posted an update 9 months ago

Post

🔥 New LLM leaderboard on the hub: an LLM Safety Leaderboard!

It evaluates LLM safety, such as bias and toxicity, PII, and robustness, and is powered by DecodingTrust (outstanding paper at Neurips!) 🚀
AI-Secure/llm-trustworthy-leaderboard

It's great to see such initiatives emerge, trying to understand the risks and biases of LLMs, and I'm hoping other tools will follow. It should be interesting for the community of model builders (whether or not they want uncensored models ^^).

Detailed intro blog: https://huggingface.co/blog/leaderboards-on-the-hub-decodingtrust.

Congrats to the AI Secure team!

replied to their post 9 months ago

We're missing a 🪄 emoji :D

posted an update 9 months ago

Post

🏅 New top model on the GAIA benchmark!

Called FRIDAY, it's a mysterious new autonomous agent, which got quite good performances on both the public validation set *and* the private test set.
It notably passed 10 points for the val and 5 points for the test set on our hardest questions (level 3): they require to take arbitrarily long sequences of actions, use any number of tools, and access the world in genera! ✨

The GAIA benchmark evaluates next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc) and was co authored by @gregmialz @ThomasNLG @ylecun @thomwolf and myself: gaia-benchmark/leaderboard

5 replies

Clémentine Fourrier

AI & ML interests

Articles

Introducing the Open FinLLM Leaderboard

BigCodeBench: Benchmarking Large Language Models on Solving Practical and Challenging Programming Tasks

Falcon 2: An 11B parameter pretrained language model and VLM, trained on over 5000B tokens tokens and 11 languages

CyberSecEval 2 - A Comprehensive Evaluation Framework for Cybersecurity Risks and Capabilities of Large Language Models

Let's talk about LLM evaluation

Introducing the Open Arabic LLM Leaderboard

Introducing the Open Leaderboard for Hebrew LLMs!

Bringing the Artificial Analysis LLM Performance Leaderboard to Hugging Face

Improving Prompt Consistency with Structured Generations

Introducing the Open Chain of Thought Leaderboard

The Open Medical-LLM Leaderboard: Benchmarking Large Language Models in Healthcare

Introducing the LiveCodeBench Leaderboard - Holistic and Contamination-Free Evaluation of Code LLMs

Introducing the Chatbot Guardrails Arena

Introducing ConTextual: How well can your Multimodal model jointly reason over text and image in text-rich scenes?

TTS Arena: Benchmarking Text-to-Speech Models in the Wild

Introducing the Red-Teaming Resistance Leaderboard

Introducing the Open Ko-LLM Leaderboard: Leading the Korean LLM Evaluation Ecosystem

NPHardEval Leaderboard: Unveiling the Reasoning Abilities of Large Language Models through Complexity Classes and Dynamic Updates

Introducing the Enterprise Scenarios Leaderboard: a Leaderboard for Real World Use Cases

The Hallucinations Leaderboard, an Open Effort to Measure Hallucinations in Large Language Models

A guide to setting up your own Hugging Face leaderboard: an end-to-end example with Vectara's hallucination leaderboard

2023, year of open LLMs

Open LLM Leaderboard: DROP deep dive

Overview of natively supported quantization schemes in 🤗 Transformers

What's going on with the Open LLM Leaderboard?

Introduction to Graph Machine Learning

Organizations

clefourrier's activity