Question about "until token" of huggingface inference method.

#674
by kimdeokgi - opened

During gsm8k inference, generation is truncated at the ":" character.

[For example]
actual reasoning

  • First, we need to find the third typing speed, which is 52 + 5 = 57 WPM.\nTo find the average, we add the three speeds and divide by 3: (47 + 52 + 57) / 3 = 156 / 3 = 52 WPM.\n#### 52

HuggingFace reasoning

  • First, we need to find the third typing speed, which is 52 + 5 = 57 WPM.\nTo find the average, we add the three speeds and divide by 3:

Even though it is correct, Since the creation is truncated, the answer is treated as incorrect.

As a result of code analysis, "until token" is as follows.
":", "Question:", "Question"

Why was “:” designated as "until token"?
If there is no problem, I wonder how I can exclude it. please.

Open LLM Leaderboard org

Hi!
This is the original design as the implementation of GSM8K in the Eleuther AI Harness, when we started the leaderboard. Our goal being to provide entirely reproducible evaluations, we used it as is, with the good and less good, so anyone could reproduce our results precisely.

However, it has since then been changed in the Harness, and we will update this aspect of the scoring in our next iteration of the leaderboard (coming in a month or so) - but for now we can't re-run more than 6K models on this evaluation at once, as we don't have the compute.

clefourrier changed discussion status to closed

Sign up or log in to comment