palmer-002.5 / README.md
leaderboard-pr-bot's picture
Adding Evaluation Results
9d39be9 verified
|
raw
history blame
5.03 kB
metadata
language:
  - en
license: apache-2.0
tags:
  - merge
model-index:
  - name: palmer-002.5
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 37.54
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=appvoid/palmer-002.5
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 61.84
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=appvoid/palmer-002.5
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 25.21
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=appvoid/palmer-002.5
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 40.22
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=appvoid/palmer-002.5
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 66.38
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=appvoid/palmer-002.5
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 1.97
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=appvoid/palmer-002.5
          name: Open LLM Leaderboard

Creative writing has never been so accesible, palmer goes beyond what it was thought about small language models. This model is a "MErging of Experts" (MEoE) using palmer-002-2401 as base, biased as an assistant without using any prompts—as a result of these efforts—palmer is better than most 1b language models on most benchmarks, despite being sometimes 40% smaller than its counterparts.

                  MMLU     ARC-C    OBQA   HellaSwag  PIQA  Winogrande Average
tinyllama-chat | 0.2470 | 0.3285 | 0.3740 | 0.6037 | 0.7448 | 0.6022 | 0.4833 |
zyte-1b	       | 0.2397 | 0.3353 | 0.3700 | 0.6086 | 0.7541 | 0.5998 | 0.4845 |
palmer-002.5   | 0.2534 | 0.3370 | 0.3740 | 0.6128 | 0.7486 | 0.6535 | 0.4965 |
qwen-1-8       | 0.4536 | 0.3490 | 0.3320 | 0.5876 | 0.7307 | 0.5896 | 0.5070 |

This work constitutes, given its compactness, an advancement towards SMLs, easily empowering edge devices such as mobile phones, raspberry pis and automated software/robots. Aditionally, palmer-002.5 deviates its main philosophy from palmer-family to become a more powerful model with more data instead of less.

prompt: Reality is but
output: a dream,
And the dreams we make are our reality.

The world is a canvas, painted by our minds,
And we can make it a masterpiece.

So let us create, let us dream,
And let our imagination run wild.

For in our imagination lies our power,
To create a world that is truly our own.

You can support me through kofi

Note that since this model uses a transformer architecture as any popular language model, its output sometimes contains hallucinations (make mistakes or false statements), and as such, it must be used with caution on sensitive scenarios.

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 38.86
AI2 Reasoning Challenge (25-Shot) 37.54
HellaSwag (10-Shot) 61.84
MMLU (5-Shot) 25.21
TruthfulQA (0-shot) 40.22
Winogrande (5-shot) 66.38
GSM8k (5-shot) 1.97