mistral-turkish / README.md
notbdq's picture
Adding Evaluation Results (#1)
d2a8c99 verified
metadata
language:
  - tr
  - en
license: apache-2.0
library_name: transformers
model-index:
  - name: alooowso
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: AI2 Reasoning Challenge (25-Shot)
          type: ai2_arc
          config: ARC-Challenge
          split: test
          args:
            num_few_shot: 25
        metrics:
          - type: acc_norm
            value: 62.97
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=notbdq/alooowso
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: HellaSwag (10-Shot)
          type: hellaswag
          split: validation
          args:
            num_few_shot: 10
        metrics:
          - type: acc_norm
            value: 84.87
            name: normalized accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=notbdq/alooowso
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: MMLU (5-Shot)
          type: cais/mmlu
          config: all
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 60.78
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=notbdq/alooowso
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: TruthfulQA (0-shot)
          type: truthful_qa
          config: multiple_choice
          split: validation
          args:
            num_few_shot: 0
        metrics:
          - type: mc2
            value: 68.18
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=notbdq/alooowso
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: Winogrande (5-shot)
          type: winogrande
          config: winogrande_xl
          split: validation
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 77.43
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=notbdq/alooowso
          name: Open LLM Leaderboard
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: GSM8k (5-shot)
          type: gsm8k
          config: main
          split: test
          args:
            num_few_shot: 5
        metrics:
          - type: acc
            value: 39.58
            name: accuracy
        source:
          url: >-
            https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=notbdq/alooowso
          name: Open LLM Leaderboard

Fine tuned model of mistral-7b-instruct-v0-2. the dataset used for fine tuning is small and custom dataset for question answering in turkish language made by me. the main duty was to make model more adapted to turkish language. prompt format :

[INST]<<SYS>>
{System}
<</SYS>>
{User} [/INST] {Assistant}

example in turkish language ;

[INST]<<SYS>>
Aşağıda bir görevi tanımlayan bir talimat bulunmaktadır. İsteği uygun şekilde tamamlayan bir yanıt yazın.
<</SYS>>
python dilinde yılan oyunu için kod verin [/INST] Aşağıda Python dili ile basit bir snake game kodu bulunmaktadır;
import os
import time

# Konfigürasyonlar
WIDTH = 20
HEIGHT = 10
snake_speed = 0.1
snake_size = 3

class Snake:
    def __init__(self):

[i cutted output for brevity]

Open LLM Leaderboard Evaluation Results

Detailed results can be found here

Metric Value
Avg. 65.63
AI2 Reasoning Challenge (25-Shot) 62.97
HellaSwag (10-Shot) 84.87
MMLU (5-Shot) 60.78
TruthfulQA (0-shot) 68.18
Winogrande (5-shot) 77.43
GSM8k (5-shot) 39.58