|Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8029|± |0.0110| | | |strict-match | 5|exact_match|↑ |0.7961|± |0.0111| | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |----------------|------:|------|-----:|--------|---|-----:|---|------| |kobest_boolq | 1|none | 5|acc |↑ |0.9167|± |0.0074| | | |none | 5|f1 |↑ |0.9167|± | N/A| |kobest_copa | 1|none | 5|acc |↑ |0.7130|± |0.0143| | | |none | 5|f1 |↑ |0.7125|± | N/A| |kobest_hellaswag| 1|none | 5|acc |↑ |0.4540|± |0.0223| | | |none | 5|acc_norm|↑ |0.5700|± |0.0222| | | |none | 5|f1 |↑ |0.4505|± | N/A| |kobest_sentineg | 1|none | 5|acc |↑ |0.9496|± |0.0110| | | |none | 5|f1 |↑ |0.9496|± | N/A| |kobest_wic | 1|none | 5|acc |↑ |0.7111|± |0.0128| | | |none | 5|f1 |↑ |0.7025|± | N/A| | Tasks |Version|Filter|n-shot| Metric | |Value | |Stderr| |-------------------------------------------------------|------:|------|-----:|-----------|---|-----:|---|-----:| |kmmlu_direct_accounting | 2|none | 5|exact_match|↑ |0.5500|± |0.0500| |kmmlu_direct_agricultural_sciences | 2|none | 5|exact_match|↑ |0.3680|± |0.0153| |kmmlu_direct_aviation_engineering_and_maintenance | 2|none | 5|exact_match|↑ |0.4670|± |0.0158| |kmmlu_direct_biology | 2|none | 5|exact_match|↑ |0.3740|± |0.0153| |kmmlu_direct_chemical_engineering | 2|none | 5|exact_match|↑ |0.4650|± |0.0158| |kmmlu_direct_chemistry | 2|none | 5|exact_match|↑ |0.4900|± |0.0204| |kmmlu_direct_civil_engineering | 2|none | 5|exact_match|↑ |0.3540|± |0.0151| |kmmlu_direct_computer_science | 2|none | 5|exact_match|↑ |0.7320|± |0.0140| |kmmlu_direct_construction | 2|none | 5|exact_match|↑ |0.3590|± |0.0152| |kmmlu_direct_criminal_law | 2|none | 5|exact_match|↑ |0.4250|± |0.0350| |kmmlu_direct_ecology | 2|none | 5|exact_match|↑ |0.4900|± |0.0158| |kmmlu_direct_economics | 2|none | 5|exact_match|↑ |0.6154|± |0.0428| |kmmlu_direct_education | 2|none | 5|exact_match|↑ |0.6900|± |0.0465| |kmmlu_direct_electrical_engineering | 2|none | 5|exact_match|↑ |0.3170|± |0.0147| |kmmlu_direct_electronics_engineering | 2|none | 5|exact_match|↑ |0.5440|± |0.0158| |kmmlu_direct_energy_management | 2|none | 5|exact_match|↑ |0.3960|± |0.0155| |kmmlu_direct_environmental_science | 2|none | 5|exact_match|↑ |0.2950|± |0.0144| |kmmlu_direct_fashion | 2|none | 5|exact_match|↑ |0.4660|± |0.0158| |kmmlu_direct_food_processing | 2|none | 5|exact_match|↑ |0.4370|± |0.0157| |kmmlu_direct_gas_technology_and_engineering | 2|none | 5|exact_match|↑ |0.3650|± |0.0152| |kmmlu_direct_geomatics | 2|none | 5|exact_match|↑ |0.3770|± |0.0153| |kmmlu_direct_health | 2|none | 5|exact_match|↑ |0.6200|± |0.0488| |kmmlu_direct_industrial_engineer | 2|none | 5|exact_match|↑ |0.4730|± |0.0158| |kmmlu_direct_information_technology | 2|none | 5|exact_match|↑ |0.7080|± |0.0144| |kmmlu_direct_interior_architecture_and_design | 2|none | 5|exact_match|↑ |0.6080|± |0.0154| |kmmlu_direct_korean_history | 2|none | 5|exact_match|↑ |0.3200|± |0.0469| |kmmlu_direct_law | 2|none | 5|exact_match|↑ |0.4730|± |0.0158| |kmmlu_direct_machine_design_and_manufacturing | 2|none | 5|exact_match|↑ |0.4750|± |0.0158| |kmmlu_direct_management | 2|none | 5|exact_match|↑ |0.6160|± |0.0154| |kmmlu_direct_maritime_engineering | 2|none | 5|exact_match|↑ |0.4817|± |0.0204| |kmmlu_direct_marketing | 2|none | 5|exact_match|↑ |0.8010|± |0.0126| |kmmlu_direct_materials_engineering | 2|none | 5|exact_match|↑ |0.4970|± |0.0158| |kmmlu_direct_math | 2|none | 5|exact_match|↑ |0.3500|± |0.0276| |kmmlu_direct_mechanical_engineering | 2|none | 5|exact_match|↑ |0.4040|± |0.0155| |kmmlu_direct_nondestructive_testing | 2|none | 5|exact_match|↑ |0.4580|± |0.0158| |kmmlu_direct_patent | 2|none | 5|exact_match|↑ |0.4100|± |0.0494| |kmmlu_direct_political_science_and_sociology | 2|none | 5|exact_match|↑ |0.5500|± |0.0288| |kmmlu_direct_psychology | 2|none | 5|exact_match|↑ |0.4700|± |0.0158| |kmmlu_direct_public_safety | 2|none | 5|exact_match|↑ |0.3680|± |0.0153| |kmmlu_direct_railway_and_automotive_engineering | 2|none | 5|exact_match|↑ |0.3550|± |0.0151| |kmmlu_direct_real_estate | 2|none | 5|exact_match|↑ |0.4650|± |0.0354| |kmmlu_direct_refrigerating_machinery | 2|none | 5|exact_match|↑ |0.3730|± |0.0153| |kmmlu_direct_social_welfare | 2|none | 5|exact_match|↑ |0.6140|± |0.0154| |kmmlu_direct_taxation | 2|none | 5|exact_match|↑ |0.4050|± |0.0348| |kmmlu_direct_telecommunications_and_wireless_technology| 2|none | 5|exact_match|↑ |0.6080|± |0.0154| | Groups |Version|Filter|n-shot|Metric| |Value | |Stderr| |------------------|------:|------|------|------|---|-----:|---|-----:| |mmlu | 2|none | |acc |↑ |0.6755|± |0.0038| | - humanities | 2|none | |acc |↑ |0.6140|± |0.0067| | - other | 2|none | |acc |↑ |0.7271|± |0.0077| | - social sciences| 2|none | |acc |↑ |0.7793|± |0.0073| | - stem | 2|none | |acc |↑ |0.6153|± |0.0084|