ELC_ParserBERT_10M / results.md
“SufurElite”
init
eff5003
|
raw
history blame
6.69 kB

Results

The results here are taken from running score_predictions.py from the babylm evaluation pipeline on the ELC_ParserBERT_10M_textonly_predictions.json.gz file in this directory, which contains the predictions for the different evaluation tasks.

Overall Results

Here are the average results per section and the macroscore, compared with the baseline models:

Model BLiMP BLiMP Supplement EWoK GLUE Macroaverage
BabyLlama 69.8 59.5 50.7 63.3 60.8
LTG-BERT 60.6 60.8 48.9 60.3 57.7
ELC-ParserBERT 59.6 57.7 63.1 44.5 56.2

The Breakdown Per Section

glue subtask Score
cola (MCC) 0.042
sst2 0.502
mrpc (F1) 0.82
qqp (F1) 0
mnli 0.357
mnli-mm 0.355
qnli 0.491
rte 0.496
boolq 0.585
multirc 0.63
wsc 0.615
Average 0.445
blimp subtask Score
adjunct_island 0.712
anaphor_gender_agreement 0.593
anaphor_number_agreement 0.647
animate_subject_passive 0.594
animate_subject_trans 0.47
causative 0.726
complex_NP_island 0.447
coordinate_structure_constraint_complex_left_branch 0.39
coordinate_structure_constraint_object_extraction 0.806
determiner_noun_agreement_1 0.793
determiner_noun_agreement_2 0.936
determiner_noun_agreement_irregular_1 0.467
determiner_noun_agreement_irregular_2 0.394
determiner_noun_agreement_with_adj_2 0.889
determiner_noun_agreement_with_adj_irregular_1 0.834
determiner_noun_agreement_with_adj_irregular_2 0.848
determiner_noun_agreement_with_adjective_1 0.758
distractor_agreement_relational_noun 0.212
distractor_agreement_relative_clause 0.282
drop_argument 0.485
ellipsis_n_bar_1 0.505
ellipsis_n_bar_2 0.342
existential_there_object_raising 0.447
existential_there_quantifiers_1 0.385
existential_there_quantifiers_2 0.396
existential_there_subject_raising 0.476
expletive_it_object_raising 0.44
inchoative 0.527
intransitive 0.484
irregular_past_participle_adjectives 0.348
irregular_past_participle_verbs 0.594
irregular_plural_subject_verb_agreement_1 0.634
irregular_plural_subject_verb_agreement_2 0.687
left_branch_island_echo_question 0.634
left_branch_island_simple_question 0.615
matrix_question_npi_licensor_present 0.206
npi_present_1 0.362
npi_present_2 0.347
only_npi_licensor_present 0.964
only_npi_scope 0.89
passive_1 0.514
passive_2 0.482
principle_A_c_command 0.635
principle_A_case_1 0.999
principle_A_case_2 0.78
principle_A_domain_1 0.893
principle_A_domain_2 0.623
principle_A_domain_3 0.556
principle_A_reconstruction 0.339
regular_plural_subject_verb_agreement_1 0.628
regular_plural_subject_verb_agreement_2 0.663
sentential_negation_npi_licensor_present 0.93
sentential_negation_npi_scope 0.722
sentential_subject_island 0.361
superlative_quantifiers_1 0.702
superlative_quantifiers_2 0.498
tough_vs_raising_1 0.351
tough_vs_raising_2 0.648
transitive 0.645
wh_island 0.719
wh_questions_object_gap 0.657
wh_questions_subject_gap 0.861
wh_questions_subject_gap_long_distance 0.937
wh_vs_that_no_gap 0.969
wh_vs_that_no_gap_long_distance 0.969
wh_vs_that_with_gap 0.222
wh_vs_that_with_gap_long_distance 0.063
Average 0.596
blimp_supplement subtask Score
hypernym 0.531
qa_congruence_easy 0.641
qa_congruence_tricky 0.521
subject_aux_inversion 0.614
turn_taking 0.579
Average 0.577
ewok subtask Score
agent-properties 0.738
material-dynamics 0.81
material-properties 0.6
physical-dynamics 0.383
physical-interactions 0.599
physical-relations 0.817
quantitative-properties 0.427
social-interactions 0.565
social-properties 0.561
social-relations 0.807
spatial-relations 0.635
Average 0.631