“SufurElite” commited on
Commit
2bdaae1
1 Parent(s): a357992

added curriculum learning blimp results

Browse files
checkpoint/curriculum_learning_results/curriculum_learning_blimp_results.json ADDED
@@ -0,0 +1,2720 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "results": {
3
+ "blimp_supplement": {
4
+ "acc,none": 0.6018926304374915,
5
+ "acc_stderr,none": 0.0066155758006338295,
6
+ "alias": "blimp_supplement"
7
+ },
8
+ "blimp_supplement_hypernym": {
9
+ "acc,none": 0.5344418052256532,
10
+ "acc_stderr,none": 0.017200425915308487,
11
+ "alias": " - blimp_supplement_hypernym"
12
+ },
13
+ "blimp_supplement_qa_congruence_easy": {
14
+ "acc,none": 0.71875,
15
+ "acc_stderr,none": 0.05664543544843536,
16
+ "alias": " - blimp_supplement_qa_congruence_easy"
17
+ },
18
+ "blimp_supplement_qa_congruence_tricky": {
19
+ "acc,none": 0.5454545454545454,
20
+ "acc_stderr,none": 0.03888176921674101,
21
+ "alias": " - blimp_supplement_qa_congruence_tricky"
22
+ },
23
+ "blimp_supplement_subject_aux_inversion": {
24
+ "acc,none": 0.6679596586501164,
25
+ "acc_stderr,none": 0.007574249693780377,
26
+ "alias": " - blimp_supplement_subject_aux_inversion"
27
+ },
28
+ "blimp_supplement_turn_taking": {
29
+ "acc,none": 0.5428571428571428,
30
+ "acc_stderr,none": 0.029824051857478606,
31
+ "alias": " - blimp_supplement_turn_taking"
32
+ },
33
+ "blimp_filtered": {
34
+ "acc,none": 0.565949835634104,
35
+ "acc_stderr,none": 0.0018253226294923256,
36
+ "alias": "blimp_filtered"
37
+ },
38
+ "blimp_adjunct_island_filtered": {
39
+ "acc,none": 0.8060344827586207,
40
+ "acc_stderr,none": 0.012986711960851074,
41
+ "alias": " - blimp_adjunct_island_filtered"
42
+ },
43
+ "blimp_anaphor_gender_agreement_filtered": {
44
+ "acc,none": 0.5633367662203913,
45
+ "acc_stderr,none": 0.015924708611966137,
46
+ "alias": " - blimp_anaphor_gender_agreement_filtered"
47
+ },
48
+ "blimp_anaphor_number_agreement_filtered": {
49
+ "acc,none": 0.728249194414608,
50
+ "acc_stderr,none": 0.014587603562175177,
51
+ "alias": " - blimp_anaphor_number_agreement_filtered"
52
+ },
53
+ "blimp_animate_subject_passive_filtered": {
54
+ "acc,none": 0.6256983240223464,
55
+ "acc_stderr,none": 0.01618544417945717,
56
+ "alias": " - blimp_animate_subject_passive_filtered"
57
+ },
58
+ "blimp_animate_subject_trans_filtered": {
59
+ "acc,none": 0.3629469122426869,
60
+ "acc_stderr,none": 0.01583594209290492,
61
+ "alias": " - blimp_animate_subject_trans_filtered"
62
+ },
63
+ "blimp_causative_filtered": {
64
+ "acc,none": 0.7212713936430318,
65
+ "acc_stderr,none": 0.01568660993194521,
66
+ "alias": " - blimp_causative_filtered"
67
+ },
68
+ "blimp_complex_NP_island_filtered": {
69
+ "acc,none": 0.4101654846335697,
70
+ "acc_stderr,none": 0.016920620795502286,
71
+ "alias": " - blimp_complex_NP_island_filtered"
72
+ },
73
+ "blimp_coordinate_structure_constraint_complex_left_branch_filtered": {
74
+ "acc,none": 0.44039735099337746,
75
+ "acc_stderr,none": 0.016502051579516674,
76
+ "alias": " - blimp_coordinate_structure_constraint_complex_left_branch_filtered"
77
+ },
78
+ "blimp_coordinate_structure_constraint_object_extraction_filtered": {
79
+ "acc,none": 0.6691253951527925,
80
+ "acc_stderr,none": 0.015282039067257242,
81
+ "alias": " - blimp_coordinate_structure_constraint_object_extraction_filtered"
82
+ },
83
+ "blimp_determiner_noun_agreement_1_filtered": {
84
+ "acc,none": 0.7782561894510226,
85
+ "acc_stderr,none": 0.013636818388739074,
86
+ "alias": " - blimp_determiner_noun_agreement_1_filtered"
87
+ },
88
+ "blimp_determiner_noun_agreement_2_filtered": {
89
+ "acc,none": 0.9183673469387755,
90
+ "acc_stderr,none": 0.008978394797225727,
91
+ "alias": " - blimp_determiner_noun_agreement_2_filtered"
92
+ },
93
+ "blimp_determiner_noun_agreement_irregular_1_filtered": {
94
+ "acc,none": 0.4581497797356828,
95
+ "acc_stderr,none": 0.019106841978411313,
96
+ "alias": " - blimp_determiner_noun_agreement_irregular_1_filtered"
97
+ },
98
+ "blimp_determiner_noun_agreement_irregular_2_filtered": {
99
+ "acc,none": 0.4024390243902439,
100
+ "acc_stderr,none": 0.017135595695835195,
101
+ "alias": " - blimp_determiner_noun_agreement_irregular_2_filtered"
102
+ },
103
+ "blimp_determiner_noun_agreement_with_adj_2_filtered": {
104
+ "acc,none": 0.8522848034006376,
105
+ "acc_stderr,none": 0.01157286891797037,
106
+ "alias": " - blimp_determiner_noun_agreement_with_adj_2_filtered"
107
+ },
108
+ "blimp_determiner_noun_agreement_with_adj_irregular_1_filtered": {
109
+ "acc,none": 0.8259052924791086,
110
+ "acc_stderr,none": 0.014161143742571668,
111
+ "alias": " - blimp_determiner_noun_agreement_with_adj_irregular_1_filtered"
112
+ },
113
+ "blimp_determiner_noun_agreement_with_adj_irregular_2_filtered": {
114
+ "acc,none": 0.8380952380952381,
115
+ "acc_stderr,none": 0.01271731759602976,
116
+ "alias": " - blimp_determiner_noun_agreement_with_adj_irregular_2_filtered"
117
+ },
118
+ "blimp_determiner_noun_agreement_with_adjective_1_filtered": {
119
+ "acc,none": 0.7663451232583065,
120
+ "acc_stderr,none": 0.013860907579350762,
121
+ "alias": " - blimp_determiner_noun_agreement_with_adjective_1_filtered"
122
+ },
123
+ "blimp_distractor_agreement_relational_noun_filtered": {
124
+ "acc,none": 0.2550761421319797,
125
+ "acc_stderr,none": 0.015538299767129306,
126
+ "alias": " - blimp_distractor_agreement_relational_noun_filtered"
127
+ },
128
+ "blimp_distractor_agreement_relative_clause_filtered": {
129
+ "acc,none": 0.28128587830080365,
130
+ "acc_stderr,none": 0.015243771404460137,
131
+ "alias": " - blimp_distractor_agreement_relative_clause_filtered"
132
+ },
133
+ "blimp_drop_argument_filtered": {
134
+ "acc,none": 0.4945652173913043,
135
+ "acc_stderr,none": 0.016492503758896365,
136
+ "alias": " - blimp_drop_argument_filtered"
137
+ },
138
+ "blimp_ellipsis_n_bar_1_filtered": {
139
+ "acc,none": 0.4713216957605985,
140
+ "acc_stderr,none": 0.017637547724110917,
141
+ "alias": " - blimp_ellipsis_n_bar_1_filtered"
142
+ },
143
+ "blimp_ellipsis_n_bar_2_filtered": {
144
+ "acc,none": 0.34299516908212563,
145
+ "acc_stderr,none": 0.01650728039405027,
146
+ "alias": " - blimp_ellipsis_n_bar_2_filtered"
147
+ },
148
+ "blimp_existential_there_object_raising_filtered": {
149
+ "acc,none": 0.458128078817734,
150
+ "acc_stderr,none": 0.017495701153044056,
151
+ "alias": " - blimp_existential_there_object_raising_filtered"
152
+ },
153
+ "blimp_existential_there_quantifiers_1_filtered": {
154
+ "acc,none": 0.34623655913978496,
155
+ "acc_stderr,none": 0.015609497407549041,
156
+ "alias": " - blimp_existential_there_quantifiers_1_filtered"
157
+ },
158
+ "blimp_existential_there_quantifiers_2_filtered": {
159
+ "acc,none": 0.47091108671789245,
160
+ "acc_stderr,none": 0.016546764735538556,
161
+ "alias": " - blimp_existential_there_quantifiers_2_filtered"
162
+ },
163
+ "blimp_existential_there_subject_raising_filtered": {
164
+ "acc,none": 0.4588744588744589,
165
+ "acc_stderr,none": 0.016401935840452977,
166
+ "alias": " - blimp_existential_there_subject_raising_filtered"
167
+ },
168
+ "blimp_expletive_it_object_raising_filtered": {
169
+ "acc,none": 0.4505928853754941,
170
+ "acc_stderr,none": 0.018071936911306062,
171
+ "alias": " - blimp_expletive_it_object_raising_filtered"
172
+ },
173
+ "blimp_inchoative_filtered": {
174
+ "acc,none": 0.552046783625731,
175
+ "acc_stderr,none": 0.017016699757166347,
176
+ "alias": " - blimp_inchoative_filtered"
177
+ },
178
+ "blimp_intransitive_filtered": {
179
+ "acc,none": 0.4988479262672811,
180
+ "acc_stderr,none": 0.016980845193638953,
181
+ "alias": " - blimp_intransitive_filtered"
182
+ },
183
+ "blimp_irregular_past_participle_adjectives_filtered": {
184
+ "acc,none": 0.3610822060353798,
185
+ "acc_stderr,none": 0.015502078036777407,
186
+ "alias": " - blimp_irregular_past_participle_adjectives_filtered"
187
+ },
188
+ "blimp_irregular_past_participle_verbs_filtered": {
189
+ "acc,none": 0.4745222929936306,
190
+ "acc_stderr,none": 0.016278359915431962,
191
+ "alias": " - blimp_irregular_past_participle_verbs_filtered"
192
+ },
193
+ "blimp_irregular_plural_subject_verb_agreement_1_filtered": {
194
+ "acc,none": 0.5907960199004975,
195
+ "acc_stderr,none": 0.017351256599074046,
196
+ "alias": " - blimp_irregular_plural_subject_verb_agreement_1_filtered"
197
+ },
198
+ "blimp_irregular_plural_subject_verb_agreement_2_filtered": {
199
+ "acc,none": 0.6468609865470852,
200
+ "acc_stderr,none": 0.016011774940164352,
201
+ "alias": " - blimp_irregular_plural_subject_verb_agreement_2_filtered"
202
+ },
203
+ "blimp_left_branch_island_echo_question_filtered": {
204
+ "acc,none": 0.5100316789862724,
205
+ "acc_stderr,none": 0.01625312997719913,
206
+ "alias": " - blimp_left_branch_island_echo_question_filtered"
207
+ },
208
+ "blimp_left_branch_island_simple_question_filtered": {
209
+ "acc,none": 0.5184016824395373,
210
+ "acc_stderr,none": 0.016211152044629508,
211
+ "alias": " - blimp_left_branch_island_simple_question_filtered"
212
+ },
213
+ "blimp_matrix_question_npi_licensor_present_filtered": {
214
+ "acc,none": 0.2906350914962325,
215
+ "acc_stderr,none": 0.014905099765395396,
216
+ "alias": " - blimp_matrix_question_npi_licensor_present_filtered"
217
+ },
218
+ "blimp_npi_present_1_filtered": {
219
+ "acc,none": 0.22552255225522552,
220
+ "acc_stderr,none": 0.013869361007529328,
221
+ "alias": " - blimp_npi_present_1_filtered"
222
+ },
223
+ "blimp_npi_present_2_filtered": {
224
+ "acc,none": 0.2461706783369803,
225
+ "acc_stderr,none": 0.01425670901286222,
226
+ "alias": " - blimp_npi_present_2_filtered"
227
+ },
228
+ "blimp_only_npi_licensor_present_filtered": {
229
+ "acc,none": 0.18480725623582767,
230
+ "acc_stderr,none": 0.013076806819440822,
231
+ "alias": " - blimp_only_npi_licensor_present_filtered"
232
+ },
233
+ "blimp_only_npi_scope_filtered": {
234
+ "acc,none": 0.8112305854241338,
235
+ "acc_stderr,none": 0.013534269930649458,
236
+ "alias": " - blimp_only_npi_scope_filtered"
237
+ },
238
+ "blimp_passive_1_filtered": {
239
+ "acc,none": 0.5416666666666666,
240
+ "acc_stderr,none": 0.017201875361661928,
241
+ "alias": " - blimp_passive_1_filtered"
242
+ },
243
+ "blimp_passive_2_filtered": {
244
+ "acc,none": 0.5027685492801772,
245
+ "acc_stderr,none": 0.01664792374125215,
246
+ "alias": " - blimp_passive_2_filtered"
247
+ },
248
+ "blimp_principle_A_c_command_filtered": {
249
+ "acc,none": 0.5179704016913319,
250
+ "acc_stderr,none": 0.01625449273385581,
251
+ "alias": " - blimp_principle_A_c_command_filtered"
252
+ },
253
+ "blimp_principle_A_case_1_filtered": {
254
+ "acc,none": 1.0,
255
+ "acc_stderr,none": 0.0,
256
+ "alias": " - blimp_principle_A_case_1_filtered"
257
+ },
258
+ "blimp_principle_A_case_2_filtered": {
259
+ "acc,none": 0.7300546448087432,
260
+ "acc_stderr,none": 0.014683937114836496,
261
+ "alias": " - blimp_principle_A_case_2_filtered"
262
+ },
263
+ "blimp_principle_A_domain_1_filtered": {
264
+ "acc,none": 0.7407002188183808,
265
+ "acc_stderr,none": 0.014503971003656274,
266
+ "alias": " - blimp_principle_A_domain_1_filtered"
267
+ },
268
+ "blimp_principle_A_domain_2_filtered": {
269
+ "acc,none": 0.6054644808743169,
270
+ "acc_stderr,none": 0.016166436151759677,
271
+ "alias": " - blimp_principle_A_domain_2_filtered"
272
+ },
273
+ "blimp_principle_A_domain_3_filtered": {
274
+ "acc,none": 0.5313496280552603,
275
+ "acc_stderr,none": 0.016276114885524856,
276
+ "alias": " - blimp_principle_A_domain_3_filtered"
277
+ },
278
+ "blimp_principle_A_reconstruction_filtered": {
279
+ "acc,none": 0.20992761116856257,
280
+ "acc_stderr,none": 0.013103269124026679,
281
+ "alias": " - blimp_principle_A_reconstruction_filtered"
282
+ },
283
+ "blimp_regular_plural_subject_verb_agreement_1_filtered": {
284
+ "acc,none": 0.5685393258426966,
285
+ "acc_stderr,none": 0.016611160843889996,
286
+ "alias": " - blimp_regular_plural_subject_verb_agreement_1_filtered"
287
+ },
288
+ "blimp_regular_plural_subject_verb_agreement_2_filtered": {
289
+ "acc,none": 0.6529100529100529,
290
+ "acc_stderr,none": 0.015493933877182859,
291
+ "alias": " - blimp_regular_plural_subject_verb_agreement_2_filtered"
292
+ },
293
+ "blimp_sentential_negation_npi_licensor_present_filtered": {
294
+ "acc,none": 0.9368879216539717,
295
+ "acc_stderr,none": 0.008025622361173993,
296
+ "alias": " - blimp_sentential_negation_npi_licensor_present_filtered"
297
+ },
298
+ "blimp_sentential_negation_npi_scope_filtered": {
299
+ "acc,none": 0.7210103329506314,
300
+ "acc_stderr,none": 0.015205656567297107,
301
+ "alias": " - blimp_sentential_negation_npi_scope_filtered"
302
+ },
303
+ "blimp_sentential_subject_island_filtered": {
304
+ "acc,none": 0.3995837669094693,
305
+ "acc_stderr,none": 0.01580864017884119,
306
+ "alias": " - blimp_sentential_subject_island_filtered"
307
+ },
308
+ "blimp_superlative_quantifiers_1_filtered": {
309
+ "acc,none": 0.7048008171603677,
310
+ "acc_stderr,none": 0.014585500871595079,
311
+ "alias": " - blimp_superlative_quantifiers_1_filtered"
312
+ },
313
+ "blimp_superlative_quantifiers_2_filtered": {
314
+ "acc,none": 0.6622718052738337,
315
+ "acc_stderr,none": 0.015068973779750108,
316
+ "alias": " - blimp_superlative_quantifiers_2_filtered"
317
+ },
318
+ "blimp_tough_vs_raising_1_filtered": {
319
+ "acc,none": 0.43670886075949367,
320
+ "acc_stderr,none": 0.01611712121619132,
321
+ "alias": " - blimp_tough_vs_raising_1_filtered"
322
+ },
323
+ "blimp_tough_vs_raising_2_filtered": {
324
+ "acc,none": 0.558695652173913,
325
+ "acc_stderr,none": 0.01637943787858984,
326
+ "alias": " - blimp_tough_vs_raising_2_filtered"
327
+ },
328
+ "blimp_transitive_filtered": {
329
+ "acc,none": 0.6555299539170507,
330
+ "acc_stderr,none": 0.01613847350012204,
331
+ "alias": " - blimp_transitive_filtered"
332
+ },
333
+ "blimp_wh_island_filtered": {
334
+ "acc,none": 0.6614583333333334,
335
+ "acc_stderr,none": 0.015280867377794811,
336
+ "alias": " - blimp_wh_island_filtered"
337
+ },
338
+ "blimp_wh_questions_object_gap_filtered": {
339
+ "acc,none": 0.4842840512223516,
340
+ "acc_stderr,none": 0.017061284331003065,
341
+ "alias": " - blimp_wh_questions_object_gap_filtered"
342
+ },
343
+ "blimp_wh_questions_subject_gap_filtered": {
344
+ "acc,none": 0.8040089086859689,
345
+ "acc_stderr,none": 0.01325416505242688,
346
+ "alias": " - blimp_wh_questions_subject_gap_filtered"
347
+ },
348
+ "blimp_wh_questions_subject_gap_long_distance_filtered": {
349
+ "acc,none": 0.9474912485414235,
350
+ "acc_stderr,none": 0.007623713502535114,
351
+ "alias": " - blimp_wh_questions_subject_gap_long_distance_filtered"
352
+ },
353
+ "blimp_wh_vs_that_no_gap_filtered": {
354
+ "acc,none": 0.9732868757259001,
355
+ "acc_stderr,none": 0.005498364795567471,
356
+ "alias": " - blimp_wh_vs_that_no_gap_filtered"
357
+ },
358
+ "blimp_wh_vs_that_no_gap_long_distance_filtered": {
359
+ "acc,none": 0.9622857142857143,
360
+ "acc_stderr,none": 0.006443906738830874,
361
+ "alias": " - blimp_wh_vs_that_no_gap_long_distance_filtered"
362
+ },
363
+ "blimp_wh_vs_that_with_gap_filtered": {
364
+ "acc,none": 0.235038084874864,
365
+ "acc_stderr,none": 0.013994831894415981,
366
+ "alias": " - blimp_wh_vs_that_with_gap_filtered"
367
+ },
368
+ "blimp_wh_vs_that_with_gap_long_distance_filtered": {
369
+ "acc,none": 0.06593406593406594,
370
+ "acc_stderr,none": 0.008231173463940311,
371
+ "alias": " - blimp_wh_vs_that_with_gap_long_distance_filtered"
372
+ }
373
+ },
374
+ "groups": {
375
+ "blimp_supplement": {
376
+ "acc,none": 0.6018926304374915,
377
+ "acc_stderr,none": 0.0066155758006338295,
378
+ "alias": "blimp_supplement"
379
+ },
380
+ "blimp_filtered": {
381
+ "acc,none": 0.565949835634104,
382
+ "acc_stderr,none": 0.0018253226294923256,
383
+ "alias": "blimp_filtered"
384
+ }
385
+ },
386
+ "group_subtasks": {
387
+ "blimp_filtered": [
388
+ "blimp_wh_vs_that_with_gap_long_distance_filtered",
389
+ "blimp_wh_vs_that_with_gap_filtered",
390
+ "blimp_wh_vs_that_no_gap_long_distance_filtered",
391
+ "blimp_wh_vs_that_no_gap_filtered",
392
+ "blimp_wh_questions_subject_gap_long_distance_filtered",
393
+ "blimp_wh_questions_subject_gap_filtered",
394
+ "blimp_wh_questions_object_gap_filtered",
395
+ "blimp_wh_island_filtered",
396
+ "blimp_transitive_filtered",
397
+ "blimp_tough_vs_raising_2_filtered",
398
+ "blimp_tough_vs_raising_1_filtered",
399
+ "blimp_superlative_quantifiers_2_filtered",
400
+ "blimp_superlative_quantifiers_1_filtered",
401
+ "blimp_sentential_subject_island_filtered",
402
+ "blimp_sentential_negation_npi_scope_filtered",
403
+ "blimp_sentential_negation_npi_licensor_present_filtered",
404
+ "blimp_regular_plural_subject_verb_agreement_2_filtered",
405
+ "blimp_regular_plural_subject_verb_agreement_1_filtered",
406
+ "blimp_principle_A_reconstruction_filtered",
407
+ "blimp_principle_A_domain_3_filtered",
408
+ "blimp_principle_A_domain_2_filtered",
409
+ "blimp_principle_A_domain_1_filtered",
410
+ "blimp_principle_A_case_2_filtered",
411
+ "blimp_principle_A_case_1_filtered",
412
+ "blimp_principle_A_c_command_filtered",
413
+ "blimp_passive_2_filtered",
414
+ "blimp_passive_1_filtered",
415
+ "blimp_only_npi_scope_filtered",
416
+ "blimp_only_npi_licensor_present_filtered",
417
+ "blimp_npi_present_2_filtered",
418
+ "blimp_npi_present_1_filtered",
419
+ "blimp_matrix_question_npi_licensor_present_filtered",
420
+ "blimp_left_branch_island_simple_question_filtered",
421
+ "blimp_left_branch_island_echo_question_filtered",
422
+ "blimp_irregular_plural_subject_verb_agreement_2_filtered",
423
+ "blimp_irregular_plural_subject_verb_agreement_1_filtered",
424
+ "blimp_irregular_past_participle_verbs_filtered",
425
+ "blimp_irregular_past_participle_adjectives_filtered",
426
+ "blimp_intransitive_filtered",
427
+ "blimp_inchoative_filtered",
428
+ "blimp_expletive_it_object_raising_filtered",
429
+ "blimp_existential_there_subject_raising_filtered",
430
+ "blimp_existential_there_quantifiers_2_filtered",
431
+ "blimp_existential_there_quantifiers_1_filtered",
432
+ "blimp_existential_there_object_raising_filtered",
433
+ "blimp_ellipsis_n_bar_2_filtered",
434
+ "blimp_ellipsis_n_bar_1_filtered",
435
+ "blimp_drop_argument_filtered",
436
+ "blimp_distractor_agreement_relative_clause_filtered",
437
+ "blimp_distractor_agreement_relational_noun_filtered",
438
+ "blimp_determiner_noun_agreement_with_adjective_1_filtered",
439
+ "blimp_determiner_noun_agreement_with_adj_irregular_2_filtered",
440
+ "blimp_determiner_noun_agreement_with_adj_irregular_1_filtered",
441
+ "blimp_determiner_noun_agreement_with_adj_2_filtered",
442
+ "blimp_determiner_noun_agreement_irregular_2_filtered",
443
+ "blimp_determiner_noun_agreement_irregular_1_filtered",
444
+ "blimp_determiner_noun_agreement_2_filtered",
445
+ "blimp_determiner_noun_agreement_1_filtered",
446
+ "blimp_coordinate_structure_constraint_object_extraction_filtered",
447
+ "blimp_coordinate_structure_constraint_complex_left_branch_filtered",
448
+ "blimp_complex_NP_island_filtered",
449
+ "blimp_causative_filtered",
450
+ "blimp_animate_subject_trans_filtered",
451
+ "blimp_animate_subject_passive_filtered",
452
+ "blimp_anaphor_number_agreement_filtered",
453
+ "blimp_anaphor_gender_agreement_filtered",
454
+ "blimp_adjunct_island_filtered"
455
+ ],
456
+ "blimp_supplement": [
457
+ "blimp_supplement_turn_taking",
458
+ "blimp_supplement_subject_aux_inversion",
459
+ "blimp_supplement_qa_congruence_tricky",
460
+ "blimp_supplement_qa_congruence_easy",
461
+ "blimp_supplement_hypernym"
462
+ ]
463
+ },
464
+ "configs": {
465
+ "blimp_adjunct_island_filtered": {
466
+ "task": "blimp_adjunct_island_filtered",
467
+ "group": "blimp_filtered",
468
+ "dataset_path": "json",
469
+ "dataset_kwargs": {
470
+ "data_files": "evaluation_data/blimp_filtered/adjunct_island.jsonl"
471
+ },
472
+ "validation_split": "train",
473
+ "doc_to_text": "",
474
+ "doc_to_target": 0,
475
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
476
+ "description": "",
477
+ "target_delimiter": " ",
478
+ "fewshot_delimiter": "\n\n",
479
+ "num_fewshot": 0,
480
+ "metric_list": [
481
+ {
482
+ "metric": "acc",
483
+ "weight_by_size": false
484
+ }
485
+ ],
486
+ "output_type": "multiple_choice",
487
+ "repeats": 1,
488
+ "should_decontaminate": true,
489
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
490
+ "metadata": {
491
+ "version": 1.0
492
+ }
493
+ },
494
+ "blimp_anaphor_gender_agreement_filtered": {
495
+ "task": "blimp_anaphor_gender_agreement_filtered",
496
+ "group": "blimp_filtered",
497
+ "dataset_path": "json",
498
+ "dataset_kwargs": {
499
+ "data_files": "evaluation_data/blimp_filtered/anaphor_gender_agreement.jsonl"
500
+ },
501
+ "validation_split": "train",
502
+ "doc_to_text": "",
503
+ "doc_to_target": 0,
504
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
505
+ "description": "",
506
+ "target_delimiter": " ",
507
+ "fewshot_delimiter": "\n\n",
508
+ "num_fewshot": 0,
509
+ "metric_list": [
510
+ {
511
+ "metric": "acc",
512
+ "weight_by_size": false
513
+ }
514
+ ],
515
+ "output_type": "multiple_choice",
516
+ "repeats": 1,
517
+ "should_decontaminate": true,
518
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
519
+ "metadata": {
520
+ "version": 1.0
521
+ }
522
+ },
523
+ "blimp_anaphor_number_agreement_filtered": {
524
+ "task": "blimp_anaphor_number_agreement_filtered",
525
+ "group": "blimp_filtered",
526
+ "dataset_path": "json",
527
+ "dataset_kwargs": {
528
+ "data_files": "evaluation_data/blimp_filtered/anaphor_number_agreement.jsonl"
529
+ },
530
+ "validation_split": "train",
531
+ "doc_to_text": "",
532
+ "doc_to_target": 0,
533
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
534
+ "description": "",
535
+ "target_delimiter": " ",
536
+ "fewshot_delimiter": "\n\n",
537
+ "num_fewshot": 0,
538
+ "metric_list": [
539
+ {
540
+ "metric": "acc",
541
+ "weight_by_size": false
542
+ }
543
+ ],
544
+ "output_type": "multiple_choice",
545
+ "repeats": 1,
546
+ "should_decontaminate": true,
547
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
548
+ "metadata": {
549
+ "version": 1.0
550
+ }
551
+ },
552
+ "blimp_animate_subject_passive_filtered": {
553
+ "task": "blimp_animate_subject_passive_filtered",
554
+ "group": "blimp_filtered",
555
+ "dataset_path": "json",
556
+ "dataset_kwargs": {
557
+ "data_files": "evaluation_data/blimp_filtered/animate_subject_passive.jsonl"
558
+ },
559
+ "validation_split": "train",
560
+ "doc_to_text": "",
561
+ "doc_to_target": 0,
562
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
563
+ "description": "",
564
+ "target_delimiter": " ",
565
+ "fewshot_delimiter": "\n\n",
566
+ "num_fewshot": 0,
567
+ "metric_list": [
568
+ {
569
+ "metric": "acc",
570
+ "weight_by_size": false
571
+ }
572
+ ],
573
+ "output_type": "multiple_choice",
574
+ "repeats": 1,
575
+ "should_decontaminate": true,
576
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
577
+ "metadata": {
578
+ "version": 1.0
579
+ }
580
+ },
581
+ "blimp_animate_subject_trans_filtered": {
582
+ "task": "blimp_animate_subject_trans_filtered",
583
+ "group": "blimp_filtered",
584
+ "dataset_path": "json",
585
+ "dataset_kwargs": {
586
+ "data_files": "evaluation_data/blimp_filtered/animate_subject_trans.jsonl"
587
+ },
588
+ "validation_split": "train",
589
+ "doc_to_text": "",
590
+ "doc_to_target": 0,
591
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
592
+ "description": "",
593
+ "target_delimiter": " ",
594
+ "fewshot_delimiter": "\n\n",
595
+ "num_fewshot": 0,
596
+ "metric_list": [
597
+ {
598
+ "metric": "acc",
599
+ "weight_by_size": false
600
+ }
601
+ ],
602
+ "output_type": "multiple_choice",
603
+ "repeats": 1,
604
+ "should_decontaminate": true,
605
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
606
+ "metadata": {
607
+ "version": 1.0
608
+ }
609
+ },
610
+ "blimp_causative_filtered": {
611
+ "task": "blimp_causative_filtered",
612
+ "group": "blimp_filtered",
613
+ "dataset_path": "json",
614
+ "dataset_kwargs": {
615
+ "data_files": "evaluation_data/blimp_filtered/causative.jsonl"
616
+ },
617
+ "validation_split": "train",
618
+ "doc_to_text": "",
619
+ "doc_to_target": 0,
620
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
621
+ "description": "",
622
+ "target_delimiter": " ",
623
+ "fewshot_delimiter": "\n\n",
624
+ "num_fewshot": 0,
625
+ "metric_list": [
626
+ {
627
+ "metric": "acc",
628
+ "weight_by_size": false
629
+ }
630
+ ],
631
+ "output_type": "multiple_choice",
632
+ "repeats": 1,
633
+ "should_decontaminate": true,
634
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
635
+ "metadata": {
636
+ "version": 1.0
637
+ }
638
+ },
639
+ "blimp_complex_NP_island_filtered": {
640
+ "task": "blimp_complex_NP_island_filtered",
641
+ "group": "blimp_filtered",
642
+ "dataset_path": "json",
643
+ "dataset_kwargs": {
644
+ "data_files": "evaluation_data/blimp_filtered/complex_NP_island.jsonl"
645
+ },
646
+ "validation_split": "train",
647
+ "doc_to_text": "",
648
+ "doc_to_target": 0,
649
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
650
+ "description": "",
651
+ "target_delimiter": " ",
652
+ "fewshot_delimiter": "\n\n",
653
+ "num_fewshot": 0,
654
+ "metric_list": [
655
+ {
656
+ "metric": "acc",
657
+ "weight_by_size": false
658
+ }
659
+ ],
660
+ "output_type": "multiple_choice",
661
+ "repeats": 1,
662
+ "should_decontaminate": true,
663
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
664
+ "metadata": {
665
+ "version": 1.0
666
+ }
667
+ },
668
+ "blimp_coordinate_structure_constraint_complex_left_branch_filtered": {
669
+ "task": "blimp_coordinate_structure_constraint_complex_left_branch_filtered",
670
+ "group": "blimp_filtered",
671
+ "dataset_path": "json",
672
+ "dataset_kwargs": {
673
+ "data_files": "evaluation_data/blimp_filtered/coordinate_structure_constraint_complex_left_branch.jsonl"
674
+ },
675
+ "validation_split": "train",
676
+ "doc_to_text": "",
677
+ "doc_to_target": 0,
678
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
679
+ "description": "",
680
+ "target_delimiter": " ",
681
+ "fewshot_delimiter": "\n\n",
682
+ "num_fewshot": 0,
683
+ "metric_list": [
684
+ {
685
+ "metric": "acc",
686
+ "weight_by_size": false
687
+ }
688
+ ],
689
+ "output_type": "multiple_choice",
690
+ "repeats": 1,
691
+ "should_decontaminate": true,
692
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
693
+ "metadata": {
694
+ "version": 1.0
695
+ }
696
+ },
697
+ "blimp_coordinate_structure_constraint_object_extraction_filtered": {
698
+ "task": "blimp_coordinate_structure_constraint_object_extraction_filtered",
699
+ "group": "blimp_filtered",
700
+ "dataset_path": "json",
701
+ "dataset_kwargs": {
702
+ "data_files": "evaluation_data/blimp_filtered/coordinate_structure_constraint_object_extraction.jsonl"
703
+ },
704
+ "validation_split": "train",
705
+ "doc_to_text": "",
706
+ "doc_to_target": 0,
707
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
708
+ "description": "",
709
+ "target_delimiter": " ",
710
+ "fewshot_delimiter": "\n\n",
711
+ "num_fewshot": 0,
712
+ "metric_list": [
713
+ {
714
+ "metric": "acc",
715
+ "weight_by_size": false
716
+ }
717
+ ],
718
+ "output_type": "multiple_choice",
719
+ "repeats": 1,
720
+ "should_decontaminate": true,
721
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
722
+ "metadata": {
723
+ "version": 1.0
724
+ }
725
+ },
726
+ "blimp_determiner_noun_agreement_1_filtered": {
727
+ "task": "blimp_determiner_noun_agreement_1_filtered",
728
+ "group": "blimp_filtered",
729
+ "dataset_path": "json",
730
+ "dataset_kwargs": {
731
+ "data_files": "evaluation_data/blimp_filtered/determiner_noun_agreement_1.jsonl"
732
+ },
733
+ "validation_split": "train",
734
+ "doc_to_text": "",
735
+ "doc_to_target": 0,
736
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
737
+ "description": "",
738
+ "target_delimiter": " ",
739
+ "fewshot_delimiter": "\n\n",
740
+ "num_fewshot": 0,
741
+ "metric_list": [
742
+ {
743
+ "metric": "acc",
744
+ "weight_by_size": false
745
+ }
746
+ ],
747
+ "output_type": "multiple_choice",
748
+ "repeats": 1,
749
+ "should_decontaminate": true,
750
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
751
+ "metadata": {
752
+ "version": 1.0
753
+ }
754
+ },
755
+ "blimp_determiner_noun_agreement_2_filtered": {
756
+ "task": "blimp_determiner_noun_agreement_2_filtered",
757
+ "group": "blimp_filtered",
758
+ "dataset_path": "json",
759
+ "dataset_kwargs": {
760
+ "data_files": "evaluation_data/blimp_filtered/determiner_noun_agreement_2.jsonl"
761
+ },
762
+ "validation_split": "train",
763
+ "doc_to_text": "",
764
+ "doc_to_target": 0,
765
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
766
+ "description": "",
767
+ "target_delimiter": " ",
768
+ "fewshot_delimiter": "\n\n",
769
+ "num_fewshot": 0,
770
+ "metric_list": [
771
+ {
772
+ "metric": "acc",
773
+ "weight_by_size": false
774
+ }
775
+ ],
776
+ "output_type": "multiple_choice",
777
+ "repeats": 1,
778
+ "should_decontaminate": true,
779
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
780
+ "metadata": {
781
+ "version": 1.0
782
+ }
783
+ },
784
+ "blimp_determiner_noun_agreement_irregular_1_filtered": {
785
+ "task": "blimp_determiner_noun_agreement_irregular_1_filtered",
786
+ "group": "blimp_filtered",
787
+ "dataset_path": "json",
788
+ "dataset_kwargs": {
789
+ "data_files": "evaluation_data/blimp_filtered/determiner_noun_agreement_irregular_1.jsonl"
790
+ },
791
+ "validation_split": "train",
792
+ "doc_to_text": "",
793
+ "doc_to_target": 0,
794
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
795
+ "description": "",
796
+ "target_delimiter": " ",
797
+ "fewshot_delimiter": "\n\n",
798
+ "num_fewshot": 0,
799
+ "metric_list": [
800
+ {
801
+ "metric": "acc",
802
+ "weight_by_size": false
803
+ }
804
+ ],
805
+ "output_type": "multiple_choice",
806
+ "repeats": 1,
807
+ "should_decontaminate": true,
808
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
809
+ "metadata": {
810
+ "version": 1.0
811
+ }
812
+ },
813
+ "blimp_determiner_noun_agreement_irregular_2_filtered": {
814
+ "task": "blimp_determiner_noun_agreement_irregular_2_filtered",
815
+ "group": "blimp_filtered",
816
+ "dataset_path": "json",
817
+ "dataset_kwargs": {
818
+ "data_files": "evaluation_data/blimp_filtered/determiner_noun_agreement_irregular_2.jsonl"
819
+ },
820
+ "validation_split": "train",
821
+ "doc_to_text": "",
822
+ "doc_to_target": 0,
823
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
824
+ "description": "",
825
+ "target_delimiter": " ",
826
+ "fewshot_delimiter": "\n\n",
827
+ "num_fewshot": 0,
828
+ "metric_list": [
829
+ {
830
+ "metric": "acc",
831
+ "weight_by_size": false
832
+ }
833
+ ],
834
+ "output_type": "multiple_choice",
835
+ "repeats": 1,
836
+ "should_decontaminate": true,
837
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
838
+ "metadata": {
839
+ "version": 1.0
840
+ }
841
+ },
842
+ "blimp_determiner_noun_agreement_with_adj_2_filtered": {
843
+ "task": "blimp_determiner_noun_agreement_with_adj_2_filtered",
844
+ "group": "blimp_filtered",
845
+ "dataset_path": "json",
846
+ "dataset_kwargs": {
847
+ "data_files": "evaluation_data/blimp_filtered/determiner_noun_agreement_with_adj_2.jsonl"
848
+ },
849
+ "validation_split": "train",
850
+ "doc_to_text": "",
851
+ "doc_to_target": 0,
852
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
853
+ "description": "",
854
+ "target_delimiter": " ",
855
+ "fewshot_delimiter": "\n\n",
856
+ "num_fewshot": 0,
857
+ "metric_list": [
858
+ {
859
+ "metric": "acc",
860
+ "weight_by_size": false
861
+ }
862
+ ],
863
+ "output_type": "multiple_choice",
864
+ "repeats": 1,
865
+ "should_decontaminate": true,
866
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
867
+ "metadata": {
868
+ "version": 1.0
869
+ }
870
+ },
871
+ "blimp_determiner_noun_agreement_with_adj_irregular_1_filtered": {
872
+ "task": "blimp_determiner_noun_agreement_with_adj_irregular_1_filtered",
873
+ "group": "blimp_filtered",
874
+ "dataset_path": "json",
875
+ "dataset_kwargs": {
876
+ "data_files": "evaluation_data/blimp_filtered/determiner_noun_agreement_with_adj_irregular_1.jsonl"
877
+ },
878
+ "validation_split": "train",
879
+ "doc_to_text": "",
880
+ "doc_to_target": 0,
881
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
882
+ "description": "",
883
+ "target_delimiter": " ",
884
+ "fewshot_delimiter": "\n\n",
885
+ "num_fewshot": 0,
886
+ "metric_list": [
887
+ {
888
+ "metric": "acc",
889
+ "weight_by_size": false
890
+ }
891
+ ],
892
+ "output_type": "multiple_choice",
893
+ "repeats": 1,
894
+ "should_decontaminate": true,
895
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
896
+ "metadata": {
897
+ "version": 1.0
898
+ }
899
+ },
900
+ "blimp_determiner_noun_agreement_with_adj_irregular_2_filtered": {
901
+ "task": "blimp_determiner_noun_agreement_with_adj_irregular_2_filtered",
902
+ "group": "blimp_filtered",
903
+ "dataset_path": "json",
904
+ "dataset_kwargs": {
905
+ "data_files": "evaluation_data/blimp_filtered/determiner_noun_agreement_with_adj_irregular_2.jsonl"
906
+ },
907
+ "validation_split": "train",
908
+ "doc_to_text": "",
909
+ "doc_to_target": 0,
910
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
911
+ "description": "",
912
+ "target_delimiter": " ",
913
+ "fewshot_delimiter": "\n\n",
914
+ "num_fewshot": 0,
915
+ "metric_list": [
916
+ {
917
+ "metric": "acc",
918
+ "weight_by_size": false
919
+ }
920
+ ],
921
+ "output_type": "multiple_choice",
922
+ "repeats": 1,
923
+ "should_decontaminate": true,
924
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
925
+ "metadata": {
926
+ "version": 1.0
927
+ }
928
+ },
929
+ "blimp_determiner_noun_agreement_with_adjective_1_filtered": {
930
+ "task": "blimp_determiner_noun_agreement_with_adjective_1_filtered",
931
+ "group": "blimp_filtered",
932
+ "dataset_path": "json",
933
+ "dataset_kwargs": {
934
+ "data_files": "evaluation_data/blimp_filtered/determiner_noun_agreement_with_adjective_1.jsonl"
935
+ },
936
+ "validation_split": "train",
937
+ "doc_to_text": "",
938
+ "doc_to_target": 0,
939
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
940
+ "description": "",
941
+ "target_delimiter": " ",
942
+ "fewshot_delimiter": "\n\n",
943
+ "num_fewshot": 0,
944
+ "metric_list": [
945
+ {
946
+ "metric": "acc",
947
+ "weight_by_size": false
948
+ }
949
+ ],
950
+ "output_type": "multiple_choice",
951
+ "repeats": 1,
952
+ "should_decontaminate": true,
953
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
954
+ "metadata": {
955
+ "version": 1.0
956
+ }
957
+ },
958
+ "blimp_distractor_agreement_relational_noun_filtered": {
959
+ "task": "blimp_distractor_agreement_relational_noun_filtered",
960
+ "group": "blimp_filtered",
961
+ "dataset_path": "json",
962
+ "dataset_kwargs": {
963
+ "data_files": "evaluation_data/blimp_filtered/distractor_agreement_relational_noun.jsonl"
964
+ },
965
+ "validation_split": "train",
966
+ "doc_to_text": "",
967
+ "doc_to_target": 0,
968
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
969
+ "description": "",
970
+ "target_delimiter": " ",
971
+ "fewshot_delimiter": "\n\n",
972
+ "num_fewshot": 0,
973
+ "metric_list": [
974
+ {
975
+ "metric": "acc",
976
+ "weight_by_size": false
977
+ }
978
+ ],
979
+ "output_type": "multiple_choice",
980
+ "repeats": 1,
981
+ "should_decontaminate": true,
982
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
983
+ "metadata": {
984
+ "version": 1.0
985
+ }
986
+ },
987
+ "blimp_distractor_agreement_relative_clause_filtered": {
988
+ "task": "blimp_distractor_agreement_relative_clause_filtered",
989
+ "group": "blimp_filtered",
990
+ "dataset_path": "json",
991
+ "dataset_kwargs": {
992
+ "data_files": "evaluation_data/blimp_filtered/distractor_agreement_relative_clause.jsonl"
993
+ },
994
+ "validation_split": "train",
995
+ "doc_to_text": "",
996
+ "doc_to_target": 0,
997
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
998
+ "description": "",
999
+ "target_delimiter": " ",
1000
+ "fewshot_delimiter": "\n\n",
1001
+ "num_fewshot": 0,
1002
+ "metric_list": [
1003
+ {
1004
+ "metric": "acc",
1005
+ "weight_by_size": false
1006
+ }
1007
+ ],
1008
+ "output_type": "multiple_choice",
1009
+ "repeats": 1,
1010
+ "should_decontaminate": true,
1011
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1012
+ "metadata": {
1013
+ "version": 1.0
1014
+ }
1015
+ },
1016
+ "blimp_drop_argument_filtered": {
1017
+ "task": "blimp_drop_argument_filtered",
1018
+ "group": "blimp_filtered",
1019
+ "dataset_path": "json",
1020
+ "dataset_kwargs": {
1021
+ "data_files": "evaluation_data/blimp_filtered/drop_argument.jsonl"
1022
+ },
1023
+ "validation_split": "train",
1024
+ "doc_to_text": "",
1025
+ "doc_to_target": 0,
1026
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1027
+ "description": "",
1028
+ "target_delimiter": " ",
1029
+ "fewshot_delimiter": "\n\n",
1030
+ "num_fewshot": 0,
1031
+ "metric_list": [
1032
+ {
1033
+ "metric": "acc",
1034
+ "weight_by_size": false
1035
+ }
1036
+ ],
1037
+ "output_type": "multiple_choice",
1038
+ "repeats": 1,
1039
+ "should_decontaminate": true,
1040
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1041
+ "metadata": {
1042
+ "version": 1.0
1043
+ }
1044
+ },
1045
+ "blimp_ellipsis_n_bar_1_filtered": {
1046
+ "task": "blimp_ellipsis_n_bar_1_filtered",
1047
+ "group": "blimp_filtered",
1048
+ "dataset_path": "json",
1049
+ "dataset_kwargs": {
1050
+ "data_files": "evaluation_data/blimp_filtered/ellipsis_n_bar_1.jsonl"
1051
+ },
1052
+ "validation_split": "train",
1053
+ "doc_to_text": "",
1054
+ "doc_to_target": 0,
1055
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1056
+ "description": "",
1057
+ "target_delimiter": " ",
1058
+ "fewshot_delimiter": "\n\n",
1059
+ "num_fewshot": 0,
1060
+ "metric_list": [
1061
+ {
1062
+ "metric": "acc",
1063
+ "weight_by_size": false
1064
+ }
1065
+ ],
1066
+ "output_type": "multiple_choice",
1067
+ "repeats": 1,
1068
+ "should_decontaminate": true,
1069
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1070
+ "metadata": {
1071
+ "version": 1.0
1072
+ }
1073
+ },
1074
+ "blimp_ellipsis_n_bar_2_filtered": {
1075
+ "task": "blimp_ellipsis_n_bar_2_filtered",
1076
+ "group": "blimp_filtered",
1077
+ "dataset_path": "json",
1078
+ "dataset_kwargs": {
1079
+ "data_files": "evaluation_data/blimp_filtered/ellipsis_n_bar_2.jsonl"
1080
+ },
1081
+ "validation_split": "train",
1082
+ "doc_to_text": "",
1083
+ "doc_to_target": 0,
1084
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1085
+ "description": "",
1086
+ "target_delimiter": " ",
1087
+ "fewshot_delimiter": "\n\n",
1088
+ "num_fewshot": 0,
1089
+ "metric_list": [
1090
+ {
1091
+ "metric": "acc",
1092
+ "weight_by_size": false
1093
+ }
1094
+ ],
1095
+ "output_type": "multiple_choice",
1096
+ "repeats": 1,
1097
+ "should_decontaminate": true,
1098
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1099
+ "metadata": {
1100
+ "version": 1.0
1101
+ }
1102
+ },
1103
+ "blimp_existential_there_object_raising_filtered": {
1104
+ "task": "blimp_existential_there_object_raising_filtered",
1105
+ "group": "blimp_filtered",
1106
+ "dataset_path": "json",
1107
+ "dataset_kwargs": {
1108
+ "data_files": "evaluation_data/blimp_filtered/existential_there_object_raising.jsonl"
1109
+ },
1110
+ "validation_split": "train",
1111
+ "doc_to_text": "",
1112
+ "doc_to_target": 0,
1113
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1114
+ "description": "",
1115
+ "target_delimiter": " ",
1116
+ "fewshot_delimiter": "\n\n",
1117
+ "num_fewshot": 0,
1118
+ "metric_list": [
1119
+ {
1120
+ "metric": "acc",
1121
+ "weight_by_size": false
1122
+ }
1123
+ ],
1124
+ "output_type": "multiple_choice",
1125
+ "repeats": 1,
1126
+ "should_decontaminate": true,
1127
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1128
+ "metadata": {
1129
+ "version": 1.0
1130
+ }
1131
+ },
1132
+ "blimp_existential_there_quantifiers_1_filtered": {
1133
+ "task": "blimp_existential_there_quantifiers_1_filtered",
1134
+ "group": "blimp_filtered",
1135
+ "dataset_path": "json",
1136
+ "dataset_kwargs": {
1137
+ "data_files": "evaluation_data/blimp_filtered/existential_there_quantifiers_1.jsonl"
1138
+ },
1139
+ "validation_split": "train",
1140
+ "doc_to_text": "",
1141
+ "doc_to_target": 0,
1142
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1143
+ "description": "",
1144
+ "target_delimiter": " ",
1145
+ "fewshot_delimiter": "\n\n",
1146
+ "num_fewshot": 0,
1147
+ "metric_list": [
1148
+ {
1149
+ "metric": "acc",
1150
+ "weight_by_size": false
1151
+ }
1152
+ ],
1153
+ "output_type": "multiple_choice",
1154
+ "repeats": 1,
1155
+ "should_decontaminate": true,
1156
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1157
+ "metadata": {
1158
+ "version": 1.0
1159
+ }
1160
+ },
1161
+ "blimp_existential_there_quantifiers_2_filtered": {
1162
+ "task": "blimp_existential_there_quantifiers_2_filtered",
1163
+ "group": "blimp_filtered",
1164
+ "dataset_path": "json",
1165
+ "dataset_kwargs": {
1166
+ "data_files": "evaluation_data/blimp_filtered/existential_there_quantifiers_2.jsonl"
1167
+ },
1168
+ "validation_split": "train",
1169
+ "doc_to_text": "",
1170
+ "doc_to_target": 0,
1171
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1172
+ "description": "",
1173
+ "target_delimiter": " ",
1174
+ "fewshot_delimiter": "\n\n",
1175
+ "num_fewshot": 0,
1176
+ "metric_list": [
1177
+ {
1178
+ "metric": "acc",
1179
+ "weight_by_size": false
1180
+ }
1181
+ ],
1182
+ "output_type": "multiple_choice",
1183
+ "repeats": 1,
1184
+ "should_decontaminate": true,
1185
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1186
+ "metadata": {
1187
+ "version": 1.0
1188
+ }
1189
+ },
1190
+ "blimp_existential_there_subject_raising_filtered": {
1191
+ "task": "blimp_existential_there_subject_raising_filtered",
1192
+ "group": "blimp_filtered",
1193
+ "dataset_path": "json",
1194
+ "dataset_kwargs": {
1195
+ "data_files": "evaluation_data/blimp_filtered/existential_there_subject_raising.jsonl"
1196
+ },
1197
+ "validation_split": "train",
1198
+ "doc_to_text": "",
1199
+ "doc_to_target": 0,
1200
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1201
+ "description": "",
1202
+ "target_delimiter": " ",
1203
+ "fewshot_delimiter": "\n\n",
1204
+ "num_fewshot": 0,
1205
+ "metric_list": [
1206
+ {
1207
+ "metric": "acc",
1208
+ "weight_by_size": false
1209
+ }
1210
+ ],
1211
+ "output_type": "multiple_choice",
1212
+ "repeats": 1,
1213
+ "should_decontaminate": true,
1214
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1215
+ "metadata": {
1216
+ "version": 1.0
1217
+ }
1218
+ },
1219
+ "blimp_expletive_it_object_raising_filtered": {
1220
+ "task": "blimp_expletive_it_object_raising_filtered",
1221
+ "group": "blimp_filtered",
1222
+ "dataset_path": "json",
1223
+ "dataset_kwargs": {
1224
+ "data_files": "evaluation_data/blimp_filtered/expletive_it_object_raising.jsonl"
1225
+ },
1226
+ "validation_split": "train",
1227
+ "doc_to_text": "",
1228
+ "doc_to_target": 0,
1229
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1230
+ "description": "",
1231
+ "target_delimiter": " ",
1232
+ "fewshot_delimiter": "\n\n",
1233
+ "num_fewshot": 0,
1234
+ "metric_list": [
1235
+ {
1236
+ "metric": "acc",
1237
+ "weight_by_size": false
1238
+ }
1239
+ ],
1240
+ "output_type": "multiple_choice",
1241
+ "repeats": 1,
1242
+ "should_decontaminate": true,
1243
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1244
+ "metadata": {
1245
+ "version": 1.0
1246
+ }
1247
+ },
1248
+ "blimp_inchoative_filtered": {
1249
+ "task": "blimp_inchoative_filtered",
1250
+ "group": "blimp_filtered",
1251
+ "dataset_path": "json",
1252
+ "dataset_kwargs": {
1253
+ "data_files": "evaluation_data/blimp_filtered/inchoative.jsonl"
1254
+ },
1255
+ "validation_split": "train",
1256
+ "doc_to_text": "",
1257
+ "doc_to_target": 0,
1258
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1259
+ "description": "",
1260
+ "target_delimiter": " ",
1261
+ "fewshot_delimiter": "\n\n",
1262
+ "num_fewshot": 0,
1263
+ "metric_list": [
1264
+ {
1265
+ "metric": "acc",
1266
+ "weight_by_size": false
1267
+ }
1268
+ ],
1269
+ "output_type": "multiple_choice",
1270
+ "repeats": 1,
1271
+ "should_decontaminate": true,
1272
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1273
+ "metadata": {
1274
+ "version": 1.0
1275
+ }
1276
+ },
1277
+ "blimp_intransitive_filtered": {
1278
+ "task": "blimp_intransitive_filtered",
1279
+ "group": "blimp_filtered",
1280
+ "dataset_path": "json",
1281
+ "dataset_kwargs": {
1282
+ "data_files": "evaluation_data/blimp_filtered/intransitive.jsonl"
1283
+ },
1284
+ "validation_split": "train",
1285
+ "doc_to_text": "",
1286
+ "doc_to_target": 0,
1287
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1288
+ "description": "",
1289
+ "target_delimiter": " ",
1290
+ "fewshot_delimiter": "\n\n",
1291
+ "num_fewshot": 0,
1292
+ "metric_list": [
1293
+ {
1294
+ "metric": "acc",
1295
+ "weight_by_size": false
1296
+ }
1297
+ ],
1298
+ "output_type": "multiple_choice",
1299
+ "repeats": 1,
1300
+ "should_decontaminate": true,
1301
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1302
+ "metadata": {
1303
+ "version": 1.0
1304
+ }
1305
+ },
1306
+ "blimp_irregular_past_participle_adjectives_filtered": {
1307
+ "task": "blimp_irregular_past_participle_adjectives_filtered",
1308
+ "group": "blimp_filtered",
1309
+ "dataset_path": "json",
1310
+ "dataset_kwargs": {
1311
+ "data_files": "evaluation_data/blimp_filtered/irregular_past_participle_adjectives.jsonl"
1312
+ },
1313
+ "validation_split": "train",
1314
+ "doc_to_text": "",
1315
+ "doc_to_target": 0,
1316
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1317
+ "description": "",
1318
+ "target_delimiter": " ",
1319
+ "fewshot_delimiter": "\n\n",
1320
+ "num_fewshot": 0,
1321
+ "metric_list": [
1322
+ {
1323
+ "metric": "acc",
1324
+ "weight_by_size": false
1325
+ }
1326
+ ],
1327
+ "output_type": "multiple_choice",
1328
+ "repeats": 1,
1329
+ "should_decontaminate": true,
1330
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1331
+ "metadata": {
1332
+ "version": 1.0
1333
+ }
1334
+ },
1335
+ "blimp_irregular_past_participle_verbs_filtered": {
1336
+ "task": "blimp_irregular_past_participle_verbs_filtered",
1337
+ "group": "blimp_filtered",
1338
+ "dataset_path": "json",
1339
+ "dataset_kwargs": {
1340
+ "data_files": "evaluation_data/blimp_filtered/irregular_past_participle_verbs.jsonl"
1341
+ },
1342
+ "validation_split": "train",
1343
+ "doc_to_text": "",
1344
+ "doc_to_target": 0,
1345
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1346
+ "description": "",
1347
+ "target_delimiter": " ",
1348
+ "fewshot_delimiter": "\n\n",
1349
+ "num_fewshot": 0,
1350
+ "metric_list": [
1351
+ {
1352
+ "metric": "acc",
1353
+ "weight_by_size": false
1354
+ }
1355
+ ],
1356
+ "output_type": "multiple_choice",
1357
+ "repeats": 1,
1358
+ "should_decontaminate": true,
1359
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1360
+ "metadata": {
1361
+ "version": 1.0
1362
+ }
1363
+ },
1364
+ "blimp_irregular_plural_subject_verb_agreement_1_filtered": {
1365
+ "task": "blimp_irregular_plural_subject_verb_agreement_1_filtered",
1366
+ "group": "blimp_filtered",
1367
+ "dataset_path": "json",
1368
+ "dataset_kwargs": {
1369
+ "data_files": "evaluation_data/blimp_filtered/irregular_plural_subject_verb_agreement_1.jsonl"
1370
+ },
1371
+ "validation_split": "train",
1372
+ "doc_to_text": "",
1373
+ "doc_to_target": 0,
1374
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1375
+ "description": "",
1376
+ "target_delimiter": " ",
1377
+ "fewshot_delimiter": "\n\n",
1378
+ "num_fewshot": 0,
1379
+ "metric_list": [
1380
+ {
1381
+ "metric": "acc",
1382
+ "weight_by_size": false
1383
+ }
1384
+ ],
1385
+ "output_type": "multiple_choice",
1386
+ "repeats": 1,
1387
+ "should_decontaminate": true,
1388
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1389
+ "metadata": {
1390
+ "version": 1.0
1391
+ }
1392
+ },
1393
+ "blimp_irregular_plural_subject_verb_agreement_2_filtered": {
1394
+ "task": "blimp_irregular_plural_subject_verb_agreement_2_filtered",
1395
+ "group": "blimp_filtered",
1396
+ "dataset_path": "json",
1397
+ "dataset_kwargs": {
1398
+ "data_files": "evaluation_data/blimp_filtered/irregular_plural_subject_verb_agreement_2.jsonl"
1399
+ },
1400
+ "validation_split": "train",
1401
+ "doc_to_text": "",
1402
+ "doc_to_target": 0,
1403
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1404
+ "description": "",
1405
+ "target_delimiter": " ",
1406
+ "fewshot_delimiter": "\n\n",
1407
+ "num_fewshot": 0,
1408
+ "metric_list": [
1409
+ {
1410
+ "metric": "acc",
1411
+ "weight_by_size": false
1412
+ }
1413
+ ],
1414
+ "output_type": "multiple_choice",
1415
+ "repeats": 1,
1416
+ "should_decontaminate": true,
1417
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1418
+ "metadata": {
1419
+ "version": 1.0
1420
+ }
1421
+ },
1422
+ "blimp_left_branch_island_echo_question_filtered": {
1423
+ "task": "blimp_left_branch_island_echo_question_filtered",
1424
+ "group": "blimp_filtered",
1425
+ "dataset_path": "json",
1426
+ "dataset_kwargs": {
1427
+ "data_files": "evaluation_data/blimp_filtered/left_branch_island_echo_question.jsonl"
1428
+ },
1429
+ "validation_split": "train",
1430
+ "doc_to_text": "",
1431
+ "doc_to_target": 0,
1432
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1433
+ "description": "",
1434
+ "target_delimiter": " ",
1435
+ "fewshot_delimiter": "\n\n",
1436
+ "num_fewshot": 0,
1437
+ "metric_list": [
1438
+ {
1439
+ "metric": "acc",
1440
+ "weight_by_size": false
1441
+ }
1442
+ ],
1443
+ "output_type": "multiple_choice",
1444
+ "repeats": 1,
1445
+ "should_decontaminate": true,
1446
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1447
+ "metadata": {
1448
+ "version": 1.0
1449
+ }
1450
+ },
1451
+ "blimp_left_branch_island_simple_question_filtered": {
1452
+ "task": "blimp_left_branch_island_simple_question_filtered",
1453
+ "group": "blimp_filtered",
1454
+ "dataset_path": "json",
1455
+ "dataset_kwargs": {
1456
+ "data_files": "evaluation_data/blimp_filtered/left_branch_island_simple_question.jsonl"
1457
+ },
1458
+ "validation_split": "train",
1459
+ "doc_to_text": "",
1460
+ "doc_to_target": 0,
1461
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1462
+ "description": "",
1463
+ "target_delimiter": " ",
1464
+ "fewshot_delimiter": "\n\n",
1465
+ "num_fewshot": 0,
1466
+ "metric_list": [
1467
+ {
1468
+ "metric": "acc",
1469
+ "weight_by_size": false
1470
+ }
1471
+ ],
1472
+ "output_type": "multiple_choice",
1473
+ "repeats": 1,
1474
+ "should_decontaminate": true,
1475
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1476
+ "metadata": {
1477
+ "version": 1.0
1478
+ }
1479
+ },
1480
+ "blimp_matrix_question_npi_licensor_present_filtered": {
1481
+ "task": "blimp_matrix_question_npi_licensor_present_filtered",
1482
+ "group": "blimp_filtered",
1483
+ "dataset_path": "json",
1484
+ "dataset_kwargs": {
1485
+ "data_files": "evaluation_data/blimp_filtered/matrix_question_npi_licensor_present.jsonl"
1486
+ },
1487
+ "validation_split": "train",
1488
+ "doc_to_text": "",
1489
+ "doc_to_target": 0,
1490
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1491
+ "description": "",
1492
+ "target_delimiter": " ",
1493
+ "fewshot_delimiter": "\n\n",
1494
+ "num_fewshot": 0,
1495
+ "metric_list": [
1496
+ {
1497
+ "metric": "acc",
1498
+ "weight_by_size": false
1499
+ }
1500
+ ],
1501
+ "output_type": "multiple_choice",
1502
+ "repeats": 1,
1503
+ "should_decontaminate": true,
1504
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1505
+ "metadata": {
1506
+ "version": 1.0
1507
+ }
1508
+ },
1509
+ "blimp_npi_present_1_filtered": {
1510
+ "task": "blimp_npi_present_1_filtered",
1511
+ "group": "blimp_filtered",
1512
+ "dataset_path": "json",
1513
+ "dataset_kwargs": {
1514
+ "data_files": "evaluation_data/blimp_filtered/npi_present_1.jsonl"
1515
+ },
1516
+ "validation_split": "train",
1517
+ "doc_to_text": "",
1518
+ "doc_to_target": 0,
1519
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1520
+ "description": "",
1521
+ "target_delimiter": " ",
1522
+ "fewshot_delimiter": "\n\n",
1523
+ "num_fewshot": 0,
1524
+ "metric_list": [
1525
+ {
1526
+ "metric": "acc",
1527
+ "weight_by_size": false
1528
+ }
1529
+ ],
1530
+ "output_type": "multiple_choice",
1531
+ "repeats": 1,
1532
+ "should_decontaminate": true,
1533
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1534
+ "metadata": {
1535
+ "version": 1.0
1536
+ }
1537
+ },
1538
+ "blimp_npi_present_2_filtered": {
1539
+ "task": "blimp_npi_present_2_filtered",
1540
+ "group": "blimp_filtered",
1541
+ "dataset_path": "json",
1542
+ "dataset_kwargs": {
1543
+ "data_files": "evaluation_data/blimp_filtered/npi_present_2.jsonl"
1544
+ },
1545
+ "validation_split": "train",
1546
+ "doc_to_text": "",
1547
+ "doc_to_target": 0,
1548
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1549
+ "description": "",
1550
+ "target_delimiter": " ",
1551
+ "fewshot_delimiter": "\n\n",
1552
+ "num_fewshot": 0,
1553
+ "metric_list": [
1554
+ {
1555
+ "metric": "acc",
1556
+ "weight_by_size": false
1557
+ }
1558
+ ],
1559
+ "output_type": "multiple_choice",
1560
+ "repeats": 1,
1561
+ "should_decontaminate": true,
1562
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1563
+ "metadata": {
1564
+ "version": 1.0
1565
+ }
1566
+ },
1567
+ "blimp_only_npi_licensor_present_filtered": {
1568
+ "task": "blimp_only_npi_licensor_present_filtered",
1569
+ "group": "blimp_filtered",
1570
+ "dataset_path": "json",
1571
+ "dataset_kwargs": {
1572
+ "data_files": "evaluation_data/blimp_filtered/only_npi_licensor_present.jsonl"
1573
+ },
1574
+ "validation_split": "train",
1575
+ "doc_to_text": "",
1576
+ "doc_to_target": 0,
1577
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1578
+ "description": "",
1579
+ "target_delimiter": " ",
1580
+ "fewshot_delimiter": "\n\n",
1581
+ "num_fewshot": 0,
1582
+ "metric_list": [
1583
+ {
1584
+ "metric": "acc",
1585
+ "weight_by_size": false
1586
+ }
1587
+ ],
1588
+ "output_type": "multiple_choice",
1589
+ "repeats": 1,
1590
+ "should_decontaminate": true,
1591
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1592
+ "metadata": {
1593
+ "version": 1.0
1594
+ }
1595
+ },
1596
+ "blimp_only_npi_scope_filtered": {
1597
+ "task": "blimp_only_npi_scope_filtered",
1598
+ "group": "blimp_filtered",
1599
+ "dataset_path": "json",
1600
+ "dataset_kwargs": {
1601
+ "data_files": "evaluation_data/blimp_filtered/only_npi_scope.jsonl"
1602
+ },
1603
+ "validation_split": "train",
1604
+ "doc_to_text": "",
1605
+ "doc_to_target": 0,
1606
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1607
+ "description": "",
1608
+ "target_delimiter": " ",
1609
+ "fewshot_delimiter": "\n\n",
1610
+ "num_fewshot": 0,
1611
+ "metric_list": [
1612
+ {
1613
+ "metric": "acc",
1614
+ "weight_by_size": false
1615
+ }
1616
+ ],
1617
+ "output_type": "multiple_choice",
1618
+ "repeats": 1,
1619
+ "should_decontaminate": true,
1620
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1621
+ "metadata": {
1622
+ "version": 1.0
1623
+ }
1624
+ },
1625
+ "blimp_passive_1_filtered": {
1626
+ "task": "blimp_passive_1_filtered",
1627
+ "group": "blimp_filtered",
1628
+ "dataset_path": "json",
1629
+ "dataset_kwargs": {
1630
+ "data_files": "evaluation_data/blimp_filtered/passive_1.jsonl"
1631
+ },
1632
+ "validation_split": "train",
1633
+ "doc_to_text": "",
1634
+ "doc_to_target": 0,
1635
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1636
+ "description": "",
1637
+ "target_delimiter": " ",
1638
+ "fewshot_delimiter": "\n\n",
1639
+ "num_fewshot": 0,
1640
+ "metric_list": [
1641
+ {
1642
+ "metric": "acc",
1643
+ "weight_by_size": false
1644
+ }
1645
+ ],
1646
+ "output_type": "multiple_choice",
1647
+ "repeats": 1,
1648
+ "should_decontaminate": true,
1649
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1650
+ "metadata": {
1651
+ "version": 1.0
1652
+ }
1653
+ },
1654
+ "blimp_passive_2_filtered": {
1655
+ "task": "blimp_passive_2_filtered",
1656
+ "group": "blimp_filtered",
1657
+ "dataset_path": "json",
1658
+ "dataset_kwargs": {
1659
+ "data_files": "evaluation_data/blimp_filtered/passive_2.jsonl"
1660
+ },
1661
+ "validation_split": "train",
1662
+ "doc_to_text": "",
1663
+ "doc_to_target": 0,
1664
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1665
+ "description": "",
1666
+ "target_delimiter": " ",
1667
+ "fewshot_delimiter": "\n\n",
1668
+ "num_fewshot": 0,
1669
+ "metric_list": [
1670
+ {
1671
+ "metric": "acc",
1672
+ "weight_by_size": false
1673
+ }
1674
+ ],
1675
+ "output_type": "multiple_choice",
1676
+ "repeats": 1,
1677
+ "should_decontaminate": true,
1678
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1679
+ "metadata": {
1680
+ "version": 1.0
1681
+ }
1682
+ },
1683
+ "blimp_principle_A_c_command_filtered": {
1684
+ "task": "blimp_principle_A_c_command_filtered",
1685
+ "group": "blimp_filtered",
1686
+ "dataset_path": "json",
1687
+ "dataset_kwargs": {
1688
+ "data_files": "evaluation_data/blimp_filtered/principle_A_c_command.jsonl"
1689
+ },
1690
+ "validation_split": "train",
1691
+ "doc_to_text": "",
1692
+ "doc_to_target": 0,
1693
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1694
+ "description": "",
1695
+ "target_delimiter": " ",
1696
+ "fewshot_delimiter": "\n\n",
1697
+ "num_fewshot": 0,
1698
+ "metric_list": [
1699
+ {
1700
+ "metric": "acc",
1701
+ "weight_by_size": false
1702
+ }
1703
+ ],
1704
+ "output_type": "multiple_choice",
1705
+ "repeats": 1,
1706
+ "should_decontaminate": true,
1707
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1708
+ "metadata": {
1709
+ "version": 1.0
1710
+ }
1711
+ },
1712
+ "blimp_principle_A_case_1_filtered": {
1713
+ "task": "blimp_principle_A_case_1_filtered",
1714
+ "group": "blimp_filtered",
1715
+ "dataset_path": "json",
1716
+ "dataset_kwargs": {
1717
+ "data_files": "evaluation_data/blimp_filtered/principle_A_case_1.jsonl"
1718
+ },
1719
+ "validation_split": "train",
1720
+ "doc_to_text": "",
1721
+ "doc_to_target": 0,
1722
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1723
+ "description": "",
1724
+ "target_delimiter": " ",
1725
+ "fewshot_delimiter": "\n\n",
1726
+ "num_fewshot": 0,
1727
+ "metric_list": [
1728
+ {
1729
+ "metric": "acc",
1730
+ "weight_by_size": false
1731
+ }
1732
+ ],
1733
+ "output_type": "multiple_choice",
1734
+ "repeats": 1,
1735
+ "should_decontaminate": true,
1736
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1737
+ "metadata": {
1738
+ "version": 1.0
1739
+ }
1740
+ },
1741
+ "blimp_principle_A_case_2_filtered": {
1742
+ "task": "blimp_principle_A_case_2_filtered",
1743
+ "group": "blimp_filtered",
1744
+ "dataset_path": "json",
1745
+ "dataset_kwargs": {
1746
+ "data_files": "evaluation_data/blimp_filtered/principle_A_case_2.jsonl"
1747
+ },
1748
+ "validation_split": "train",
1749
+ "doc_to_text": "",
1750
+ "doc_to_target": 0,
1751
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1752
+ "description": "",
1753
+ "target_delimiter": " ",
1754
+ "fewshot_delimiter": "\n\n",
1755
+ "num_fewshot": 0,
1756
+ "metric_list": [
1757
+ {
1758
+ "metric": "acc",
1759
+ "weight_by_size": false
1760
+ }
1761
+ ],
1762
+ "output_type": "multiple_choice",
1763
+ "repeats": 1,
1764
+ "should_decontaminate": true,
1765
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1766
+ "metadata": {
1767
+ "version": 1.0
1768
+ }
1769
+ },
1770
+ "blimp_principle_A_domain_1_filtered": {
1771
+ "task": "blimp_principle_A_domain_1_filtered",
1772
+ "group": "blimp_filtered",
1773
+ "dataset_path": "json",
1774
+ "dataset_kwargs": {
1775
+ "data_files": "evaluation_data/blimp_filtered/principle_A_domain_1.jsonl"
1776
+ },
1777
+ "validation_split": "train",
1778
+ "doc_to_text": "",
1779
+ "doc_to_target": 0,
1780
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1781
+ "description": "",
1782
+ "target_delimiter": " ",
1783
+ "fewshot_delimiter": "\n\n",
1784
+ "num_fewshot": 0,
1785
+ "metric_list": [
1786
+ {
1787
+ "metric": "acc",
1788
+ "weight_by_size": false
1789
+ }
1790
+ ],
1791
+ "output_type": "multiple_choice",
1792
+ "repeats": 1,
1793
+ "should_decontaminate": true,
1794
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1795
+ "metadata": {
1796
+ "version": 1.0
1797
+ }
1798
+ },
1799
+ "blimp_principle_A_domain_2_filtered": {
1800
+ "task": "blimp_principle_A_domain_2_filtered",
1801
+ "group": "blimp_filtered",
1802
+ "dataset_path": "json",
1803
+ "dataset_kwargs": {
1804
+ "data_files": "evaluation_data/blimp_filtered/principle_A_domain_2.jsonl"
1805
+ },
1806
+ "validation_split": "train",
1807
+ "doc_to_text": "",
1808
+ "doc_to_target": 0,
1809
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1810
+ "description": "",
1811
+ "target_delimiter": " ",
1812
+ "fewshot_delimiter": "\n\n",
1813
+ "num_fewshot": 0,
1814
+ "metric_list": [
1815
+ {
1816
+ "metric": "acc",
1817
+ "weight_by_size": false
1818
+ }
1819
+ ],
1820
+ "output_type": "multiple_choice",
1821
+ "repeats": 1,
1822
+ "should_decontaminate": true,
1823
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1824
+ "metadata": {
1825
+ "version": 1.0
1826
+ }
1827
+ },
1828
+ "blimp_principle_A_domain_3_filtered": {
1829
+ "task": "blimp_principle_A_domain_3_filtered",
1830
+ "group": "blimp_filtered",
1831
+ "dataset_path": "json",
1832
+ "dataset_kwargs": {
1833
+ "data_files": "evaluation_data/blimp_filtered/principle_A_domain_3.jsonl"
1834
+ },
1835
+ "validation_split": "train",
1836
+ "doc_to_text": "",
1837
+ "doc_to_target": 0,
1838
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1839
+ "description": "",
1840
+ "target_delimiter": " ",
1841
+ "fewshot_delimiter": "\n\n",
1842
+ "num_fewshot": 0,
1843
+ "metric_list": [
1844
+ {
1845
+ "metric": "acc",
1846
+ "weight_by_size": false
1847
+ }
1848
+ ],
1849
+ "output_type": "multiple_choice",
1850
+ "repeats": 1,
1851
+ "should_decontaminate": true,
1852
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1853
+ "metadata": {
1854
+ "version": 1.0
1855
+ }
1856
+ },
1857
+ "blimp_principle_A_reconstruction_filtered": {
1858
+ "task": "blimp_principle_A_reconstruction_filtered",
1859
+ "group": "blimp_filtered",
1860
+ "dataset_path": "json",
1861
+ "dataset_kwargs": {
1862
+ "data_files": "evaluation_data/blimp_filtered/principle_A_reconstruction.jsonl"
1863
+ },
1864
+ "validation_split": "train",
1865
+ "doc_to_text": "",
1866
+ "doc_to_target": 0,
1867
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1868
+ "description": "",
1869
+ "target_delimiter": " ",
1870
+ "fewshot_delimiter": "\n\n",
1871
+ "num_fewshot": 0,
1872
+ "metric_list": [
1873
+ {
1874
+ "metric": "acc",
1875
+ "weight_by_size": false
1876
+ }
1877
+ ],
1878
+ "output_type": "multiple_choice",
1879
+ "repeats": 1,
1880
+ "should_decontaminate": true,
1881
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1882
+ "metadata": {
1883
+ "version": 1.0
1884
+ }
1885
+ },
1886
+ "blimp_regular_plural_subject_verb_agreement_1_filtered": {
1887
+ "task": "blimp_regular_plural_subject_verb_agreement_1_filtered",
1888
+ "group": "blimp_filtered",
1889
+ "dataset_path": "json",
1890
+ "dataset_kwargs": {
1891
+ "data_files": "evaluation_data/blimp_filtered/regular_plural_subject_verb_agreement_1.jsonl"
1892
+ },
1893
+ "validation_split": "train",
1894
+ "doc_to_text": "",
1895
+ "doc_to_target": 0,
1896
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1897
+ "description": "",
1898
+ "target_delimiter": " ",
1899
+ "fewshot_delimiter": "\n\n",
1900
+ "num_fewshot": 0,
1901
+ "metric_list": [
1902
+ {
1903
+ "metric": "acc",
1904
+ "weight_by_size": false
1905
+ }
1906
+ ],
1907
+ "output_type": "multiple_choice",
1908
+ "repeats": 1,
1909
+ "should_decontaminate": true,
1910
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1911
+ "metadata": {
1912
+ "version": 1.0
1913
+ }
1914
+ },
1915
+ "blimp_regular_plural_subject_verb_agreement_2_filtered": {
1916
+ "task": "blimp_regular_plural_subject_verb_agreement_2_filtered",
1917
+ "group": "blimp_filtered",
1918
+ "dataset_path": "json",
1919
+ "dataset_kwargs": {
1920
+ "data_files": "evaluation_data/blimp_filtered/regular_plural_subject_verb_agreement_2.jsonl"
1921
+ },
1922
+ "validation_split": "train",
1923
+ "doc_to_text": "",
1924
+ "doc_to_target": 0,
1925
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1926
+ "description": "",
1927
+ "target_delimiter": " ",
1928
+ "fewshot_delimiter": "\n\n",
1929
+ "num_fewshot": 0,
1930
+ "metric_list": [
1931
+ {
1932
+ "metric": "acc",
1933
+ "weight_by_size": false
1934
+ }
1935
+ ],
1936
+ "output_type": "multiple_choice",
1937
+ "repeats": 1,
1938
+ "should_decontaminate": true,
1939
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1940
+ "metadata": {
1941
+ "version": 1.0
1942
+ }
1943
+ },
1944
+ "blimp_sentential_negation_npi_licensor_present_filtered": {
1945
+ "task": "blimp_sentential_negation_npi_licensor_present_filtered",
1946
+ "group": "blimp_filtered",
1947
+ "dataset_path": "json",
1948
+ "dataset_kwargs": {
1949
+ "data_files": "evaluation_data/blimp_filtered/sentential_negation_npi_licensor_present.jsonl"
1950
+ },
1951
+ "validation_split": "train",
1952
+ "doc_to_text": "",
1953
+ "doc_to_target": 0,
1954
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1955
+ "description": "",
1956
+ "target_delimiter": " ",
1957
+ "fewshot_delimiter": "\n\n",
1958
+ "num_fewshot": 0,
1959
+ "metric_list": [
1960
+ {
1961
+ "metric": "acc",
1962
+ "weight_by_size": false
1963
+ }
1964
+ ],
1965
+ "output_type": "multiple_choice",
1966
+ "repeats": 1,
1967
+ "should_decontaminate": true,
1968
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1969
+ "metadata": {
1970
+ "version": 1.0
1971
+ }
1972
+ },
1973
+ "blimp_sentential_negation_npi_scope_filtered": {
1974
+ "task": "blimp_sentential_negation_npi_scope_filtered",
1975
+ "group": "blimp_filtered",
1976
+ "dataset_path": "json",
1977
+ "dataset_kwargs": {
1978
+ "data_files": "evaluation_data/blimp_filtered/sentential_negation_npi_scope.jsonl"
1979
+ },
1980
+ "validation_split": "train",
1981
+ "doc_to_text": "",
1982
+ "doc_to_target": 0,
1983
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
1984
+ "description": "",
1985
+ "target_delimiter": " ",
1986
+ "fewshot_delimiter": "\n\n",
1987
+ "num_fewshot": 0,
1988
+ "metric_list": [
1989
+ {
1990
+ "metric": "acc",
1991
+ "weight_by_size": false
1992
+ }
1993
+ ],
1994
+ "output_type": "multiple_choice",
1995
+ "repeats": 1,
1996
+ "should_decontaminate": true,
1997
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
1998
+ "metadata": {
1999
+ "version": 1.0
2000
+ }
2001
+ },
2002
+ "blimp_sentential_subject_island_filtered": {
2003
+ "task": "blimp_sentential_subject_island_filtered",
2004
+ "group": "blimp_filtered",
2005
+ "dataset_path": "json",
2006
+ "dataset_kwargs": {
2007
+ "data_files": "evaluation_data/blimp_filtered/sentential_subject_island.jsonl"
2008
+ },
2009
+ "validation_split": "train",
2010
+ "doc_to_text": "",
2011
+ "doc_to_target": 0,
2012
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2013
+ "description": "",
2014
+ "target_delimiter": " ",
2015
+ "fewshot_delimiter": "\n\n",
2016
+ "num_fewshot": 0,
2017
+ "metric_list": [
2018
+ {
2019
+ "metric": "acc",
2020
+ "weight_by_size": false
2021
+ }
2022
+ ],
2023
+ "output_type": "multiple_choice",
2024
+ "repeats": 1,
2025
+ "should_decontaminate": true,
2026
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2027
+ "metadata": {
2028
+ "version": 1.0
2029
+ }
2030
+ },
2031
+ "blimp_superlative_quantifiers_1_filtered": {
2032
+ "task": "blimp_superlative_quantifiers_1_filtered",
2033
+ "group": "blimp_filtered",
2034
+ "dataset_path": "json",
2035
+ "dataset_kwargs": {
2036
+ "data_files": "evaluation_data/blimp_filtered/superlative_quantifiers_1.jsonl"
2037
+ },
2038
+ "validation_split": "train",
2039
+ "doc_to_text": "",
2040
+ "doc_to_target": 0,
2041
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2042
+ "description": "",
2043
+ "target_delimiter": " ",
2044
+ "fewshot_delimiter": "\n\n",
2045
+ "num_fewshot": 0,
2046
+ "metric_list": [
2047
+ {
2048
+ "metric": "acc",
2049
+ "weight_by_size": false
2050
+ }
2051
+ ],
2052
+ "output_type": "multiple_choice",
2053
+ "repeats": 1,
2054
+ "should_decontaminate": true,
2055
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2056
+ "metadata": {
2057
+ "version": 1.0
2058
+ }
2059
+ },
2060
+ "blimp_superlative_quantifiers_2_filtered": {
2061
+ "task": "blimp_superlative_quantifiers_2_filtered",
2062
+ "group": "blimp_filtered",
2063
+ "dataset_path": "json",
2064
+ "dataset_kwargs": {
2065
+ "data_files": "evaluation_data/blimp_filtered/superlative_quantifiers_2.jsonl"
2066
+ },
2067
+ "validation_split": "train",
2068
+ "doc_to_text": "",
2069
+ "doc_to_target": 0,
2070
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2071
+ "description": "",
2072
+ "target_delimiter": " ",
2073
+ "fewshot_delimiter": "\n\n",
2074
+ "num_fewshot": 0,
2075
+ "metric_list": [
2076
+ {
2077
+ "metric": "acc",
2078
+ "weight_by_size": false
2079
+ }
2080
+ ],
2081
+ "output_type": "multiple_choice",
2082
+ "repeats": 1,
2083
+ "should_decontaminate": true,
2084
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2085
+ "metadata": {
2086
+ "version": 1.0
2087
+ }
2088
+ },
2089
+ "blimp_supplement_hypernym": {
2090
+ "task": "blimp_supplement_hypernym",
2091
+ "group": "blimp_supplement",
2092
+ "dataset_path": "json",
2093
+ "dataset_kwargs": {
2094
+ "data_files": "evaluation_data/supplement_filtered/hypernym.jsonl"
2095
+ },
2096
+ "validation_split": "train",
2097
+ "doc_to_text": "",
2098
+ "doc_to_target": 0,
2099
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2100
+ "description": "",
2101
+ "target_delimiter": " ",
2102
+ "fewshot_delimiter": "\n\n",
2103
+ "num_fewshot": 0,
2104
+ "metric_list": [
2105
+ {
2106
+ "metric": "acc",
2107
+ "weight_by_size": false
2108
+ }
2109
+ ],
2110
+ "output_type": "multiple_choice",
2111
+ "repeats": 1,
2112
+ "should_decontaminate": true,
2113
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2114
+ "metadata": {
2115
+ "version": 1.0
2116
+ }
2117
+ },
2118
+ "blimp_supplement_qa_congruence_easy": {
2119
+ "task": "blimp_supplement_qa_congruence_easy",
2120
+ "group": "blimp_supplement",
2121
+ "dataset_path": "json",
2122
+ "dataset_kwargs": {
2123
+ "data_files": "evaluation_data/supplement_filtered/qa_congruence_easy.jsonl"
2124
+ },
2125
+ "validation_split": "train",
2126
+ "doc_to_text": "",
2127
+ "doc_to_target": 0,
2128
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2129
+ "description": "",
2130
+ "target_delimiter": " ",
2131
+ "fewshot_delimiter": "\n\n",
2132
+ "num_fewshot": 0,
2133
+ "metric_list": [
2134
+ {
2135
+ "metric": "acc",
2136
+ "weight_by_size": false
2137
+ }
2138
+ ],
2139
+ "output_type": "multiple_choice",
2140
+ "repeats": 1,
2141
+ "should_decontaminate": true,
2142
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2143
+ "metadata": {
2144
+ "version": 1.0
2145
+ }
2146
+ },
2147
+ "blimp_supplement_qa_congruence_tricky": {
2148
+ "task": "blimp_supplement_qa_congruence_tricky",
2149
+ "group": "blimp_supplement",
2150
+ "dataset_path": "json",
2151
+ "dataset_kwargs": {
2152
+ "data_files": "evaluation_data/supplement_filtered/qa_congruence_tricky.jsonl"
2153
+ },
2154
+ "validation_split": "train",
2155
+ "doc_to_text": "",
2156
+ "doc_to_target": 0,
2157
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2158
+ "description": "",
2159
+ "target_delimiter": " ",
2160
+ "fewshot_delimiter": "\n\n",
2161
+ "num_fewshot": 0,
2162
+ "metric_list": [
2163
+ {
2164
+ "metric": "acc",
2165
+ "weight_by_size": false
2166
+ }
2167
+ ],
2168
+ "output_type": "multiple_choice",
2169
+ "repeats": 1,
2170
+ "should_decontaminate": true,
2171
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2172
+ "metadata": {
2173
+ "version": 1.0
2174
+ }
2175
+ },
2176
+ "blimp_supplement_subject_aux_inversion": {
2177
+ "task": "blimp_supplement_subject_aux_inversion",
2178
+ "group": "blimp_supplement",
2179
+ "dataset_path": "json",
2180
+ "dataset_kwargs": {
2181
+ "data_files": "evaluation_data/supplement_filtered/subject_aux_inversion.jsonl"
2182
+ },
2183
+ "validation_split": "train",
2184
+ "doc_to_text": "",
2185
+ "doc_to_target": 0,
2186
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2187
+ "description": "",
2188
+ "target_delimiter": " ",
2189
+ "fewshot_delimiter": "\n\n",
2190
+ "num_fewshot": 0,
2191
+ "metric_list": [
2192
+ {
2193
+ "metric": "acc",
2194
+ "weight_by_size": false
2195
+ }
2196
+ ],
2197
+ "output_type": "multiple_choice",
2198
+ "repeats": 1,
2199
+ "should_decontaminate": true,
2200
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2201
+ "metadata": {
2202
+ "version": 1.0
2203
+ }
2204
+ },
2205
+ "blimp_supplement_turn_taking": {
2206
+ "task": "blimp_supplement_turn_taking",
2207
+ "group": "blimp_supplement",
2208
+ "dataset_path": "json",
2209
+ "dataset_kwargs": {
2210
+ "data_files": "evaluation_data/supplement_filtered/turn_taking.jsonl"
2211
+ },
2212
+ "validation_split": "train",
2213
+ "doc_to_text": "",
2214
+ "doc_to_target": 0,
2215
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2216
+ "description": "",
2217
+ "target_delimiter": " ",
2218
+ "fewshot_delimiter": "\n\n",
2219
+ "num_fewshot": 0,
2220
+ "metric_list": [
2221
+ {
2222
+ "metric": "acc",
2223
+ "weight_by_size": false
2224
+ }
2225
+ ],
2226
+ "output_type": "multiple_choice",
2227
+ "repeats": 1,
2228
+ "should_decontaminate": true,
2229
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2230
+ "metadata": {
2231
+ "version": 1.0
2232
+ }
2233
+ },
2234
+ "blimp_tough_vs_raising_1_filtered": {
2235
+ "task": "blimp_tough_vs_raising_1_filtered",
2236
+ "group": "blimp_filtered",
2237
+ "dataset_path": "json",
2238
+ "dataset_kwargs": {
2239
+ "data_files": "evaluation_data/blimp_filtered/tough_vs_raising_1.jsonl"
2240
+ },
2241
+ "validation_split": "train",
2242
+ "doc_to_text": "",
2243
+ "doc_to_target": 0,
2244
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2245
+ "description": "",
2246
+ "target_delimiter": " ",
2247
+ "fewshot_delimiter": "\n\n",
2248
+ "num_fewshot": 0,
2249
+ "metric_list": [
2250
+ {
2251
+ "metric": "acc",
2252
+ "weight_by_size": false
2253
+ }
2254
+ ],
2255
+ "output_type": "multiple_choice",
2256
+ "repeats": 1,
2257
+ "should_decontaminate": true,
2258
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2259
+ "metadata": {
2260
+ "version": 1.0
2261
+ }
2262
+ },
2263
+ "blimp_tough_vs_raising_2_filtered": {
2264
+ "task": "blimp_tough_vs_raising_2_filtered",
2265
+ "group": "blimp_filtered",
2266
+ "dataset_path": "json",
2267
+ "dataset_kwargs": {
2268
+ "data_files": "evaluation_data/blimp_filtered/tough_vs_raising_2.jsonl"
2269
+ },
2270
+ "validation_split": "train",
2271
+ "doc_to_text": "",
2272
+ "doc_to_target": 0,
2273
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2274
+ "description": "",
2275
+ "target_delimiter": " ",
2276
+ "fewshot_delimiter": "\n\n",
2277
+ "num_fewshot": 0,
2278
+ "metric_list": [
2279
+ {
2280
+ "metric": "acc",
2281
+ "weight_by_size": false
2282
+ }
2283
+ ],
2284
+ "output_type": "multiple_choice",
2285
+ "repeats": 1,
2286
+ "should_decontaminate": true,
2287
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2288
+ "metadata": {
2289
+ "version": 1.0
2290
+ }
2291
+ },
2292
+ "blimp_transitive_filtered": {
2293
+ "task": "blimp_transitive_filtered",
2294
+ "group": "blimp_filtered",
2295
+ "dataset_path": "json",
2296
+ "dataset_kwargs": {
2297
+ "data_files": "evaluation_data/blimp_filtered/transitive.jsonl"
2298
+ },
2299
+ "validation_split": "train",
2300
+ "doc_to_text": "",
2301
+ "doc_to_target": 0,
2302
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2303
+ "description": "",
2304
+ "target_delimiter": " ",
2305
+ "fewshot_delimiter": "\n\n",
2306
+ "num_fewshot": 0,
2307
+ "metric_list": [
2308
+ {
2309
+ "metric": "acc",
2310
+ "weight_by_size": false
2311
+ }
2312
+ ],
2313
+ "output_type": "multiple_choice",
2314
+ "repeats": 1,
2315
+ "should_decontaminate": true,
2316
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2317
+ "metadata": {
2318
+ "version": 1.0
2319
+ }
2320
+ },
2321
+ "blimp_wh_island_filtered": {
2322
+ "task": "blimp_wh_island_filtered",
2323
+ "group": "blimp_filtered",
2324
+ "dataset_path": "json",
2325
+ "dataset_kwargs": {
2326
+ "data_files": "evaluation_data/blimp_filtered/wh_island.jsonl"
2327
+ },
2328
+ "validation_split": "train",
2329
+ "doc_to_text": "",
2330
+ "doc_to_target": 0,
2331
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2332
+ "description": "",
2333
+ "target_delimiter": " ",
2334
+ "fewshot_delimiter": "\n\n",
2335
+ "num_fewshot": 0,
2336
+ "metric_list": [
2337
+ {
2338
+ "metric": "acc",
2339
+ "weight_by_size": false
2340
+ }
2341
+ ],
2342
+ "output_type": "multiple_choice",
2343
+ "repeats": 1,
2344
+ "should_decontaminate": true,
2345
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2346
+ "metadata": {
2347
+ "version": 1.0
2348
+ }
2349
+ },
2350
+ "blimp_wh_questions_object_gap_filtered": {
2351
+ "task": "blimp_wh_questions_object_gap_filtered",
2352
+ "group": "blimp_filtered",
2353
+ "dataset_path": "json",
2354
+ "dataset_kwargs": {
2355
+ "data_files": "evaluation_data/blimp_filtered/wh_questions_object_gap.jsonl"
2356
+ },
2357
+ "validation_split": "train",
2358
+ "doc_to_text": "",
2359
+ "doc_to_target": 0,
2360
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2361
+ "description": "",
2362
+ "target_delimiter": " ",
2363
+ "fewshot_delimiter": "\n\n",
2364
+ "num_fewshot": 0,
2365
+ "metric_list": [
2366
+ {
2367
+ "metric": "acc",
2368
+ "weight_by_size": false
2369
+ }
2370
+ ],
2371
+ "output_type": "multiple_choice",
2372
+ "repeats": 1,
2373
+ "should_decontaminate": true,
2374
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2375
+ "metadata": {
2376
+ "version": 1.0
2377
+ }
2378
+ },
2379
+ "blimp_wh_questions_subject_gap_filtered": {
2380
+ "task": "blimp_wh_questions_subject_gap_filtered",
2381
+ "group": "blimp_filtered",
2382
+ "dataset_path": "json",
2383
+ "dataset_kwargs": {
2384
+ "data_files": "evaluation_data/blimp_filtered/wh_questions_subject_gap.jsonl"
2385
+ },
2386
+ "validation_split": "train",
2387
+ "doc_to_text": "",
2388
+ "doc_to_target": 0,
2389
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2390
+ "description": "",
2391
+ "target_delimiter": " ",
2392
+ "fewshot_delimiter": "\n\n",
2393
+ "num_fewshot": 0,
2394
+ "metric_list": [
2395
+ {
2396
+ "metric": "acc",
2397
+ "weight_by_size": false
2398
+ }
2399
+ ],
2400
+ "output_type": "multiple_choice",
2401
+ "repeats": 1,
2402
+ "should_decontaminate": true,
2403
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2404
+ "metadata": {
2405
+ "version": 1.0
2406
+ }
2407
+ },
2408
+ "blimp_wh_questions_subject_gap_long_distance_filtered": {
2409
+ "task": "blimp_wh_questions_subject_gap_long_distance_filtered",
2410
+ "group": "blimp_filtered",
2411
+ "dataset_path": "json",
2412
+ "dataset_kwargs": {
2413
+ "data_files": "evaluation_data/blimp_filtered/wh_questions_subject_gap_long_distance.jsonl"
2414
+ },
2415
+ "validation_split": "train",
2416
+ "doc_to_text": "",
2417
+ "doc_to_target": 0,
2418
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2419
+ "description": "",
2420
+ "target_delimiter": " ",
2421
+ "fewshot_delimiter": "\n\n",
2422
+ "num_fewshot": 0,
2423
+ "metric_list": [
2424
+ {
2425
+ "metric": "acc",
2426
+ "weight_by_size": false
2427
+ }
2428
+ ],
2429
+ "output_type": "multiple_choice",
2430
+ "repeats": 1,
2431
+ "should_decontaminate": true,
2432
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2433
+ "metadata": {
2434
+ "version": 1.0
2435
+ }
2436
+ },
2437
+ "blimp_wh_vs_that_no_gap_filtered": {
2438
+ "task": "blimp_wh_vs_that_no_gap_filtered",
2439
+ "group": "blimp_filtered",
2440
+ "dataset_path": "json",
2441
+ "dataset_kwargs": {
2442
+ "data_files": "evaluation_data/blimp_filtered/wh_vs_that_no_gap.jsonl"
2443
+ },
2444
+ "validation_split": "train",
2445
+ "doc_to_text": "",
2446
+ "doc_to_target": 0,
2447
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2448
+ "description": "",
2449
+ "target_delimiter": " ",
2450
+ "fewshot_delimiter": "\n\n",
2451
+ "num_fewshot": 0,
2452
+ "metric_list": [
2453
+ {
2454
+ "metric": "acc",
2455
+ "weight_by_size": false
2456
+ }
2457
+ ],
2458
+ "output_type": "multiple_choice",
2459
+ "repeats": 1,
2460
+ "should_decontaminate": true,
2461
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2462
+ "metadata": {
2463
+ "version": 1.0
2464
+ }
2465
+ },
2466
+ "blimp_wh_vs_that_no_gap_long_distance_filtered": {
2467
+ "task": "blimp_wh_vs_that_no_gap_long_distance_filtered",
2468
+ "group": "blimp_filtered",
2469
+ "dataset_path": "json",
2470
+ "dataset_kwargs": {
2471
+ "data_files": "evaluation_data/blimp_filtered/wh_vs_that_no_gap_long_distance.jsonl"
2472
+ },
2473
+ "validation_split": "train",
2474
+ "doc_to_text": "",
2475
+ "doc_to_target": 0,
2476
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2477
+ "description": "",
2478
+ "target_delimiter": " ",
2479
+ "fewshot_delimiter": "\n\n",
2480
+ "num_fewshot": 0,
2481
+ "metric_list": [
2482
+ {
2483
+ "metric": "acc",
2484
+ "weight_by_size": false
2485
+ }
2486
+ ],
2487
+ "output_type": "multiple_choice",
2488
+ "repeats": 1,
2489
+ "should_decontaminate": true,
2490
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2491
+ "metadata": {
2492
+ "version": 1.0
2493
+ }
2494
+ },
2495
+ "blimp_wh_vs_that_with_gap_filtered": {
2496
+ "task": "blimp_wh_vs_that_with_gap_filtered",
2497
+ "group": "blimp_filtered",
2498
+ "dataset_path": "json",
2499
+ "dataset_kwargs": {
2500
+ "data_files": "evaluation_data/blimp_filtered/wh_vs_that_with_gap.jsonl"
2501
+ },
2502
+ "validation_split": "train",
2503
+ "doc_to_text": "",
2504
+ "doc_to_target": 0,
2505
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2506
+ "description": "",
2507
+ "target_delimiter": " ",
2508
+ "fewshot_delimiter": "\n\n",
2509
+ "num_fewshot": 0,
2510
+ "metric_list": [
2511
+ {
2512
+ "metric": "acc",
2513
+ "weight_by_size": false
2514
+ }
2515
+ ],
2516
+ "output_type": "multiple_choice",
2517
+ "repeats": 1,
2518
+ "should_decontaminate": true,
2519
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2520
+ "metadata": {
2521
+ "version": 1.0
2522
+ }
2523
+ },
2524
+ "blimp_wh_vs_that_with_gap_long_distance_filtered": {
2525
+ "task": "blimp_wh_vs_that_with_gap_long_distance_filtered",
2526
+ "group": "blimp_filtered",
2527
+ "dataset_path": "json",
2528
+ "dataset_kwargs": {
2529
+ "data_files": "evaluation_data/blimp_filtered/wh_vs_that_with_gap_long_distance.jsonl"
2530
+ },
2531
+ "validation_split": "train",
2532
+ "doc_to_text": "",
2533
+ "doc_to_target": 0,
2534
+ "doc_to_choice": "{{[sentence_good, sentence_bad]}}",
2535
+ "description": "",
2536
+ "target_delimiter": " ",
2537
+ "fewshot_delimiter": "\n\n",
2538
+ "num_fewshot": 0,
2539
+ "metric_list": [
2540
+ {
2541
+ "metric": "acc",
2542
+ "weight_by_size": false
2543
+ }
2544
+ ],
2545
+ "output_type": "multiple_choice",
2546
+ "repeats": 1,
2547
+ "should_decontaminate": true,
2548
+ "doc_to_decontamination_query": "{{sentence_good}} {{sentence_bad}}",
2549
+ "metadata": {
2550
+ "version": 1.0
2551
+ }
2552
+ }
2553
+ },
2554
+ "versions": {
2555
+ "blimp_adjunct_island_filtered": 1.0,
2556
+ "blimp_anaphor_gender_agreement_filtered": 1.0,
2557
+ "blimp_anaphor_number_agreement_filtered": 1.0,
2558
+ "blimp_animate_subject_passive_filtered": 1.0,
2559
+ "blimp_animate_subject_trans_filtered": 1.0,
2560
+ "blimp_causative_filtered": 1.0,
2561
+ "blimp_complex_NP_island_filtered": 1.0,
2562
+ "blimp_coordinate_structure_constraint_complex_left_branch_filtered": 1.0,
2563
+ "blimp_coordinate_structure_constraint_object_extraction_filtered": 1.0,
2564
+ "blimp_determiner_noun_agreement_1_filtered": 1.0,
2565
+ "blimp_determiner_noun_agreement_2_filtered": 1.0,
2566
+ "blimp_determiner_noun_agreement_irregular_1_filtered": 1.0,
2567
+ "blimp_determiner_noun_agreement_irregular_2_filtered": 1.0,
2568
+ "blimp_determiner_noun_agreement_with_adj_2_filtered": 1.0,
2569
+ "blimp_determiner_noun_agreement_with_adj_irregular_1_filtered": 1.0,
2570
+ "blimp_determiner_noun_agreement_with_adj_irregular_2_filtered": 1.0,
2571
+ "blimp_determiner_noun_agreement_with_adjective_1_filtered": 1.0,
2572
+ "blimp_distractor_agreement_relational_noun_filtered": 1.0,
2573
+ "blimp_distractor_agreement_relative_clause_filtered": 1.0,
2574
+ "blimp_drop_argument_filtered": 1.0,
2575
+ "blimp_ellipsis_n_bar_1_filtered": 1.0,
2576
+ "blimp_ellipsis_n_bar_2_filtered": 1.0,
2577
+ "blimp_existential_there_object_raising_filtered": 1.0,
2578
+ "blimp_existential_there_quantifiers_1_filtered": 1.0,
2579
+ "blimp_existential_there_quantifiers_2_filtered": 1.0,
2580
+ "blimp_existential_there_subject_raising_filtered": 1.0,
2581
+ "blimp_expletive_it_object_raising_filtered": 1.0,
2582
+ "blimp_inchoative_filtered": 1.0,
2583
+ "blimp_intransitive_filtered": 1.0,
2584
+ "blimp_irregular_past_participle_adjectives_filtered": 1.0,
2585
+ "blimp_irregular_past_participle_verbs_filtered": 1.0,
2586
+ "blimp_irregular_plural_subject_verb_agreement_1_filtered": 1.0,
2587
+ "blimp_irregular_plural_subject_verb_agreement_2_filtered": 1.0,
2588
+ "blimp_left_branch_island_echo_question_filtered": 1.0,
2589
+ "blimp_left_branch_island_simple_question_filtered": 1.0,
2590
+ "blimp_matrix_question_npi_licensor_present_filtered": 1.0,
2591
+ "blimp_npi_present_1_filtered": 1.0,
2592
+ "blimp_npi_present_2_filtered": 1.0,
2593
+ "blimp_only_npi_licensor_present_filtered": 1.0,
2594
+ "blimp_only_npi_scope_filtered": 1.0,
2595
+ "blimp_passive_1_filtered": 1.0,
2596
+ "blimp_passive_2_filtered": 1.0,
2597
+ "blimp_principle_A_c_command_filtered": 1.0,
2598
+ "blimp_principle_A_case_1_filtered": 1.0,
2599
+ "blimp_principle_A_case_2_filtered": 1.0,
2600
+ "blimp_principle_A_domain_1_filtered": 1.0,
2601
+ "blimp_principle_A_domain_2_filtered": 1.0,
2602
+ "blimp_principle_A_domain_3_filtered": 1.0,
2603
+ "blimp_principle_A_reconstruction_filtered": 1.0,
2604
+ "blimp_regular_plural_subject_verb_agreement_1_filtered": 1.0,
2605
+ "blimp_regular_plural_subject_verb_agreement_2_filtered": 1.0,
2606
+ "blimp_sentential_negation_npi_licensor_present_filtered": 1.0,
2607
+ "blimp_sentential_negation_npi_scope_filtered": 1.0,
2608
+ "blimp_sentential_subject_island_filtered": 1.0,
2609
+ "blimp_superlative_quantifiers_1_filtered": 1.0,
2610
+ "blimp_superlative_quantifiers_2_filtered": 1.0,
2611
+ "blimp_supplement_hypernym": 1.0,
2612
+ "blimp_supplement_qa_congruence_easy": 1.0,
2613
+ "blimp_supplement_qa_congruence_tricky": 1.0,
2614
+ "blimp_supplement_subject_aux_inversion": 1.0,
2615
+ "blimp_supplement_turn_taking": 1.0,
2616
+ "blimp_tough_vs_raising_1_filtered": 1.0,
2617
+ "blimp_tough_vs_raising_2_filtered": 1.0,
2618
+ "blimp_transitive_filtered": 1.0,
2619
+ "blimp_wh_island_filtered": 1.0,
2620
+ "blimp_wh_questions_object_gap_filtered": 1.0,
2621
+ "blimp_wh_questions_subject_gap_filtered": 1.0,
2622
+ "blimp_wh_questions_subject_gap_long_distance_filtered": 1.0,
2623
+ "blimp_wh_vs_that_no_gap_filtered": 1.0,
2624
+ "blimp_wh_vs_that_no_gap_long_distance_filtered": 1.0,
2625
+ "blimp_wh_vs_that_with_gap_filtered": 1.0,
2626
+ "blimp_wh_vs_that_with_gap_long_distance_filtered": 1.0
2627
+ },
2628
+ "n-shot": {
2629
+ "blimp_adjunct_island_filtered": 0,
2630
+ "blimp_anaphor_gender_agreement_filtered": 0,
2631
+ "blimp_anaphor_number_agreement_filtered": 0,
2632
+ "blimp_animate_subject_passive_filtered": 0,
2633
+ "blimp_animate_subject_trans_filtered": 0,
2634
+ "blimp_causative_filtered": 0,
2635
+ "blimp_complex_NP_island_filtered": 0,
2636
+ "blimp_coordinate_structure_constraint_complex_left_branch_filtered": 0,
2637
+ "blimp_coordinate_structure_constraint_object_extraction_filtered": 0,
2638
+ "blimp_determiner_noun_agreement_1_filtered": 0,
2639
+ "blimp_determiner_noun_agreement_2_filtered": 0,
2640
+ "blimp_determiner_noun_agreement_irregular_1_filtered": 0,
2641
+ "blimp_determiner_noun_agreement_irregular_2_filtered": 0,
2642
+ "blimp_determiner_noun_agreement_with_adj_2_filtered": 0,
2643
+ "blimp_determiner_noun_agreement_with_adj_irregular_1_filtered": 0,
2644
+ "blimp_determiner_noun_agreement_with_adj_irregular_2_filtered": 0,
2645
+ "blimp_determiner_noun_agreement_with_adjective_1_filtered": 0,
2646
+ "blimp_distractor_agreement_relational_noun_filtered": 0,
2647
+ "blimp_distractor_agreement_relative_clause_filtered": 0,
2648
+ "blimp_drop_argument_filtered": 0,
2649
+ "blimp_ellipsis_n_bar_1_filtered": 0,
2650
+ "blimp_ellipsis_n_bar_2_filtered": 0,
2651
+ "blimp_existential_there_object_raising_filtered": 0,
2652
+ "blimp_existential_there_quantifiers_1_filtered": 0,
2653
+ "blimp_existential_there_quantifiers_2_filtered": 0,
2654
+ "blimp_existential_there_subject_raising_filtered": 0,
2655
+ "blimp_expletive_it_object_raising_filtered": 0,
2656
+ "blimp_filtered": 0,
2657
+ "blimp_inchoative_filtered": 0,
2658
+ "blimp_intransitive_filtered": 0,
2659
+ "blimp_irregular_past_participle_adjectives_filtered": 0,
2660
+ "blimp_irregular_past_participle_verbs_filtered": 0,
2661
+ "blimp_irregular_plural_subject_verb_agreement_1_filtered": 0,
2662
+ "blimp_irregular_plural_subject_verb_agreement_2_filtered": 0,
2663
+ "blimp_left_branch_island_echo_question_filtered": 0,
2664
+ "blimp_left_branch_island_simple_question_filtered": 0,
2665
+ "blimp_matrix_question_npi_licensor_present_filtered": 0,
2666
+ "blimp_npi_present_1_filtered": 0,
2667
+ "blimp_npi_present_2_filtered": 0,
2668
+ "blimp_only_npi_licensor_present_filtered": 0,
2669
+ "blimp_only_npi_scope_filtered": 0,
2670
+ "blimp_passive_1_filtered": 0,
2671
+ "blimp_passive_2_filtered": 0,
2672
+ "blimp_principle_A_c_command_filtered": 0,
2673
+ "blimp_principle_A_case_1_filtered": 0,
2674
+ "blimp_principle_A_case_2_filtered": 0,
2675
+ "blimp_principle_A_domain_1_filtered": 0,
2676
+ "blimp_principle_A_domain_2_filtered": 0,
2677
+ "blimp_principle_A_domain_3_filtered": 0,
2678
+ "blimp_principle_A_reconstruction_filtered": 0,
2679
+ "blimp_regular_plural_subject_verb_agreement_1_filtered": 0,
2680
+ "blimp_regular_plural_subject_verb_agreement_2_filtered": 0,
2681
+ "blimp_sentential_negation_npi_licensor_present_filtered": 0,
2682
+ "blimp_sentential_negation_npi_scope_filtered": 0,
2683
+ "blimp_sentential_subject_island_filtered": 0,
2684
+ "blimp_superlative_quantifiers_1_filtered": 0,
2685
+ "blimp_superlative_quantifiers_2_filtered": 0,
2686
+ "blimp_supplement": 0,
2687
+ "blimp_supplement_hypernym": 0,
2688
+ "blimp_supplement_qa_congruence_easy": 0,
2689
+ "blimp_supplement_qa_congruence_tricky": 0,
2690
+ "blimp_supplement_subject_aux_inversion": 0,
2691
+ "blimp_supplement_turn_taking": 0,
2692
+ "blimp_tough_vs_raising_1_filtered": 0,
2693
+ "blimp_tough_vs_raising_2_filtered": 0,
2694
+ "blimp_transitive_filtered": 0,
2695
+ "blimp_wh_island_filtered": 0,
2696
+ "blimp_wh_questions_object_gap_filtered": 0,
2697
+ "blimp_wh_questions_subject_gap_filtered": 0,
2698
+ "blimp_wh_questions_subject_gap_long_distance_filtered": 0,
2699
+ "blimp_wh_vs_that_no_gap_filtered": 0,
2700
+ "blimp_wh_vs_that_no_gap_long_distance_filtered": 0,
2701
+ "blimp_wh_vs_that_with_gap_filtered": 0,
2702
+ "blimp_wh_vs_that_with_gap_long_distance_filtered": 0
2703
+ },
2704
+ "config": {
2705
+ "model": "hf-mlm",
2706
+ "model_args": "pretrained=../models/ELC_ParserBERT_10M_256_ordered/,backend=mlm,trust_remote_code=True",
2707
+ "batch_size": "128",
2708
+ "batch_sizes": [],
2709
+ "device": "cuda:0",
2710
+ "use_cache": null,
2711
+ "limit": null,
2712
+ "bootstrap_iters": 100000,
2713
+ "gen_kwargs": null
2714
+ },
2715
+ "git_hash": "0150eb5",
2716
+ "date": 1726583493.7720273,
2717
+ "pretty_env_info": "PyTorch version: 2.4.1\nIs debug build: False\nCUDA used to build PyTorch: 12.1\nROCM used to build PyTorch: N/A\n\nOS: CentOS Linux release 7.9.2009 (Core) (x86_64)\nGCC version: (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)\nClang version: Could not collect\nCMake version: version 2.8.12.2\nLibc version: glibc-2.17\n\nPython version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime)\nPython platform: Linux-3.10.0-1160.71.1.el7.x86_64-x86_64-with-glibc2.17\nIs CUDA available: True\nCUDA runtime version: 12.1.105\nCUDA_MODULE_LOADING set to: LAZY\nGPU models and configuration: GPU 0: NVIDIA A100 80GB PCIe\nNvidia driver version: 545.23.08\ncuDNN version: Could not collect\nHIP runtime version: N/A\nMIOpen runtime version: N/A\nIs XNNPACK available: True\n\nCPU:\nArchitecture: x86_64\nCPU op-mode(s): 32-bit, 64-bit\nByte Order: Little Endian\nCPU(s): 128\nOn-line CPU(s) list: 0-127\nThread(s) per core: 2\nCore(s) per socket: 32\nSocket(s): 2\nNUMA node(s): 2\nVendor ID: GenuineIntel\nCPU family: 6\nModel: 106\nModel name: Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz\nStepping: 6\nCPU MHz: 2600.000\nBogoMIPS: 5200.00\nVirtualization: VT-x\nL1d cache: 48K\nL1i cache: 32K\nL2 cache: 1280K\nL3 cache: 49152K\nNUMA node0 CPU(s): 0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108,110,112,114,116,118,120,122,124,126\nNUMA node1 CPU(s): 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81,83,85,87,89,91,93,95,97,99,101,103,105,107,109,111,113,115,117,119,121,123,125,127\nFlags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 invpcid_single intel_pt ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq md_clear pconfig spec_ctrl intel_stibp flush_l1d arch_capabilities\n\nVersions of relevant libraries:\n[pip3] flake8==7.0.0\n[pip3] mypy==1.10.0\n[pip3] mypy-extensions==1.0.0\n[pip3] numpy==1.26.4\n[pip3] numpydoc==1.7.0\n[pip3] torch==2.4.1\n[pip3] torchaudio==2.4.1\n[pip3] torchvision==0.19.1\n[pip3] triton==3.0.0\n[conda] _anaconda_depends 2024.06 py311_mkl_2 \n[conda] blas 1.0 mkl \n[conda] ffmpeg 4.3 hf484d3e_0 pytorch\n[conda] libjpeg-turbo 2.0.0 h9bf148f_0 pytorch\n[conda] mkl 2023.1.0 h213fc3f_46344 \n[conda] mkl-service 2.4.0 py311h5eee18b_1 \n[conda] mkl_fft 1.3.10 py311h5eee18b_0 \n[conda] mkl_random 1.2.7 py311ha02d727_0 \n[conda] numpy 1.26.4 py311h08b1b3b_0 \n[conda] numpy-base 1.26.4 py311hf175353_0 \n[conda] numpydoc 1.7.0 py311h06a4308_0 \n[conda] pytorch 2.4.1 py3.11_cuda12.1_cudnn9.1.0_0 pytorch\n[conda] pytorch-cuda 12.1 ha16c6d3_5 pytorch\n[conda] pytorch-mutex 1.0 cuda pytorch\n[conda] torchaudio 2.4.1 py311_cu121 pytorch\n[conda] torchtriton 3.0.0 py311 pytorch\n[conda] torchvision 0.19.1 py311_cu121 pytorch",
2718
+ "transformers_version": "4.44.2",
2719
+ "upper_git_hash": null
2720
+ }