|
--- |
|
license: apache-2.0 |
|
--- |
|
|
|
# ELC-ParserBERT |
|
|
|
This model is an adaptation of the [Every Layer Counts BERT model](<https://aclanthology.org/2023.conll-babylm.20/>), but it incorporates the `Parser Network` from the [StructFormer](<https://arxiv.org/abs/2012.00857>). It was trained for the [BabyLM 2024 challenge](https://babylm.github.io/index.html)'s Strict-Small track. |
|
|
|
## Dataset |
|
|
|
The training data for the challenge can be accessed through OSF [here](https://osf.io/ad7qg/). This model was trained on the 10M token training dataset. |
|
|
|
### Order in Pretraining |
|
|
|
After the segmentation of the data, the segements are ordered in increasing difficulty according to the flesch_reading_ease metric. This ordering can either be maintained by not including the shuffle flag when training or rejected (and allowing shuffling of the data to happen); this model did shuffle the data. |
|
|
|
## Hyperparameters |
|
|
|
### Base Model |
|
|
|
| Hyperparameter | Value | |
|
| -------------- | ----- | |
|
| Initial learning rate | 5e-3 | |
|
| Batch size | 256 | |
|
| Steps | 13495 | |
|
| shuffled | True | |
|
|attention_probs_dropout_prob | 0.1 | |
|
| classifier_dropout | 0.2 | |
|
| hidden_dropout_prob | 0.1 | |
|
| hidden_size | 384 | |
|
| intermediate_size | 1024 | |
|
| layer_norm_eps | 1e-07 | |
|
| max_position_embeddings | 512 | |
|
| num_attention_heads | 6 | |
|
| num_hidden_layers | 12 | |
|
| vocab_size | 16384 | |
|
| n_parser_layers | 4 | |
|
| parser_conv_size |9 | |
|
|
|
### Fine-tuning |
|
|
|
The fine-tuning parameters were unchanged from the organizer outside of following the ELC-BERT model's patience approach for last year, in particular: |
|
|
|
| Hyperparameter | Value | |
|
| -------------- | ----- | |
|
| Initial learning rate | 5e-5 | |
|
| Batch size | 64 | |
|
| Maximum epochs | 10 | |
|
| Evaluate every (epochs) | 1 | |
|
| Patience | 10 (for CoLA, MRPC, RTE, BoolQ, MultiRC, and WSC), 100 (for MNLI, MNLI-MM, QQP, QNLI, and SST-2) | |
|
| Seed | 12 | |
|
|
|
## Credit |
|
|
|
As mentioned above, this model is an adapatation of Every Layer Counts (ELC) BERT and StructFormer, the citations and code repositories for which can be found here |
|
|
|
* StructFormer |
|
* [StructFormer Github](<https://github.com/google-research/google-research/tree/master/structformer>) |
|
|
|
* ```bibtex |
|
@misc{shen2020structformer, |
|
title={StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling}, |
|
author={Yikang Shen and Yi Tay and Che Zheng and Dara Bahri and Donald Metzler and Aaron Courville}, |
|
year={2020}, |
|
eprint={2012.00857}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}}``` |
|
* ELC-BERT: |
|
* [ELC-BERT Github](<https://github.com/ltgoslo/elc-bert>) |
|
* [ELC-BERT 10M Hugging Face](https://huggingface.co/lgcharpe/ELC_BERT_small_baby_10M) |
|
* ```bibtex |
|
@inproceedings{georges-gabriel-charpentier-samuel-2023-layers, |
|
title = "Not all layers are equally as important: Every Layer Counts {BERT}", |
|
author = "Georges Gabriel Charpentier, Lucas and |
|
Samuel, David", |
|
editor = "Warstadt, Alex and |
|
Mueller, Aaron and |
|
Choshen, Leshem and |
|
Wilcox, Ethan and |
|
Zhuang, Chengxu and |
|
Ciro, Juan and |
|
Mosquera, Rafael and |
|
Paranjabe, Bhargavi and |
|
Williams, Adina and |
|
Linzen, Tal and |
|
Cotterell, Ryan", |
|
booktitle = "Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning", |
|
month = dec, |
|
year = "2023", |
|
address = "Singapore", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://aclanthology.org/2023.conll-babylm.20", |
|
doi = "10.18653/v1/2023.conll-babylm.20", |
|
pages = "238--252", |
|
}``` |
|
|