MaCoCu
/

XLMR-base-MaCoCu-is

Feature Extraction

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

ZJaume commited on Feb 13, 2023

Commit

4abb43d

•

1 Parent(s): a6b67ad

Add model card

Files changed (1) hide show

README.md +68 -0

README.md ADDED Viewed

	@@ -0,0 +1,68 @@

+---
+license: cc0-1.0
+language:
+- is
+tags:
+- MaCoCu
+---
+# Model description
+**XLMR-base-MaCoCu-is** is a large pre-trained language model trained on **Icelandic** texts. It was created by continuing training from the [XLM-RoBERTa-base](https://huggingface.co/xlm-roberta-base) model. It was developed as part of the [MaCoCu](https://macocu.eu/) project and only uses data that was crawled during the project. The main developer is [Jaume Zaragoza-Bernabeu](https://github.com/ZJaume) from Prompsit Language Engineering.
+XLMR-base-MaCoCu-is was trained on 4.4GB of Icelandic text, which is equal to 688M tokens. It was trained for 40,000 steps with a batch size of 256. It uses the same vocabulary as the original XLMR-base model.
+The training and fine-tuning procedures are described in detail on our [Github repo](https://github.com/macocu/LanguageModels).
+## Warning
+This model has not been fully trained because it was intended for use as base of [Bicleaner AI Icelandic model](https://huggingface.co/bitextor/bicleaner-ai-full-en-is). If you need better performance, please use [XLMR-MaCoCu-is](https://huggingface.co/MaCoCu/XLMR-MaCoCu-is).
+# How to use
+```python
+from transformers import AutoTokenizer, AutoModel, TFAutoModel
+tokenizer = AutoTokenizer.from_pretrained("MaCoCu/XLMR-base-MaCoCu-is")
+model = AutoModel.from_pretrained("MaCoCu/XLMR-base-MaCoCu-is") # PyTorch
+model = TFAutoModel.from_pretrained("MaCoCu/XLMR-base-MaCoCu-is") # Tensorflow
+```
+# Data
+For training, we used all Icelandic data that was present in the monolingual Icelandic [MaCoCu](https://macocu.eu/) corpus. After de-duplicating the data, we were left with a total of 4.4 GB of text, which equals 688M tokens.
+# Acknowledgements
+The authors received funding from the European Union’s Connecting Europe Facility 2014-
+2020 - CEF Telecom, under Grant Agreement No.INEA/CEF/ICT/A2020/2278341 (MaCoCu).
+# Citation
+If you use this model, please cite the following paper:
+```bibtex
+@inproceedings{non-etal-2022-macocu,
+    title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
+    author = "Ba{\~n}{\'o}n, Marta  and
+      Espl{\`a}-Gomis, Miquel  and
+      Forcada, Mikel L.  and
+      Garc{\'\i}a-Romero, Cristian  and
+      Kuzman, Taja  and
+      Ljube{\v{s}}i{\'c}, Nikola  and
+      van Noord, Rik  and
+      Sempere, Leopoldo Pla  and
+      Ram{\'\i}rez-S{\'a}nchez, Gema  and
+      Rupnik, Peter  and
+      Suchomel, V{\'\i}t  and
+      Toral, Antonio  and
+      van der Werff, Tobias  and
+      Zaragoza, Jaume",
+    booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
+    month = jun,
+    year = "2022",
+    address = "Ghent, Belgium",
+    publisher = "European Association for Machine Translation",
+    url = "https://aclanthology.org/2022.eamt-1.41",
+    pages = "303--304"
+}
+```