--- language: fa tags: - persian - RoBERTa license: apache-2.0 pipeline_tag: fill-mask mask_token: '[MASK]' widget: - text: 'در همین لحظه که شما مشغول [MASK] این متن هستید، میلیون‌ها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمع‌آوری، پردازش و تحلیل این کلان داده (Big Data) می‌پردازیم.' extra_gated_prompt: "This MODEL IS NOT FREE, please enter your contact informations. We will reach you out" extra_gated_fields: contact information: text ---

# Logo Lifeweb

### Tehran Language Model Welcome to Tehran, the repository for Lifeweb's language model. First versions of our models are all trained on our own dataset called **Divan** with more than **164 million documents** and more than **10B tokens** which is normalized and deduplicated meticulously to ensure its enrichment and comprehensiveness. A better dataset leads to a better model! # Use Model You can easily access the models using the sample code provided in the below. ```python from transformers import AutoTokenizer, AutoModelForMaskedLM, FillMaskPipeline # v1.0 model_name = "lifeweb-ai/tehran" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForMaskedLM.from_pretrained(model_name) text = "در همین لحظه که شما مشغول خواندن این متن هستید، میلیون‌ها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمع‌آوری، پردازش و تحلیل این کلان داده (Big Data) می‌پردازیم." print(tokenizer.tokenize(text)) # ['در', 'همین', 'لحظه', 'که', 'شما', 'مشغول', 'خواندن', 'این', 'متن', 'هستید،', 'میلیون', '[zwnj]', 'ها', 'دیتا', 'در', 'فضای', 'انلاین', 'در', 'حال', 'تولید', 'است', '.', 'ما', 'در', 'لایف', 'وب', 'به', 'جمع', '[zwnj]', 'اوری', '##،', 'پردازش', 'و', 'تحلیل', 'این', 'کلان', 'داده', '(', 'big', 'data', ')', 'می', '[zwnj]', 'پردازیم', '.', '.'] # fill mask task text = "در همین لحظه که شما مشغول [MASK] این متن هستید، میلیون‌ها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمع‌آوری، پردازش و تحلیل این کلان داده (Big Data) می‌پردازیم." classifier = FillMaskPipeline(model=model, tokenizer=tokenizer) result = classifier(text) print(result[0]) #{'score': 0.3825972378253937, 'token': 5764, 'token_str': 'خواندن', 'sequence': 'در همین لحظه که شما مشغول خواندن این متن هستید، میلیون ها دیتا در فضای انلاین در حال تولید است. ما در لایف وب به جمع اوری، پردازش و تحلیل این کلان داده ( big data ) می پردازیم.'} ``` # Results The **Tehran** is evaluated on three downstream NLP tasks comprising **NER**, **Sentiment Analysis**, and **Emotion Detection**. **Tehran** outperforms every other Persian language model in terms of accuracy and macro F1. Obvious from the table below, you can find the colab codes for each task to use as a tutorial besides the macro F1 score.These Colab codes are run equally on 4x2080 TI graphic cards.

Model	NER		Sentiment		Emotion
	Arman	Peyma	Sentipers (multi)	Snappfood	Arman
lifeweb-ai/tehran	71.87%	90.79%	63.75%	88.74%	77.73%
lifeweb-ai/shiraz	67.62%	86.24%	59.17%	88.01%	66.97%
sbunlp/fabert	71.23%	88.53%	58.51%	88.60%	72.65%
ViraIntelligentDataMining/AriaBERT	69.12%	87.15%	59.26%	87.96%	69.11%
HooshvareLab/bert-fa-zwnj-base	67.49%	85.73%	59.61%	87.58%	59.27%
HooshvareLab/roberta-fa-zwnj-base	69.73%	86.21%	56.23%	87.19%	57.96%

If you tested our models on a public dataset, and you wanted to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so that we can add a reference. # Cite You are welcome to use our LM models in your work or research, if so, we kindly ask you to cite it using the following entry: ``` @misc{Tehran, author = {Mehrdad Azizi, Reza Salehi Chegeni, Parisa Mousavi, Iman Hashemi}, title = {[Optimizing Pre-trained BERT-based Models for Persian Language Processing]}, year = {2024}, publisher = {LifeWeb} } ``` # Contributors - Mehrdad Azizi: [**Linkedin**](https://www.linkedin.com/in/mehrdad-azizi-50839489/), [**Github**](https://github.com/mehrazi) - Reza Salehi Chegeni: [**Linkedin**](https://www.linkedin.com/in/reza-salehi-chegeni-6988ba271/), [**Github**](https://github.com/rezasalehichegeni) - Parisa Mousavi: [**Linkedin**](https://www.linkedin.com/in/seyede-parisa-mousavi/), [**Github**](https://github.com/Mousavi-Parisa) - Iman Hashemi: [**Linkedin**](https://www.linkedin.com/in/iman-hashemi-403738a5), [**Github**](https://github.com/hashemiiman) - Lifeweb: [**HuggingFace**](https://huggingface.co/lifeweb-ai), [**Official Website**](https://lifewebco.com/), [**Linkedin**](https://www.linkedin.com/company/lifewebir/mycompany/) # Releases **v1.0(2024-03-09)** First version of **Tehran** model trained on **DIVAN**.