metadata
language:
- en
- bn
library_name: transformers
license: apache-2.0
tags:
- transformers
- gemma2
- gemma
rishiraj/gemma-2-9b-bn
This repository extends the google/gemma-2-9b
tokenizer by training it on Bengali text. The original tokenizer splits many Bengali words into subword components, leading to inefficiency and loss of meaning. Our extended Bengali tokenizer better preserves word integrity, tokenizing more effectively with fewer splits, ensuring more meaningful representation of the text.
Token Information
Tokenizer | Number of Tokens |
---|---|
google/gemma-2-9b |
256,000 |
rishiraj/gemma-2-9b-bn |
392,402 |
Why Fewer Tokens for Bengali?
While Bengali is very expressive and flexible, it hasn't undergone as much global influence as English in terms of absorbing new words from many different languages.
Tokenizer Comparison
Text:
আমি একজন ভালো ছেলে এবং আমি ফুটবল খেলতে পছন্দ করি
Tokenizer | Output |
---|---|
google/gemma-2-9b |
['আ', 'মি', '▁এক', 'জন', '▁ভ', 'াল', 'ো', '▁', 'ছে', 'লে', '▁এবং', '▁আম', 'ি', '▁ফ', 'ু', 'ট', 'ব', 'ল', '▁খ', 'েল', 'তে', '▁প', 'ছ', 'ন্দ', '▁কর', 'ি'] |
rishiraj/gemma-2-9b-bn |
['আমি', '▁একজন', '▁ভালো', '▁ছেলে', '▁এবং', '▁আমি', '▁ফুটবল', '▁খেলতে', '▁পছন্দ', '▁করি'] |
Usage
Install dependencies:
pip install transformers
Load and use the tokenizer:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("rishiraj/gemma-2-9b-bn") tokens = tokenizer.tokenize("আমি একজন ভালো ছেলে এবং আমি ফুটবল খেলতে পছন্দ করি") print(tokens)