imvladikon commited on
Commit
44bfa66
โ€ข
1 Parent(s): ed4bb7e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +43 -4
README.md CHANGED
@@ -5,15 +5,54 @@ tags:
5
  - language model
6
  ---
7
 
8
- Checkpoint of the alephbertgimmel-base-512 from https://github.com/Dicta-Israel-Center-for-Text-Analysis/alephbertgimmel
 
 
 
 
9
  (for testing purpose, please use original checkpoints of the authors of this model)
10
 
11
- AlephBertGimmel - Modern Hebrew pretrained BERT model with a 128K token vocabulary.
 
12
 
13
- When using AlephBertGimmel, please reference:
14
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
15
  ```
16
 
17
- Eylon Guetta, Avi Shmidman, Shaltiel Shmidman, Cheyn Shmuel Shmidman, Joshua Guedalia, Moshe Koppel, Dan Bareket, Amit Seker and Reut Tsarfaty, "Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive Analysis of Hebrew BERT Models and a New One to Outperform Them All", Nov 2022 [http://arxiv.org/abs/2211.15199]
 
 
 
 
 
 
 
 
 
 
 
 
18
 
19
  ```
 
5
  - language model
6
  ---
7
 
8
+ ## AlephBertGimmel
9
+ Modern Hebrew pretrained BERT model with a 128K token vocabulary.
10
+
11
+
12
+ Checkpoint of the alephbertgimmel-base-512 from [alephbertgimmel](https://github.com/Dicta-Israel-Center-for-Text-Analysis/alephbertgimmel)
13
  (for testing purpose, please use original checkpoints of the authors of this model)
14
 
15
+ ```python
16
+ from transformers import AutoTokenizer, AutoModelForMaskedLM
17
 
 
18
 
19
+ import torch
20
+ from transformers import AutoModelForMaskedLM, AutoTokenizer
21
+
22
+ model = AutoModelForMaskedLM.from_pretrained("imvladikon/alephbertgimmel-base-512")
23
+ tokenizer = AutoTokenizer.from_pretrained("imvladikon/alephbertgimmel-base-512")
24
+
25
+ text = "{} ื”ื™ื ืžื˜ืจื•ืคื•ืœื™ืŸ ื”ืžื”ื•ื•ื” ืืช ืžืจื›ื– ื”ื›ืœื›ืœื”"
26
+
27
+ input = tokenizer.encode(text.format("[MASK]"), return_tensors="pt")
28
+ mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]
29
+
30
+ token_logits = model(input).logits
31
+ mask_token_logits = token_logits[0, mask_token_index, :]
32
+ top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()
33
+
34
+ for token in top_5_tokens:
35
+ print(text.format(tokenizer.decode([token])))
36
+
37
+ # ื”ืขื™ืจ ื”ื™ื ืžื˜ืจื•ืคื•ืœื™ืŸ ื”ืžื”ื•ื•ื” ืืช ืžืจื›ื– ื”ื›ืœื›ืœื”
38
+ # ื™ืจื•ืฉืœื™ื ื”ื™ื ืžื˜ืจื•ืคื•ืœื™ืŸ ื”ืžื”ื•ื•ื” ืืช ืžืจื›ื– ื”ื›ืœื›ืœื”
39
+ # ื—ื™ืคื” ื”ื™ื ืžื˜ืจื•ืคื•ืœื™ืŸ ื”ืžื”ื•ื•ื” ืืช ืžืจื›ื– ื”ื›ืœื›ืœื”
40
+ # ืœื•ื ื“ื•ืŸ ื”ื™ื ืžื˜ืจื•ืคื•ืœื™ืŸ ื”ืžื”ื•ื•ื” ืืช ืžืจื›ื– ื”ื›ืœื›ืœื”
41
+ # ืื™ืœืช ื”ื™ื ืžื˜ืจื•ืคื•ืœื™ืŸ ื”ืžื”ื•ื•ื” ืืช ืžืจื›ื– ื”ื›ืœื›ืœื”
42
  ```
43
 
44
+
45
+ When using AlephBertGimmel, please reference:
46
+
47
+ ```bibtex
48
+
49
+ @misc{guetta2022large,
50
+ title={Large Pre-Trained Models with Extra-Large Vocabularies: A Contrastive Analysis of Hebrew BERT Models and a New One to Outperform Them All},
51
+ author={Eylon Guetta and Avi Shmidman and Shaltiel Shmidman and Cheyn Shmuel Shmidman and Joshua Guedalia and Moshe Koppel and Dan Bareket and Amit Seker and Reut Tsarfaty},
52
+ year={2022},
53
+ eprint={2211.15199},
54
+ archivePrefix={arXiv},
55
+ primaryClass={cs.CL}
56
+ }
57
 
58
  ```