avichr commited on
Commit
99affcc
โ€ข
1 Parent(s): 935c778

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +65 -0
README.md ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Legal-HeBERT
2
+ Legal-HeBERT is a BERT model for Hebrew legal and legislative domains. It is intended to improve the legal NLP research and tools development in Hebrew. We release two versions of Legal-HeBERT. The first version is a fine-tuned model of [HeBERT](https://github.com/avichaychriqui/HeBERT) applied on legal and legislative documents. The second version uses [HeBERT](https://github.com/avichaychriqui/HeBERT)'s architecture guidlines to train a BERT model from scratch. <br>
3
+ We continue collecting legal data, examining different architectural designs, and performing tagged datasets and legal tasks for evaluating and to development of a Hebrew legal tools.
4
+
5
+ ## Training Data
6
+ Our training datasets are:
7
+ | Name | Hebrew Description | Size (GB) | Documents | Sentences | Words | Notes |
8
+ |---|---|---|---|---|---|---|
9
+ | The Israeli Law Book | ืกืคืจ ื”ื—ื•ืงื™ื ื”ื™ืฉืจืืœื™ | 0.05 | 2338 | 293352 | 4851063 | |
10
+ | Judgments of the Supreme Court | ืžืื’ืจ ืคืกืงื™ ื”ื“ื™ืŸ ืฉืœ ื‘ื™ืช ื”ืžืฉืคื˜ ื”ืขืœื™ื•ืŸ | 0.7 | 212348 | 5790138 | 79672415 | |
11
+ | custody courts | ื”ื—ืœื˜ื•ืช ื‘ืชื™ ื”ื“ื™ืŸ ืœืžืฉืžื•ืจืช | 2.46 | 169,708 | 8,555,893 | 213,050,492 | |
12
+ | Law memoranda, drafts of secondary legislation and drafts of support tests that have been distributed to the public for comment | ืชื–ื›ื™ืจื™ ื—ื•ืง, ื˜ื™ื•ื˜ื•ืช ื—ืงื™ืงืช ืžืฉื ื” ื•ื˜ื™ื•ื˜ื•ืช ืžื‘ื—ื ื™ ืชืžื™ื›ื” ืฉื”ื•ืคืฆื• ืœื”ืขืจื•ืช ื”ืฆื™ื‘ื•ืจ | 0.4 | 3,291 | 294,752 | 7,218,960 | |
13
+ | Supervisors of Land Registration judgments | ืžืื’ืจ ืคืกืงื™ ื“ื™ืŸ ืฉืœ ื”ืžืคืงื—ื™ื ืขืœ ืจื™ืฉื•ื ื”ืžืงืจืงืขื™ืŸ | 0.02 | 559 | 67,639 | 1,785,446 | |
14
+ | Decisions of the Labor Court - Corona | ืžืื’ืจ ื”ื—ืœื˜ื•ืช ื‘ื™ืช ื”ื“ื™ืŸ ืœืขื ื™ื™ืŸ ืฉื™ืจื•ืช ื”ืชืขืกื•ืงื” โ€“ ืงื•ืจื•ื ื” | 0.001 | 146 | 3505 | 60195 | |
15
+ | Decisions of the Israel Lands Council | ื”ื—ืœื˜ื•ืช ืžื•ืขืฆืช ืžืงืจืงืขื™ ื™ืฉืจืืœ | | 118 | 11283 | 162692 | aggregate file |
16
+ | Judgments of the Disciplinary Tribunal and the Israel Police Appeals Tribunal | ืคืกืงื™ ื“ื™ืŸ ืฉืœ ื‘ื™ืช ื”ื“ื™ืŸ ืœืžืฉืžืขืช ื•ื‘ื™ืช ื”ื“ื™ืŸ ืœืขืจืขื•ืจื™ื ืฉืœ ืžืฉื˜ืจืช ื™ืฉืจืืœ | 0.02 | 54 | 83724 | 1743419 | aggregate files |
17
+ | Disciplinary Appeals Committee in the Ministry of Health | ื•ืขื“ืช ืขืจืจ ืœื“ื™ืŸ ืžืฉืžืขืชื™ ื‘ืžืฉืจื“ ื”ื‘ืจื™ืื•ืช | 0.004 | 252 | 21010 | 429807 | 465 files are scanned and didn't parser |
18
+ | Attorney General's Positions | ืžืื’ืจ ื”ืชื™ื™ืฆื‘ื•ื™ื•ืช ื”ื™ื•ืขืฅ ื”ืžืฉืคื˜ื™ ืœืžืžืฉืœื” | 0.008 | 281 | 32724 | 813877 | |
19
+ | Legal-Opinion of the Attorney General | ืžืื’ืจ ื—ื•ื•ืช ื“ืขืช ื”ื™ื•ืขืฅ ื”ืžืฉืคื˜ื™ ืœืžืžืฉืœื” | 0.002 | 44 | 7132 | 188053 | |
20
+ | | | | | | | |
21
+ | total | | 3.665 | 389,139 | 15,161,152 | 309,976,419 | |
22
+
23
+ We thank <b>Yair Gardin</b> for the referring to the governance data, <b>Elhanan Schwarts</b> for collecting and parsing The Israeli law book, and <b>Jonathan Schler</b> for collecting the judgments of the supreme court.
24
+
25
+
26
+ ## Training process
27
+ * Vocabulary size: 50,000 tokens
28
+ * 4 epochs (1M stepsยฑ)
29
+ * lr=5e-5
30
+ * mlm_probability=0.15
31
+ * batch size = 32 (for each gpu)
32
+ * NVIDIA GeForce RTX 2080 TI + NVIDIA GeForce RTX 3090 (1 week training)
33
+
34
+ ### Additional training settings:
35
+ <b>Fine-tuned [HeBERT](https://github.com/avichaychriqui/HeBERT) model:</b> The first eight layers were freezed (like [Lee et al. (2019)](https://arxiv.org/abs/1911.03090) suggest)<br>
36
+ <b>Legal-HeBERT trained from scratch:</b> The training process is similar to [HeBERT](https://github.com/avichaychriqui/HeBERT) and inspired by [Chalkidis et al. (2020)](https://arxiv.org/abs/2010.02559) <br>
37
+
38
+ ## How to use
39
+ The models can be found in huggingface hub and can be fine-tunned to any down-stream task:
40
+ ```
41
+ # !pip install transformers==4.14.1
42
+ from transformers import AutoTokenizer, AutoModel
43
+
44
+ model_name = 'avichr/Legal-heBERT_ft' # for the fine-tuned HeBERT model
45
+ model_name = 'avichr/Legal-heBERT' # for legal HeBERT model trained from scratch
46
+
47
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
48
+ model = AutoModel.from_pretrained(model_name)
49
+
50
+ from transformers import pipeline
51
+ fill_mask = pipeline(
52
+ "fill-mask",
53
+ model=model_name,
54
+ )
55
+ fill_mask("ื”ืงื•ืจื•ื ื” ืœืงื—ื” ืืช [MASK] ื•ืœื ื• ืœื ื ืฉืืจ ื“ื‘ืจ.")
56
+ ```
57
+ ## Stay tuned!
58
+ We are still working on our models and the datasets. We will edit this page as we progress. We are open for collaborations.
59
+
60
+ ## Contact us
61
+ [Avichay Chriqui](mailto:[email protected]), The Coller AI Lab <br>
62
+ [Inbal yahav](mailto:[email protected]), The Coller AI Lab <br>
63
+ [Ittai Bar-Siman-Tov](mailto:[email protected]), the BIU Innovation Lab for Law, Data-Science and Digital Ethics <br>
64
+
65
+ Thank you, ืชื•ื“ื”, ุดูƒุฑุง <br>