BachhoangVnist commited on
Commit
d7bdc80
1 Parent(s): e101c9c

init embedding model

Browse files
.gitattributes CHANGED
@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ pytorch_model.bin filter=lfs diff=lfs merge=lfs -text
37
+ .git/lfs/objects/cb/fa/cbfae04cc7f3063949a7d81258e185cd31249892768c15e27b70d4797f42b902 filter=lfs diff=lfs merge=lfs -text
38
+ model.safetensors filter=lfs diff=lfs merge=lfs -text
39
+ .git/lfs/objects/2d/81/2d8135cb6ff79bf4303fb0afd11808791c9123ae8ee753fccd480207e347963e filter=lfs diff=lfs merge=lfs -text
1_Pooling/config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 768,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false
9
+ }
README.md ADDED
@@ -0,0 +1,168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ pipeline_tag: sentence-similarity
3
+ tags:
4
+ - sentence-transformers
5
+ - feature-extraction
6
+ - sentence-similarity
7
+ - transformers
8
+ library_name: generic
9
+ language:
10
+ - vi
11
+ widget:
12
+ - source_sentence: Làm thế nào Đại học Bách khoa Hà Nội thu hút sinh viên quốc tế?
13
+ sentences:
14
+ - >-
15
+ Đại học Bách khoa Hà Nội đã phát triển các chương trình đào tạo bằng tiếng
16
+ Anh để làm cho việc học tại đây dễ dàng hơn cho sinh viên quốc tế.
17
+ - >-
18
+ Môi trường học tập đa dạng và sự hỗ trợ đầy đủ cho sinh viên quốc tế tại Đại
19
+ học Bách khoa Hà Nội giúp họ thích nghi nhanh chóng.
20
+ - Hà Nội có khí hậu mát mẻ vào mùa thu.
21
+ - Các món ăn ở Hà Nội rất ngon và đa dạng.
22
+ license: apache-2.0
23
+ ---
24
+
25
+ # bkai-foundation-models/vietnamese-bi-encoder
26
+
27
+ This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.
28
+
29
+ We train the model on a merged training dataset that consists of:
30
+ - MS Macro (translated into Vietnamese)
31
+ - SQuAD v2 (translated into Vietnamese)
32
+ - 80% of the training set from the Legal Text Retrieval Zalo 2021 challenge
33
+
34
+ We use [phobert-base-v2](https://github.com/VinAIResearch/PhoBERT) as the pre-trained backbone.
35
+
36
+ Here are the results on the remaining 20% of the training set from the Legal Text Retrieval Zalo 2021 challenge:
37
+
38
+ | Pretrained Model | Training Datasets | Acc@1 | Acc@10 | Acc@100 | Pre@10 | MRR@10 |
39
+ |-------------------------------|---------------------------------------|:------------:|:-------------:|:--------------:|:-------------:|:-------------:|
40
+ | [Vietnamese-SBERT](https://huggingface.co/keepitreal/vietnamese-sbert) | - | 32.34 | 52.97 | 89.84 | 7.05 | 45.30 |
41
+ | PhoBERT-base-v2 | MSMACRO | 47.81 | 77.19 | 92.34 | 7.72 | 58.37 |
42
+ | PhoBERT-base-v2 | MSMACRO + SQuADv2.0 + 80% Zalo | 73.28 | 93.59 | 98.85 | 9.36 | 80.73 |
43
+
44
+
45
+ <!--- Describe your model here -->
46
+
47
+ ## Usage (Sentence-Transformers)
48
+
49
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
50
+
51
+ ```
52
+ pip install -U sentence-transformers
53
+ ```
54
+
55
+ Then you can use the model like this:
56
+
57
+ ```python
58
+ from sentence_transformers import SentenceTransformer
59
+
60
+ # INPUT TEXT MUST BE ALREADY WORD-SEGMENTED!
61
+ sentences = ["Cô ấy là một người vui_tính .", "Cô ấy cười nói suốt cả ngày ."]
62
+
63
+ model = SentenceTransformer('bkai-foundation-models/vietnamese-bi-encoder')
64
+ embeddings = model.encode(sentences)
65
+ print(embeddings)
66
+ ```
67
+
68
+
69
+ ## Usage (Widget HuggingFace)
70
+ The widget use custom pipeline on top of the default pipeline by adding additional word segmenter before PhobertTokenizer. So you do not need to segment words before using the API:
71
+
72
+ An example could be seen in Hosted inference API.
73
+
74
+
75
+ ## Usage (HuggingFace Transformers)
76
+
77
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
78
+
79
+ ```python
80
+ from transformers import AutoTokenizer, AutoModel
81
+ import torch
82
+
83
+
84
+ #Mean Pooling - Take attention mask into account for correct averaging
85
+ def mean_pooling(model_output, attention_mask):
86
+ token_embeddings = model_output[0] #First element of model_output contains all token embeddings
87
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
88
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
89
+
90
+
91
+ # Sentences we want sentence embeddings, we could use pyvi, underthesea, RDRSegment to segment words
92
+ sentences = ['Cô ấy là một người vui_tính .', 'Cô ấy cười nói suốt cả ngày .']
93
+
94
+ # Load model from HuggingFace Hub
95
+ tokenizer = AutoTokenizer.from_pretrained('bkai-foundation-models/vietnamese-bi-encoder')
96
+ model = AutoModel.from_pretrained('bkai-foundation-models/vietnamese-bi-encoder')
97
+
98
+ # Tokenize sentences
99
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
100
+
101
+ # Compute token embeddings
102
+ with torch.no_grad():
103
+ model_output = model(**encoded_input)
104
+
105
+ # Perform pooling. In this case, mean pooling.
106
+ sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
107
+
108
+ print("Sentence embeddings:")
109
+ print(sentence_embeddings)
110
+ ```
111
+
112
+ ## Training
113
+
114
+ The model was trained with the parameters:
115
+
116
+ **DataLoader**:
117
+
118
+ `torch.utils.data.dataloader.DataLoader` of length 17584 with parameters:
119
+
120
+ ```
121
+ {'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}
122
+ ```
123
+
124
+ **Loss**:
125
+
126
+ `sentence_transformers.losses.MultipleNegativesRankingLoss.MultipleNegativesRankingLoss` with parameters:
127
+
128
+ ```
129
+ {'scale': 20.0, 'similarity_fct': 'cos_sim'}
130
+ ```
131
+
132
+ Parameters of the fit()-Method:
133
+
134
+ ```
135
+ {
136
+ "epochs": 15,
137
+ "evaluation_steps": 0,
138
+ "evaluator": "NoneType",
139
+ "max_grad_norm": 1,
140
+ "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
141
+ "optimizer_params": {
142
+ "lr": 2e-05
143
+ },
144
+ "scheduler": "WarmupLinear",
145
+ "steps_per_epoch": null,
146
+ "warmup_steps": 1000,
147
+ "weight_decay": 0.01
148
+ }
149
+ ```
150
+
151
+ ## Full Model Architecture
152
+
153
+ ```
154
+ SentenceTransformer(
155
+ (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: RobertaModel
156
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False})
157
+ )
158
+ ```
159
+
160
+ ### Please cite our manuscript if this dataset is used for your work
161
+ ```
162
+ @article{duc2024towards,
163
+ title={Towards Comprehensive Vietnamese Retrieval-Augmented Generation and Large Language Models},
164
+ author={Nguyen Quang Duc, Le Hai Son, Nguyen Duc Nhan, Nguyen Dich Nhat Minh, Le Thanh Huong, Dinh Viet Sang},
165
+ journal={arXiv preprint arXiv:2403.01616},
166
+ year={2024}
167
+ }
168
+ ```
added_tokens.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "<mask>": 64000
3
+ }
bpe.codes ADDED
The diff for this file is too large to render. See raw diff
 
config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "output/train_bi-encoder-mnrl-vinai-phobert-base-v2-margin_3.0-2023-08-27_23-13-25/",
3
+ "architectures": [
4
+ "RobertaModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "bos_token_id": 0,
8
+ "classifier_dropout": null,
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 3072,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 258,
17
+ "model_type": "roberta",
18
+ "num_attention_heads": 12,
19
+ "num_hidden_layers": 12,
20
+ "pad_token_id": 1,
21
+ "position_embedding_type": "absolute",
22
+ "tokenizer_class": "PhobertTokenizer",
23
+ "torch_dtype": "float32",
24
+ "transformers_version": "4.38.2",
25
+ "type_vocab_size": 1,
26
+ "use_cache": true,
27
+ "vocab_size": 64001
28
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "2.2.2",
4
+ "transformers": "4.32.0",
5
+ "pytorch": "2.0.0+cu117"
6
+ }
7
+ }
custom_tokenizer.py ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers import PhobertTokenizer
2
+ from pyvi import ViTokenizer
3
+
4
+
5
+ class CustomPhobertTokenizer(PhobertTokenizer):
6
+ def rdr_segment(self, text):
7
+ return ViTokenizer.tokenize(text)
8
+
9
+ def _tokenize(self, text):
10
+ segmented_text = self.rdr_segment(text)
11
+ return super()._tokenize(segmented_text)
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:e681accadaec87e79901db0c3f68e33d996cba334633b6dd0b2483dba4f398e0
3
+ size 540015464
modules.json ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ }
14
+ ]
pipeline.py ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from typing import Dict, List, Union
2
+ import torch
3
+ from transformers import AutoModel
4
+ from custom_tokenizer import CustomPhobertTokenizer
5
+
6
+
7
+ def mean_pooling(model_output, attention_mask):
8
+ token_embeddings = model_output[
9
+ 0
10
+ ] # First element of model_output contains all token embeddings
11
+ input_mask_expanded = (
12
+ attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
13
+ )
14
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
15
+ input_mask_expanded.sum(1), min=1e-9
16
+ )
17
+
18
+
19
+ class PreTrainedPipeline:
20
+ def __init__(self, path="."):
21
+ self.model = AutoModel.from_pretrained(path)
22
+ self.tokenizer = CustomPhobertTokenizer.from_pretrained(path)
23
+
24
+ def __call__(self, inputs: Dict[str, Union[str, List[str]]]) -> List[float]:
25
+ """
26
+ Args:
27
+ inputs (Dict[str, Union[str, List[str]]]):
28
+ a dictionary containing a query sentence and a list of key sentences
29
+ """
30
+
31
+ # Combine the query sentence and key sentences into one list
32
+ sentences = [inputs["source_sentence"]] + inputs["sentences"]
33
+
34
+ # Tokenize sentences
35
+ encoded_input = self.tokenizer(
36
+ sentences, padding=True, truncation=True, return_tensors="pt"
37
+ )
38
+
39
+ # Compute token embeddings
40
+ with torch.no_grad():
41
+ model_output = self.model(**encoded_input)
42
+
43
+ # Perform pooling to get sentence embeddings
44
+ sentence_embeddings = mean_pooling(
45
+ model_output, encoded_input["attention_mask"]
46
+ )
47
+
48
+ # Separate the query embedding from the key embeddings
49
+ query_embedding = sentence_embeddings[0]
50
+ key_embeddings = sentence_embeddings[1:]
51
+
52
+ # Compute cosine similarities (or any other comparison method you prefer)
53
+ cosine_similarities = torch.nn.functional.cosine_similarity(
54
+ query_embedding.unsqueeze(0), key_embeddings
55
+ )
56
+
57
+ # Convert the tensor of cosine similarities to a list of floats
58
+ scores = cosine_similarities.tolist()
59
+
60
+ return scores
61
+
62
+
63
+ if __name__ == "__main__":
64
+ inputs = {
65
+ "source_sentence": "Anh ấy đang là sinh viên năm cuối",
66
+ "sentences": [
67
+ "Anh ấy học tại Đại học Bách khoa Hà Nội, chuyên ngành Khoa học máy tính",
68
+ "Anh ấy đang làm việc tại nhà máy sản xuất linh kiện điện tử",
69
+ "Anh ấy chuẩn bị đi du học nước ngoài",
70
+ "Anh ấy sắp mở cửa hàng bán mỹ phẩm",
71
+ "Nhà anh ấy có rất nhiều cây cảnh",
72
+ ],
73
+ }
74
+
75
+ pipeline = PreTrainedPipeline()
76
+ res = pipeline(inputs)
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2d8135cb6ff79bf4303fb0afd11808791c9123ae8ee753fccd480207e347963e
3
+ size 540057065
requirements.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ pyvi>=0.1.1
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 256,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": "<mask>",
6
+ "pad_token": "<pad>",
7
+ "sep_token": "</s>",
8
+ "unk_token": "<unk>"
9
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "1": {
12
+ "content": "<pad>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "2": {
20
+ "content": "</s>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "3": {
28
+ "content": "<unk>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "64000": {
36
+ "content": "<mask>",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "bos_token": "<s>",
45
+ "clean_up_tokenization_spaces": true,
46
+ "cls_token": "<s>",
47
+ "eos_token": "</s>",
48
+ "mask_token": "<mask>",
49
+ "model_max_length": 1000000000000000019884624838656,
50
+ "pad_token": "<pad>",
51
+ "sep_token": "</s>",
52
+ "tokenizer_class": "PhobertTokenizer",
53
+ "unk_token": "<unk>"
54
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff