File size: 4,833 Bytes
b88a7cd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 |
# Bad_text_classifier
## Model ์๊ฐ
์ธํฐ๋ท ์์ ํผ์ ธ์๋ ์ฌ๋ฌ ๋๊ธ, ์ฑํ
์ด ๋ฏผ๊ฐํ ๋ด์ฉ์ธ์ง ์๋์ง๋ฅผ ํ๋ณํ๋ ๋ชจ๋ธ์ ๊ณต๊ฐํฉ๋๋ค. ํด๋น ๋ชจ๋ธ์ ๊ณต๊ฐ๋ฐ์ดํฐ๋ฅผ ์ฌ์ฉํด label์ ์์ ํ๊ณ ๋ฐ์ดํฐ๋ค์ ํฉ์ณ ๊ตฌ์ฑํด finetuning์ ์งํํ์์ต๋๋ค. ํด๋น ๋ชจ๋ธ์ด ์ธ์ ๋ ๋ชจ๋ ๋ฌธ์ฅ์ ์ ํํ ํ๋จ์ด ๊ฐ๋ฅํ ๊ฒ์ ์๋๋ผ๋ ์ ์ํดํด ์ฃผ์๋ฉด ๊ฐ์ฌ๋๋ฆฌ๊ฒ ์ต๋๋ค.
```
NOTE)
๊ณต๊ฐ ๋ฐ์ดํฐ์ ์ ์๊ถ ๋ฌธ์ ๋ก ์ธํด ๋ชจ๋ธ ํ์ต์ ์ฌ์ฉ๋ ๋ณํ๋ ๋ฐ์ดํฐ๋ ๊ณต๊ฐ ๋ถ๊ฐ๋ฅํ๋ค๋ ์ ์ ๋ฐํ๋๋ค.
๋ํ ํด๋น ๋ชจ๋ธ์ ์๊ฒฌ์ ์ ์๊ฒฌ๊ณผ ๋ฌด๊ดํ๋ค๋ ์ ์ ๋ฏธ๋ฆฌ ๋ฐํ๋๋ค.
```
## Dataset
### data label
* **0 : bad sentence**
* **1 : not bad sentence**
### ์ฌ์ฉํ dataset
* [smilegate-ai/Korean Unsmile Dataset](https://github.com/smilegate-ai/korean_unsmile_dataset)
* [kocohub/Korean HateSpeech Dataset](https://github.com/kocohub/korean-hate-speech)
### dataset ๊ฐ๊ณต ๋ฐฉ๋ฒ
๊ธฐ์กด ์ด์ง ๋ถ๋ฅ๊ฐ ์๋์๋ ๋ ๋ฐ์ดํฐ๋ฅผ ์ด์ง ๋ถ๋ฅ ํํ๋ก labeling์ ๋ค์ ํด์ค ๋ค, Korean HateSpeech Dataset์ค label 1(not bad sentence)๋ง์ ์ถ๋ ค ๊ฐ๊ณต๋ Korean Unsmile Dataset์ ํฉ์ณ ์ฃผ์์ต๋๋ค.
</br>
**Korean Unsmile Dataset์ clean์ผ๋ก labeling ๋์ด์๋ ๋ฐ์ดํฐ ์ค ๋ช๊ฐ์ ๋ฐ์ดํฐ๋ฅผ 0 (bad sentence)์ผ๋ก ์์ ํ์์ต๋๋ค.**
* "~๋
ธ"๊ฐ ํฌํจ๋ ๋ฌธ์ฅ ์ค, "์ด๊ธฐ", "๋
ธ๋ฌด"๊ฐ ํฌํจ๋ ๋ฐ์ดํฐ๋ 0 (bad sentence)์ผ๋ก ์์
* "์ข", "๋ด" ๋ฑ ์ฑ ๊ด๋ จ ๋์์ค๊ฐ ํฌํจ๋ ๋ฐ์ดํฐ๋ 0 (bad sentence)์ผ๋ก ์์
</br></br>
## Model Training
* huggingface transformers์ ElectraForSequenceClassification๋ฅผ ์ฌ์ฉํด finetuning์ ์ํํ์์ต๋๋ค.
* ํ๊ตญ์ด ๊ณต๊ฐ Electra ๋ชจ๋ธ ์ค 3๊ฐ์ง ๋ชจ๋ธ์ ์ฌ์ฉํด ๊ฐ๊ฐ ํ์ต์์ผ์ฃผ์์ต๋๋ค.
### use model
* [Beomi/KcELECTRA](https://github.com/Beomi/KcELECTRA)
* [monologg/koELECTRA](https://github.com/monologg/KoELECTRA)
* [tunib/electra-ko-base](https://huggingface.co/tunib/electra-ko-base)
### how to train?
```BASH
python codes/model_source/train_torch_sch.py \
--learning_rate=3e-06 \
--use_float_16=True \
--weight-decay=0.001 \
--base_save_ckpt_path=BASE_SAVE_CHPT_PATH \
--epochs=10 \
--batch_size=128 \
--model_type=MODEL_TYPE
```
### parameters
| parameter | type | description | default |
| ---------- | ---------- | ---------- | --------- |
| learning_rate | float | decise learning rate for train | 5e-05 |
| use_float_16 | bool | decise to apply float 16 or not | False |
| weight_decay | float | define weight decay lambda | None |
| base_ckpt_save_path | str | base path that will be saved trained checkpoints | None |
| epochs | int | full train epochs | 5 |
| batch_size | int | batch size using in train time | 64 |
| model_type | int | used to choose what electra model using for training | 0 |
```
NOTE) train dataset, valid dataset์ train_torch_sch.py ๋ด์ config ๋ถ๋ถ์์ ์ง์ ํ์ค ์ ์์ต๋๋ค
```
</br>
## How to use model?
```PYTHON
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained('JminJ/kcElectra_base_Bad_Sentence_Classifier')
tokenizer = AutoTokenizer.from_pretrained('JminJ/kcElectra_base_Bad_Sentence_Classifier')
```
</br>
## Predict model
์ฌ์ฉ์๊ฐ ํ
์คํธ ํด๋ณด๊ณ ์ถ์ ๋ฌธ์ฅ์ ๋ฃ์ด predict๋ฅผ ์ํํด ๋ณผ ์ ์์ต๋๋ค.
```BASH
python codes/model_source/utils/predict.py \
--input_text=INPUT_TEXT \
--base_ckpt=BASE_CKPT
```
### parameters
| parameter | type | description | default |
| ---------- | ---------- | ---------- | --------- |
| input_text | str | user input text | "๋ฐ๊ฐ์ต๋๋ค. JminJ์
๋๋ค!" |
| base_ckpt | str | base path that saved trained checkpoints | False |
</br>
## Model Valid Accuracy
| mdoel | accuracy |
| ---------- | ---------- |
| kcElectra_base_fp16_wd_custom_dataset | 0.8849 |
| tunibElectra_base_fp16_wd_custom_dataset | 0.8726 |
| koElectra_base_fp16_wd_custom_dataset | 0.8434 |
```
Note)
๋ชจ๋ ๋ชจ๋ธ์ ๋์ผํ seed, learning_rate(3e-06), weight_decay lambda(0.001), batch_size(128)๋ก ํ์ต๋์์ต๋๋ค.
```
</br>
## Contact
* [email protected]
</br></br>
## Github
* https://github.com/JminJ/Bad_text_classifier
</br></br>
## Reference
* [Beomi/KcELECTRA](https://github.com/Beomi/KcELECTRA)
* [monologg/koELECTRA](https://github.com/monologg/KoELECTRA)
* [tunib/electra-ko-base](https://huggingface.co/tunib/electra-ko-base)
* [smilegate-ai/Korean Unsmile Dataset](https://github.com/smilegate-ai/korean_unsmile_dataset)
* [kocohub/Korean HateSpeech Dataset](https://github.com/kocohub/korean-hate-speech)
* [ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators](https://arxiv.org/abs/2003.10555)
|