JminJ's picture
Create README.md
b88a7cd
|
raw
history blame
4.83 kB

Bad_text_classifier

Model ์†Œ๊ฐœ

์ธํ„ฐ๋„ท ์ƒ์— ํผ์ ธ์žˆ๋Š” ์—ฌ๋Ÿฌ ๋Œ“๊ธ€, ์ฑ„ํŒ…์ด ๋ฏผ๊ฐํ•œ ๋‚ด์šฉ์ธ์ง€ ์•„๋‹Œ์ง€๋ฅผ ํŒ๋ณ„ํ•˜๋Š” ๋ชจ๋ธ์„ ๊ณต๊ฐœํ•ฉ๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ชจ๋ธ์€ ๊ณต๊ฐœ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•ด label์„ ์ˆ˜์ •ํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋“ค์„ ํ•ฉ์ณ ๊ตฌ์„ฑํ•ด finetuning์„ ์ง„ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ชจ๋ธ์ด ์–ธ์ œ๋‚˜ ๋ชจ๋“  ๋ฌธ์žฅ์„ ์ •ํ™•ํžˆ ํŒ๋‹จ์ด ๊ฐ€๋Šฅํ•œ ๊ฒƒ์€ ์•„๋‹ˆ๋ผ๋Š” ์  ์–‘ํ•ดํ•ด ์ฃผ์‹œ๋ฉด ๊ฐ์‚ฌ๋“œ๋ฆฌ๊ฒ ์Šต๋‹ˆ๋‹ค.

NOTE)
๊ณต๊ฐœ ๋ฐ์ดํ„ฐ์˜ ์ €์ž‘๊ถŒ ๋ฌธ์ œ๋กœ ์ธํ•ด ๋ชจ๋ธ ํ•™์Šต์— ์‚ฌ์šฉ๋œ ๋ณ€ํ˜•๋œ ๋ฐ์ดํ„ฐ๋Š” ๊ณต๊ฐœ ๋ถˆ๊ฐ€๋Šฅํ•˜๋‹ค๋Š” ์ ์„ ๋ฐํž™๋‹ˆ๋‹ค.
๋˜ํ•œ ํ•ด๋‹น ๋ชจ๋ธ์˜ ์˜๊ฒฌ์€ ์ œ ์˜๊ฒฌ๊ณผ ๋ฌด๊ด€ํ•˜๋‹ค๋Š” ์ ์„ ๋ฏธ๋ฆฌ ๋ฐํž™๋‹ˆ๋‹ค.

Dataset

data label

  • 0 : bad sentence
  • 1 : not bad sentence

์‚ฌ์šฉํ•œ dataset

dataset ๊ฐ€๊ณต ๋ฐฉ๋ฒ•

๊ธฐ์กด ์ด์ง„ ๋ถ„๋ฅ˜๊ฐ€ ์•„๋‹ˆ์˜€๋˜ ๋‘ ๋ฐ์ดํ„ฐ๋ฅผ ์ด์ง„ ๋ถ„๋ฅ˜ ํ˜•ํƒœ๋กœ labeling์„ ๋‹ค์‹œ ํ•ด์ค€ ๋’ค, Korean HateSpeech Dataset์ค‘ label 1(not bad sentence)๋งŒ์„ ์ถ”๋ ค ๊ฐ€๊ณต๋œ Korean Unsmile Dataset์— ํ•ฉ์ณ ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

Korean Unsmile Dataset์— clean์œผ๋กœ labeling ๋˜์–ด์žˆ๋˜ ๋ฐ์ดํ„ฐ ์ค‘ ๋ช‡๊ฐœ์˜ ๋ฐ์ดํ„ฐ๋ฅผ 0 (bad sentence)์œผ๋กœ ์ˆ˜์ •ํ•˜์˜€์Šต๋‹ˆ๋‹ค.

  • "~๋…ธ"๊ฐ€ ํฌํ•จ๋œ ๋ฌธ์žฅ ์ค‘, "์ด๊ธฐ", "๋…ธ๋ฌด"๊ฐ€ ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ๋Š” 0 (bad sentence)์œผ๋กœ ์ˆ˜์ •
  • "์ข†", "๋ดŠ" ๋“ฑ ์„ฑ ๊ด€๋ จ ๋‰˜์•™์Šค๊ฐ€ ํฌํ•จ๋œ ๋ฐ์ดํ„ฐ๋Š” 0 (bad sentence)์œผ๋กœ ์ˆ˜์ •

Model Training

  • huggingface transformers์˜ ElectraForSequenceClassification๋ฅผ ์‚ฌ์šฉํ•ด finetuning์„ ์ˆ˜ํ–‰ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
  • ํ•œ๊ตญ์–ด ๊ณต๊ฐœ Electra ๋ชจ๋ธ ์ค‘ 3๊ฐ€์ง€ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด ๊ฐ๊ฐ ํ•™์Šต์‹œ์ผœ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.

use model

how to train?

python codes/model_source/train_torch_sch.py \
    --learning_rate=3e-06 \
    --use_float_16=True \
    --weight-decay=0.001 \
    --base_save_ckpt_path=BASE_SAVE_CHPT_PATH \
    --epochs=10 \
    --batch_size=128 \
    --model_type=MODEL_TYPE

parameters

parameter type description default
learning_rate float decise learning rate for train 5e-05
use_float_16 bool decise to apply float 16 or not False
weight_decay float define weight decay lambda None
base_ckpt_save_path str base path that will be saved trained checkpoints None
epochs int full train epochs 5
batch_size int batch size using in train time 64
model_type int used to choose what electra model using for training 0
NOTE) train dataset, valid dataset์€ train_torch_sch.py ๋‚ด์˜ config ๋ถ€๋ถ„์—์„œ ์ง€์ •ํ•˜์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

How to use model?

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained('JminJ/kcElectra_base_Bad_Sentence_Classifier')
tokenizer = AutoTokenizer.from_pretrained('JminJ/kcElectra_base_Bad_Sentence_Classifier')

Predict model

์‚ฌ์šฉ์ž๊ฐ€ ํ…Œ์ŠคํŠธ ํ•ด๋ณด๊ณ  ์‹ถ์€ ๋ฌธ์žฅ์„ ๋„ฃ์–ด predict๋ฅผ ์ˆ˜ํ–‰ํ•ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

python codes/model_source/utils/predict.py \
    --input_text=INPUT_TEXT \
    --base_ckpt=BASE_CKPT

parameters

parameter type description default
input_text str user input text "๋ฐ˜๊ฐ‘์Šต๋‹ˆ๋‹ค. JminJ์ž…๋‹ˆ๋‹ค!"
base_ckpt str base path that saved trained checkpoints False

Model Valid Accuracy

mdoel accuracy
kcElectra_base_fp16_wd_custom_dataset 0.8849
tunibElectra_base_fp16_wd_custom_dataset 0.8726
koElectra_base_fp16_wd_custom_dataset 0.8434
Note)
๋ชจ๋“  ๋ชจ๋ธ์€ ๋™์ผํ•œ seed, learning_rate(3e-06), weight_decay lambda(0.001), batch_size(128)๋กœ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Contact

Github

Reference