File size: 3,651 Bytes
0476028
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5fc4e3f
0476028
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
---
license: mit
base_model:
- unsloth/gemma-2-9b-bnb-4bit
pipeline_tag: text2text-generation
---
# Introduction
**Reverse Dictionary**   
This dictionary is not a dictionary that tells you the meaning when you enter a word, but a dictionary that tells you the words corresponding to the meaning when you enter sentence.

I used [μš°λ¦¬λ§μƒ˜](https://github.com/songys/Dictionaries) dataset, which consists of a lot of information such as word, word meanings, word types, synonymsm, and example sentences.
Then, only the words and their meaning were separated to fit the model input structure.

# The process of model training
Because I worked in a colab environments, I used [Unsloth](https://github.com/unslothai/unsloth?tab=readme-ov-file), a finetuning optimization tool that is useful for small GPU resources.   

I used **gemma-2-9b-bnb-4bit** model among the models supported by Unsloth. This model was 4bit quantized and trained by modifying the parameters. But during the learning process, evaluation couldn't be performed due to 'out of memory', and the entire dataset was trained.   

You can find detailed code on github below.

# Result
An example inference is as follows:
![image/png](https://cdn-uploads.huggingface.co/production/uploads/66e500f67d43e55cfdf656af/DyxEwO41R66TKC3gn26zB.png)

First of all, we tested simple 10 words:
```
λΉ„ν–‰κΈ° - 동λ ₯으둜 ν”„λ‘œνŽ λŸ¬λ₯Ό λŒλ¦¬κ±°λ‚˜ μ—°μ†Œ κ°€μŠ€λ₯Ό λ‚΄λΏœλŠ” νž˜μ— μ˜ν•˜μ—¬ μƒκΈ°λŠ” μ–‘λ ₯(ζšεŠ›)을 μ΄μš©ν•˜μ—¬ κ³΅μ€‘μœΌλ‘œ λ– μ„œ λ‚ μ•„λ‹€λ‹ˆλŠ” 항곡기

κ°€λ°© - 물건을 λ„£μ–΄ λ“€κ±°λ‚˜ λ©”κ³  닀닐 수 있게 λ§Œλ“  용ꡬ

고양이 - κ³ μ–‘μž‡κ³Όμ˜ ν•˜λ‚˜. μ›λž˜ μ•„ν”„λ¦¬μΉ΄μ˜ 리비아살쾑이λ₯Ό 길듀인 κ²ƒμœΌλ‘œ, ν„±κ³Ό μ†‘κ³³λ‹ˆκ°€ 특히 λ°œλ‹¬ν•΄μ„œ μœ‘μ‹μ„ 주둜 ν•œλ‹€. λ°œν†±μ€ 자유둭게 κ°μΆ”κ±°λ‚˜ λ“œλŸ¬λ‚Ό 수 있으며, λˆˆμ€ μ–΄λ‘μš΄ κ³³μ—μ„œλ„ 잘 λ³Ό 수 μžˆλ‹€. μ• μ™„λ™λ¬Όλ‘œλ„ μœ‘μ’…ν•˜μ—¬ μ—¬λŸ¬ ν’ˆμ’…μ΄ μžˆλ‹€.

μ˜ν™” - μΌμ •ν•œ 의미λ₯Ό κ°–κ³  μ›€μ§μ΄λŠ” λŒ€μƒμ„ μ΄¬μ˜ν•˜μ—¬ μ˜μ‚¬κΈ°λ‘œ μ˜μ‚¬λ§‰μ— μž¬ν˜„ν•˜λŠ” μ’…ν•© 예술.

μžλ™μ°¨ - 원동기λ₯Ό μž₯μΉ˜ν•˜μ—¬ κ·Έ 동λ ₯으둜 바퀴λ₯Ό κ΅΄λ €μ„œ μ² κΈΈμ΄λ‚˜ κ°€μ„€λœ 선에 μ˜ν•˜μ§€ μ•„λ‹ˆν•˜κ³  λ•… μœ„λ₯Ό 움직이도둝 λ§Œλ“  μ°¨. 승용차, μŠΉν•©μžλ™μ°¨, ν™”λ¬Ό μžλ™μ°¨, 특수 μžλ™μ°¨ 및 이λ₯œμžλ™μ°¨κ°€ μžˆλ‹€.

λ°”λ‚˜λ‚˜ - 파초과의 상둝 μ—¬λŸ¬ν•΄μ‚΄μ΄ν’€. λ†’μ΄λŠ” 3~10미터이며, λ•…μ†μ˜ μ•Œμ€„κΈ°μ—μ„œ 죽순 λͺ¨μ–‘μ˜ 싹이 λ‚˜μ™€ κΈ΄ νƒ€μ›ν˜•μ˜ 녹색 잎이 8~10κ°œκ°€ λ­‰μ³λ‚˜κ³ , κΈ΄ μžŽκΉμ§€κ°€ μ„œλ‘œ 겹쳐 헛쀄기λ₯Ό μ΄λ£¨λ©΄μ„œ μžλž€λ‹€. μ΄ˆμ—¬λ¦„μ— μ»€λ‹€λž€ 꽃쀄기가 λ‚˜μ™€ 엷은 λˆ„λŸ°μƒ‰μ˜ μž”κ½ƒμ΄ 이삭 λͺ¨μ–‘μœΌλ‘œ ν”Όκ³ , μ—΄λ§€λŠ” μ‹μš©ν•œλ‹€. μ—΄λŒ€ 지방이 μ›μ‚°μ§€λ‘œ μš°λ¦¬λ‚˜λΌμ—μ„œλŠ” μ˜¨μ‹€μ—μ„œ μž¬λ°°ν•œλ‹€.

컴퓨터 - μ „μž 회둜λ₯Ό μ΄μš©ν•œ κ³ μ†μ˜ μžλ™ 계산기. 숫자 계산, μžλ™ μ œμ–΄, 데이터 처리, 사무 관리, μ–Έμ–΄λ‚˜ μ˜μƒ 정보 처리 λ”°μœ„μ— κ΄‘λ²”μœ„ν•˜κ²Œ μ΄μš©λœλ‹€.

사과 - μ‚¬κ³Όλ‚˜λ¬΄μ˜ 열맀.

μ±… - 쒅이λ₯Ό μ—¬λŸ¬ μž₯ λ¬Άμ–΄ 맨 물건.

학ꡐ - μΌμ •ν•œ λͺ©μ γ†κ΅κ³Ό κ³Όμ •γ†μ„€λΉ„γ†μ œλ„ 및 λ²•κ·œμ— μ˜ν•˜μ—¬ κ³„μ†μ μœΌλ‘œ ν•™μƒμ—κ²Œ κ΅μœ‘μ„ μ‹€μ‹œν•˜λŠ” κΈ°κ΄€.
```

The result is that **7 out of 10 words** were guessed correctly, and 2 words were output as similar words.   
And 10% of the dataset was used as a testset.

# References
https://github.com/teddylee777/langchain-kr/tree/main/18-FineTuning

---
**If you want to see more,**      
- Github : https://github.com/hyunjin-C/gemma-sprint    
- Blog : https://velog.io/@hyunjin-c/Gemma-Sprint-Gemma-2-9b-Finetuning