Edit model card

Experimental-Neo-TinyStories-Korean-800K-20240819

A new model trained with new datasets while retaining the almost identical architecture to the previous version. The only difference in the architecture is that the context length has been increased to 1024.

  • Architecture: Llama
  • Vocab size: 4096
  • Hidden size: 64
  • Layers: 5
  • Heads: 8 (MHA)
  • Context length: up to 1024 tokens

Improvements

Even though this model is exceptionally small for a language model, it has achieved significant improvements in generating accurate and logical sentences compared to its predecessor. These improvements were made possible by learning from super simple stories generated by more powerful language models, inspired by the TinyStories paper. Instead of using a dataset translated from English data, a much higher quality dataset was obtained by generating new synthetic data in line with the methodology outlined in the paper.

This model was intentionally kept the same size as its predecessor to demonstrate the impact of dataset quality on model performance. In fact, despite being trained on only about 10% of the tokens used by the previous model, it exhibits significantly better performance. The dataset will be released along with a larger version of the Neo-TinyStories-Korean model once its creation and validation are complete.

Generation Examples

Result of this model with only single <s> token was given:

ν‘Έλ₯΄λ₯Έ ν•˜λŠ˜ μ•„λž˜ λˆˆλΆ€μ‹  햇살이 κ°€λ“ν•œ λ‚ μ΄μ—ˆμ–΄μš”. μ•„κΈ° 곰은 μ—„λ§ˆ κ³°κ³Ό ν•¨κ»˜ 숲으둜 λ†€λŸ¬ κ°”μ–΄μš”. 숲 μ†μ—λŠ” 예쁜 꽃듀이 ν”Όμ–΄ μžˆμ—ˆκ³ , μ•„κΈ° 곰은 신이 λ‚˜μ„œ 꽃을 κ΅¬κ²½ν–ˆμ–΄μš”. μ•„κΈ° 곰은 꽃을 보며 신이 λ‚¬μ–΄μš”. κ°‘μžκΈ° μ•„κΈ° 곰은 μ—„λ§ˆ κ³°μ—κ²Œ 달렀가 "μ—„λ§ˆ, μ € 꽃 μ˜ˆλ»μš”!"라고 λ§ν–ˆμ–΄μš”. μ—„λ§ˆ 곰은 μ•„κΈ° κ³°μ—κ²Œ "그래, 예쁜 꽃이야!"라고 λ§ν•˜λ©° μ•„κΈ° 곰을 κΌ­ μ•ˆμ•„ μ˜¬λ Έμ–΄μš”. μ•„κΈ° 곰은 μ—„λ§ˆ 곰의 말을 λ“£κ³  꽃을 꺾지 μ•Šκ³  예쁘게 λ°”λΌλ³΄λŠ” 것을 λ°°μ› μ–΄μš”. μ•„κΈ° 곰은 μ—„λ§ˆ 곰의 말을 잘 λ“£λŠ” 것이 μ€‘μš”ν•˜λ‹€λŠ” 것을 μ•Œμ•˜μ–΄μš”.

Result of previous model with only single <s> token was given:

μ˜›λ‚  μ˜›λ‚ , 큰 μˆ²μ†μ— 큰 λ‚˜λ¬΄λ“€μ΄ μž‘μ€ μˆ²μ— μ‚΄κ³  μžˆμ—ˆμŠ΅λ‹ˆλ‹€. κ·Έ λ‚˜λ¬΄λ“€μ€ 맀우 ν–‰λ³΅ν–ˆμŠ΅λ‹ˆλ‹€. μ–΄λŠ λ‚ , μž‘μ€ μƒˆλ“€μ΄ κ·Έ λ‚˜λ¬΄μ— μ™”μ–΄μš”. κ·Έ λ‚˜λ¬΄λŠ” μž‘μ€ μƒˆλ₯Ό 보고 맀우 κΈ°λ»ν–ˆμŠ΅λ‹ˆλ‹€. "μ•ˆλ…•, μž‘μ€ μƒˆμ•Ό! μ–΄λ–»κ²Œ ν•΄μ•Ό?"라고 μƒˆκ°€ λ§ν–ˆμŠ΅λ‹ˆλ‹€. "λ‚˜λŠ” 잎이 μ—†μ–΄μ„œ κ·Έλž˜μš”." μž‘μ€ μƒˆλŠ” "μ €λŠ” 크고 κ°•ν•΄μ§€λŠ”λ°, μ €λŠ” 정말 κ°•ν•΄μš”."라고 λ§ν–ˆμŠ΅λ‹ˆλ‹€. μž‘μ€ μƒˆλŠ” λ‚ μ•„μ˜¬λΌ μžŽμ„ λ¨Ήκ³  μ‹Άμ–΄ ν–ˆμŠ΅λ‹ˆλ‹€. μƒˆλŠ” μžŽμ„ 작고 μ‹Άμ–΄ ν–ˆμŠ΅λ‹ˆλ‹€. μž‘μ€ μƒˆλŠ” μžŽμ„ μ°Ύμ•„μ„œ μžŽμ„ μ°Ύμ•„ λ‚˜μ„°μŠ΅λ‹ˆλ‹€. λ§ˆμΉ¨λ‚΄, μž‘μ€ μƒˆλŠ” μžŽμ„ μ°Ύμ•„ 큰 μžŽμ„ μ°Ύμ•˜μŠ΅λ‹ˆλ‹€. μž‘μ€ μƒˆλŠ” μžŽμ‚¬κ·€λ“€μ„ μ°Ύμ•„μ„œ μžŽμ„ μ°Ύμ•„λ‚΄μ–΄ μžŽμ„ λ½‘μ•„λƒˆμŠ΅λ‹ˆλ‹€. μƒˆλŠ” μžŽμ„ λ”°μ„œ μžŽμ„ λ‹€μ‹œ λ½‘μ•„λƒˆμŠ΅λ‹ˆλ‹€. μž‘μ€ μƒˆλŠ” 맀우 ν–‰λ³΅ν–ˆμŠ΅λ‹ˆλ‹€. 그듀은 쒅일 잎으둜 λ†€μ•˜κ³ , κ·Έ λ‚  μ΄ν›„λ‘œ 그듀은 항상 ν•¨κ»˜ λ†€μ•˜μŠ΅λ‹ˆλ‹€.

Result of this model with start sequence λ¨Ό μ˜›λ‚ , 고양이 was given:

λ¨Ό μ˜›λ‚ , 고양이 ν•œ λ§ˆλ¦¬κ°€ μ‚΄μ•˜μŠ΅λ‹ˆλ‹€. κ³ μ–‘μ΄λŠ” λ°₯을 λ¨Ήκ³  μ‹Άμ—ˆμŠ΅λ‹ˆλ‹€. κ³ μ–‘μ΄λŠ” λ°₯을 μ°Ύμ•„ 숲 μ†μœΌλ‘œ λ“€μ–΄κ°”μŠ΅λ‹ˆλ‹€. κ³ μ–‘μ΄λŠ” λ§›μžˆλŠ” 과일을 μ°Ύμ•˜μŠ΅λ‹ˆλ‹€. κ³ μ–‘μ΄λŠ” 과일을 λ¨Ήκ³  μ‹Άμ—ˆμŠ΅λ‹ˆλ‹€. κ³ μ–‘μ΄λŠ” 과일을 먹으렀고 ν–ˆμ§€λ§Œ, κ³ μ–‘μ΄λŠ” 배에 올라갈 수 μ—†μ—ˆμŠ΅λ‹ˆλ‹€. κ³ μ–‘μ΄λŠ” μŠ¬νŽμŠ΅λ‹ˆλ‹€. κ³ μ–‘μ΄λŠ” λ°₯을 λͺ» λ¨Ήμ—ˆμŠ΅λ‹ˆλ‹€. κ³ μ–‘μ΄λŠ” μŠ¬νŽμŠ΅λ‹ˆλ‹€. κ³ μ–‘μ΄λŠ” μ—„λ§ˆμ—κ²Œ λ§ν–ˆμŠ΅λ‹ˆλ‹€. "μ—„λ§ˆ, λ°₯이 μ—†μ–΄μš”!" κ³ μ–‘μ΄λŠ” μŠ¬νŽμŠ΅λ‹ˆλ‹€. κ³ μ–‘μ΄λŠ” μ—„λ§ˆμ—κ²Œ λ°₯을 μ£Όμ—ˆμŠ΅λ‹ˆλ‹€. μ—„λ§ˆλŠ” κ³ μ–‘μ΄μ—κ²Œ λ°₯을 μ£Όμ—ˆμŠ΅λ‹ˆλ‹€. κ³ μ–‘μ΄λŠ” λ°₯을 λ§›μžˆκ²Œ λ¨Ήμ—ˆμŠ΅λ‹ˆλ‹€. κ³ μ–‘μ΄λŠ” μ—„λ§ˆμ—κ²Œ 고맙닀고 λ§ν–ˆμŠ΅λ‹ˆλ‹€. κ³ μ–‘μ΄λŠ” μ—„λ§ˆ 말을 잘 λ“€μ—ˆμŠ΅λ‹ˆλ‹€.

Result of previous model with start sequence λ¨Ό μ˜›λ‚ , 고양이 was given:

λ¨Ό μ˜›λ‚ , 고양이, 보라색인 μƒˆ, 톰은 κ³΅μ›μ—μ„œ 놀고 μžˆμ—ˆμŠ΅λ‹ˆλ‹€. κ·ΈλŠ” 큰 λ‚˜λ¬΄λ₯Ό 보고 μ˜¬λΌκ°€κ³  μ‹Άμ–΄ ν–ˆμŠ΅λ‹ˆλ‹€. κ·ΈλŠ” μ˜¬λΌκ°€κ³  μ˜¬λΌκ°€κΈ° μ‹œμž‘ν–ˆμŠ΅λ‹ˆλ‹€. κ·ΈλŠ” μ˜¬λΌκ°€μ„œ λ‚˜λ¬΄λ₯Ό 였λ₯΄κΈ° μ‹œμž‘ν–ˆμŠ΅λ‹ˆλ‹€. 톰이 올라 κ°€λ©΄μ„œ κ·ΈλŠ” 큰 λ‚˜λ¬΄λ₯Ό λ³΄μ•˜μŠ΅λ‹ˆλ‹€. κ·ΈλŠ” λ‚˜λ¬΄λ₯Ό 였λ₯΄κΈ° μ‹œμž‘ν–ˆμŠ΅λ‹ˆλ‹€. 톰은 λ‚˜λ¬΄λ₯Ό 였λ₯΄κΈ° μ‹œμž‘ν–ˆμŠ΅λ‹ˆλ‹€. κ·ΈλŠ” 손을 μž‘μ•˜μŠ΅λ‹ˆλ‹€. κ·ΈλŠ” λ‚˜λ¬΄ κΌ­λŒ€κΈ°κΉŒμ§€ μ˜¬λΌκ°”μŠ΅λ‹ˆλ‹€. κ·Έ λ‚˜λ¬΄ κΌ­λŒ€κΈ°μ— λ„λ‹¬ν–ˆμŠ΅λ‹ˆλ‹€. 톰은 맀우 ν–‰λ³΅ν–ˆμŠ΅λ‹ˆλ‹€. κ·ΈλŠ” λ‚˜λ¬΄ μ•„λž˜ 큰 λ‚˜λ¬΄λ₯Ό λ³΄μ•˜μŠ΅λ‹ˆλ‹€. κ·ΈλŠ” λ‚˜λ¬΄μ—μ„œ λ‚΄λ €μ˜€λ©° λ‚˜λ¬΄λ₯Ό 였λ₯Ό 수 μžˆμ—ˆμŠ΅λ‹ˆλ‹€. κ·ΈλŠ” λ‚˜λ¬΄λ₯Ό 였λ₯΄κ³  μ‹Άμ—ˆμŠ΅λ‹ˆλ‹€. ν•˜μ§€λ§Œ 톰은 λ‚˜λ¬΄λ₯Ό 였λ₯΄κΈ° μ‹œμž‘ν–ˆμŠ΅λ‹ˆλ‹€. κ·ΈλŠ” 큰 λ‚˜λ¬΄λ₯Ό 였λ₯΄κΈ° μœ„ν•΄ μ˜¬λΌκ°€ μ˜¬λΌκ°”μŠ΅λ‹ˆλ‹€. 톰은 λ‚˜λ¬΄ κΌ­λŒ€κΈ°μ— λ„λ‹¬ν–ˆμŠ΅λ‹ˆλ‹€. κ·ΈλŠ” λ‚˜λ¬΄λ₯Ό 였λ₯΄λŠ” 것이 λ„ˆλ¬΄ κΈ°λ»€μŠ΅λ‹ˆλ‹€. κ·ΈλŠ” λ‚˜λ¬΄ κΌ­λŒ€κΈ°κΉŒμ§€ λ„λ‹¬ν–ˆμŠ΅λ‹ˆλ‹€. λ‚˜λ¬΄ κΌ­λŒ€κΈ°μ—λŠ” 큰 λ‚˜λ¬΄κ°€ μžˆλŠ” 것을 λ³΄μ•˜μŠ΅λ‹ˆλ‹€. κ·ΈλŠ” λ‚˜λ¬΄ μœ„λ‘œ μ˜¬λΌκ°”κ³  λ‚˜λ¬΄λ₯Ό 였λ₯΄κΈ° μ‹œμž‘ν–ˆμŠ΅λ‹ˆλ‹€. λ‚˜λ¬΄λŠ” λ‚˜λ¬΄ κΌ­λŒ€κΈ°μ— μ˜¬λΌκ°”μŠ΅λ‹ˆλ‹€. 톰은 맀우 κΈ°λ»€μŠ΅λ‹ˆλ‹€. κ·ΈλŠ” λ‚˜λ¬΄μ˜ λͺ¨λ“  μΉœκ΅¬λ“€μ„ λ°©λ¬Έν•΄ μ˜€λž«λ™μ•ˆ λ‚΄λ €μ™”μŠ΅λ‹ˆλ‹€. 그듀은 ν•¨κ»˜ λ§Žμ€ 재미λ₯Ό λŠκΌˆμŠ΅λ‹ˆλ‹€.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained('north-wind/Experimental-Neo-TinyStories-Korean-800K-20240819')
tokenizer = AutoTokenizer.from_pretrained('north-wind/Experimental-Neo-TinyStories-Korean-800K-20240819')

input_text = ''
input_ids = tokenizer(input_text, return_tensors='pt').input_ids

output = model.generate(input_ids, max_length=1024, do_sample=True, temperature=0.5)
print(tokenizer.decode(output[0]))

Further plans

  • Generate more data to match the quantity to the original English version of TinyStories dataset
  • Apply quality filter to generated dataset
  • Train a larger model on the final dataset
  • Release the model and dataset
Downloads last month
7
Safetensors
Model size
791k params
Tensor type
F32
Β·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for north-wind/Experimental-Neo-TinyStories-Korean-800K-20240819

Quantizations
1 model