Curious to know how 8k context llama2 got trained on 24gb GPU

#1
by Praveen92y - opened

Can you share the finetuning script, how did you train 8k context length llama2 got trained on 24gb GPU?

I trained it on a RTX 6000 Ada, which has 48gb VRAM. However, for this model, I didn't actually perform any training at 8k context length (unlike the first airophin model). I started from another model checkpoint that had been trained on such.
As far as the finetuning script goes, It's basically a modified version of the qlora script in the original qlora paper. I have a version of it here (there are differences to what I used here; I may update it when I get a chance): https://github.com/bhenrym14/qlora-airoboros-longcontext/blob/main/qlora_airo.py

Thanks @bhenrym14 for the clarification. Just wanted to confirm what you have used as model_max_len, 8192 or default one i.e 2048 in mentioned script ? Also can you confirm if similar script was used to finetune the first airophin model(bhenrym14/airophin-13b-pntk-16k-fp16), if yes then what was value of model_max_len there and gpu type and how many gpus?

This script relies on the RoPE monkey-patch to apply the desired interpolation factor (where I just hard-coded it); this is because I wrote this before transformers had native support for RoPE scaling. So yes, for this model, I did scale appropriately (factor of 2) for 8192 context. Since tranformers now has support, I generally will edit the backbone config to include the desired scaling method, and use the model_max_len to control the maximum sequence length the model sees in training; this is simply so I can run larger batches without risking OOM on a couple of samples in an otherwise shorter sequence dataset.

For the airoboros finetune phase, I capped it at ~3000 for max_model_len (again, RoPE scaling is still for 8192). I trained on a single gpu. It was a RTX 6000 Ada generation.

bhenrym14 changed discussion status to closed

Sign up or log in to comment