great work!

#1
by MaziyarPanahi - opened

This is a great work! Given the limited number of A100/80G and only by running it for 100 minutes makes it very interesting! Just out of curiosity, did you use accelerate to launch axolotl and load the model on each GPU or you used python to launch it and shard the model on all GPUs? (I can't find a way to use your config and not get OOM on my 4 A100/80G)

I have the exact same question, have you got a better answer now a couple of months later?

Yes, you can use DeepSpeed (zero2.json) or FSDP. Both make this possible, but the base model supports up to 65k seq length, this fine-tuned max at 2k. Clearly, there is a need for much more compute if one wants to go higher for the seq length.

Thank you for the answer. Have you tried finetuning with full seq length too? Just curious about how much might be needed.

Unfortunately, I couldn't find a good SFT dataset that has a very long text with high-quality. That's the primarily issue when it comes to long-context fine-tuning, and I am pretty sure it will require much more memory.

Yes, I'm guessing the eqivalence of around 8 H100 :-)

Sign up or log in to comment