Need help to debug my training process

#37
by arthurcornelio88 - opened

Hello fellows,

My friend and me, we're fine-tuning the model with our dataset. The task is very heavy for our PCs, then we passed to SageMaker. Then, we have some questions :

  1. Firstly, I would like to know if it's normal to take 5h to train it in a ml.g5.24xlarge instance, mainly because, for testing, we're using a very small dataset (ten audio files).
  2. Is it necessary to have all the demo files ? How could we understand better the params from demo_cfg ?
  3. Is there any step that we did that is not necessary - and, maybe, is causing the heavy computations ? Batch sizes, gpus, cuda stuff, etc.

We're attaching all the process of our training, to help the collective debugging.

a) the model archi

archi-sagemaker.png

In jupter notebooks:

b) first imports

jupyter1-imports.png

c) model loading

jupyter2-modelloading.png

d) Cuda import and training prompt

jupyter3-cuda et prompt.png

OUR TRAINING PROMPT :

!python stable-audio-tools-sam/train.py --model-config stable_open_model_files/model_config.json --dataset-config stable_open_model_files/dataset_config.json --name rayan-training --save-dir checkpoints --pretrained-ckpt-path stable_open_model_files/model.safetensors --batch-size 16 --num-gpus 4 --strategy deepspeed

Outputs:

e) Models loaded

jupyter4-output1.png

f) Some warnings and cuda loading

jupyter4-output2.png

g) Training in action

jupyter4-output3.png

jupyter4-output4.png

jupyter4-output5.png

h) After 5h without conclusion, our keyboard interruption...

jupyter4-output6.png

We can, eventually, put the logs from sagemaker here too.

Thanks in advance !

hours? hmm with 4 gpus... I heard they put something like 16000 gpu hours just into the vae.

https://github.com/yukara-ikemiya/friendly-stable-audio-tools
give that an eyball maybe.

for what its worth your code seemed okay.

after 10k steps it should drop out a ckpt and keep the best 2 of those models (IIRC) and do that every 10k

After 5h without conclusion, our keyboard interruption...

it will NEVER conclude.

the code says maximum epochs 100000 (or something insane). You are supposed to stop it AFTER a 10k moment (like 10k or 100k or 200k however many steps you want)

when it has JUST spat out a ckpt is my preferred time.

If you exit early

Define the path where you want to save the model

model_save_path = os.path.join(OUTPUT_DIR, 'final_model_checkpoint.ckpt')

Save the model using the trainer

trainer.save_checkpoint(model_save_path)

print("Model saved successfully at:", model_save_path)

I believe that can - in a pinch - rip out the traned model so far.
But better off doing what I said before. (since this is just some last ditch code I made up one time)

Hello,

Effectively, I put a lower number for max_epochs and I'm getting the model after a keyboard interruption, your solution seems like mine's, thanks for confirming me my insight !

Sign up or log in to comment