BlinkDL/rwkv-4-raven · How many GPU and CPU memory I need to finetune a 7B raven model

Hello, we have GPU clusters to train a 7B raven model with our own in-domain corpus. I have 4 v100 GPUs with 32GB memory in 4 pods whose CPU memory is 48GB. (micro_bsz is 1, total batch size is 4). I have tried deepspeed 2 and 3 with offload, but the CPU memory isn't sufficient. If I only fine-tune the model with LoRA, micro_bsz=8, resources are enough, and will use 23.2GB GPU memory and 19.2GB CPU memory in each pod (Trainable params is 20.3MB, with deep-speed stage2). I have set grad_cp =1, use fp16

So I have several questions:

1. Is it possible to share the resources you used when you train the 7B model
1. How can I set the accumulate_grad_batches in train.py scripts. I find this in my log file but I can't set it...
1. I notice one phenomenon when I fine-tuned model with LoRA (actually w/o LoRA is same when I train a 1B5 model), at the beginning of training, they will occupy around 42GB CPU memory only, then split data into CPU and GPU memory separately, is it normal?
1. What is the meaning of grad_cp, is it mean we will compute the gradient again, so we won't save it in GPU memory (but computation time is 30% lower), save around 7G for 7B model? (7G is estimated by me, maynot be correct.)