I wonder if you guys have tried training the T5 11B param model on a single node with 8GPUs for the single task full finetuning case? I have not been able to get past the CUDA OOM issue with this repo codebase even with setting per device batch size to 1 for training and eval with p4d.24xlarge machine having 8 GPUs.
I wonder if you guys have tried training the T5 11B param model on a single node with 8GPUs for the single task full finetuning case? I have not been able to get past the CUDA OOM issue with this repo codebase even with setting per device batch size to 1 for training and eval with p4d.24xlarge machine having 8 GPUs.