Open slerman12 opened 3 years ago
The model should run on 4 GPUs with ~24GB of memory each. I will change the default batch size in scripts/train_videogpt.py
, as it should be something like 4 or 8 (batch size per GPU) to get a total batch size across all GPUs of around 32.
If you haven't tried it yet, I also suggest using sparse attention, as you get some memory usage reduction and speed-up when training the model.
Thank you so much! I'll give that a try.
Don't want to keep prodding you, but I ran the provided Sparse Attention installation script:
sudo apt-get install llvm-9-dev
And received this trace:
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package llvm-9-dev
I tried installing llvm another way:
bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"
This worked, but the subsequent install deepseed
command did not:
Command errored out with exit status 1
Hmm not too sure what the issue is. Have you tried running sudo apt update
or sudo apt-get install
before installing llvm-9-dev
? This page might also have some useful information.
For the deepspeed
install, do you know what the exact error was?
The trace is pretty long, but I think it was this:
csrc/sparse_attention/utils.cpp:110:90: warning: narrowing conversion of ‘H’
from ‘size_t {aka long unsigned int}’ to ‘long int’ inside { } [-Wnarrowing]
error: command '/usr/bin/gcc' failed with exit code 1
Maybe our system has some issue with gcc? I'm not too familiar with this system-level stuff.
I believe that is essentially the same error that you mentioned above failed with exit code 1
, and right above that is just a warning, and not the error. The error should be somewhere else up in the logs.
Have you tried looking at some of the github issues on the Deepspeed repo that might be relevant? Such as this one
One other option is to try out the Dockerfile in the other VideoGPT related repo
Would you happen to have a rough estimate of the kind of compute needed to run this model? Unfortunately, we are subject to a very limited compute scenario and I am getting memory allocation errors when trying to run under the default settings.
Thank you for any support.