Compute compatibility? - Githubissues

wilson1yan / VideoGPT

MIT License

987 stars 120 forks source link

Compute compatibility? #6

Open slerman12 opened 3 years ago

slerman12 commented 3 years ago

Would you happen to have a rough estimate of the kind of compute needed to run this model? Unfortunately, we are subject to a very limited compute scenario and I am getting memory allocation errors when trying to run under the default settings.

Thank you for any support.

wilson1yan commented 3 years ago

The model should run on 4 GPUs with ~24GB of memory each. I will change the default batch size in scripts/train_videogpt.py, as it should be something like 4 or 8 (batch size per GPU) to get a total batch size across all GPUs of around 32.

If you haven't tried it yet, I also suggest using sparse attention, as you get some memory usage reduction and speed-up when training the model.

slerman12 commented 3 years ago

Thank you so much! I'll give that a try.

slerman12 commented 3 years ago

Don't want to keep prodding you, but I ran the provided Sparse Attention installation script:

sudo apt-get install llvm-9-dev

And received this trace:

Reading package lists... Done
Building dependency tree    
Reading state information... Done
E: Unable to locate package llvm-9-dev

I tried installing llvm another way:

bash -c "$(wget -O - https://apt.llvm.org/llvm.sh)"

This worked, but the subsequent install deepseed command did not:

Command errored out with exit status 1

wilson1yan commented 3 years ago

Hmm not too sure what the issue is. Have you tried running sudo apt update or sudo apt-get install before installing llvm-9-dev? This page might also have some useful information.

For the deepspeed install, do you know what the exact error was?

slerman12 commented 3 years ago

The trace is pretty long, but I think it was this:

csrc/sparse_attention/utils.cpp:110:90: warning: narrowing conversion of ‘H’
 from ‘size_t {aka long unsigned int}’ to ‘long int’ inside { } [-Wnarrowing]
    error: command '/usr/bin/gcc' failed with exit code 1

Maybe our system has some issue with gcc? I'm not too familiar with this system-level stuff.

wilson1yan commented 3 years ago

I believe that is essentially the same error that you mentioned above failed with exit code 1, and right above that is just a warning, and not the error. The error should be somewhere else up in the logs.

Have you tried looking at some of the github issues on the Deepspeed repo that might be relevant? Such as this one

One other option is to try out the Dockerfile in the other VideoGPT related repo