yfeng95 / PoseGPT

215 stars 13 forks source link

python main_chat.py <> CUDA out of memory #5

Open affromero opened 4 months ago

affromero commented 4 months ago

Hello,

Thanks!

neil0306 commented 4 months ago

Same here, I was using a single 3090 GPU, and met the OOM error with bf16 and the AttributeError: 'LlamaAttention' object has no attribute 'rope_theta' with fp16.

Edit: After monitoring the RAM usage, I found that this model may need more than 25GB of GPU memory. Note that I did not load any images or run a forward pass. Therefore, we may need a GPU with more than 40GB of memory.

neil0306 commented 4 months ago
image

I ran this model (with bf16) successfully on an L20 GPU (48GB), but it is really at risk of running out of memory (OOM).

image

For fp16, it used about 26GB, which is why we got OOM with a 3090 or any GPU with 24GB.

For RAM: the model will use about 26GB, so I think >= 32GB RAM should be a good choice.

w4230213 commented 4 months ago
image

I ran this model (with bf16) successfully on an L20 GPU (48GB), but it is really at risk of running out of memory (OOM).

image

For fp16, it used about 26GB, which is why we got OOM with a 3090 or any GPU with 24GB.

For RAM: the model will use about 26GB, so I think >= 32GB RAM should be a good choice.

Hi, would you mind share the way how you fix that 'rope_theta' issue? I'm using a 32GB v100 , it's not enough for bf16, so I tried the fp16 format, (though I dont know why fp16 can reduce memory cost significantly, according to your post.), but met the problem above. It looks like relate to deepspeed, which I'm using: deepspped==0.14.4 transformers==4.31.0

neil0306 commented 4 months ago
image

I ran this model (with bf16) successfully on an L20 GPU (48GB), but it is really at risk of running out of memory (OOM).

image

For fp16, it used about 26GB, which is why we got OOM with a 3090 or any GPU with 24GB. For RAM: the model will use about 26GB, so I think >= 32GB RAM should be a good choice.

Hi, would you mind share the way how you fix that 'rope_theta' issue? I'm using a 32GB v100 , it's not enough for bf16, so I tried the fp16 format, (though I dont know why fp16 can reduce memory cost significantly, according to your post.), but met the problem above. It looks like relate to deepspeed, which I'm using: deepspped==0.14.4 transformers==4.31.0


Hi,

the deepspeed version I am using is 0.6.5.

And below are my conda env list logs for your convenience:

# packages in environment at /root/miniconda3/envs/chatpose:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
_openmp_mutex             5.1                       1_gnu    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
accelerate                0.31.0                   pypi_0    pypi
aiofiles                  23.2.1                   pypi_0    pypi
aiohttp                   3.9.5                    pypi_0    pypi
aiosignal                 1.3.1                    pypi_0    pypi
altair                    5.3.0                    pypi_0    pypi
annotated-types           0.7.0                    pypi_0    pypi
anyio                     4.4.0                    pypi_0    pypi
async-timeout             4.0.3                    pypi_0    pypi
attrs                     23.2.0                   pypi_0    pypi
bitsandbytes              0.41.1                   pypi_0    pypi
ca-certificates           2024.3.11            h06a4308_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
certifi                   2024.6.2                 pypi_0    pypi
charset-normalizer        3.3.2                    pypi_0    pypi
click                     8.1.7                    pypi_0    pypi
contourpy                 1.2.1                    pypi_0    pypi
cycler                    0.12.1                   pypi_0    pypi
deepspeed                 0.6.5                    pypi_0    pypi
einops                    0.4.1                    pypi_0    pypi
exceptiongroup            1.2.1                    pypi_0    pypi
fastapi                   0.100.1                  pypi_0    pypi
ffmpy                     0.3.2                    pypi_0    pypi
filelock                  3.15.4                   pypi_0    pypi
flash-attn                2.5.9.post1              pypi_0    pypi
fonttools                 4.53.0                   pypi_0    pypi
frozenlist                1.4.1                    pypi_0    pypi
fsspec                    2024.6.0                 pypi_0    pypi
gradio                    3.39.0                   pypi_0    pypi
gradio-client             1.0.1                    pypi_0    pypi
grpcio                    1.64.1                   pypi_0    pypi
h11                       0.14.0                   pypi_0    pypi
hjson                     3.1.0                    pypi_0    pypi
httpcore                  1.0.5                    pypi_0    pypi
httpx                     0.27.0                   pypi_0    pypi
huggingface-hub           0.23.4                   pypi_0    pypi
idna                      3.7                      pypi_0    pypi
imageio                   2.34.2                   pypi_0    pypi
importlib-resources       6.4.0                    pypi_0    pypi
jinja2                    3.1.4                    pypi_0    pypi
jsonschema                4.22.0                   pypi_0    pypi
jsonschema-specifications 2023.12.1                pypi_0    pypi
kiwisolver                1.4.5                    pypi_0    pypi
lazy-loader               0.4                      pypi_0    pypi
ld_impl_linux-64          2.38                 h1181459_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libffi                    3.4.4                h6a678d5_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libgcc-ng                 11.2.0               h1234567_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libgomp                   11.2.0               h1234567_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
libstdcxx-ng              11.2.0               h1234567_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
linkify-it-py             2.0.3                    pypi_0    pypi
markdown-it-py            2.2.0                    pypi_0    pypi
markdown2                 2.4.10                   pypi_0    pypi
markupsafe                2.1.5                    pypi_0    pypi
matplotlib                3.9.0                    pypi_0    pypi
mdit-py-plugins           0.3.3                    pypi_0    pypi
mdurl                     0.1.2                    pypi_0    pypi
msgpack                   1.0.8                    pypi_0    pypi
multidict                 6.0.5                    pypi_0    pypi
ncurses                   6.4                  h6a678d5_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
networkx                  3.2.1                    pypi_0    pypi
ninja                     1.11.1.1                 pypi_0    pypi
numpy                     1.24.2                   pypi_0    pypi
openai                    0.27.8                   pypi_0    pypi
opencv-python             4.8.0.74                 pypi_0    pypi
openssl                   3.0.14               h5eee18b_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
orjson                    3.10.5                   pypi_0    pypi
packaging                 24.1                     pypi_0    pypi
pandas                    2.2.2                    pypi_0    pypi
peft                      0.4.0                    pypi_0    pypi
pillow                    9.4.0                    pypi_0    pypi
pip                       24.0             py39h06a4308_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
protobuf                  5.27.1                   pypi_0    pypi
psutil                    6.0.0                    pypi_0    pypi
py-cpuinfo                9.0.0                    pypi_0    pypi
pydantic                  2.7.4                    pypi_0    pypi
pydantic-core             2.18.4                   pypi_0    pypi
pydub                     0.25.1                   pypi_0    pypi
pyparsing                 3.1.2                    pypi_0    pypi
python                    3.9.19               h955ad1f_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
python-dateutil           2.9.0.post0              pypi_0    pypi
python-multipart          0.0.9                    pypi_0    pypi
pytz                      2024.1                   pypi_0    pypi
pyyaml                    6.0.1                    pypi_0    pypi
ray                       2.6.1                    pypi_0    pypi
readline                  8.2                  h5eee18b_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
referencing               0.35.1                   pypi_0    pypi
regex                     2024.5.15                pypi_0    pypi
requests                  2.31.0                   pypi_0    pypi
rpds-py                   0.18.1                   pypi_0    pypi
safetensors               0.4.3                    pypi_0    pypi
scikit-image              0.24.0                   pypi_0    pypi
scipy                     1.11.2                   pypi_0    pypi
semantic-version          2.10.0                   pypi_0    pypi
sentencepiece             0.2.0                    pypi_0    pypi
setuptools                69.5.1           py39h06a4308_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
shortuuid                 1.0.11                   pypi_0    pypi
six                       1.16.0                   pypi_0    pypi
sniffio                   1.3.1                    pypi_0    pypi
sqlite                    3.45.3               h5eee18b_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
starlette                 0.27.0                   pypi_0    pypi
tifffile                  2024.6.18                pypi_0    pypi
tk                        8.6.14               h39e8969_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
tokenizers                0.13.3                   pypi_0    pypi
toolz                     0.12.1                   pypi_0    pypi
torch                     1.13.1+cu117             pypi_0    pypi
torchvision               0.14.1+cu117             pypi_0    pypi
tqdm                      4.64.1                   pypi_0    pypi
transformers              4.31.0                   pypi_0    pypi
typing-extensions         4.12.2                   pypi_0    pypi
tzdata                    2024.1                   pypi_0    pypi
uc-micro-py               1.0.3                    pypi_0    pypi
urllib3                   2.2.2                    pypi_0    pypi
uvicorn                   0.23.2                   pypi_0    pypi
websockets                11.0.3                   pypi_0    pypi
wheel                     0.43.0           py39h06a4308_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
xz                        5.4.6                h5eee18b_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
yacs                      0.1.8                    pypi_0    pypi
yarl                      1.9.4                    pypi_0    pypi
zipp                      3.19.2                   pypi_0    pypi
zlib                      1.2.13               h5eee18b_1    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
xiaoluo333 commented 3 months ago
image

I ran this model (with bf16) successfully on an L20 GPU (48GB), but it is really at risk of running out of memory (OOM).

image

For fp16, it used about 26GB, which is why we got OOM with a 3090 or any GPU with 24GB.

For RAM: the model will use about 26GB, so I think >= 32GB RAM should be a good choice.

Hi, would you mind share the way how you fix that 'rope_theta' issue? I'm using a 32GB v100 , it's not enough for bf16, so I tried the fp16 format, (though I dont know why fp16 can reduce memory cost significantly, according to your post.), but met the problem above.

It looks like relate to deepspeed, which I'm using:

deepspped==0.14.4

transformers==4.31.0

你好,想问下最终运行结果显示什么样呀?

wmj142326 commented 2 months ago

For the problem of a single 3090 memory explosion, can I use two 3090s in parallel training? It appears that the source code does not provide parallel training?

xiaoTan12 commented 2 months ago

I successfully deployed BF16 (with a display of 26GB of VRAM), but after encountering an issue with the UseSR interaction input, the VRAM was running low. I am using 40GB. What is the problem?

wmj142326 commented 2 months ago

I successfully deployed BF16 (with a display of 26GB of VRAM), but after encountering an issue with the UseSR interaction input, the VRAM was running low. I am using 40GB. What is the problem?

I had the same problem, again because it wasn't big enough. I re-used the A800 (80G) to run successfully and achieve communication and pose generation!

wchieffff commented 1 month ago

I add "device_map="auto" at line 178 in main_chat.py and set precision=fp16,but it is still OOM...... Did I do sth wrong? image