SFT issue - Githubissues

ErikZ719 commented 6 days ago

Hi, I follow the default settings (pyproject.toml) to do fine tuning experiments on 3X3090 and it reports out of memory, is this normal? Pre-training works fine. `Package Version Editable project location

absl-py 2.1.0 accelerate 0.26.1 aiofiles 23.2.1 altair 5.4.1 annotated-types 0.7.0 anyio 4.6.2.post1 attrs 24.2.0 bitsandbytes 0.44.1 certifi 2022.12.7 charset-normalizer 2.1.1 click 8.1.7 contourpy 1.3.0 cycler 0.12.1 deepspeed 0.13.1 docker-pycreds 0.4.0 einops 0.6.1 einops-exts 0.0.4 exceptiongroup 1.2.2 fastapi 0.115.4 ffmpy 0.4.0 filelock 3.13.1 flash-attn 2.5.8 fonttools 4.54.1 fsspec 2024.2.0 gitdb 4.0.11 GitPython 3.1.43 gradio 4.16.0 gradio_client 0.8.1 grpcio 1.67.1 h11 0.14.0 hjson 3.1.0 httpcore 0.17.3 httpx 0.24.0 huggingface-hub 0.26.2 idna 3.4 importlib_resources 6.4.5 Jinja2 3.1.3 joblib 1.4.2 jsonschema 4.23.0 jsonschema-specifications 2024.10.1 kiwisolver 1.4.7 latex2mathml 3.77.0 llava 1.2.2.post1 /root/zqy/cca-llava Markdown 3.7 markdown-it-py 3.0.0 markdown2 2.5.1 MarkupSafe 2.1.5 matplotlib 3.9.2 mdurl 0.1.2 mpmath 1.3.0 narwhals 1.12.1 networkx 3.2.1 ninja 1.11.1.1 numpy 1.26.3 orjson 3.10.11 packaging 24.1 pandas 2.2.3 peft 0.13.2 pillow 10.2.0 pip 24.3.1 platformdirs 4.3.6 protobuf 5.28.3 psutil 6.1.0 py-cpuinfo 9.0.0 pydantic 2.9.2 pydantic_core 2.23.4 pydub 0.25.1 Pygments 2.18.0 pynvml 11.5.0 pyparsing 3.2.0 python-dateutil 2.9.0.post0 python-multipart 0.0.17 pytz 2024.2 PyYAML 6.0.2 referencing 0.35.1 regex 2024.9.11 requests 2.28.1 rich 13.9.4 rpds-py 0.20.1 ruff 0.7.2 safetensors 0.4.5 scikit-learn 1.2.2 scipy 1.14.1 semantic-version 2.10.0 sentencepiece 0.1.99 sentry-sdk 2.17.0 setproctitle 1.3.3 setuptools 75.1.0 shellingham 1.5.4 shortuuid 1.0.13 six 1.16.0 smmap 5.0.1 sniffio 1.3.1 starlette 0.41.2 svgwrite 1.4.3 sympy 1.13.1 tensorboard 2.18.0 tensorboard-data-server 0.7.2 threadpoolctl 3.5.0 timm 0.6.13 tokenizers 0.15.2 tomlkit 0.12.0 torch 2.1.1+cu121 torchaudio 2.1.1+cu121 torchvision 0.16.1+cu121 tqdm 4.66.6 transformers 4.37.2 triton 2.1.0 typer 0.12.5 typing_extensions 4.12.2 tzdata 2024.2 urllib3 1.26.13 uvicorn 0.32.0 wandb 0.18.5 wavedrom 2.0.3.post3 websockets 11.0.3 Werkzeug 3.1.1 wheel 0.44.0 xformers 0.0.23 `

xing0047 commented 6 days ago

Hi @ErikZ719

Thanks for your feedback.

As LLaVA official repo states, if you do not have enough gpu memory for LLaVA training, please consider,

1) Use LoRA: finetune_lora.sh. As LLaVA indicates, 7B training can be fitted in 8-RTX3090 (I runned lora training on 4090 and seems work well). Make sure per_device_train_batch_size*gradient_accumulation_steps is the same as the provided script for best reproducibility.

2) Replace zero3.json with zero3_offload.json which offloads some parameters to CPU RAM. This slows down the training speed.

ErikZ719 commented 6 days ago

Thank you very much for your reply. Would it be possible to show your cuda version and “pip list” on the 4090? I'm getting a version conflict error with my lora training.

xing0047 commented 6 days ago

Hi @ErikZ719

Please check below for your reference.

pip list

Package                       Version     Editable project location
----------------------------- ----------- ---------------------------------------
accelerate                    0.26.1
aiofiles                      23.2.1
aiohappyeyeballs              2.4.3
aiohttp                       3.10.8
aiosignal                     1.3.1
altair                        5.2.0
annotated-types               0.6.0
antlr4-python3-runtime        4.9.3
anyio                         4.6.0
asttokens                     2.4.1
async-timeout                 4.0.3
attrs                         24.2.0
av                            13.0.0
bitsandbytes                  0.44.1
black                         24.1.0
bleach                        6.1.0
blis                          0.7.11
braceexpand                   0.1.7
Brotli                        1.0.9
cachetools                    5.3.3
catalogue                     2.0.10
certifi                       2024.8.30
cffi                          1.16.0
cfgv                          3.4.0
chardet                       5.2.0
charset-normalizer            3.3.2
click                         8.1.7
cloudpathlib                  0.16.0
colorama                      0.4.6
confection                    0.1.4
contexttimer                  0.3.3
contourpy                     1.3.0
cycler                        0.12.1
cymem                         2.0.8
DataProperty                  1.0.1
datasets                      2.16.1
decorator                     4.4.2
decord                        0.6.0
deepspeed                     0.13.1
diffusers                     0.16.0
dill                          0.3.7
distlib                       0.3.8
distro                        1.9.0
docker-pycreds                0.4.0
easydict                      1.9
einops                        0.6.1
einops-exts                   0.0.4
et-xmlfile                    1.1.0
evaluate                      0.4.3
exceptiongroup                1.2.0
executing                     2.0.1
fairscale                     0.4.4
fastapi                       0.115.0
ffmpy                         0.4.0
filelock                      3.13.1
fonttools                     4.54.1
frozenlist                    1.4.1
fsspec                        2023.10.0
ftfy                          6.1.3
gitdb                         4.0.11
GitPython                     3.1.43
gmpy2                         2.1.2
gradio                        4.16.0
gradio_client                 0.8.1
h11                           0.14.0
h5py                          3.10.0
hf_transfer                   0.1.8
hjson                         3.1.0
httpcore                      0.16.3
httpx                         0.23.3
huggingface-hub               0.25.1
identify                      2.5.35
idna                          3.7
imageio-ffmpeg                0.4.9
importlib_resources           6.4.5
iopath                        0.1.10
ipython                       8.22.1
isort                         5.13.2
jedi                          0.19.1
Jinja2                        3.1.4
jiter                         0.5.0
joblib                        1.3.2
jsonlines                     4.0.0
jsonschema                    4.23.0
jsonschema-specifications     2023.12.1
kaggle                        1.6.6
kiwisolver                    1.4.7
langcodes                     3.3.0
latex2mathml                  3.77.0
lazy_loader                   0.3
llava                         1.2.2.post1 /home/xingyun/xingy/cca-llava
lmms_eval                     0.2.4       /home/xingyun/xingy/cca-llava/lmms-eval
loguru                        0.7.2
lxml                          5.3.0
markdown-it-py                3.0.0
markdown2                     2.5.0
MarkupSafe                    2.1.3
matplotlib                    3.9.2
matplotlib-inline             0.1.6
mbstrdecoder                  1.1.3
mdurl                         0.1.2
mkl_fft                       1.3.10
mkl_random                    1.2.7
mkl-service                   2.4.0
moviepy                       1.0.3
mpmath                        1.3.0
multidict                     6.1.0
multiprocess                  0.70.15
murmurhash                    1.0.10
mutagen                       1.47.0
mypy-extensions               1.0.0
networkx                      3.2.1
ninja                         1.11.1.1
nltk                          3.8.1
nodeenv                       1.8.0
numexpr                       2.10.1
numpy                         1.26.4
nvidia-cublas-cu12            12.1.3.1
nvidia-cuda-cupti-cu12        12.1.105
nvidia-cuda-nvrtc-cu12        12.1.105
nvidia-cuda-runtime-cu12      12.1.105
nvidia-cufft-cu12             11.0.2.54
nvidia-curand-cu12            10.3.2.106
nvidia-cusolver-cu12          11.4.5.107
nvidia-cusparse-cu12          12.1.0.106
nvidia-nvjitlink-cu12         12.3.101
nvidia-nvtx-cu12              12.1.105
omegaconf                     2.3.0
openai                        1.51.0
opencv-python-headless        4.10.0.84
opendatasets                  0.1.22
openpyxl                      3.1.5
orjson                        3.10.7
packaging                     24.1
pandas                        2.2.3
parso                         0.8.3
pathspec                      0.12.1
pathvalidate                  3.2.1
peft                          0.13.0
pillow                        10.4.0
pip                           24.2
platformdirs                  4.2.0
portalocker                   2.8.2
pre-commit                    3.6.2
preshed                       3.0.9
proglog                       0.1.10
prompt-toolkit                3.0.43
protobuf                      3.20.0
psutil                        6.0.0
pure-eval                     0.2.2
py-cpuinfo                    9.0.0
pyarrow                       15.0.0
pyarrow-hotfix                0.6
pybind11                      2.13.6
pycocoevalcap                 1.2
pycocotools                   2.0.8
pycparser                     2.21
pycryptodomex                 3.21.0
pydantic                      2.9.2
pydantic_core                 2.23.4
pydeck                        0.8.1b0
pydub                         0.25.1
Pygments                      2.17.2
pynvml                        11.5.0
pyparsing                     3.1.4
PySocks                       1.7.1
pytablewriter                 1.2.0
python-dateutil               2.9.0.post0
python-magic                  0.4.27
python-multipart              0.0.12
python-slugify                8.0.4
pytz                          2024.2
PyYAML                        6.0.1
pyyaml_env_tag                0.1
referencing                   0.35.1
regex                         2024.9.11
requests                      2.32.3
rfc3986                       1.5.0
rich                          13.9.1
rpds-py                       0.20.0
ruff                          0.6.8
sacrebleu                     2.4.3
safetensors                   0.4.5
scikit-image                  0.22.0
scikit-learn                  1.2.2
scipy                         1.14.1
seaborn                       0.13.2
semantic-version              2.10.0
sentencepiece                 0.1.99
sentry-sdk                    2.14.0
setproctitle                  1.3.3
setuptools                    75.1.0
shellingham                   1.5.4
shortuuid                     1.0.13
six                           1.16.0
smart-open                    6.4.0
smmap                         5.0.1
sniffio                       1.3.1
soundfile                     0.12.1
spacy-legacy                  3.0.12
spacy-loggers                 1.0.5
sqlitedict                    2.1.0
srsly                         2.4.8
stack-data                    0.6.3
starlette                     0.38.6
streamlit                     1.31.1
svgwrite                      1.4.3
sympy                         1.12
tabledata                     1.3.3
tabulate                      0.9.0
tcolorpy                      0.1.6
tenacity                      8.3.0
tensorboardX                  2.6.2.2
text-unidecode                1.3
threadpoolctl                 3.5.0
tifffile                      2024.2.12
tiktoken                      0.7.0
timm                          0.6.13
tokenizers                    0.15.2
toml                          0.10.2
tomli                         2.0.2
tomlkit                       0.12.0
toolz                         0.12.1
torch                         2.1.1
torchvision                   0.16.1
tornado                       6.4
tqdm                          4.66.5
tqdm-multiprocess             0.0.11
traitlets                     5.14.1
transformers                  4.37.2
transformers-stream-generator 0.0.5
triton                        2.1.0
typepy                        1.3.2
typer                         0.12.5
typing_extensions             4.11.0
tzdata                        2024.2
tzlocal                       5.2
urllib3                       2.2.3
uvicorn                       0.31.0
validators                    0.22.0
virtualenv                    20.25.1
wandb                         0.18.2
wasabi                        1.1.2
watchdog                      4.0.0
wavedrom                      2.0.3.post3
wcwidth                       0.2.13
weasel                        0.3.4
webencodings                  0.5.1
websockets                    13.1
wheel                         0.44.0
xformers                      0.0.23
xxhash                        3.5.0
yarl                          1.13.1
yt-dlp                        2024.9.27
zss                           1.2.0
zstandard                     0.23.0

cuda

import torch
print(torch.version.cuda)  # 12.1

ErikZ719 commented 6 days ago

What can i say！ Man，thank you very much. : )

xing0047 / cca-llava

SFT issue #4