zjysteven / lmms-finetune

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, qwen-vl, phi3-v etc.
Apache License 2.0
122 stars 8 forks source link

Flash-attn issues #32

Closed binarybeastt closed 2 weeks ago

binarybeastt commented 2 weeks ago

Thank you for your good work

While trying to fine-tune the interleave 0.5B model, I keep running into errors which I don't quite understand, but they're related to flash attention, for more context, I'm using 8 nvidia A100 gpus.

W0830 08:50:06.518000 139927194171200 torch/distributed/run.py:779] 
W0830 08:50:06.518000 139927194171200 torch/distributed/run.py:779] *****************************************
W0830 08:50:06.518000 139927194171200 torch/distributed/run.py:779] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0830 08:50:06.518000 139927194171200 torch/distributed/run.py:779] *****************************************
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
[2024-08-30 08:50:13,953] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/deepspeed.py:24: FutureWarning: transformers.deepspeed module is deprecated and will be removed in a future version. Please import deepspeed modules directly from transformers.integrations
  warnings.warn(
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[2024-08-30 08:50:14,354] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[2024-08-30 08:50:14,875] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-30 08:50:14,911] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-30 08:50:14,966] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
 [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias=None):
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
[2024-08-30 08:50:15,493] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-30 08:50:15,627] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-08-30 08:50:15,674] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
[2024-08-30 08:50:15,838] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-30 08:50:15,838] [INFO] [comm.py:668:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
 [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias=None):
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
 [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias=None):
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
 [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias=None):
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
[2024-08-30 08:50:17,056] [INFO] [comm.py:637:init_distributed] cdb=None
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
 [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
[2024-08-30 08:50:17,249] [INFO] [comm.py:637:init_distributed] cdb=None
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
 [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias=None):
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
 [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias=None):
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias=None):
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
[2024-08-30 08:50:17,464] [INFO] [comm.py:637:init_distributed] cdb=None
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.4
 [WARNING]  using untested triton version (3.0.0), only 1.0.0 is known to be compatible
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:49: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  def forward(ctx, input, weight, bias=None):
/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/linear.py:67: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  def backward(ctx, grad_output):
[2024-08-30 08:50:17,854] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-30 08:50:17,977] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-08-30 08:50:17,983] [INFO] [comm.py:637:init_distributed] cdb=None
[rank5]: Traceback (most recent call last):
[rank5]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1659, in _get_module
[rank5]:     return importlib.import_module("." + module_name, self.__name__)
[rank5]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank5]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank5]:   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank5]:   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
[rank5]:   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
[rank5]:   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
[rank5]:   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
[rank5]:   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
[rank5]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/siglip/modeling_siglip.py", line 46, in <module>
[rank5]:     from ...modeling_flash_attention_utils import _flash_attention_forward
[rank5]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/modeling_flash_attention_utils.py", line 27, in <module>
[rank5]:     from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input  # noqa
[rank5]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn/__init__.py", line 3, in <module>
[rank5]:     from flash_attn.flash_attn_interface import (
[rank5]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 10, in <module>
[rank5]:     import flash_attn_2_cuda as flash_attn_cuda
[rank5]: ImportError: /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

[rank5]: The above exception was the direct cause of the following exception:

[rank5]: Traceback (most recent call last):
[rank5]:   File "/teamspace/studios/this_studio/lmms-finetune/train.py", line 199, in <module>
[rank5]:     train()
[rank5]:   File "/teamspace/studios/this_studio/lmms-finetune/train.py", line 74, in train
[rank5]:     model, tokenizer, processor = loader.load()
[rank5]:   File "/teamspace/studios/this_studio/lmms-finetune/loaders/llava_interleave.py", line 13, in load
[rank5]:     model = LlavaForConditionalGeneration.from_pretrained(
[rank5]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3828, in from_pretrained
[rank5]:     model = cls(config, *model_args, **model_kwargs)
[rank5]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
[rank5]:     f(module, *args, **kwargs)
[rank5]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 246, in __init__
[rank5]:     self.vision_tower = AutoModel.from_config(config.vision_config)
[rank5]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 437, in from_config
[rank5]:     model_class = _get_model_class(config, cls._model_mapping)
[rank5]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 384, in _get_model_class
[rank5]:     supported_models = model_mapping[type(config)]
[rank5]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 735, in __getitem__
[rank5]:     return self._load_attr_from_module(model_type, model_name)
[rank5]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 749, in _load_attr_from_module
[rank5]:     return getattribute_from_module(self._modules[module_name], attr)
[rank5]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 693, in getattribute_from_module
[rank5]:     if hasattr(module, attr):
[rank5]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1649, in __getattr__
[rank5]:     module = self._get_module(self._class_to_module[name])
[rank5]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1661, in _get_module
[rank5]:     raise RuntimeError(
[rank5]: RuntimeError: Failed to import transformers.models.siglip.modeling_siglip because of the following error (look up to see its traceback):
[rank5]: /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
[rank4]: Traceback (most recent call last):
[rank4]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1659, in _get_module
[rank4]:     return importlib.import_module("." + module_name, self.__name__)
[rank4]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank4]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank4]:   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank4]:   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
[rank4]:   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
[rank4]:   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
[rank4]:   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
[rank4]:   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
[rank4]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/siglip/modeling_siglip.py", line 46, in <module>
[rank4]:     from ...modeling_flash_attention_utils import _flash_attention_forward
[rank4]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/modeling_flash_attention_utils.py", line 27, in <module>
[rank4]:     from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input  # noqa
[rank4]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn/__init__.py", line 3, in <module>
[rank4]:     from flash_attn.flash_attn_interface import (
[rank4]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 10, in <module>
[rank4]:     import flash_attn_2_cuda as flash_attn_cuda
[rank4]: ImportError: /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

[rank4]: The above exception was the direct cause of the following exception:

[rank4]: Traceback (most recent call last):
[rank4]:   File "/teamspace/studios/this_studio/lmms-finetune/train.py", line 199, in <module>
[rank4]:     train()
[rank4]:   File "/teamspace/studios/this_studio/lmms-finetune/train.py", line 74, in train
[rank4]:     model, tokenizer, processor = loader.load()
[rank4]:   File "/teamspace/studios/this_studio/lmms-finetune/loaders/llava_interleave.py", line 13, in load
[rank4]:     model = LlavaForConditionalGeneration.from_pretrained(
[rank4]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3828, in from_pretrained
[rank4]:     model = cls(config, *model_args, **model_kwargs)
[rank4]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
[rank4]:     f(module, *args, **kwargs)
[rank4]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 246, in __init__
[rank4]:     self.vision_tower = AutoModel.from_config(config.vision_config)
[rank4]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 437, in from_config
[rank4]:     model_class = _get_model_class(config, cls._model_mapping)
[rank4]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 384, in _get_model_class
[rank4]:     supported_models = model_mapping[type(config)]
[rank4]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 735, in __getitem__
[rank4]:     return self._load_attr_from_module(model_type, model_name)
[rank4]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 749, in _load_attr_from_module
[rank4]:     return getattribute_from_module(self._modules[module_name], attr)
[rank4]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 693, in getattribute_from_module
[rank4]:     if hasattr(module, attr):
[rank4]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1649, in __getattr__
[rank4]:     module = self._get_module(self._class_to_module[name])
[rank4]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1661, in _get_module
[rank4]:     raise RuntimeError(
[rank4]: RuntimeError: Failed to import transformers.models.siglip.modeling_siglip because of the following error (look up to see its traceback):
[rank4]: /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
[2024-08-30 08:50:18,257] [INFO] [comm.py:637:init_distributed] cdb=None
[rank7]: Traceback (most recent call last):
[rank7]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1659, in _get_module
[rank7]:     return importlib.import_module("." + module_name, self.__name__)
[rank7]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank7]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank7]:   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank7]:   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
[rank7]:   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
[rank7]:   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
[rank7]:   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
[rank7]:   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
[rank7]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/siglip/modeling_siglip.py", line 46, in <module>
[rank7]:     from ...modeling_flash_attention_utils import _flash_attention_forward
[rank7]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/modeling_flash_attention_utils.py", line 27, in <module>
[rank7]:     from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input  # noqa
[rank7]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn/__init__.py", line 3, in <module>
[rank7]:     from flash_attn.flash_attn_interface import (
[rank7]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 10, in <module>
[rank7]:     import flash_attn_2_cuda as flash_attn_cuda
[rank7]: ImportError: /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

[rank7]: The above exception was the direct cause of the following exception:

[rank7]: Traceback (most recent call last):
[rank7]:   File "/teamspace/studios/this_studio/lmms-finetune/train.py", line 199, in <module>
[rank7]:     train()
[rank7]:   File "/teamspace/studios/this_studio/lmms-finetune/train.py", line 74, in train
[rank7]:     model, tokenizer, processor = loader.load()
[rank7]:   File "/teamspace/studios/this_studio/lmms-finetune/loaders/llava_interleave.py", line 13, in load
[rank7]:     model = LlavaForConditionalGeneration.from_pretrained(
[rank7]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3828, in from_pretrained
[rank7]:     model = cls(config, *model_args, **model_kwargs)
[rank7]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
[rank7]:     f(module, *args, **kwargs)
[rank7]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 246, in __init__
[rank7]:     self.vision_tower = AutoModel.from_config(config.vision_config)
[rank7]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 437, in from_config
[rank7]:     model_class = _get_model_class(config, cls._model_mapping)
[rank7]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 384, in _get_model_class
[rank7]:     supported_models = model_mapping[type(config)]
[rank7]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 735, in __getitem__
[rank7]:     return self._load_attr_from_module(model_type, model_name)
[rank7]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 749, in _load_attr_from_module
[rank7]:     return getattribute_from_module(self._modules[module_name], attr)
[rank7]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 693, in getattribute_from_module
[rank7]:     if hasattr(module, attr):
[rank7]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1649, in __getattr__
[rank7]:     module = self._get_module(self._class_to_module[name])
[rank7]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1661, in _get_module
[rank7]:     raise RuntimeError(
[rank7]: RuntimeError: Failed to import transformers.models.siglip.modeling_siglip because of the following error (look up to see its traceback):
[rank7]: /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
Loading model, tokenizer, processor...
[2024-08-30 08:50:18,845] [INFO] [partition_parameters.py:345:__exit__] finished initializing model - num_params = 0, num_elems = 0.00B
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1659, in _get_module
[rank0]:     return importlib.import_module("." + module_name, self.__name__)
[rank0]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank0]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank0]:   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank0]:   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
[rank0]:   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
[rank0]:   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
[rank0]:   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
[rank0]:   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
[rank0]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/siglip/modeling_siglip.py", line 46, in <module>
[rank0]:     from ...modeling_flash_attention_utils import _flash_attention_forward
[rank0]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/modeling_flash_attention_utils.py", line 27, in <module>
[rank0]:     from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input  # noqa
[rank0]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn/__init__.py", line 3, in <module>
[rank0]:     from flash_attn.flash_attn_interface import (
[rank0]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 10, in <module>
[rank0]:     import flash_attn_2_cuda as flash_attn_cuda
[rank0]: ImportError: /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

[rank0]: The above exception was the direct cause of the following exception:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/teamspace/studios/this_studio/lmms-finetune/train.py", line 199, in <module>
[rank0]:     train()
[rank0]:   File "/teamspace/studios/this_studio/lmms-finetune/train.py", line 74, in train
[rank0]:     model, tokenizer, processor = loader.load()
[rank0]:   File "/teamspace/studios/this_studio/lmms-finetune/loaders/llava_interleave.py", line 13, in load
[rank0]:     model = LlavaForConditionalGeneration.from_pretrained(
[rank0]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3828, in from_pretrained
[rank0]:     model = cls(config, *model_args, **model_kwargs)
[rank0]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
[rank0]:     f(module, *args, **kwargs)
[rank0]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 246, in __init__
[rank0]:     self.vision_tower = AutoModel.from_config(config.vision_config)
[rank0]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 437, in from_config
[rank0]:     model_class = _get_model_class(config, cls._model_mapping)
[rank0]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 384, in _get_model_class
[rank0]:     supported_models = model_mapping[type(config)]
[rank0]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 735, in __getitem__
[rank0]:     return self._load_attr_from_module(model_type, model_name)
[rank0]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 749, in _load_attr_from_module
[rank0]:     return getattribute_from_module(self._modules[module_name], attr)
[rank0]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 693, in getattribute_from_module
[rank0]:     if hasattr(module, attr):
[rank0]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1649, in __getattr__
[rank0]:     module = self._get_module(self._class_to_module[name])
[rank0]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1661, in _get_module
[rank0]:     raise RuntimeError(
[rank0]: RuntimeError: Failed to import transformers.models.siglip.modeling_siglip because of the following error (look up to see its traceback):
[rank0]: /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1659, in _get_module
[rank1]:     return importlib.import_module("." + module_name, self.__name__)
[rank1]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank1]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank1]:   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank1]:   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
[rank1]:   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
[rank1]:   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
[rank1]:   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
[rank1]:   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
[rank1]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/siglip/modeling_siglip.py", line 46, in <module>
[rank1]:     from ...modeling_flash_attention_utils import _flash_attention_forward
[rank1]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/modeling_flash_attention_utils.py", line 27, in <module>
[rank1]:     from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input  # noqa
[rank1]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn/__init__.py", line 3, in <module>
[rank1]:     from flash_attn.flash_attn_interface import (
[rank1]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 10, in <module>
[rank1]:     import flash_attn_2_cuda as flash_attn_cuda
[rank1]: ImportError: /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

[rank1]: The above exception was the direct cause of the following exception:

[rank1]: Traceback (most recent call last):
[rank1]:   File "/teamspace/studios/this_studio/lmms-finetune/train.py", line 199, in <module>
[rank1]:     train()
[rank1]:   File "/teamspace/studios/this_studio/lmms-finetune/train.py", line 74, in train
[rank1]:     model, tokenizer, processor = loader.load()
[rank1]:   File "/teamspace/studios/this_studio/lmms-finetune/loaders/llava_interleave.py", line 13, in load
[rank1]:     model = LlavaForConditionalGeneration.from_pretrained(
[rank1]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3828, in from_pretrained
[rank1]:     model = cls(config, *model_args, **model_kwargs)
[rank1]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
[rank1]:     f(module, *args, **kwargs)
[rank1]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 246, in __init__
[rank1]:     self.vision_tower = AutoModel.from_config(config.vision_config)
[rank1]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 437, in from_config
[rank1]:     model_class = _get_model_class(config, cls._model_mapping)
[rank1]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 384, in _get_model_class
[rank1]:     supported_models = model_mapping[type(config)]
[rank1]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 735, in __getitem__
[rank1]:     return self._load_attr_from_module(model_type, model_name)
[rank1]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 749, in _load_attr_from_module
[rank1]:     return getattribute_from_module(self._modules[module_name], attr)
[rank1]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 693, in getattribute_from_module
[rank1]:     if hasattr(module, attr):
[rank1]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1649, in __getattr__
[rank1]:     module = self._get_module(self._class_to_module[name])
[rank1]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1661, in _get_module
[rank1]:     raise RuntimeError(
[rank1]: RuntimeError: Failed to import transformers.models.siglip.modeling_siglip because of the following error (look up to see its traceback):
[rank1]: /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
[rank6]: Traceback (most recent call last):
[rank6]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1659, in _get_module
[rank6]:     return importlib.import_module("." + module_name, self.__name__)
[rank6]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank6]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank6]:   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank6]:   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
[rank6]:   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
[rank6]:   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
[rank6]:   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
[rank6]:   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
[rank6]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/siglip/modeling_siglip.py", line 46, in <module>
[rank6]:     from ...modeling_flash_attention_utils import _flash_attention_forward
[rank6]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/modeling_flash_attention_utils.py", line 27, in <module>
[rank6]:     from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input  # noqa
[rank6]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn/__init__.py", line 3, in <module>
[rank6]:     from flash_attn.flash_attn_interface import (
[rank6]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 10, in <module>
[rank6]:     import flash_attn_2_cuda as flash_attn_cuda
[rank6]: ImportError: /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

[rank6]: The above exception was the direct cause of the following exception:

[rank6]: Traceback (most recent call last):
[rank6]:   File "/teamspace/studios/this_studio/lmms-finetune/train.py", line 199, in <module>
[rank6]:     train()
[rank6]:   File "/teamspace/studios/this_studio/lmms-finetune/train.py", line 74, in train
[rank6]:     model, tokenizer, processor = loader.load()
[rank6]:   File "/teamspace/studios/this_studio/lmms-finetune/loaders/llava_interleave.py", line 13, in load
[rank6]:     model = LlavaForConditionalGeneration.from_pretrained(
[rank6]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3828, in from_pretrained
[rank6]:     model = cls(config, *model_args, **model_kwargs)
[rank6]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
[rank6]:     f(module, *args, **kwargs)
[rank6]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 246, in __init__
[rank6]:     self.vision_tower = AutoModel.from_config(config.vision_config)
[rank6]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 437, in from_config
[rank6]:     model_class = _get_model_class(config, cls._model_mapping)
[rank6]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 384, in _get_model_class
[rank6]:     supported_models = model_mapping[type(config)]
[rank6]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 735, in __getitem__
[rank6]:     return self._load_attr_from_module(model_type, model_name)
[rank6]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 749, in _load_attr_from_module
[rank6]:     return getattribute_from_module(self._modules[module_name], attr)
[rank6]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 693, in getattribute_from_module
[rank6]:     if hasattr(module, attr):
[rank6]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1649, in __getattr__
[rank6]:     module = self._get_module(self._class_to_module[name])
[rank6]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1661, in _get_module
[rank6]:     raise RuntimeError(
[rank6]: RuntimeError: Failed to import transformers.models.siglip.modeling_siglip because of the following error (look up to see its traceback):
[rank6]: /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
[rank2]: Traceback (most recent call last):
[rank2]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1659, in _get_module
[rank2]:     return importlib.import_module("." + module_name, self.__name__)
[rank2]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank2]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank2]:   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank2]:   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
[rank2]:   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
[rank2]:   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
[rank2]:   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
[rank2]:   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
[rank2]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/siglip/modeling_siglip.py", line 46, in <module>
[rank2]:     from ...modeling_flash_attention_utils import _flash_attention_forward
[rank2]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/modeling_flash_attention_utils.py", line 27, in <module>
[rank2]:     from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input  # noqa
[rank2]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn/__init__.py", line 3, in <module>
[rank2]:     from flash_attn.flash_attn_interface import (
[rank2]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 10, in <module>
[rank2]:     import flash_attn_2_cuda as flash_attn_cuda
[rank2]: ImportError: /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

[rank2]: The above exception was the direct cause of the following exception:

[rank2]: Traceback (most recent call last):
[rank2]:   File "/teamspace/studios/this_studio/lmms-finetune/train.py", line 199, in <module>
[rank2]:     train()
[rank2]:   File "/teamspace/studios/this_studio/lmms-finetune/train.py", line 74, in train
[rank2]:     model, tokenizer, processor = loader.load()
[rank2]:   File "/teamspace/studios/this_studio/lmms-finetune/loaders/llava_interleave.py", line 13, in load
[rank2]:     model = LlavaForConditionalGeneration.from_pretrained(
[rank2]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3828, in from_pretrained
[rank2]:     model = cls(config, *model_args, **model_kwargs)
[rank2]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
[rank2]:     f(module, *args, **kwargs)
[rank2]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 246, in __init__
[rank2]:     self.vision_tower = AutoModel.from_config(config.vision_config)
[rank2]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 437, in from_config
[rank2]:     model_class = _get_model_class(config, cls._model_mapping)
[rank2]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 384, in _get_model_class
[rank2]:     supported_models = model_mapping[type(config)]
[rank2]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 735, in __getitem__
[rank2]:     return self._load_attr_from_module(model_type, model_name)
[rank2]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 749, in _load_attr_from_module
[rank2]:     return getattribute_from_module(self._modules[module_name], attr)
[rank2]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 693, in getattribute_from_module
[rank2]:     if hasattr(module, attr):
[rank2]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1649, in __getattr__
[rank2]:     module = self._get_module(self._class_to_module[name])
[rank2]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1661, in _get_module
[rank2]:     raise RuntimeError(
[rank2]: RuntimeError: Failed to import transformers.models.siglip.modeling_siglip because of the following error (look up to see its traceback):
[rank2]: /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
[rank3]: Traceback (most recent call last):
[rank3]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1659, in _get_module
[rank3]:     return importlib.import_module("." + module_name, self.__name__)
[rank3]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/importlib/__init__.py", line 126, in import_module
[rank3]:     return _bootstrap._gcd_import(name[level:], package, level)
[rank3]:   File "<frozen importlib._bootstrap>", line 1050, in _gcd_import
[rank3]:   File "<frozen importlib._bootstrap>", line 1027, in _find_and_load
[rank3]:   File "<frozen importlib._bootstrap>", line 1006, in _find_and_load_unlocked
[rank3]:   File "<frozen importlib._bootstrap>", line 688, in _load_unlocked
[rank3]:   File "<frozen importlib._bootstrap_external>", line 883, in exec_module
[rank3]:   File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
[rank3]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/siglip/modeling_siglip.py", line 46, in <module>
[rank3]:     from ...modeling_flash_attention_utils import _flash_attention_forward
[rank3]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/modeling_flash_attention_utils.py", line 27, in <module>
[rank3]:     from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input  # noqa
[rank3]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn/__init__.py", line 3, in <module>
[rank3]:     from flash_attn.flash_attn_interface import (
[rank3]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 10, in <module>
[rank3]:     import flash_attn_2_cuda as flash_attn_cuda
[rank3]: ImportError: /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi

[rank3]: The above exception was the direct cause of the following exception:

[rank3]: Traceback (most recent call last):
[rank3]:   File "/teamspace/studios/this_studio/lmms-finetune/train.py", line 199, in <module>
[rank3]:     train()
[rank3]:   File "/teamspace/studios/this_studio/lmms-finetune/train.py", line 74, in train
[rank3]:     model, tokenizer, processor = loader.load()
[rank3]:   File "/teamspace/studios/this_studio/lmms-finetune/loaders/llava_interleave.py", line 13, in load
[rank3]:     model = LlavaForConditionalGeneration.from_pretrained(
[rank3]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3828, in from_pretrained
[rank3]:     model = cls(config, *model_args, **model_kwargs)
[rank3]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 506, in wrapper
[rank3]:     f(module, *args, **kwargs)
[rank3]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/llava/modeling_llava.py", line 246, in __init__
[rank3]:     self.vision_tower = AutoModel.from_config(config.vision_config)
[rank3]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 437, in from_config
[rank3]:     model_class = _get_model_class(config, cls._model_mapping)
[rank3]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 384, in _get_model_class
[rank3]:     supported_models = model_mapping[type(config)]
[rank3]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 735, in __getitem__
[rank3]:     return self._load_attr_from_module(model_type, model_name)
[rank3]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 749, in _load_attr_from_module
[rank3]:     return getattribute_from_module(self._modules[module_name], attr)
[rank3]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 693, in getattribute_from_module
[rank3]:     if hasattr(module, attr):
[rank3]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1649, in __getattr__
[rank3]:     module = self._get_module(self._class_to_module[name])
[rank3]:   File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1661, in _get_module
[rank3]:     raise RuntimeError(
[rank3]: RuntimeError: Failed to import transformers.models.siglip.modeling_siglip because of the following error (look up to see its traceback):
[rank3]: /home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c104cuda9SetDeviceEi
W0830 08:50:20.342000 139927194171200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 10857 closing signal SIGTERM
W0830 08:50:20.343000 139927194171200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 10858 closing signal SIGTERM
W0830 08:50:20.347000 139927194171200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 10859 closing signal SIGTERM
W0830 08:50:20.348000 139927194171200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 10860 closing signal SIGTERM
W0830 08:50:20.348000 139927194171200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 10861 closing signal SIGTERM
W0830 08:50:20.348000 139927194171200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 10863 closing signal SIGTERM
W0830 08:50:20.348000 139927194171200 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 10864 closing signal SIGTERM
E0830 08:50:20.692000 139927194171200 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 5 (pid: 10862) of binary: /home/zeus/miniconda3/envs/cloudspace/bin/python
Traceback (most recent call last):
  File "/home/zeus/miniconda3/envs/cloudspace/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
train.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-30_08:50:20
  host      : ip-10-192-12-54.us-east-2.compute.internal
  rank      : 5 (local_rank: 5)
  exitcode  : 1 (pid: 10862)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
zjysteven commented 2 weeks ago

Hi, from https://github.com/Dao-AILab/flash-attention/issues/667 it seems that Undefined symbols is due to some incompatibilities. Can you try re-installing flash-attention, or see if the discussions in that issue help?

pip uninstall flash-attn
pip install --no-cache-dir --no-build-isolation flash-attn
binarybeastt commented 2 weeks ago

Reinstalling flash attention works, thank you.