xing0047 / cca-llava

[NeurIPS 2024] Mitigating Object Hallucination via Concentric Causal Attention
Apache License 2.0
42 stars 1 forks source link

Prepend issue #5

Open mzamini92 opened 1 day ago

mzamini92 commented 1 day ago

Hi. Thanks for the great work. I tried to prepend and just add the

import transformers
from llava.cca_utils.cca import llamaforcausallm_forward, cca_forward 
transformers.models.llama.LlamaForCausalLM.forward = llamaforcausallm_forward
transformers.models.llama.LlamaModel.forward = cca_forward

to LLaVA-UHD or EAGLE and I get:

[rank0]:     for img_token_pos in batch_img_token_pos:
[rank0]: TypeError: 'NoneType' object is not iterable

When I also modify the llava_llama.py file the same as yours, I get:

[rank6]:   File "/llava/cca_utils/cca.py", line 342, in cca_forward
[rank6]:     torch.arange(img_token_pos + H // 2, seq_len - IMG_TOKEN_LEN + H // 2)
[rank6]: RuntimeError: upper bound and larger bound inconsistent with step sign

did I miss anything?

yiheng003 commented 1 day ago

Thank you for your interest in our work and your feedback ^_^

CCA is technically compatible to LVLMs other than LLaVA in this repo, such as LLaVA-UHD and EAGLE. Some modifications need to be made to accomodate for change in visual feature resolution.

Our method was implemented based on LLaVA, where length of visual token sequence is 576 (manually set here). 2-D visual features from this sequence will be in 24 by 24 (set here). Length of visual token sequence is subject to LVLM model design, hence slight modifications should be made when adapting CCA to other models.

We suspect this could be a possible reason. If this is the case, you may adjust the IMG_TOKEN_LEN, H and W according to the visual feature size in your experimental setup.

Let us know if you have further questions on this. Thanks.