Open zjysteven opened 1 month ago
The comment is so helpful!
I'm still getting a loss of 0 after --model_max_length
to 4096 or more (only with llava-onevision). Are there other reasons that could be causing this?
@sunfanyunn model_max_length
might still be small w.r.t. your input. You can use this script to examine the output of collator to see if model_max_length
is large enough.
import json
import os
from tqdm import tqdm
import torch
torch.set_printoptions(profile="full", linewidth=240)
from torch.utils.data import DataLoader
from transformers import AutoProcessor, AutoTokenizer
from datasets import LazySupervisedDataset
from collators import COLLATORS
from loaders import LOADERS
from supported_models import MODEL_HF_PATH
model_id = "llava-onevision-0.5b-ov"
model_family_id = "llava-onevision"
dataset = LazySupervisedDataset(
data_path='./example_data/single_image.json', # use your own data here
image_folder='./example_data/images',
video_folder='./example_data/videos',
model_family_id=model_family_id,
)
_, tokenizer, processor, config = LOADERS[model_family_id](
model_hf_path=MODEL_HF_PATH[model_id],
model_local_path=MODEL_HF_PATH[model_id],
compute_dtype=torch.float16,
).load(load_model=False)
tokenizer.model_max_length = 4096
collator = COLLATORS[model_family_id](
config=config,
processor=processor,
tokenizer=tokenizer
)
dataloader = DataLoader(dataset, batch_size=2, collate_fn=collator)
batch = next(iter(dataloader))
print(batch["input_ids"])
print()
print(batch["labels"])
print()
print(tokenizer.batch_decode(batch["input_ids"], skip_special_tokens=False))
print(tokenizer.decode(
batch["labels"][1][torch.where(batch["labels"][1] != -100)[0]], skip_special_tokens=True
))
I realized my input images (of size 1080 x 1080) are being tokenized into 7371 tokens. @zjysteven am I missing anything obvious?
This is from the llava onevision paper, which shows that the maximum number of tokens for one image is 7290. Although yours is slightly more (probably with some marking tokens like newline tokens), I don't think there is anything wrong. This is what I meant earlier that your model_max_length
may not be large enough.
I am aware, thank you! But, I think my images are represented in a single-image way even when I provide multiple images
Oh I see the point now. I briefly browsed through the huggingface's preprocessing code but didn't notice a point where it's distinguished into "single image" and "multi-image" with different preprocessing; it seems to me that currently all images are processed with "anyres" which results in what you saw here.
Meanwhile I do see that in official implementation's training https://github.com/LLaVA-VL/LLaVA-NeXT/blob/79ef45a6d8b89b92d7a8525f077c3a3a9894a87d/llava/train/train.py#L1140-L1148 single-image and multi-images are distinguished in training.
if "image" in sources[0]:
image_file = self.list_data_dict[i]["image"]
if type(image_file) is list:
image = [self.process_image(f) for f in image_file]
# Handling multi images
# overwrite to process with simple pad
if len(image_file) > 1:
image = [self.process_image(f, "pad") for f in image_file]
image = [[im[0], im[1], "image"] for im in image]
Tagging @zucchini-nlp to see if she can kindly confirm this and has any idea.
Hey all!
Yes, you're right, currently HF implementation doesn't distinguish between single vs mutii-image setting. AFAIR inference in orig impl also did not, unless I am missing anything as many things changed in the course of porting the model. I can check out later the inference in original repo
I am sorry HF doesn't support training same way as in the paper. The reason is that I tried to reduce complexity for the model, and aimed at inference-first use cases. We can add single vs multi image difference but it would require padding/unpadding similar to Mllama in that case adding more code to work with.
Let's me how the inference works in original repo and I'll come back to this issue, right now a bit short of bandwidth
Thank you!
one thing i would like to share:
### After finetuning
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration # llava-onevision
from transformers import AutoProcessor, LlavaForConditionalGeneration # llava
TL;DR: Set a large enough
model_max_length
(e.g., 2048, 4096, or even larger) when finetuning, or otherwise you will be likely to see training loss always being 0.Today we have enabled finetuning of LLaVA-Onevision in lmms-finetune. There is a quite subtle caveat, though, that's worth mentioning.
In earlier versions of transformers (can't remember exactly, but must be some point before 4.45.2),
model_max_length
only counts the number of text tokens without considering the vision tokens. Taking LLaVA-1.5 as an example, where each image will be translated into 576 tokens when being sent to the LLM. It means that if you setmodel_max_length
to be 128, then with a prompt including an image, your input sequence length will essentially be 128 - 1 + 576 = 703.Recently transformers implementations start to include the the vision tokens into
model_max_length
, where you can see it here https://github.com/huggingface/transformers/blob/3f06f95ebe617b192251ef756518690f5bc7ff76/src/transformers/models/llava/processing_llava.py#L143-L164.Such processing requires some arguments/keywords from the processor's config, which, as of Oct 16, hasn't been updated for LLaVA-1.5/1.6/Interleave/Next-Video.The latest LLaVA-Onevision, however, is fully compatible with this new change, which means thatmodel_max_length
will include all vision tokens. As a result, remember to set a large enoughmodel_max_length
when finetuningLLaVA-Onevisionevery model, or otherwise you probably will see loss being 0 all the time as all input tokens could be vision tokens.I hope that I have made this clear enough, but feel free to leave questions if there are any.