zjysteven / lmms-finetune

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, qwen-vl, qwen2-vl, phi3-v etc.
Apache License 2.0
188 stars 23 forks source link

loss being 0 is a caveat of `model_max_length` #43

Open zjysteven opened 1 month ago

zjysteven commented 1 month ago

TL;DR: Set a large enough model_max_length (e.g., 2048, 4096, or even larger) when finetuning, or otherwise you will be likely to see training loss always being 0.

Today we have enabled finetuning of LLaVA-Onevision in lmms-finetune. There is a quite subtle caveat, though, that's worth mentioning.

In earlier versions of transformers (can't remember exactly, but must be some point before 4.45.2), model_max_length only counts the number of text tokens without considering the vision tokens. Taking LLaVA-1.5 as an example, where each image will be translated into 576 tokens when being sent to the LLM. It means that if you set model_max_length to be 128, then with a prompt including an image, your input sequence length will essentially be 128 - 1 + 576 = 703.

Recently transformers implementations start to include the the vision tokens into model_max_length, where you can see it here https://github.com/huggingface/transformers/blob/3f06f95ebe617b192251ef756518690f5bc7ff76/src/transformers/models/llava/processing_llava.py#L143-L164. Such processing requires some arguments/keywords from the processor's config, which, as of Oct 16, hasn't been updated for LLaVA-1.5/1.6/Interleave/Next-Video. The latest LLaVA-Onevision, however, is fully compatible with this new change, which means that model_max_length will include all vision tokens. As a result, remember to set a large enough model_max_length when finetuning LLaVA-Onevision every model, or otherwise you probably will see loss being 0 all the time as all input tokens could be vision tokens.

I hope that I have made this clear enough, but feel free to leave questions if there are any.

98986oiuoy commented 1 month ago

The comment is so helpful!

sunfanyunn commented 1 month ago

I'm still getting a loss of 0 after --model_max_length to 4096 or more (only with llava-onevision). Are there other reasons that could be causing this?

zjysteven commented 1 month ago

@sunfanyunn model_max_length might still be small w.r.t. your input. You can use this script to examine the output of collator to see if model_max_length is large enough.

import json
import os
from tqdm import tqdm

import torch
torch.set_printoptions(profile="full", linewidth=240)
from torch.utils.data import DataLoader
from transformers import AutoProcessor, AutoTokenizer

from datasets import LazySupervisedDataset
from collators import COLLATORS
from loaders import LOADERS
from supported_models import MODEL_HF_PATH

model_id = "llava-onevision-0.5b-ov"
model_family_id = "llava-onevision"

dataset = LazySupervisedDataset(
    data_path='./example_data/single_image.json', # use your own data here
    image_folder='./example_data/images',
    video_folder='./example_data/videos',
    model_family_id=model_family_id,
)

_, tokenizer, processor, config = LOADERS[model_family_id](
    model_hf_path=MODEL_HF_PATH[model_id],
    model_local_path=MODEL_HF_PATH[model_id],
    compute_dtype=torch.float16,
).load(load_model=False)
tokenizer.model_max_length = 4096
collator = COLLATORS[model_family_id](
    config=config,
    processor=processor,
    tokenizer=tokenizer
)

dataloader = DataLoader(dataset, batch_size=2, collate_fn=collator)

batch = next(iter(dataloader))
print(batch["input_ids"])
print()
print(batch["labels"])
print()
print(tokenizer.batch_decode(batch["input_ids"], skip_special_tokens=False))
print(tokenizer.decode(
    batch["labels"][1][torch.where(batch["labels"][1] != -100)[0]], skip_special_tokens=True
))
sunfanyunn commented 1 month ago
image image

I realized my input images (of size 1080 x 1080) are being tokenized into 7371 tokens. @zjysteven am I missing anything obvious?

zjysteven commented 1 month ago

image This is from the llava onevision paper, which shows that the maximum number of tokens for one image is 7290. Although yours is slightly more (probably with some marking tokens like newline tokens), I don't think there is anything wrong. This is what I meant earlier that your model_max_length may not be large enough.

sunfanyunn commented 1 month ago

I am aware, thank you! But, I think my images are represented in a single-image way even when I provide multiple images

zjysteven commented 1 month ago

Oh I see the point now. I briefly browsed through the huggingface's preprocessing code but didn't notice a point where it's distinguished into "single image" and "multi-image" with different preprocessing; it seems to me that currently all images are processed with "anyres" which results in what you saw here.

Meanwhile I do see that in official implementation's training https://github.com/LLaVA-VL/LLaVA-NeXT/blob/79ef45a6d8b89b92d7a8525f077c3a3a9894a87d/llava/train/train.py#L1140-L1148 single-image and multi-images are distinguished in training.

        if "image" in sources[0]:
            image_file = self.list_data_dict[i]["image"]
            if type(image_file) is list:
                image = [self.process_image(f) for f in image_file]
                # Handling multi images
                # overwrite to process with simple pad 
                if len(image_file) > 1:
                    image = [self.process_image(f, "pad") for f in image_file]
                    image = [[im[0], im[1], "image"] for im in image]

Tagging @zucchini-nlp to see if she can kindly confirm this and has any idea.

zucchini-nlp commented 1 month ago

Hey all!

Yes, you're right, currently HF implementation doesn't distinguish between single vs mutii-image setting. AFAIR inference in orig impl also did not, unless I am missing anything as many things changed in the course of porting the model. I can check out later the inference in original repo

I am sorry HF doesn't support training same way as in the paper. The reason is that I tried to reduce complexity for the model, and aimed at inference-first use cases. We can add single vs multi image difference but it would require padding/unpadding similar to Mllama in that case adding more code to work with.

Let's me how the inference works in original repo and I'll come back to this issue, right now a bit short of bandwidth

sunfanyunn commented 1 month ago

Thank you!

sailfish009 commented 1 month ago

one thing i would like to share:

### After finetuning
from transformers import AutoProcessor, LlavaOnevisionForConditionalGeneration  # llava-onevision
from transformers import AutoProcessor, LlavaForConditionalGeneration           # llava