Long Output After Finetuning - Githubissues

zjysteven / lmms-finetune

A minimal codebase for finetuning large multimodal models, supporting llava-1.5/1.6, llava-interleave, llava-next-video, llava-onevision, qwen-vl, qwen2-vl, phi3-v etc.

Apache License 2.0

150 stars 17 forks source link

Long Output After Finetuning #47

Open TonyJiang17 opened 5 hours ago

TonyJiang17 commented 5 hours ago

Have anyone ever ran into the issue where after finetuning the output doesn't know when to end, only ends until max new token is reached? Does it has to do with the tokenizer is not adding an eos token to the end?

I am specifically finetuning llava-next-video...

zjysteven commented 5 hours ago

Yeah not having eos token probably is the cause. Let me update it real soon. Currently I'm just using the chat template from huggingface which does not apply eos token probably because it's designed for inference.

zjysteven commented 5 hours ago

@TonyJiang17 Would you try again and let me know if it helps on llava-next-video?

TonyJiang17 commented 5 hours ago

I already made the following change locally and am training something now. Does this look correct to you? I'll let you know if there's any changes. I made this change in the Loader file for llava next video

"processor = LlavaNextVideoProcessor.from_pretrained(self.model_hf_path, add_eos_token=True)"

zjysteven commented 4 hours ago

One more change is needed, which requires a monkey patch of huggingface's apply_chat_template. They hard-coded add_special_tokens=False, which still won't add bos and eos tokens.

TonyJiang17 commented 4 hours ago

Oh shoot, nothing we could do locally to change it? unless i guess manually add them?

zjysteven commented 4 hours ago

I've pushed a fix here https://github.com/zjysteven/lmms-finetune/commit/c95ea1a7c9dfc33ba9820d83224c3a0eb93dfa0d. It's not that many changes, and I have confirmed from the outputs of the collator that it now includes bos and eos. If you just pull and train again that would be great.

TonyJiang17 commented 4 hours ago

thanks! will try and let you know, may not be able to give an update until tmr tho

zjysteven commented 4 hours ago

No worries. I have a local file for checking the output of collator, in case it helps.

import json
import os
from tqdm import tqdm

import torch
torch.set_printoptions(profile="full", linewidth=240)
from torch.utils.data import DataLoader
from transformers import AutoProcessor, AutoTokenizer

from datasets import LazySupervisedDataset
from collators import COLLATORS
from loaders import LOADERS
from supported_models import MODEL_HF_PATH

model_id = "llava-next-video-7b"
model_family_id = "llava-next-video"

dataset = LazySupervisedDataset(
    data_path='./example_data/video.json',
    image_folder='./example_data/images',
    video_folder='./example_data/videos',
    model_family_id=model_family_id,
)

_, tokenizer, processor, config = LOADERS[model_family_id](
    model_hf_path=MODEL_HF_PATH[model_id],
    model_local_path=MODEL_HF_PATH[model_id],
    compute_dtype=torch.float16,
).load(load_model=False)
tokenizer.model_max_length = 256
collator = COLLATORS[model_family_id](
    config=config,
    processor=processor,
    tokenizer=tokenizer
)

dataloader = DataLoader(dataset, batch_size=2, collate_fn=collator)

batch = next(iter(dataloader))
print(batch["input_ids"])
print()
print(batch["labels"])
print()
print(tokenizer.batch_decode(batch["input_ids"], skip_special_tokens=False))
print(tokenizer.decode(
    batch["labels"][1][torch.where(batch["labels"][1] != -100)[0]]
))

zjysteven commented 4 hours ago

Oh wait. Seems like the eos token is not included in labels. One sec.

zjysteven commented 4 hours ago

Now it works. Again please pull the latest code.