zzxslp / SoM-LLaVA

[COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
113 stars 2 forks source link

How to load the som-llava model using the transformers library? #3

Closed iz2late closed 4 months ago

iz2late commented 4 months ago

I attempted to use the following code, but unfortunately, it didn't work out:

model = LlavaForConditionalGeneration.from_pretrained("zzxslp/som-llava-v1.5-13b").to('cuda').eval()
processor = AutoProcessor.from_pretrained("zzxslp/som-llava-v1.5-13b")

I'm wondering if it's possible to directly load the som-llava model using the Transformers library. Is this functionality currently supported, or is it not compatible with this approach?

zzxslp commented 4 months ago

Hi! The checkpoint is formatted the same as in the LLaVA official repo, you should be able to train, eval and demo run the model following their instructions. Will see how to integrate into HF later.

zzxslp commented 4 months ago

Hi, we've converted our model into HF format, and you can access it from here: https://huggingface.co/zzxslp/som-llava-v1.5-13b-hf, and here is the example code.

from PIL import Image
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration

model_path = "zzxslp/som-llava-v1.5-13b-hf"

model = LlavaForConditionalGeneration.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)

prompt = "USER: <image>\nWhat's the content of the image? ASSISTANT:"
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=prompt, images=image, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_new_tokens=20)
output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print (output)