zzxslp / SoM-LLaVA

[COLM-2024] List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs
113 stars 2 forks source link

Annotated Images Download #1

Closed NormXU closed 4 months ago

NormXU commented 4 months ago

Thank you very much for your awesome work. Would you mind providing the annotated image download links?

zzxslp commented 4 months ago

Hi, thanks for the suggestion! It is available on HF as som_train2017.zip

NormXU commented 4 months ago

@zzxslp The GPT responses appear to be inaccurate. Could you please review the image annotations in case of any discrepancies? 20240428-105901

zzxslp commented 4 months ago

Hi, to be clear, the GPT-4v results are definitely noisy and not perfect (for example, sometimes the tag points to an object that's close to the tag). We manually checked a few examples and find that the results are ok, and our trained SoM-LLaVA can generate reasonably listings. See the listing example (SoM-LLaVA vs. GPT-4V) in the readme.

zzxslp commented 4 months ago

@NormXU Can you help take a look at more images? I don't have the copy of original images used to train the model (I no longer have access to the servers at Microsoft, and it should have been deleted automatically). So the images I provided here are re-generated by me this week. The algorithm to generate these tags are deterministic so the image annotations should be the same under multiple runs (code is provided in the repo for sanity check). If the inaccuracy you meant here is the "tag shifting" issue, for example, tag-10 is a white dog but you think it should be the sidewalk, then that's an inherent bias of GPT-4V, we've partially fixed it by improving input prompts.

NormXU commented 4 months ago

@zzxslp Thank you for your quick reply. After further review of the data, I have identified additional cases for your review, which I believe may have flaws:

20240428-120701 The tag labeled 'Tag-5' indicates the wing of the aircraft, yet it appears to be put far away from the airplane.

20240428-120711 'Tag-2' is intended to denote the Zebra, however, the tag is positioned on the grass. Given that the zebra is the main subject of this image, placing the tag directly on it would be more appropriate.

While it's true that GPT-4V may exhibit bias or hallucination, I believe we can place the generated tags more precisely on or around the objects.

I've started a trial run with som_train2017 to see what results I can get. I'll keep you informed if the model fails to generate accurate listings. Thank you again for your work.

zzxslp commented 4 months ago

Hi, thanks for the verification, the tags are in correct orders. Though we've improved the text quality via demonstrations, the annotations from GPT-4V are for sure not perfect. For your other suggestion, we find it is hard to generate accurate tags with semantic-SAM (or similar models), let me know if there is a way to place tags more accurately (other than human annotations). We want this whole process to be automatic so we can scale to large datasets.