Failing to reproduce results

Congratulations on the arxiv submission!

I tried to reproduce the results of this paper on top of Huggingface Diffusers, based on the reference implementation provided in the preprint.

I ended up implementing like so:
Changes to txt2img
Changes to diffusers
Some explanation in tweet.

In my independent implementation: structured diffusion changes the images only slightly, and in the 10 samples * 4 prompts that I tried, never made the generations more relevant to the prompt.

structured (left) / regular (right) "two blue sheep and a red goat":

I attach the rest of my results:
A red bird and a green apple.zip A white goat standing next to two black goats.zip two blue sheep and a red goat.zip Two ripe spotted bananas are sitting inside a green bowl on a gray counter.zip

Basically, I'm wondering whether:

this is exactly the kind of difference I should expect to see (in line with the claimed 5–8% advantage)
there's a mistake in my reproduction; better results are possible

Could you possibly read my attention.py and see if it looks like a reasonable interpretation of your algorithm? I changed it substantially to make it to do more work in parallel. I think it should be equivalent, but did I miss something important?

Thanks in advance for any attention you can give this!

Hi,

Thank you for your interest in our work. We also observe that a certain amount of our images are similar to the stable diffusion outputs. As is written in the manuscript, for evaluations, we discarded 20% of the most similar pairs and randomly sampled around 1,500 image pairs for comparison. So the 5-8% improvement could be discounted when considering all images. Among the winning cases in the head-to-head comparison, there are 31% for “fewer missing objects”, 14.1% for “better-matched colors”, and 54.8% for “other attributes or details”. So it is within expectation to see many images with details enhanced (like the last one in "two blue sheep and a red goat", 00054/64, 00058/68 in "a red bird and a green apple", 00026/36 in "a white goat standing..."). I also tried the banana prompt and observed detail enhancement on the "bananas" in 3/10 cases, while the rest look similar. It could be that we randomly ran into some good initialization for Fig. 4 while in general, the improvement is not significant for the banana prompt.

For the conjunction prompt (i.e., using "and" to connect two objects), you may want to try multiple keys and a single value. Note that the single value is not the plain encoding of the original prompt but an aligned version (also see eq. 5-6). This method is more likely to generate both objects simultaneously (like Fig. 1 (right) or Fig. 5 (top right)). However, as reflected in Table 2, these prompts seem quite challenging for existing T2I models, and we still expect incomplete compositions in most cases. If you can run this codebase and fix the seed to 42, you should be able to get the following results.

multiple keys, single value (right) / single key, multiple values (middle) / regular (right) combine_images

I am not entirely sure, but your implementation looks correct. You may use the following initialized noise patterns and see if they result in the same images as above. init.zip

As mentioned in README, improvement is not guaranteed on every sample but rather system-level. Overall we find it hard to quantify images compositionally, and we are still working on metrics beyond human evaluation to improve the experiment section. You may download this batch of examples to have a better understanding of the overall performance. Please let me know if this helps and if you have further questions.

Thanks, Weixi

thanks very much for the detailed response!
okay, looking at the samples from your Google Drive: yeah, they seem to vary in a pretty similar way to the way my own results varied. that's heartening; maybe my reproduction is close or equivalent.

I tried to see if I could generate the same images as from your Google Drive (even by turning off structured diffusion and trying to match your vanilla results).

I used the 4 noised latents .pts you provided, prompt "a red car and a white sheep", 15 steps DPM-Solver++ (2M) (k-diffusion), 16-bit Unet, 32-bit latents, 32-bit sampling, stable-diffusion 1.4, diffusers.

with structured diffusion off:
0.pt 1.pt 2.pt 3.pt

these don't match the "baseline" samples from your Google Drive:
00000-0-a red car and a white sheep 00001-0-a red car and a white sheep 00002-0-a red car and a white sheep

err, well 3.pt came out like 00000-0-a red car and a white sheep.png. seems more likely to be a coincidence though.

maybe I'll have to resort to getting the reference implementation running in order to do a comparison.

thanks for mentioning "multiple keys, single value". I think I built something like that along the way, but deleted it thinking I'd misunderstood. originally I had indeed written it such that it aligned all the noun-phrase embeddings onto one prompt embedding (instead of onto several). so I can undelete that and try it out.

"many attributes with details enhanced" might explain why (my implementation of) structured diffusion upgraded my bird into a photorealistic one ("A red bird and a green apple"):

standard | structured 00001-0-a red car and a white sheep 00002-0-a red car and a white sheep

I guess next step is for me to try running the reference implementation on my machine, see if I can get the same baseline outputs, then see if I can get the same structured diffusion outputs with my algorithm. if it is indeed an equivalent implementation: it'd mean we can enjoy improved perf (does more work in parallel and fuses a lot of multiplication) and diffusers support. the downside however is that if my implementation is equivalent, then the results I got would be valid too (which didn't come close to the best results from the paper). but I haven't tried "multiple keys, single value", so that's worth exploration too.

I forgot to mention that the prompt is "a white car and a red sheep", and the provided init noise patterns correspond to the 12 images displayed in the reply, not to any images in Google Drive.

Even with the same noise initialization, there might be some other randomness that causes slight differences between the provided images and your generation results. But you should be able to get the same 12 images using the codebase here. We are also working on a huggingface demo using Gradio, and hopefully, we can make it available soon. Hope this helps!

yes, this is indeed very soon.

weixi-feng / Structured-Diffusion-Guidance

Failing to reproduce results #2