[PDF] [Project Page]
This repository contains the PyTorch implementation for the ICLR 2024 Spotlight Paper "Generating Images with 3D Annotations Using Diffusion Models" by the following authors.
Wufei Ma*, Qihao Liu*, Jiahao Wang*, Angtian Wang, Xiaoding Yuan, Yi Zhang, Zihao Xiao, Beijia Lu, Ruxiao Duan, Yongrui Qi, Adam Kortylewski, Yaoyao Liu✉, Alan Yuille
We present 3D Diffusion Style Transfer (3D-DST), a simple and effective approach to generate images with 3D annotations using diffusion models. Our method exploits ControlNet, which extends diffusion models by using visual prompts in addition to text prompts. We render 3D CAD models from a variety of poses and viewing directions, compute the edge maps of the rendered images, and use these edge maps as visual prompts to generate realistic images. With explicit 3D geometry control, we can easily change the 3D structures of the objects in the generated images and obtain ground-truth 3D annotations automatically. Experiments on image classification, 3D pose estimation, and 3D object detection show that with 3D-DST data we can effectively improve the models' performance in both in-distribution and out-of-ditribution settings.
Besides code to reproduce our data generation pipeline, we also release the following data to support other research projects in the community:
ccvl/3D-DST-models
.ccvl/3D-DST-captions
.ccvl/3D-DST-data
.Please check INSTALL.md for installation instructions.
Rendering images with Blender.
python3 scripts/render_synthetic_data.py \
--data_path DST3D/train \
--model_path /path/to/all_dst_models \
--shapenet_path /path/to/ShapeNetCore.v2 \
--objaverse_path /path/to/objaverse_models \
--omniobject3d_path /path/to/OpenXD-OmniObject3D-New \
--synsets n02690373 \
--workers 48 \
--num_samples 2500 \
--disable_random_distance
DST image generation with visual prompts and LLM prompts.
CUDA_VISIBLE_DEVICES=0 python3 scripts/controllable_generation.py \
--model_name control_v11p_sd15_canny \
--data_path DST3D \
--data_name image_dst \
--synsets n02690373
Run K-fold Consistency Filter (KCF) on the generated images. The KCF code trains a ResNet50 pose estimation model and produces a validation loss for each sample. The results are saved in a JSON
file in --output_dir
.
CUDA_VISIBLE_DEVICES=0 python3 scripts/run_kcf_filter.py \
--data_path DST3D/train \
--category n02690373 \
--output_dir exp/kcf_n02690373
We release our generated 3D-DST data for all 1000 classes in ImageNet-1k here. We also provide the DeiT-small models trained on our 3D-DST data.
Image Classification on ImageNet-200.
model | data | acc@1 | url |
---|---|---|---|
DeiT-small | baseline | 81.5 | checkpoint & log |
DeiT-small | with 3D-DST | 84.8 | checkpoint & log |
Image Classification on ImageNet-1k. We provide baseline results on ImageNet-1k with 3D-DST pretraining.
model | data | acc@1 |
---|---|---|
DeiT-small | baseline | 80.1 |
DeiT-small | with 3D-DST | 81.1 |
This project is released under the MIT license. Please see the LICENSE file for more information.
If you find this repository helpful, please consider citing:
@inproceedings{ma2024generating,
title={Generating Images with 3D Annotations Using Diffusion Models},
author={Wufei Ma and Qihao Liu and Jiahao Wang and Angtian Wang and Xiaoding Yuan and Yi Zhang and Zihao Xiao and Guofeng Zhang and Beijia Lu and Ruxiao Duan and Yongrui Qi and Adam Kortylewski and Yaoyao Liu and Alan Yuille},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=XlkN11Xj6J}
}