Training test! - Githubissues

toyxyz commented 1 month ago

I prepared 1000 images and ran a training test. I set the max steps to 1000 and the training finished in 6 minutes, but the result is very cool!

Do you have any tips for running training?

cchance27 commented 1 month ago

thats so cool!

Jerry-155 commented 1 month ago

File "E:\CN\ctrlora\myenv\lib\site-packages\torch\distributed\distributed_c10d.py", line 886, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in Have you ever had that problem?

Jerry-155 commented 1 month ago

Administrator@win-20231229PFV MINGW64 /e/CN/ctrlora $ python scripts/train_ctrlora_finetune.py \ --dataroot ./data/shoes \ --config ./configs/ctrlora_finetune_sd15_rank128.yaml \ --sd_ckpt ./ckpts/sd15/v1-5-pruned.ckpt \ --cn_ckpt ./ckpts/ctrlora-basecn/ctrlora_sd15_basecn700k.ckpt \ --name shoes1 \ --max_steps 5000 GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs E:\CN\ctrlora\myenv\lib\site-packages\pytorch_lightning\trainer\configuration_validator.py:118: UserWarning: You defined a validation_step but have no val_dataloader. Skipping val loop. rank_zero_warn("You defined a validation_step but have no val_dataloader. Skipping val loop.") E:\CN\ctrlora\myenv\lib\site-packages\pytorch_lightning\trainer\configuration_validator.py:280: LightningDeprecationWarning: Base LightningModule.on_train_batch_start hook signature has changed in v1.5. The dataloader_idx argument will be removed in v1.7. rank_zero_deprecation( initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1 [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:50388 (system error: 10049 - ▒▒▒▒▒▒▒▒▒▒▒У▒▒▒▒▒▒▒ĵ▒ַ▒▒Ч▒▒). [W ..\torch\csrc\distributed\c10d\socket.cpp:601] [c10d] The client socket has failed to connect to [kubernetes.docker.internal]:50388 (system error: 10049 - ▒▒▒▒▒▒▒▒▒▒▒У▒▒▒▒▒▒▒ĵ▒ַ▒▒Ч▒▒). logging improved. Dataset size: 1163 Number of devices: 1 Batch size per device: 1 Gradient accumulation: 1 Total batch size: 1 No module 'xformers'. Proceeding without it. ControlFinetuneLDM: Running in eps-prediction mode DiffusionWrapper has 859.52 M params. making attention of type 'vanilla' with 512 in_channels Working with z of shape (1, 4, 32, 32) = 4096 dimensions. making attention of type 'vanilla' with 512 in_channels Loaded model config from [./configs/ctrlora_finetune_sd15_rank128.yaml] Loaded state_dict from [./ckpts/sd15/v1-5-pruned.ckpt] Loaded state_dict from [./ckpts/ctrlora-basecn/ctrlora_sd15_basecn700k.ckpt] Successfully initialize SD from ./ckpts/sd15/v1-5-pruned.ckpt Successfully initialize ControlNet from ./ckpts/ctrlora-basecn/ctrlora_sd15_basecn700k.ckpt Traceback (most recent call last): File "E:\CN\ctrlora\scripts\train_ctrlora_finetune.py", line 129, in trainer.fit(model, dataloader) File "E:\CN\ctrlora\myenv\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 735, in fit self._call_and_handle_interrupt( File "E:\CN\ctrlora\myenv\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 682, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "E:\CN\ctrlora\myenv\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 770, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "E:\CN\ctrlora\myenv\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 1131, in _run self.accelerator.setup_environment() File "E:\CN\ctrlora\myenv\lib\site-packages\pytorch_lightning\accelerators\gpu.py", line 39, in setup_environment super().setup_environment() File "E:\CN\ctrlora\myenv\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 83, in setup_environment self.training_type_plugin.setup_environment() File "E:\CN\ctrlora\myenv\lib\site-packages\pytorch_lightning\plugins\training_type\ddp.py", line 192, in setup_environment self.setup_distributed() File "E:\CN\ctrlora\myenv\lib\site-packages\pytorch_lightning\plugins\training_type\ddp.py", line 279, in setup_distributed init_dist_connection(self.cluster_environment, self.torch_distributed_backend) File "E:\CN\ctrlora\myenv\lib\site-packages\pytorch_lightning\utilities\distributed.py", line 386, in init_dist_connection torch.distributed.init_process_group( File "E:\CN\ctrlora\myenv\lib\site-packages\torch\distributed\distributed_c10d.py", line 761, in init_process_group default_pg = _new_process_group_helper( File "E:\CN\ctrlora\myenv\lib\site-packages\torch\distributed\distributed_c10d.py", line 886, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in

toyxyz commented 1 month ago

File "E:\CN\ctrlora\myenv\lib\site-packages\torch\distributed\distributed_c10d.py", line 886, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in Have you ever had that problem?

I solved the problem with this method. https://github.com/xyfJASON/ctrlora/issues/8

Jerry-155 commented 1 month ago

File "E:\CN\ctrlora\myenv\lib\site-packages\torch\distributed\distributed_c10d.py", line 886, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in Have you ever had that problem?文件“E:\CN\ctrlora\myenv\lib\site-packages\torch\distributed\distributed_c10d.py”，第 886 行，在 _new_process_group_helper 中引发运行时错误（“分布式包没有 NCCL ”“内置”）运行时错误：分布式包没有内置 NCCL 您遇到过这个问题吗？

I solved the problem with this method.我用这个方法解决了这个问题。 #8

Thanks I figured out how to fix it, just need to delete the strategy='ddp', that's it!

Jerry-155 commented 1 month ago

What NVIDIA graphics card did you train on?

toyxyz commented 1 month ago

What NVIDIA graphics card did you train on?

I used rtx 4090

Jerry-155 commented 1 month ago

What NVIDIA graphics card did you train on?

I used rtx 4090

I'm 4080. I can't train.

toyxyz commented 1 month ago

I trained using 1000 hand-drawn images. It's a little fuzzy because the lines aren't consistent, but it works.

toyxyz commented 1 month ago

What NVIDIA graphics card did you train on?

I used rtx 4090

I'm 4080. I can't train.

VRAM usage is not that high, 16GB should work. Or reduce the batch size.

Jerry-155 commented 1 month ago

You've got a great training set，from china?I want to train models that recognise different product materials and then optimise the light and shadow of the product based on the materials

toyxyz commented 1 month ago

You've got a great training set，from china?I want to train models that recognise different product materials and then optimise the light and shadow of the product based on the materials

You can create datasets using texture images and 3D models that use those textures.

crapthings commented 1 month ago

@toyxyz how to get 3d pose like this

toyxyz commented 1 month ago

@toyxyz how to get 3d pose like this

Rendered using blender.

LynnHo commented 1 month ago

I trained using 1000 hand-drawn images. It's a little fuzzy because the lines aren't consistent, but it works.

@toyxyz Hi, many thanks for your testing!!! By the way, how do you collect this kind of data?

toyxyz commented 1 month ago

I trained using 1000 hand-drawn images. It's a little fuzzy because the lines aren't consistent, but it works.

@toyxyz Hi, many thanks for your testing!!! By the way, how do you collect this kind of data?

I drew it myself, of course!

ccnguyen commented 1 week ago

@toyxyz Great work!! Did you use all the default settings to train these? I've been trying myself with a dataset of 1000 rough lineart to art pairs and the following command but it does not seem to be learning anything....

python scripts/train_ctrlora_finetune.py \
    --dataroot ./data/dataset \
    --config ./configs/ctrlora_finetune_sd15_rank128.yaml \
    --sd_ckpt ./ckpts/sd15/v1-5-pruned.ckpt \
    --cn_ckpt ./ckpts/control_sd15_init.pth \
    --name lineart_custom \
    --bs 1 \
    --img_logger_freq 2000 \
    --ckpt_logger_freq 2000

LynnHo commented 1 week ago

@toyxyz Great work!! Did you use all the default settings to train these? I've been trying myself with a dataset of 1000 rough lineart to art pairs and the following command but it does not seem to be learning anything....
python scripts/train_ctrlora_finetune.py \
    --dataroot ./data/dataset \
    --config ./configs/ctrlora_finetune_sd15_rank128.yaml \
    --sd_ckpt ./ckpts/sd15/v1-5-pruned.ckpt \
    --cn_ckpt ./ckpts/control_sd15_init.pth \
    --name lineart_custom \
    --bs 1 \
    --img_logger_freq 2000 \
    --ckpt_logger_freq 2000

@ccnguyen you need to train your LoRA with a well-trained Base ControlNet (ctrlora_sd15_basecn700k.ckpt), like

python scripts/train_ctrlora_finetune.py \
  --dataroot ./data/dataset  \
  --config ./configs/ctrlora_finetune_sd15_rank128.yaml \
  --sd_ckpt ./ckpts/sd15/v1-5-pruned.ckpt \
  --cn_ckpt ./ckpts/ctrlora-basecn/ctrlora_sd15_basecn700k.ckpt \
  --max_steps 1000

xyfJASON / ctrlora

Training test! #9