How to get this model trained on multiple machines in a distributed manner(like DistributedDataParallel)?

wgcban / ChangeFormer

[IGARSS'22]: A Transformer-Based Siamese Network for Change Detection

https://www.wgcban.com/research#h.e51z61ujhqim

MIT License

427 stars 57 forks source link

How to get this model trained on multiple machines in a distributed manner(like DistributedDataParallel)? #90

Closed StorywithLove closed 10 months ago

StorywithLove commented 10 months ago

The model and the code given is amazing! It still performs well when migrated to our own dataset. On the downside, the framework doesn't take distributed training into account, and I wonder if the authors have a good way to get it trained on multiple machines, or if it will be updated in a future release!

wgcban commented 10 months ago

@Programming-Music Thanks for your feedback! I appreciate it. I can certainly work on adding distributed-data-parallel (DDP) to the codebase as a future work. I will make this improvement in my free time. At the meantime, if you want to train the model on multi-GPUs you can easily utilize data-parallel (DP) with just a couple of line of code change, although it would not be as efficient as distributed data parallel (DDP).

StorywithLove commented 10 months ago

Yeah, your answer is very helpful. When I set the gpu_ids to 0,1 and the batch_size is greater than 1 (due to the custom img_size being set to 1024 and the batch_size being often set to 1 to try to run), the model can be successfully run on multiple gpu's!

wgcban commented 10 months ago

@Programming-Music Great! Thanks for confirming!