zylo117 / Yet-Another-EfficientDet-Pytorch

The pytorch re-implement of the official efficientdet with SOTA performance in real time and pretrained weights.
GNU Lesser General Public License v3.0
5.2k stars 1.27k forks source link

Inquiry for training memory on GTX 2080 Ti (D0~D4) #673

Open Ronald-Kray opened 3 years ago

Ronald-Kray commented 3 years ago

@zylo117 Hi, I'm testing the required training memory on a GTX 2080 Ti. Could you check Mib(D0~D4) for 1 batch size, please?

EfficientDet-D0(15.5MB) | 1 batch:1,326Mib EfficientDet-D1(26.4MB) | 1 batch:2,068Mib EfficientDet-D2(32.2MB) | 1 batch:2,840Mib EfficientDet-D3(47.7MB) | 1 batch:4,824Mib EfficientDet-D4(81.9MB) | 1 batch:7,844Mib

zylo117 commented 3 years ago

training takes 3 times memory as inference

Ronald-Kray commented 3 years ago

@zylo117 those are not exactly 3 times memory as inference. How can I calculate how much memory consumption is used for training?

Ronald-Kray commented 3 years ago

@zylo117 Hey! I have a question about memory consumption. I'm using for training Yolov5 and EfficientDet on 3 X GTX 2080 Ti 11G. Any idea why Yolov5 can allocate more batch size than EfficientDet and why EfficientDet requires so much memory for training? It would be clear if I could directly calculate the allocated memory.

Yolov5(MB)(Flops) | 1 batch(1 Gpu) MiB, Maximum batch size(1 Gpu) | EfficietnDet(MB)(Flops) | 1 batch(1 Gpu)  MiB, Maximum batch size(1 Gpu) -- | -- | -- | -- Yolov5s(14MB)(FLOPs 17B) | 1 batch:1,686 MiB, Max batch size:56(10,054 Mib) | EfficientDet-D0(15.5MB)(FLOPs 2.5B) | 1 batch:1,326 MiB, Max batch size:17(10,132 Mib) Yolov5m(41MB)(FLOPs 51.3B) | 1 batch:2,010 MiB, Max batch size:29(10,228 Mib) | EfficientDet-D1(26.4MB)(FLOPs 6B) | 1 batch:2,068 MiB, Max batch size:7(9,768 Mib) Yolov5l(90MB)(FLOPs 115.4B) | 1 batch: 2,528 MiB, Max batch size:17(10,170 Mib) | EfficientDet-D2(32.2MB)(FLOPs 11B) | 1 batch:2,840 MiB, Max batch size:4(9,002 Mib) Yolov5x(168MB)(FLOPs 218.8B) | 1 batch: 3,168 MiB, Max batch size:11(10,254 Mib) | EfficientDet-D3(47.7MB)(FLOPs 25B) | 1 batch:4,824 MiB, Max batch size:2(8738 Mib)
zylo117 commented 3 years ago

no idea. But it should have something to do with the cuda implementation of depthwiseconv and swish. In practice, the effdet consumes only 100MB in naive c++ impl