Pre-processing step takes very long if processing several frames simultaneously

Hi,

I am trying to run the detector on several videos (input is a batch of cv::Mat objects), and the bottleneck here is the preprocessing step, which is computationally heavy in its current form (frame clone, converting between rgb and bgr, concatenating, etc...) Are there any suggestions on how to optimize this step for batches?

I vaguely went over the python implementation and I think they speed up computation by doing vectorization tricks through numpy instead of using OpenCV operations, although I am unsure if this would really make a huge difference.

Any help would be appreciated, thank you!

yasenh / libtorch-yolov5

Pre-processing step takes very long if processing several frames simultaneously #56