Slow training using power GPU

enesdoruk commented 2 years ago

Hi i have gtx1080 grapgics card and one epoch takes 5 minutes. And when i use tesla v100 on google cloud computing, takes 5 minutes too. I cant understand, How i can solve and what is the problem?

satpalsr commented 2 years ago

Hey @enesdoruk, I am too experiencing the slow training. Any solutions @ybkscht ?

enesdoruk commented 2 years ago

model is training one v100 GPU, 1.5k data, phi=3 and batch size=4, it is 2 days and epoch is 150. It is too slow @ybkscht @satpalsr

ybkscht commented 2 years ago

Hi @enesdoruk ,

it sounds like the bottleneck in your case is not the GPU but instead loading, preprocessing and getting the data fast enough to your GPU (the generator part). You can try using the --multiprocessing argument which starts multiple generator processes and should speed up your training if the generator is the bottleneck. But please note that from my experience using multiprocessing can cause problems, especially with windows. You can also try setting the --workers argument higher which starts multiple generator threads (in case multiprocessing is False). Because of GIL the generator is not really parallelized using multiple threads but as far as I know it can speed up IO (loading data from disk).

If your dataset is small enough to fit into your memory you can also try this to skip accessing the disk for each example which is quiet expensive.

So basically you should try to speed up the data loading and preprocessing part.

Sincerely, Yannick

enesdoruk commented 2 years ago

I am using 12 workers and I tried to use multiprocessing but when I multiprocessing active, training not started and did not give a reaction(not error or warning) @ybkscht.

enesdoruk commented 2 years ago

" If your dataset is small enough to fit into your memory you can also try this to skip accessing the disk for each example which is quiet expensive. " I don't understand this sentence. @ybkscht

enesdoruk commented 2 years ago

And i want to one example, i used three different GPU in different training: one gtx1080, one gtx1660, one tesla v100. when I starting with BS 1 phi 0 and same dataset, training time almost same, there is small soo difference @ybkscht

ybkscht commented 2 years ago

The generator loads every image and annotation file from your disk when creating a new batch and this is a quiet expensive operation and it is possible that it is the bottleneck in your case. So instead of having to load every example always from disk, you can try to load the dataset into your memory at beginning. You can either try using ramdisk or to change the generator so that it loads the dataset into memory in the init method. But as already mentioned this works only if your dataset is small enough to fit into your memory.

enesdoruk commented 2 years ago

Do you think what is the problem for multiprocessing, I activated multiprocessing and waited 30 minutes, there is no reaction @ybkscht

ybkscht commented 2 years ago

I don't really know what the problem here is but I often had and still have problems trying to use multiprocessing with tensorflow, especially under windows.

enesdoruk commented 2 years ago

finally, which changing should i do in this project for generator. which file and which line, can you explain changing for generator loads. @ybkscht and can I set multiprocessing True and worker bigger than 0 same time.

ybkscht commented 2 years ago

In generators/linemod.py (and occlusion.py) the paths to all images and masks of your dataset are currently stored in lists. And if the generator needs to generate a batch (getitem method of the generator base class in generators/common.py) then the needed images and masks are loaded from disk using the paths stored in the lists (load_image and load_mask methods in generators/linemod.py).

So you can try to load all images and masks in the init method of linemod.py and store them in lists and change the load_image and load_masks methods, so that they only return the images from the lists instead of having them to load from disk.

For example add this in the init method of LineModGenerator in linemod.py after shuffling the dataset (line 123):

self.all_images = []
for path_to_image in self.image_paths:
    #from load_image method
    image = cv2.imread(path_to_image)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    self.all_images.append(image)

self.all_masks = []
for path_to_mask in self.mask_paths:
    #from load_mask method
    mask = cv2.imread(path_to_mask)
    self.all_masks.append(mask)

And then change the load_image and load_mask methods of LineModGenerator in linemod.py:

def load_image(self, image_index):
    """ Load an image at the image_index.
    """
    return copy.deepcopy(self.all_images[image_index])

def load_mask(self, image_index):
    """ Load mask at the image_index.
    """
    return copy.deepcopy(self.all_masks[image_index])

Please note that this is just some example code I wrote down quickly and didn't test if it works, so maybe you have to fix some bugs. But it should be a good starting point and give you the idea.

ybkscht commented 2 years ago

If multiprocessing doesn't work you should set it to False but not use it with workers = 0 because then everything runs sequential and is probably relatively slow.

enesdoruk commented 2 years ago

Thanks, @ybkscht, it works. There is no huge difference but a bit faster. I will try v100 GPU. I use multiprocessing with workers 0, when I deactivate multiprocessing, I set workers > 0.

ybkscht / EfficientPose

Slow training using power GPU #38