Memory leak in process_video.py

eacharya commented 9 years ago

Branch: Latest Development

Issue: During post-processing there is a ~6GB/hour memory leak. This is noticeable in longer videos after ~2 hours of operation when swap usage starts slowing everything to a halt. Note that no crash occurs in any of the processes.

This issue can be reproduced even when both RabbitWriter and JSONWriter are disabled - that is no output is stored from post-processing. One way to easily reproduce the issue is to run a longer video located at:

ubuntu@gpu2:/mnt/data/wc14itr/vdo-set04/roundOf16_issue98/wc14-ArgSwi$

regmiz commented 9 years ago

Looking into this.. One item that seemed to help somewhat ( in a shorter video ) was to 'return None' at https://github.com/zigvu/khajuri/blob/Development/postprocessing/task/CaffeResultPostProcess.py#L40 instead of returning the frame. Since the result is only drained at the end, we tend to collect all the frames there.

I have not tried on a longer video. @eacharya can you give it a try?

eacharya commented 9 years ago

That is probably going to fix the issue since None takes so little space to store. But probably defeat the original intent of having output of task N piped to input of task N+1.

Will test and post results here.

eacharya commented 9 years ago

Even with return None at the end of CaffeResultPostProcess call, I am getting large and increasing memory footprint. The long video can be found here:

ubuntu@gpu2:/mnt/data/wc14itr/vdo-set04/roundOf16_issue98/wc14-ArgSwi

The config in that folder has both RabbitWriter and JSONWriter turned off - so we don't need to worry about video_id, chia_version_id, storage etc. The current run hasn't crashed (memory is 18GB in 4 hours) but I suspect it will crash before the end of video. (Will update if I see otherwise).

eacharya commented 9 years ago

With return None, the run above completed - but the memory leak still exists. I am guessing that with a video longer than 2 hours, it will eventually get stuck again. Let me know if we want to test with a longer video - I can concatenate two videos and try if that helps in debugging.

regmiz commented 9 years ago

Using ThreadWorker https://github.com/zigvu/khajuri/blob/Development/infra/Worker.py#L6 - seems to stop the memory growth - but its obviously not what we want.

So, the problem is most likely - not in our task, but either in

Pipeline or ProcessWorker
Or inside multiprocessing in the way how we use the package.

Looking into it further.

eacharya commented 9 years ago

After 1.5 hrs of running, right before getting to swap, here is the state of the memory:

    ubuntu@gpu2:~/khajuri$ df -h
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/sda6       915G  706G  163G  82% /
    udev             16G  4.0K   16G   1% /dev
    tmpfs           3.2G  468K  3.2G   1% /run
    none            5.0M     0  5.0M   0% /run/lock
    none             16G  300K   16G   1% /run/shm
    tmpfs            30G  5.0G   26G  17% /mnt/tmp
    /dev/sda1       228M   52M  164M  25% /boot
    ubuntu@gpu2:~/khajuri$ free -m
                 total       used       free     shared    buffers     cached
    Mem:         32141      31841        299          0         40      12034
    -/+ buffers/cache:      19766      12374
    Swap:         1951        199       1752

And HTOP shows that 20GB is used in main memory.

So, there is (32 - 20 - 5 = 7)GB of memory that has been lost. It has processed 46631.json implying about 31min of video processed.

zigvu / khajuri

Memory leak in process_video.py #107