Open eacharya opened 9 years ago
Looking into this.. One item that seemed to help somewhat ( in a shorter video ) was to 'return None' at https://github.com/zigvu/khajuri/blob/Development/postprocessing/task/CaffeResultPostProcess.py#L40 instead of returning the frame. Since the result is only drained at the end, we tend to collect all the frames there.
I have not tried on a longer video. @eacharya can you give it a try?
That is probably going to fix the issue since None
takes so little space to store. But probably defeat the original intent of having output of task N
piped to input of task N+1
.
Will test and post results here.
Even with return None
at the end of CaffeResultPostProcess
call, I am getting large and increasing memory footprint. The long video can be found here:
ubuntu@gpu2:/mnt/data/wc14itr/vdo-set04/roundOf16_issue98/wc14-ArgSwi
The config in that folder has both RabbitWriter and JSONWriter turned off - so we don't need to worry about video_id
, chia_version_id
, storage etc. The current run hasn't crashed (memory is 18GB in 4 hours) but I suspect it will crash before the end of video. (Will update if I see otherwise).
With return None
, the run above completed - but the memory leak still exists. I am guessing that with a video longer than 2 hours, it will eventually get stuck again. Let me know if we want to test with a longer video - I can concatenate two videos and try if that helps in debugging.
Using ThreadWorker https://github.com/zigvu/khajuri/blob/Development/infra/Worker.py#L6 - seems to stop the memory growth - but its obviously not what we want.
So, the problem is most likely - not in our task, but either in
Looking into it further.
After 1.5 hrs of running, right before getting to swap, here is the state of the memory:
ubuntu@gpu2:~/khajuri$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda6 915G 706G 163G 82% /
udev 16G 4.0K 16G 1% /dev
tmpfs 3.2G 468K 3.2G 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 16G 300K 16G 1% /run/shm
tmpfs 30G 5.0G 26G 17% /mnt/tmp
/dev/sda1 228M 52M 164M 25% /boot
ubuntu@gpu2:~/khajuri$ free -m
total used free shared buffers cached
Mem: 32141 31841 299 0 40 12034
-/+ buffers/cache: 19766 12374
Swap: 1951 199 1752
And HTOP shows that 20GB is used in main memory.
So, there is (32 - 20 - 5 = 7)GB of memory that has been lost. It has processed 46631.json implying about 31min of video processed.
Branch: Latest Development
Issue: During post-processing there is a ~6GB/hour memory leak. This is noticeable in longer videos after ~2 hours of operation when swap usage starts slowing everything to a halt. Note that no crash occurs in any of the processes.
This issue can be reproduced even when both RabbitWriter and JSONWriter are disabled - that is no output is stored from post-processing. One way to easily reproduce the issue is to run a longer video located at:
ubuntu@gpu2:/mnt/data/wc14itr/vdo-set04/roundOf16_issue98/wc14-ArgSwi$