Closed FreddieMatherSmartDCSIT closed 10 months ago
We identified a memory leak. We were not deleting instances of the PVjob and InstrumentationEvents. Identified on 26 October. Will be fixed in the next tag.
➜ git/munin/deploy git:(AsyncLogger2) ✗ docker stats --no-stream
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
6bed14ffb634 deploy-aeo_svdc_proc_2-1 4.43% 25.38MiB / 7.758GiB 0.32% 520MB / 564MB 0B / 0B 9
f5336bb01703 deploy-async_logger-1 22.43% 11.36MiB / 7.758GiB 0.14% 2.16GB / 333MB 229kB / 0B 6
2a64bb3187fd deploy-istore_proc-1 0.23% 5.648MiB / 7.758GiB 0.07% 1.71MB / 1.5MB 0B / 0B 9
3f5a19ccef84 deploy-aeo_svdc_proc_4-1 6.68% 28.03MiB / 7.758GiB 0.35% 520MB / 568MB 0B / 0B 9
400a71701769 deploy-aer_proc-1 11.91% 235.4MiB / 7.758GiB 2.96% 732MB / 1.61GB 770kB / 0B 9
f97561bf5d6f deploy-aeo_svdc_proc_3-1 4.38% 23.71MiB / 7.758GiB 0.30% 520MB / 567MB 328kB / 0B 9
420ccb5f718c deploy-aeo_svdc_proc_1-1 5.54% 23.5MiB / 7.758GiB 0.30% 520MB / 568MB 147kB / 0B 9
9752128d2154 deploy-kafka-1 10.67% 1.281GiB / 7.758GiB 16.51% 4.91GB / 4.99GB 1.54MB / 4.49GB 83
c1e43584da1e deploy-zookeeper-1 0.10% 190.9MiB / 7.758GiB 2.40% 386kB / 466kB 2.19MB / 2.27MB 27
ac25324107aa munin-conan-server-1 0.01% 27.42MiB / 7.758GiB 0.35% 303kB / 234kB 451kB / 389kB 1
a 2 million event run began at 15:46 and leveled out pretty quickly This was running 800 events/second.
We see that kind of memory profile in the shorter time frames (note we send a linearly increasing profile of events at the start which is why you see the slow ramp up)
but when the cpu maxes out thats when events back up and the memory starts to spike
Yes, understood. The memory leak we found caused (causes) exactly this behaviour. We were not deleting an element in a list that was repeatedly searched. As the list got longer, the CPU time to search the list got longer. We did not run out of memory before we ran out of CPU cycles which caused multiple jobs to be in process simultaneously until we folded.
I will be running a 24 hour test soon to discover if there are any more resource issues hiding.
Ah ok - sorry was just making sure we were talking about the same thing.
We found and identified another source of memory expansion. It was in the StoredJobId class. These were growing faster than and needed to be pruned more often than every 24 hours. This has been fixed in v1.1.3.
Behaviour of the CPU measurements has been noted throughout longer run endurance tests. The CPU usage of AEOSVDC containers increases with a trend of a steady in a linear increase even when it seems that AEOSVDC is able to process the the sent events at the required rate. It has been noted that event response times (time from sending to fully processed by the PV) also show this linear trend. This has been noted in runs with both 4 AEOSVDC (8 hour test and 24 hour test) and 8 AEOSVDC (24 hour test) containers. When the CPU maxes out for all containers the response times show a highly non-linear increase.
Evidence
8 hour run (4 AEOSVDC)
The 8 hour run had the following container numbers and events sent/s
CPU linearly increasing as below and maxing out
Response times (aggregated mean values in one second bins) linearly increasing until CPU maxes out and then starting to non-linearly increase just before end of test
24 hour (4 AEO SVDC)
The 24 hour run had the following container numbers and events sent/s
CPU linearly increasing as below and maxing out after approximately 8 hours
Response times (aggregated mean values in one second bins) linearly increasing until CPU maxes out and then non-linearly increase before no more events are processed
24 hour (8 AEO SVDC)
The 24 hour run had the following container numbers and events sent/s
CPU linearly increasing at a slower rate than with 4 AEOSVDC containers. Doesn't max out but shows the behaviour still exists even. The expected trend would likely max out the cpu over another 1-2 days
Response times (aggregated mean values in one second bins) linearly increasing over the test