xtuml / munin

Apache License 2.0
1 stars 0 forks source link

AEOSVDC increasing CPU #156

Closed FreddieMatherSmartDCSIT closed 10 months ago

FreddieMatherSmartDCSIT commented 11 months ago

Behaviour of the CPU measurements has been noted throughout longer run endurance tests. The CPU usage of AEOSVDC containers increases with a trend of a steady in a linear increase even when it seems that AEOSVDC is able to process the the sent events at the required rate. It has been noted that event response times (time from sending to fully processed by the PV) also show this linear trend. This has been noted in runs with both 4 AEOSVDC (8 hour test and 24 hour test) and 8 AEOSVDC (24 hour test) containers. When the CPU maxes out for all containers the response times show a highly non-linear increase.

Evidence

8 hour run (4 AEOSVDC)

The 8 hour run had the following container numbers and events sent/s

CPU linearly increasing as below and maxing out

image

Response times (aggregated mean values in one second bins) linearly increasing until CPU maxes out and then starting to non-linearly increase just before end of test

image

24 hour (4 AEO SVDC)

The 24 hour run had the following container numbers and events sent/s

CPU linearly increasing as below and maxing out after approximately 8 hours

image

Response times (aggregated mean values in one second bins) linearly increasing until CPU maxes out and then non-linearly increase before no more events are processed

image

24 hour (8 AEO SVDC)

The 24 hour run had the following container numbers and events sent/s

CPU linearly increasing at a slower rate than with 4 AEOSVDC containers. Doesn't max out but shows the behaviour still exists even. The expected trend would likely max out the cpu over another 1-2 days

image

Response times (aggregated mean values in one second bins) linearly increasing over the test

cortlandstarrett commented 11 months ago

We identified a memory leak. We were not deleting instances of the PVjob and InstrumentationEvents. Identified on 26 October. Will be fixed in the next tag.

cortlandstarrett commented 11 months ago
➜  git/munin/deploy git:(AsyncLogger2) ✗ docker stats --no-stream
CONTAINER ID   NAME                       CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
6bed14ffb634   deploy-aeo_svdc_proc_2-1   4.43%     25.38MiB / 7.758GiB   0.32%     520MB / 564MB     0B / 0B           9
f5336bb01703   deploy-async_logger-1      22.43%    11.36MiB / 7.758GiB   0.14%     2.16GB / 333MB    229kB / 0B        6
2a64bb3187fd   deploy-istore_proc-1       0.23%     5.648MiB / 7.758GiB   0.07%     1.71MB / 1.5MB    0B / 0B           9
3f5a19ccef84   deploy-aeo_svdc_proc_4-1   6.68%     28.03MiB / 7.758GiB   0.35%     520MB / 568MB     0B / 0B           9
400a71701769   deploy-aer_proc-1          11.91%    235.4MiB / 7.758GiB   2.96%     732MB / 1.61GB    770kB / 0B        9
f97561bf5d6f   deploy-aeo_svdc_proc_3-1   4.38%     23.71MiB / 7.758GiB   0.30%     520MB / 567MB     328kB / 0B        9
420ccb5f718c   deploy-aeo_svdc_proc_1-1   5.54%     23.5MiB / 7.758GiB    0.30%     520MB / 568MB     147kB / 0B        9
9752128d2154   deploy-kafka-1             10.67%    1.281GiB / 7.758GiB   16.51%    4.91GB / 4.99GB   1.54MB / 4.49GB   83
c1e43584da1e   deploy-zookeeper-1         0.10%     190.9MiB / 7.758GiB   2.40%     386kB / 466kB     2.19MB / 2.27MB   27
ac25324107aa   munin-conan-server-1       0.01%     27.42MiB / 7.758GiB   0.35%     303kB / 234kB     451kB / 389kB     1
cortlandstarrett commented 11 months ago
memory
cortlandstarrett commented 11 months ago

a 2 million event run began at 15:46 and leveled out pretty quickly This was running 800 events/second.

FreddieMatherSmartDCSIT commented 11 months ago

We see that kind of memory profile in the shorter time frames (note we send a linearly increasing profile of events at the start which is why you see the slow ramp up)

image

but when the cpu maxes out thats when events back up and the memory starts to spike

image

cortlandstarrett commented 11 months ago

Yes, understood. The memory leak we found caused (causes) exactly this behaviour. We were not deleting an element in a list that was repeatedly searched. As the list got longer, the CPU time to search the list got longer. We did not run out of memory before we ran out of CPU cycles which caused multiple jobs to be in process simultaneously until we folded.

I will be running a 24 hour test soon to discover if there are any more resource issues hiding.

FreddieMatherSmartDCSIT commented 11 months ago

Ah ok - sorry was just making sure we were talking about the same thing.

cortlandstarrett commented 10 months ago

We found and identified another source of memory expansion. It was in the StoredJobId class. These were growing faster than and needed to be pruned more often than every 24 hours. This has been fixed in v1.1.3.