xtuml / munin

Apache License 2.0
1 stars 0 forks source link

AEOSVDC memory usage and container failures #157

Closed FreddieMatherSmartDCSIT closed 6 months ago

FreddieMatherSmartDCSIT commented 7 months ago

On longer endurance test runs for the Protocol Verifier, the PV maxes out CPU and gets behind on processing events. When this happens the PV starts to consume memory holding onto the backed up events and this increases as more events are added. Once all the memory of the host box is consumed AEOSVDC containers start to fail periodically one by one (every time the memory of the box is full) and never restart.

Evidence

The 24 hour run had the following container numbers and events sent/s

The cumulative events processed diverges from the the cumulative events sent at approximately the 8 hour mark into the test.

image

The response times start to non-linearly increase at this point hitting a peak after which no more events are processed.

image

After approximately 8 hours the memory of all AEOSVDC containers starts to increase (claiming memory from Kafka) until the max memory of the box is hit. The AEOSVDC containers start to fail progressively one by one as they consume all the memory of the box. Eventually one AEOSVDC container out of the 4 is left and its memory usage continues to climb towards the max memory of the box.

image Container memory usage

image image image image CPU usage showing containers failing

cortlandstarrett commented 7 months ago

This is likely due to the memory leak identified on 26 October.

jt765487 commented 7 months ago

@cortlandstarrett do you know when a version with the fix can be provided to @FreddieMatherSmartDCSIT?

cortlandstarrett commented 7 months ago

@jt765487 , we could provide it today if desired. I can give @FreddieMatherSmartDCSIT the option of running now or waiting a day or two until we have run our own 24 hour test.

FreddieMatherSmartDCSIT commented 7 months ago

@cortlandstarrett if you can provide us with a new version for defect retests that would be great. We would need a bit of time to build the PV and get prepped for the tests so that day or two would be useful. Its unlikely we would start the deployment process for the PV until tomorrow morning due to being near the end of the day here today but we would likely be ready by tomorrow afternoon-evening to start retests all well and good.

(@jt765487 for you info)

cortlandstarrett commented 6 months ago

fixed in v1.1.3 (StoredJobId growth)