xtuml / munin

Apache License 2.0
1 stars 0 forks source link

Events are still sent to failed AEOSVDC containers #159

Closed FreddieMatherSmartDCSIT closed 2 months ago

FreddieMatherSmartDCSIT commented 7 months ago

During a 24 hour performance test it was observed that after 2 AEOSVDC containers failed the rate at which events were processed dropped dramatically. Response times of events that were processed were still in line with those before the containers failed. The CPU of the remaining containers was also not maxed out anywhere during the test.

It was found that the discrepancy in the number events processed by the PV versus those sent at the end of the test was equal to approximately the number of events that would have been processed by those failed containers. This all suggests that events were still being sent to the failed containers and were expected to be processed by them but because they weren't active those events could not be processed.

Evidence

The 24 hour run had the following container numbers and events sent/s

Events processed

The graph below shows the events sent vs processed - after approximately 17 hours when the containers fail the cumulative processed events diverge from the cumulative events sent

image

Response times (time processed - time sent) stay the same even after the containers fail

image

The number of total events sent to the PV was 43499973 and the number of events that were failed to be processed was 3279078. If the time left from when the containers failed is approximately 7 hours then an approximation of the number of events that would have been processed by the two containers out of the 8 containers assuming that the rate of 500 events/s was kept up with:

7 * 60 * 60 * 500 * 2/8 = 3150000

The value of 3150000 is close to 3279078 and indicates that events were likely still sent to the failed containers.

CPU

The CPU images below show that after the the two AEOSVDC containers failed the CPU of the non failed containers did not jump to accommodate the now fewer containers. This suggest events were not being redistributed to the other containers

image All containers CPU

image image Failed containers CPU

cortlandstarrett commented 7 months ago

This is operating as designed. Until we have dynamic scaling, we will not recover from a failed process.

FreddieMatherSmartDCSIT commented 7 months ago

Ok thanks for the info!

cortlandstarrett commented 2 months ago

This is resolved with the advent of Job Management.