zopefoundation / ZODB

Python object-oriented database
https://zodb-docs.readthedocs.io/
Other
682 stars 92 forks source link

`racetest` improvement #380

Closed d-maurer closed 1 year ago

d-maurer commented 1 year ago

This PR tries to facilitate the failure analysis for check_race_external_invalidate_vs_disconnect.

"https://github.com/zopefoundation/ZEO/pull/228" suffers occasionally from a failure of the test above. Unfortunately, the failure output is extremely difficult to analyze:

The PR protects the output of exception/traceback information from the concurrent threads with a lock and thereby prevents that the corresponding lines are mixed: exception output from one thread remains contiguous. The PR also checks explicitly that the test finishes within the expected time. If this fails, a speaking failure is generated. Formerly, a too slow execution was recognized by one of the threads not stopping. Finally, all threads are given a chance to stop gracefully in case of a problem -- formlerly the first thread which did not stop within the expected time cause the test to be terminated; the resulting server shutdown cause a huge amount of irrelevant ClientDisconnected exceptions.

navytux commented 1 year ago

Dieter, thanks a lot for this PR. My first quick impression is positive. I will try to read the code again afresh tonight thoroughly.

navytux commented 1 year ago

Hello Dieter. I have reviewed this PR by way of amending it via https://github.com/zopefoundation/ZODB/pull/380/commits/f2335127. Could you please have a look?

I've pushed this amendment here, but if that is not ok with you I will restore hereby branch state to what it was before and open another PR with my modifications.

d-maurer commented 1 year ago

Kirill Smelkov wrote at 2023-4-14 03:02 -0700:

... I've pushed this amendment here, but if that is not ok with you I will restore hereby branch state to what it was before and open another PR with my modifications.

You know that I do not mind when you improve my PRs in place.

I will check your modifications in the next few days and report back.

navytux commented 1 year ago

Thanks, Dieter.

P.S. there are two failures with "test did not finish within 60 seconds" on Windows: https://github.com/zopefoundation/ZODB/actions/runs/4698624182/jobs/8331142450, https://github.com/zopefoundation/ZODB/actions/runs/4698624182/jobs/8331142166. Those look to be due to overload of the machines maybe. But I'm not sure 100%.

d-maurer commented 1 year ago

@navytux I (mostly) like your changes. Nevertheless, I made a few modifications - would you like to check them? We likely should add tests to verify that the new TestGroup behaves as expected in all situations.

navytux commented 1 year ago

Dieter, thanks. I'm ok with your amendments - thanks for making them and correcting things on my side.

add TestGroup tests; minor cleanup

I should hopefully be able to review new tests tonight, but I do not see them added to the git repository. Maybe you forgot to commit them?

d-maurer commented 1 year ago

Kirill Smelkov wrote at 2023-4-17 01:00 -0700:

... I should hopefully be able to review new tests tonight, but I do not see them added to the git repository. Maybe you forgot to commit them?

Indeed, I forgot the git add. Fixed now.

navytux commented 1 year ago

Thanks, Dieter.

navytux commented 1 year ago

A likely overload failure is there again on windows: https://github.com/zopefoundation/ZODB/actions/runs/4730554401/jobs/8394428237.

we need to either increase the wait time from 60 to something, or to see what's up with our windows runners.

d-maurer commented 1 year ago

Kirill Smelkov wrote at 2023-4-18 02:22 -0700:

A likely overload failure is there again on windows: https://github.com/zopefoundation/ZODB/actions/runs/4730554401/jobs/8394428237.

we need to either increase the wait time from 60 to something, or to see what's up with our windows runners.

If the only failure messages are "test did not end within ..." and maybe some few "thread did not finish", then the problem is only that we give the test insufficient time. Should we see a lot "thread did not finish" or exceptions, then the problem may be deeper.

I will look at the failures soon (tomorrow at the latest).

I assume that the occasional zodbmaster ZEO failures also are caused by insufficient time for the ...race...disconnect test.

navytux commented 1 year ago

Thanks, Dieter. As far as I see it was only "AssertionError: test did not finish within 60 seconds" this time.

d-maurer commented 1 year ago

Kirill Smelkov wrote at 2023-4-18 04:08 -0700:

Thanks, Dieter. As far as I see it was only "AssertionError: test did not finish within 60 seconds" this time.

I increased the allowed time from 60 t0 120 seconds. Tests both in this PR as well as the ZEO PR have passed with this value.

Thank you for your cooperation.

navytux commented 1 year ago

Thanks, Dieter as well. I appreciate that it is possible to do kind of pair programming we did here for good.