Closed navytux closed 3 years ago
/cc @jimfulton
Kirill Smelkov wrote at 2021-4-1 21:59 -0700:
( moving out from https://github.com/zopefoundation/ZEO/issues/155#issuecomment-811984478 since this seems to be a different problem )
Something really wrong is going on with ZEO5.
I tried to reproduce your observation but failed:
ZEO==5.2.2; ZODB==5.6.0; CPython==3.6.9
I used your zopenrace.py
with the following changes:
...
from ZEO import server, DB
...
def main():
address, stop = server()
db = DB(address)
...
N = 10000
t1 = go(T1, N)
t2 = go(T2, N)
t1.join()
t2.join()
stop()
Apparently, the problem depends on further details.
@d-maurer, thanks for feedback.
I checked a bit more from scratch and, contrary to the problem from https://github.com/zopefoundation/ZEO/issues/155 (which fails on both py2 and py3), the test program in this issue fails on CPython2, but does not fail on CPython3. The difference might be thus somehow related to trollius.
To verify I used zoperace-zeo.py which is zopenrace.py
from https://github.com/zopefoundation/ZODB/issues/290 plus the following change that you suggested:
--- a/zopenrace.py.kirr0
+++ b/zopenrace.py
@@ -87,8 +87,7 @@
"""
from __future__ import print_function
-from ZODB import DB
-from ZODB.MappingStorage import MappingStorage
+from ZEO import server, DB
import transaction
from persistent import Persistent
@@ -110,8 +109,8 @@ def __init__(self, i):
def main():
- zstor = MappingStorage()
- db = DB(zstor)
+ address, stop = server()
+ db = DB(address)
# init initializes the database with two integer objects - obj1/obj2 that are set to 0.
@@ -194,6 +193,7 @@ def t2():
t2 = go(T2, N)
t1.join()
t2.join()
+ stop()
if not all(okv):
raise AssertionError('FAIL')
For me it fails with CPython 2.7.18 on two different machines. For completeness I attach full log of one such verification.
Please note, that I hit this ZEO bug only accidentally, while trying to find a place for my ZODB test programs in wendelin.core test suite, so that those test programs are run wrt all ZODB versions we care about (ZODB4, ZODB5 and ZODB4 with MVCC backport from ZODB5 (see https://lab.nexedi.com/nexedi/ZODB/commit/8e7eab33?expanded=1), and wrt all storages (FileStorage, ZEO and NEO) as explained in this patch.
It is possible that the bug is indeed in trollius. It is also possible that the bug is in ZEO5 and only being more likely to hit with trollius. For now I did not looked into ZEO5 to verify it for concurrency-correctness at all.
Thanks again for feedback, Kirill
Kirill Smelkov wrote at 2021-4-5 01:50 -0700: @.***, thanks for feedback.
I checked a bit more from scratch and, contrary to the problem from https://github.com/zopefoundation/ZEO/issues/155 (which fails on both py2 and py3), the test program in this issue fails on CPython2, but does not fail on CPython3. The difference might be thus somehow related to trollius.
I could reproduce the problem here with Python 2 -- investigating.
Dieter Maurer wrote at 2021-4-6 07:22 +0200:
Kirill Smelkov wrote at 2021-4-5 01:50 -0700: @.***, thanks for feedback.
I checked a bit more from scratch and, contrary to the problem from https://github.com/zopefoundation/ZEO/issues/155 (which fails on both py2 and py3), the test program in this issue fails on CPython2, but does not fail on CPython3. The difference might be thus somehow related to trollius.
I could reproduce the problem here with Python 2 -- investigating.
The problem is obviously associated with a race condition: the failure happens after different rounds, rarely not all.
To understand what happens, I have activated logging
(address, stop = server(log="/tmp/zeo.log")
)
and instrumented some methods in ZEO.ClientStorage
and
ZODB.mvccadapter
:
from decorator import decorator
@decorator
def trace(f, *args, **kw):
logger.debug("%r %r %r", f, args, kw)
return f(*args, **kw)
I also stop the test as soon as the first failure is observed.
One run has shown the error already for the first modification.
Below, I show the corresponding log file.
The _invalidate_finish
comes from the modifying client,
the loadBefore
from the verifying client. The serial in the
last loadBefore
shows that it already sees the new transaction.
A possible explanation could be that the new transaction was
observable before _invalidate_finish
has been completed.
2021-04-06T09:46:48 INFO ZEO.runzeo (3694) opening storage '1' using MappingStorage
------
2021-04-06T09:46:48 INFO ZEO.StorageServer StorageServer created RW with storages: 1:RW:Mapping Storage
------
2021-04-06T09:46:48 DEBUG trollius Using selector: EpollSelector
------
2021-04-06T09:46:48 INFO ZEO.asyncio.server listening on (u'127.0.0.1', 0)
------
2021-04-06T09:46:48 DEBUG ZEO.tests.forker SERVER CREATED
------
2021-04-06T09:46:48 DEBUG ZEO.tests.forker ADDRESS SENT
------
2021-04-06T09:46:48 INFO ZEO.ClientStorage ('127.0.0.1', 37011) ClientStorage (pid=3694) created RW/normal for storage: '1'
------
2021-04-06T09:46:48 INFO ZEO.cache created temporary cache file '<fdopen>'
------
2021-04-06T09:46:48 DEBUG trollius Using selector: EpollSelector
------
2021-04-06T09:46:48 DEBUG ZEO.asyncio.client disconnected <ZEO.asyncio.client.Client object at 0x7f5503226f90> None
------
2021-04-06T09:46:48 DEBUG ZEO.asyncio.client try_connecting
------
2021-04-06T09:46:48 DEBUG ZEO.asyncio.server Accepted connection
------
2021-04-06T09:46:48 INFO ZEO.asyncio.base Connected server protocol
------
2021-04-06T09:46:48 INFO ZEO.asyncio.base Connected Protocol(('127.0.0.1', 37011), '1', 0)
------
2021-04-06T09:46:48 INFO ZEO.asyncio.server received handshake 'Z5'
------
2021-04-06T09:46:48 INFO ZEO.ClientStorage ('127.0.0.1', 37011) Connected to storage: ('localhost', 37011)
------
2021-04-06T09:46:48 DEBUG txn.140003321820992 new transaction
------
2021-04-06T09:46:48 DEBUG ZEO.ClientStorage <function loadBefore at 0x7f550322e7d0> (<ZEO.ClientStorage.ClientStorage object at 0x7f550426f610>, '\x00\x00\x00\x00\x00\x00\x00\x00', '\x00\x00\x00\x00\x00\x00\x00\x01') {}
------
2021-04-06T09:46:48 DEBUG txn.140003321820992 commit <Connection at 7f55027b3210>
------
2021-04-06T09:46:48 DEBUG ZEO.StorageServer (127.0.0.1:37011) ('1') lock: transactions waiting: 0
------
2021-04-06T09:46:48 BLATHER ZEO.StorageServer (127.0.0.1:37011) Preparing to commit transaction: 1 objects, 96 bytes
------
2021-04-06T09:46:48 DEBUG ZEO.StorageServer (127.0.0.1:37011) ('1') unlock: transactions waiting: 0
------
2021-04-06T09:46:48 DEBUG ZODB.mvccadapter <function _invalidate_finish at 0x7f550c9ea8d0> (<ZODB.mvccadapter.MVCCAdapter object at 0x7f55027b31d0>, '\x03\xdf1\x92\xd0\x88\x02\xdd', set(['\x00\x00\x00\x00\x00\x00\x00\x00']), <ZODB.mvccadapter.MVCCAdapterInstance object at 0x7f55027b32d0>) {}
------
2021-04-06T09:46:48 DEBUG txn.140003321820992 commit
------
2021-04-06T09:46:48 DEBUG txn.140003321820992 new transaction
------
2021-04-06T09:46:48 DEBUG txn.140003321820992 commit <Connection at 7f55027b3210>
------
2021-04-06T09:46:48 DEBUG ZEO.StorageServer (127.0.0.1:37011) ('1') lock: transactions waiting: 0
------
2021-04-06T09:46:48 BLATHER ZEO.StorageServer (127.0.0.1:37011) Preparing to commit transaction: 3 objects, 291 bytes
------
2021-04-06T09:46:48 DEBUG ZEO.StorageServer (127.0.0.1:37011) ('1') unlock: transactions waiting: 0
------
2021-04-06T09:46:48 DEBUG ZODB.mvccadapter <function _invalidate_finish at 0x7f550c9ea8d0> (<ZODB.mvccadapter.MVCCAdapter object at 0x7f55027b31d0>, '\x03\xdf1\x92\xd0\x8c\xaf\xaa', set(['\x00\x00\x00\x00\x00\x00\x00\x00', '\x00\x00\x00\x00\x00\x00\x00\x01', '\x00\x00\x00\x00\x00\x00\x00\x02']), <ZODB.mvccadapter.MVCCAdapterInstance object at 0x7f55027b32d0>) {}
------
2021-04-06T09:46:48 DEBUG txn.140003321820992 commit
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 new transaction
------
2021-04-06T09:46:48 DEBUG txn.140003071555328 new transaction
------
2021-04-06T09:46:48 DEBUG ZEO.ClientStorage <function loadBefore at 0x7f550322e7d0> (<ZEO.ClientStorage.ClientStorage object at 0x7f550426f610>, '\x00\x00\x00\x00\x00\x00\x00\x00', '\x03\xdf1\x92\xd0\x8c\xaf\xab') {}
------
2021-04-06T09:46:48 DEBUG ZEO.ClientStorage <function loadBefore at 0x7f550322e7d0> (<ZEO.ClientStorage.ClientStorage object at 0x7f550426f610>, '\x00\x00\x00\x00\x00\x00\x00\x01', '\x03\xdf1\x92\xd0\x8c\xaf\xab') {}
------
2021-04-06T09:46:48 DEBUG ZEO.ClientStorage <function loadBefore at 0x7f550322e7d0> (<ZEO.ClientStorage.ClientStorage object at 0x7f550426f610>, '\x00\x00\x00\x00\x00\x00\x00\x00', '\x03\xdf1\x92\xd0\x8c\xaf\xab') {}
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 abort
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 new transaction
------
2021-04-06T09:46:48 DEBUG ZEO.ClientStorage <function loadBefore at 0x7f550322e7d0> (<ZEO.ClientStorage.ClientStorage object at 0x7f550426f610>, '\x00\x00\x00\x00\x00\x00\x00\x01', '\x03\xdf1\x92\xd0\x8c\xaf\xab') {}
------
2021-04-06T09:46:48 DEBUG ZEO.ClientStorage <function loadBefore at 0x7f550322e7d0> (<ZEO.ClientStorage.ClientStorage object at 0x7f550426f610>, '\x00\x00\x00\x00\x00\x00\x00\x01', '\x03\xdf1\x92\xd0\x8c\xaf\xab') {}
------
2021-04-06T09:46:48 DEBUG ZEO.ClientStorage <function loadBefore at 0x7f550322e7d0> (<ZEO.ClientStorage.ClientStorage object at 0x7f550426f610>, '\x00\x00\x00\x00\x00\x00\x00\x02', '\x03\xdf1\x92\xd0\x8c\xaf\xab') {}
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 abort
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 new transaction
------
2021-04-06T09:46:48 DEBUG ZEO.ClientStorage <function loadBefore at 0x7f550322e7d0> (<ZEO.ClientStorage.ClientStorage object at 0x7f550426f610>, '\x00\x00\x00\x00\x00\x00\x00\x01', '\x03\xdf1\x92\xd0\x8c\xaf\xab') {}
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 abort
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 new transaction
------
2021-04-06T09:46:48 DEBUG ZEO.ClientStorage <function loadBefore at 0x7f550322e7d0> (<ZEO.ClientStorage.ClientStorage object at 0x7f550426f610>, '\x00\x00\x00\x00\x00\x00\x00\x01', '\x03\xdf1\x92\xd0\x8c\xaf\xab') {}
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 abort
------
2021-04-06T09:46:48 DEBUG txn.140003071555328 commit <Connection at 7f55027b3dd0>
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 new transaction
------
2021-04-06T09:46:48 DEBUG ZEO.ClientStorage <function loadBefore at 0x7f550322e7d0> (<ZEO.ClientStorage.ClientStorage object at 0x7f550426f610>, '\x00\x00\x00\x00\x00\x00\x00\x01', '\x03\xdf1\x92\xd0\x8c\xaf\xab') {}
------
2021-04-06T09:46:48 DEBUG ZEO.StorageServer (127.0.0.1:37011) ('1') lock: transactions waiting: 0
------
2021-04-06T09:46:48 BLATHER ZEO.StorageServer (127.0.0.1:37011) Preparing to commit transaction: 2 objects, 194 bytes
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 abort
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 new transaction
------
2021-04-06T09:46:48 DEBUG ZEO.ClientStorage <function loadBefore at 0x7f550322e7d0> (<ZEO.ClientStorage.ClientStorage object at 0x7f550426f610>, '\x00\x00\x00\x00\x00\x00\x00\x01', '\x03\xdf1\x92\xd0\x8c\xaf\xab') {}
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 abort
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 new transaction
------
2021-04-06T09:46:48 DEBUG ZEO.ClientStorage <function loadBefore at 0x7f550322e7d0> (<ZEO.ClientStorage.ClientStorage object at 0x7f550426f610>, '\x00\x00\x00\x00\x00\x00\x00\x01', '\x03\xdf1\x92\xd0\x8c\xaf\xab') {}
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 abort
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 new transaction
------
2021-04-06T09:46:48 DEBUG ZEO.ClientStorage <function loadBefore at 0x7f550322e7d0> (<ZEO.ClientStorage.ClientStorage object at 0x7f550426f610>, '\x00\x00\x00\x00\x00\x00\x00\x01', '\x03\xdf1\x92\xd0\x8c\xaf\xab') {}
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 abort
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 new transaction
------
2021-04-06T09:46:48 DEBUG ZEO.ClientStorage <function loadBefore at 0x7f550322e7d0> (<ZEO.ClientStorage.ClientStorage object at 0x7f550426f610>, '\x00\x00\x00\x00\x00\x00\x00\x01', '\x03\xdf1\x92\xd0\x8c\xaf\xab') {}
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 abort
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 new transaction
------
2021-04-06T09:46:48 DEBUG ZEO.ClientStorage <function loadBefore at 0x7f550322e7d0> (<ZEO.ClientStorage.ClientStorage object at 0x7f550426f610>, '\x00\x00\x00\x00\x00\x00\x00\x01', '\x03\xdf1\x92\xd0\x8c\xaf\xab') {}
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 abort
------
2021-04-06T09:46:48 DEBUG ZEO.StorageServer (127.0.0.1:37011) ('1') unlock: transactions waiting: 0
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 new transaction
------
2021-04-06T09:46:48 DEBUG ZEO.ClientStorage <function loadBefore at 0x7f550322e7d0> (<ZEO.ClientStorage.ClientStorage object at 0x7f550426f610>, '\x00\x00\x00\x00\x00\x00\x00\x01', '\x03\xdf1\x92\xd0\x8c\xaf\xab') {}
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 abort
------
2021-04-06T09:46:48 DEBUG txn.140003079948032 new transaction
------
2021-04-06T09:46:48 DEBUG ZODB.mvccadapter <function _invalidate_finish at 0x7f550c9ea8d0> (<ZODB.mvccadapter.MVCCAdapter object at 0x7f55027b31d0>, '\x03\xdf1\x92\xd0\x93\xdc\xbb', set(['\x00\x00\x00\x00\x00\x00\x00\x01', '\x00\x00\x00\x00\x00\x00\x00\x02']), <ZODB.mvccadapter.MVCCAdapterInstance object at 0x7f55027b3e10>) {}
------
2021-04-06T09:46:48 DEBUG ZEO.ClientStorage <function loadBefore at 0x7f550322e7d0> (<ZEO.ClientStorage.ClientStorage object at 0x7f550426f610>, '\x00\x00\x00\x00\x00\x00\x00\x01', '\x03\xdf1\x92\xd0\x93\xdc\xbc') {}
------
2021-04-06T09:46:48 DEBUG txn.140003071555328 commit
Dieter Maurer wrote at 2021-4-6 10:10 +0200:
Dieter Maurer wrote at 2021-4-6 07:22 +0200:
Kirill Smelkov wrote at 2021-4-5 01:50 -0700: @.***, thanks for feedback.
I checked a bit more from scratch and, contrary to the problem from https://github.com/zopefoundation/ZEO/issues/155 (which fails on both py2 and py3), the test program in this issue fails on CPython2, but does not fail on CPython3. The difference might be thus somehow related to trollius.
I could reproduce the problem here with Python 2 -- investigating.
The problem is understood and related to PR #307:
during the development of this PR, it was recognized that it is vital
when the value returned by lastTransaction
changes.
The lastTransaction
interface now specifies that lastTransaction
must return the last transaction for which processing of
invalidations is complete.
@jmuchemb I am not sure that the requirement above is already sufficient:
at various interfaces, the tid and the associated invalidations
are tightly coupled. The requirement wants lastTransaction
updated
after the processing of the invalidations. What happens when between
this update invalidations for a new transaction are processed?
For ZEO.ClientStorage.ClientStorage
,
the lastTransaction
value actually comes from
its cache. Its LastTid
is regularly set before invalidation processing.
Thus, ClientStorage
does not satisfy the requirement above.
@d-maurer, thanks a lot for investigating.
@jmuchemb I am not sure that the requirement above is already sufficient: at various interfaces, the tid and the associated invalidations are tightly coupled. The requirement wants
lastTransaction
updated after the processing of the invalidations. What happens when between this update invalidations for a new transaction are processed?
Since it was me to originally push for this interface change, let me say that you are right. If the follow-up invalidation is processed in between we are getting into scenario that is explained in https://github.com/zopefoundation/ZODB/issues/290 and leads to corrupt view of data when updating to next transaction.
The interface should state not only that "process invalidations" is ordered with lastTransaction via "happens before" (Inv_x < L_y for x ≤ y
), but also that non of further invalidations are processed yet (L_y < Inv_z for y < z
).
This should remove the possibility of future invalidation getting in between invalidate and lastTransaction.
By default I let @jmuchemb comment and handle interface amendment since it was his version of text to be merged into interfaces instead of https://github.com/zopefoundation/ZODB/pull/313.
Looking at https://github.com/zopefoundation/ZEO/pull/167 I'm also sorry to say that I missed that ZEO5 does not confirm to requested interface amendment. I verified it for ZEO4 (https://github.com/zopefoundation/ZODB/pull/307#discussion_r434145034), and from https://github.com/zopefoundation/ZODB/pull/307#discussion_r434166238 got the impression it is also the case for ZEO5. But I never looked inside to verifiy that by myself. Let me appologize...
Regarding https://github.com/zopefoundation/ZEO/pull/167 changes I would also suggest to move RaceConditionTests
to ZODB itself so that all storages are excercised for this bug, similarly to e.g. test in PackableStorage:
https://github.com/zopefoundation/ZODB/blob/5.6.0-19-gdad778016/src/ZODB/tests/PackableStorage.py#L103 https://github.com/zopefoundation/ZODB/blob/5.6.0-19-gdad778016/src/ZODB/tests/testFileStorage.py#L41-L57 https://github.com/zopefoundation/ZODB/blob/5.6.0-19-gdad778016/src/ZODB/tests/testDemoStorage.py#L45-L56
(see also https://github.com/zopefoundation/ZODB/issues/290#issuecomment-811998279 for similar call)
Thanks again for looking into this issue, Kirill
( moving out from https://github.com/zopefoundation/ZEO/issues/155#issuecomment-811984478 since this seems to be a different problem )
Something really wrong is going on with ZEO5.
In addition to the test program provided in https://github.com/zopefoundation/ZEO/issues/155, where there are 8 simultaneous workers each of which opens its own storage-level connection, I've tried to run the test program from https://github.com/zopefoundation/ZODB/issues/290 with minimal edit to use ZEO instead of
MappingStorage
, and it fails too.In that test program there is only one storage-level connection and two simultaneous users that are using that single storage-level connection. In other words there are no external invalidations involved at all.
https://lab.nexedi.com/kirr/wendelin.core/blob/e2dc8651/lib/tests/testprog/zopenrace.py#L102-203
This failure indicates data corruption. Quoting https://github.com/zopefoundation/ZODB/issues/290:
---- 8< ---- That in turn can lead to loose of Consistency of the database if mix of current and old data is used to process a transaction. A classic example would be bank accounts A, B and C with A<-B and A<-C transfer transactions. If transaction that handles A<-C sees stale data for A when starting its processing, it results in A loosing what it should have received from B.
...
The program simulates two clients: one (T2) constantly modifies two integer objects preserving invariant that their values stay equal. The other client (T1) constantly opens the database and verifies the invariant. T1 forces access to one of the object to always go through loading from the database, and this way if live cache becomes stale the bug is observed as invariant breakage. ---- 8< ----
Here is how it fails with ZEO:
P.S. Suggested patch from https://github.com/zopefoundation/ZEO/issues/155#issuecomment-581046248 does not help here - even though it seemingly fixes https://github.com/zopefoundation/ZEO/issues/155, it does not help to fix https://github.com/zopefoundation/ZODB/issues/290 with ZEO5.
P.P.S. ZEO
5.2.2-1-g3d90ed42
, ZODB5.6.0-18-g75c703fcf
, CPython2.7.18