zopefoundation / BTrees

Other
80 stars 28 forks source link

BTree "the bucket being iterated changed size" #118

Closed matthchr closed 3 years ago

matthchr commented 4 years ago

Versions: Python 3.6

BTrees==4.6.0
ZODB==5.5.1

I have a BTrees.OOBTree.BTree in a ZODB which seems to have somehow become corrupted.

While our system was running we observed that the data the BTree was returning seemed suspicious (while iterating the tree, items which should have been there were not actually there). We took a snapshot of the tree and manually examined it, and determined that the structure of the BTree seems to have become corrupted.

When we try to iterate the tree (or its keys):

for key in tree.keys():
    print('key: {}'.format(key))

We get this error:

Traceback (most recent call last):
  File "/home/test.py", line 40, in <module>
    for key in tree.keys():
RuntimeError: the bucket being iterated changed size

Interestingly, as far as I can tell our live service was not getting this exception (but it definitely was also not returning all of the data it should have been) - I'm not sure why the discrepancy between what our live service was seeing with this DB versus what I see when I download it and look through it locally.

BTrees.check.check shows no issues.

Obviously, I am not modifying the tree while iterating over it in my debugging environment - but when I look deeper into the structure of the tree by iterating the buckets one by one I see this:

Bucket: BTrees.OOBTree.OOBucket([('0231ed5f-6a2a-4675-a75a-103c6818de95', <Data object at 0x7faba5d396d8 oid 0x265b9 in <Connection at 7fabaa263cc0>>), ('03697f69-500b-4f7f-8ff1-b05c7d85b6cc', <Data object at 0x7faba5d39748 oid 0x26480 in <Connection at 7fabaa263cc0>>), ('03ac8297-38bf-4148-bd4e-df43c1e3d08b', <Data object at 0x7faba5d397b8 oid 0x262b5 in <Connection at 7fabaa263cc0>>), ('03d44486-31a2-4ed7-be24-d755f3cbff2a', <Data object at 0x7faba5d39828 oid 0x25fdb in <Connection at 7fabaa263cc0>>)])
... (a bunch more buckets that all look fine)
Bucket: BTrees.OOBTree.OOBucket([('4480e440-2d21-45f8-a909-664a85330df7', <Data object at 0x7faba5d41978 oid 0x264ad in <Connection at 7fabaa263cc0>>), ('452fc9ad-2544-4385-a034-3120a69409e8', <Data object at 0x7faba5d419e8 oid 0x264c4 in <Connection at 7fabaa263cc0>>), ('4554ce23-1ef5-4ff3-bee2-eb9233243448', <Data object at 0x7faba5d41a58 oid 0x2612f in <Connection at 7fabaa263cc0>>), ('45f92417-c80f-481a-b0d7-60110d757591', <Data object at 0x7faba5d41ac8 oid 0x25ccf in <Connection at 7fabaa263cc0>>)])
Bucket: BTrees.OOBTree.OOBucket([])
Bucket: BTrees.OOBTree.OOBucket([])
Bucket: BTrees.OOBTree.OOBucket([])
Bucket: BTrees.OOBTree.OOBucket([])
Bucket: BTrees.OOBTree.OOBucket([('91660989-9343-4f72-a076-94c8fc6c155f', <Data object at 0x7faba5ce03c8 oid 0xf769 in <Connection at 7fabaa263cc0>>), ('9176559f-e650-47bf-812b-dc4e75eaf299', <Data object at 0x7faba5ce0438 oid 0x223fa in <Connection at 7fabaa263cc0>>), ('91cd51fc-e9bd-4730-843f-b2553a06e63d', <Data object at 0x7faba5ce04a8 oid 0x22e68 in <Connection at 7fabaa263cc0>>), ('91d1b461-77ba-42b6-930b-326d7b2badbf', <Data object at 0x7faba5ce0518 oid 0x26310 in <Connection at 7fabaa263cc0>>), ('92acce7e-4e15-45c0-b954-2c25f9325731', <Data object at 0x7faba5ce0588 oid 0x221b9 in <Connection at 7fabaa263cc0>>), ('92e5b508-84a1-404b-9692-da2a7696c762', <Data object at 0x7faba5ce05f8 oid 0x22d32 in <Connection at 7fabaa263cc0>>), ... )])
... (more buckets with data here)

From what I can tell, 0 size buckets in the "middle" of the bucket list is not expected and is what triggers the exception I'm getting. Additionally if you look closely, it seems like a large swath of buckets is lost, because there should be a relatively even distribution of GUIDs as keys (and I see that from first digit 0-4 and first digit 9-f), but first digit 5-8 is missing.

My questions are:

  1. Why doesn't BTrees.check.check notice this?
  2. What could cause this type of corruption?

It's possible that this is related to #68, although they look sufficiently different to me that I figured I would post a new issue. It is also possible that this only started happening in BTrees==4.6.0 as we moved to that version relatively recently (we were on BTrees==4.5.1 previously) - but given how rare this is occurring we may just not have noticed it until now...

matthchr commented 4 years ago

With respect to the "lost" buckets, I just did: r = tree['747b6e77-3fe6-4155-afe6-b217505f4215'] (which was one of the lost keys) and somehow that actually worked - I can't seem to see/find the bucket where this is stored though, but somehow if I look it up by specific index the BTree is able to get me the value. Not sure if that adds clarity or confuses things further...

d-maurer commented 4 years ago

Matthew Christopher wrote at 2020-1-10 14:58 -0800:

...

From what I can tell, 0 size buckets in the "middle" of the bucket list is not expected and is what triggers the exception I'm getting. Additionally if you look closely, it seems like a large swath of buckets is lost, because there should be a relatively even distribution of GUIDs as keys (and I see that from first digit 0-4 and first digit 9-f), but first digit 5-8 is missing.

My questions are:

  1. Why doesn't BTrees.check.check notice this?

Likely an omission.

  1. What could cause this type of corruption?

Your report reminds me of an earlier report. That, too, reported a BTrees corruption - involving concurrency.

I suggested to use dm.historical to find which transaction introduced the corruption (Zope 2 usually writes sufficient transaction metadata to get a feeling what BTrees operations could have been performed). Of course, this requires that the corruption was introduced after the last (effective) pack time.

matthchr commented 4 years ago

@d-maurer - Yes, after the previous issue we had, we significantly improved our locking. All BTree writes are now done under a lock - so it doesn't seem as likely that this is a simple corruption issue due to multi-threading (at least not with writes).

We also did observe that the previous issue, where the BTree would show us a key (in say, for item in tree) but be unable to access the value of that key when doing tree[key] seems to have been mitigated by locking.

There are a few possibilities we are wondering about in the same space though:

  1. We call zodb.pack() periodically, and that is not currently under a lock. Is it possible that we're introducing some sort of corruption by packing? My read of the pack documentation suggests that's probably unlikely but it's one avenue I'm investigating.
  2. Only writes are holding the lock right now, with the assumption that a write + read happening in parallel wouldn't corrupt the BTree - is that assumption flawed? Should we lock all reads and writes?

Also, from the previous issue you had suggested:

If multiple threads need access to the same persistent object, each one must get its own local copy by loading it (directly or indirectly) from a local (i.e. thread specific) ZODB connection.

Our application uses async/await as its primary form of asynchrony - which actually means really there's only one thread in play (the one running the asyncio loop). That thread is the only thread with a Connection at all (IIRC by default connections are associated to the thread they're opened on) and so I don't think there's any conflict resolution going on at that level.

I will definiately improve our logging to detect this case as well (check missed it so we'll need some specific tests for this I think). Am I correct in assuming that there should be no empty buckets in the bucket linked-list?

jamadden commented 4 years ago

we significantly improved our locking

So you must be using gross-grained cross-process locking? Because writes can occur from any process at any time. That seems very expensive and counter-productive. Or you just have a single process...

Our application uses async/await as its primary form of asynchrony - which actually means really there's only one thread in play (the one running the asyncio loop). That thread is the only thread with a Connection at all (IIRC by default connections are associated to the thread they're opened on) and so I don't think there's any conflict resolution going on at that level.

Sorry, but if the application is sharing Persistent object instances across concurrency contexts it is unbelievable architecturally broken. That is to say, any object instance produced by a Connection must only ever be used by a single "thread" at a time (as must the Connection). Introducing any chance of doing otherwise is just not dreamed of by ZODB.

Perhaps I've misunderstood, and there are indeed extra guarantees in place in this architecture to ensure that the obvious errors of mixing concurrency and Connection don't happen. If that's the case, please forgive me.

d-maurer commented 4 years ago

Matthew Christopher wrote at 2020-1-13 13:59 -0800:

@d-maurer - Yes, after the previous issue we had, we significantly improved our locking. All BTree writes are now done under a lock - so it doesn't seem as likely that this is a simple corruption issue due to multi-threading (at least not with writes).

The ZODB was designed to be used without explicit locking. Thus, if used correctly, locking should not be necessary.

We also did observe that the previous issue, where the BTree would show us a key (in say, for item in tree) but be unable to access the value of that key when doing tree[key] seems to have been mitigated by locking.

There are a few possibilities we are wondering about in the same space though:

  1. We call zodb.pack() periodically, and that is not currently under a lock. Is it possible that we're introducing some sort of corruption by packing? My read of the pack documentation suggests that's probably unlikely but it's one avenue I'm investigating.

Packing should not introduce corruptions. If anything, it may only loose whole transactions and be unable to corrupt the internal state of persistent objects.

  1. Only writes are holding the lock right now, with the assumption that a write + read happening in parallel wouldn't corrupt the BTree - is that assumption flawed? Should we lock all reads and writes?

As indicated above, no locks should be necessary.

Also, from the previous issue you had suggested:

If multiple threads need access to the same persistent object, each one must get its own local copy by loading it (directly or indirectly) from a local (i.e. thread specific) ZODB connection.

Our application uses async/await as its primary form of asynchrony - which actually means really there's only one thread in play (the one running the asyncio loop). That thread is the only thread with a Connection at all (IIRC by default connections are associated to the thread they're opened on) and so I don't think there's any conflict resolution going on at that level.

What I have formulated above for threads is actually a requirement for concurrent activities controlling a transaction. In the standard setup, a transaction is bound to a controlling thread. In your case, however, your coroutines are likely the concurrent activities controlling the transactions and must fulfill the requirement.

If the requirement is not fulfilled, you may loose important transaction features: as effects of different concurrent activities may get committed together leading to an inconsistent state. Another effect may be that some modifications by one concurrent activity is wiped out by an "abort" from a different concurrent activity and the first current transaction makes further modifications only valid if its former modifications were effective - again leading to inconsistent state.

The inconsistencies mentioned above are application level inconsistencies. I do not think that not fulfilling the requirement can cause data structure level inconsistencies. HOWEVER I have seen (very rare) cases where READING (!) a BTree structure from a foreign thread has caused a SIGSEGV: apparently, some (internal) "BTree" operations are performed without holding the GIL and interfering reads can be desastrous. However, coroutine concurrency should not introduce such behaviour (as it does not interrupt internal BTree operations).

matthchr commented 4 years ago

@jamadden asked:

So you must be using gross-grained cross-process locking? Because writes can occur from any process at any time. That seems very expensive and counter-productive. Or you just have a single process...

There's only a single process, a single asyncio event loop, and a single active Connection. Locks are just in-memory locks.

Sorry, but if the application is sharing Persistent object instances across concurrency contexts it is unbelievable architecturally broken. That is to say, any object instance produced by a Connection must only ever be used by a single "thread" at a time (as must the Connection). Introducing any chance of doing otherwise is just not dreamed of by ZODB.

asyncio only provides cooperative yielding/scheduling. Are you saying that the following is not allowed :

async def test_async(zodb_connection):
    await some_async_operation()
    zodb_connection.root['a'].persistent_thing.x = 1
    transaction.commit()

async def test_async_2(zodb_connection):
    await some_async_operation()
    zodb_connection.root['a'].persistent_thing.x = 2
    transaction.commit()

# Example usage
await test_async(conn)
...
await test_async_2(conn)

That is to say, any object instance produced by a Connection must only ever be used by a single "thread" at a time (as must the Connection).

Provided when you say "Thread" here you mean Python thread (and not some lower level primitive or logical concept), we aren't violating this - asyncio also only ever runs coroutines on a single Python thread. Interruptions can only occur when an await is reached and the interrupting coroutine is also run on the same thread.

@d-maurer says:

If the requirement is not fulfilled, you may loose important transaction features: as effects of different concurrent activities may get committed together leading to an inconsistent state. Another effect may be that some modifications by one concurrent activity is wiped out by an "abort" from a different concurrent activity and the first current transaction makes further modifications only valid if its former modifications were effective - again leading to inconsistent state.

These are good points - as you say though these are application level inconsistencies which we manage:

  1. We don't ever abort currently
  2. Modification to any persistent object is accompanied by a transaction.commit() in the coroutine which made the modification
  3. There aren't any await's between the modification of a persistent object and a transaction being committed (it's done within a with block)

Taking a step back - the entire point of doing what we're doing is basically because opening/closing Connections frequently we observed to be expensive. It's not tenable for our application to open a fresh connection for every single coroutine where we may make a modification due to the overall performance impact of doing so - we open a connection, do many transactions/modifications and then close the connection (which triggers persistence to disk I believe) at a defined time. My understanding was that this pattern, while strange, is not fundamentally against the architecture of ZODB/BTrees - is that not correct?

d-maurer commented 4 years ago

Matthew Christopher wrote at 2020-1-14 09:08 -0800:

... Taking a step back - the entire point of doing what we're doing is basically because opening/closing Connections frequently we observed to be expensive.

I am using the ZODB in the context of Zope. In this context, a (at least one, sonmetimes several) connection is opened/closed for each request. I have not observed that this causes a significant overhead for request processing.

Note that "ZODB.DB.DB" uses a connection pool to make opening new connections efficient (the primary purpose is to reuse the cache associated with the connection; the open itself is fast anyway). You should see significant degradations only when you open more concurrent connections than the pool size.

... My understanding was that this pattern, while strange, is not fundamentally against the architecture of ZODB/BTrees - is that not correct?

Reading persistent objects, too, may (or may not) be important for application level consistency as explained below.

For a transactional system, the notion of "transaction isolation" is important. This "measures" the degree of transaction independence. The best is "serial": with this isolation, the execution of a set of transactions either fails or has the same effect as some serial (rather than concurrent) execution of the transactions.

The ZODB implements a weaker transaction isolation: so called "snapshot isolation". In this model, a transaction sees the state when the transaction started. Conflicts introduced by concurrent modifications to individual objects are recognized during commit and cause some affected transaction[s] to be aborted.

If you have two coroutines C1 and C2 with:

C1: ... ... read persistent object ... ... ... context switch ... ... ... read persistent object ...

and

C2: ... ... write persistent object ... ... commit ... context switch ...

then "C1" may read inconsistent data. The first "read" reads state before the commit; the second after the commit.

Such inconsistent reads may or may not be a problem for your application. In no case should they be responsible for the physical corruption you have observed.

Recently, I have fixed a BTree inconsistency in a system of a client. The system has been operational for about 10 years and this has been the first detected inconsistency. I tend to think that this might have been the effect of less than complete reliability: computers and storage media are highly reliable but nevertheless, very rarely, errors may happen. If such an error affects persistent data, its effect will remain... Maybe, you, too, see the effect of a random error? At your place, I would fix the corrupt BTree and start worrying only should a similar problem reoccur in a short to medium timeframe.

matthchr commented 4 years ago

We have finally been able to get a somewhat reliable repro of this issue... appreciate your leaving this issue open so long even with no activity.

We've found a way to generate load in our application which produces this issue somewhat consistently. I have shared a ZODB file with the full transaction history available (no packs have been done). I followed the advice given in #102 by @d-maurer and used dm.historical to look through the transaction history for the BTree in question.

Here's what I found (caveat that I am not an expert here so I was learning as I went):

I took a deeper look at the difference in history between the last good history entry and the first bad one and what I observed was that we removed keys from 4 buckets between the two entries. Specifically these keys:

bucket 118 differs
1 keys only in lhs {'testaccount1batchtest18 22F1F921291938C1$abs-2-job406589010-2 22F1E3E288AA12F8$job-1$abs-2-job406589010-2-task-63$0:662512081'}
bucket 119 differs
1 keys only in lhs ({'testaccount1batchtest18 22F1F921291938C1$abs-2-job406589010-2 22F1E3E288AA12F8$job-1$abs-2-job406589010-2-task-67$0:662512081'})
bucket 159 (this is the bucket that becomes corrupt) differs
1 keys only in lhs ({'testaccount1batchtest18 22F1F921291938C1$abs-2-job406589010-4 22F1E3E288A8B2FD$job-1$abs-2-job406589010-4-task-205$0:662512051'})
bucket 160 differs
2 keys only in lhs ({'testaccount1batchtest18 22F1F921291938C1$abs-2-job406589010-4 22F1E3E288A8B2FD$job-1$abs-2-job406589010-4-task-235$0:662512051', 'testaccount1batchtest18 22F1F921291938C1$abs-2-job406589010-4 22F1E3E288A8B2FD$job-1$abs-2-job406589010-4-task-234$0:662512051'})

The net of this is that basically we removed 5 keys from 4 buckets, with the key in bucket 159 being the last key. That bucket (oid b'\x00\x00\x00\x00\x00\x00\r\x91') didn't disappear like it was (I think) supposed to.

The ZODB containing the BTree in question is in this zip. There are multiple BTrees in the DB: The one with the issue is: btree = conn.root()['agent.task.history'].history I've posted the code I used to gather this data here.

Here are the details of our application and the situations that this reproduces in:

I have tried to produce a synthetic test that reproduces this issue outside of the context of our application (no PyInstaller, etc), but I haven't been able to do so. I am hoping that somebody can take a look at the btree in question and maybe something jumps out at you about what specifically is going on with this transaction that is suspicious. With that information maybe I could write a test that reproduced the problem without all of the context/cruft that our application brings, since right now the only way I have to reproduce this is a long and somewhat cumbersome process that involves running our entire application for a while, driving particular load at it, and hoping the issue reproduces.

matthchr commented 4 years ago

I also just now realized that @jamadden never actually clarified his usage of the word Thread w.r.t the comments previously - it could be that our application is unbelievably architecturally broken, if that's the case I'd like to know!

d-maurer commented 4 years ago

Matthew Christopher wrote at 2020-8-26 15:30 -0700:

We have finally been able to get a somewhat reliable repro of this issue... appreciate your leaving this issue open so long even with no activity. ...

  • The application is entirely using Python async/await, so as discussed above there is some concurrency in the application but no true parallelism.

This is potentially problematic -- unless you ensure that each coroutine uses its own ZODB connection and the transactions are coroutine specific.

The transaction manager used by ZODB connections by default are thread, not coroutine specific. Thus, you would need to pass a special transaction manager when you open your ZODB connections.

Failure to ensure those may lead to inconsistent persistent data. I am not sure whether it can lead to a physically damanage tree.

d-maurer commented 4 years ago

Matthew Christopher wrote at 2020-8-26 15:31 -0700:

I also just now realized that @jamadden never actually clarified his usage of the word Thread w.r.t the comments previously - it could be that our application is unbelievably architecturally broken, if that's the case I'd like to know!

"thread" has the narrow meaning of a "Python thread" (see the Python documentation for its modules _thread and threading). The coroutines associated with "async/await" are not threads in this sense.

By default, the ZODB targets a multi-thread application and makes assumptions about how threads are using ZODB connections:

The coroutines associated with "async/await" are similar to threads. However, they, too, must satisfy the conditions formulated above for threads if they want to use the ZODB safely. Especially, they must use a special transaction manager which binds the transactions to coroutines rather than threads.

d-maurer commented 4 years ago

Matthew Christopher wrote at 2020-8-26 15:30 -0700:

... I took a deeper look at the difference in history between the last good history entry and the first bad one and what I observed was that we removed keys from 4 buckets between the two entries. ... The net of this is that basically we removed 5 keys from 4 buckets, with the key in bucket 159 being the last key. That bucket (oid b'\x00\x00\x00\x00\x00\x00\r\x91') didn't disappear like it was (I think) supposed to.

You are right. The bucket should have been removed from the tree.

Your problem may have a stronger relation to #68 as originally assumed: even for the good tree (which still can be iterated over) btree._check() reports AssertionError: Bucket next pointer is damaged. Thus, your observed problem (bucket not removed) might be a consequence of an earlier bug (damaged bucket chain).

The ZODB containing the BTree in question is in this zip. There are multiple BTrees in the DB: The one with the issue is: btree = conn.root()['agent.task.history'].history I've posted the code I used to gather this data here.

For those who want to join the analysis, I used the following code:

from pprint import pprint as pp
from ZODB.DB import DB
from ZODB.FileStorage import FileStorage
s=FileStorage("agent.db")
db=DB(s)
c=db.open()
o=c.root()['agent.task.history']
o._p_activate()
t=o.__Broken_state__["history"]
from dm.historical import *
b=c[b'\x00\x00\x00\x00\x00\x00\r\x91']
bad = getObjectAt(t, b'\x03\xda\x0bj4\xfd\x99\xee') # no longer iterable
good = getObjectAt(t, b'\x03\xda\x0bj4\xfd\x99\xe0') # still iterable
from BTrees.check import *
top=crack_btree(bad, True)
bb_p = top[2][157] # predecessor of bad bucket
bb = top[2][158] # bad bucket
good._check()
d-maurer commented 4 years ago

... From what I can tell, 0 size buckets in the "middle" of the bucket list is not expected and is what triggers the exception I'm getting. ... My questions are:

1. Why doesn't `BTrees.check.check` notice this?

This is checked (among others) by the _check method of btrees, not by the check function.

d-maurer commented 4 years ago

Dieter Maurer wrote at 2020-8-28 08:42 +0200:

Matthew Christopher wrote at 2020-8-26 15:30 -0700:

... Your problem may have a stronger relation to #68 as originally assumed: even for the good tree (which still can be iterated over) btree._check() reports AssertionError: Bucket next pointer is damaged. Thus, your observed problem (bucket not removed) might be a consequence of an earlier bug (damaged bucket chain).

I have analysed this further - and indeed, the "bucket not removed" is likely a direct consequence of the damaged bucket chain: almost surely, a different bucket got removed (the one in the chain at the position of the emptied bucket).

I used the following code (following code previously posted):

gt=crack_btree(good, True)
for i in range(len(gt[1])):
  if gt[2][i].__getstate__()[1] is not gt[2][i+1]: print(i); break

The result was 157; this is the position before the emptied bucket which was not removed (as it should have been).

d-maurer commented 4 years ago

Dieter Maurer wrote at 2020-8-28 09:28 +0200:

Dieter Maurer wrote at 2020-8-28 08:42 +0200:

Matthew Christopher wrote at 2020-8-26 15:30 -0700:

... Your problem may have a stronger relation to #68 as originally assumed: even for the good tree (which still can be iterated over) btree._check() reports AssertionError: Bucket next pointer is damaged. Thus, your observed problem (bucket not removed) might be a consequence of an earlier bug (damaged bucket chain).

I have analysed this further - and indeed, the "bucket not removed" is likely a direct consequence of the damaged bucket chain: almost surely, a different bucket got removed (the one in the chain at the position of the emptied bucket).

I used the following code (following code previously posted):

gt=crack_btree(good, True)
for i in range(len(gt[1])):
 if gt[2][i].__getstate__()[1] is not gt[2][i+1]: print(i); break

The result was 157; this is the position before the emptied bucket which was not removed (as it should have been).

The "damaged bucket chain" was introduced by the following transaction:

{'description': '',
 'obj': <BTrees.OOBTree.OOBTree object at 0x7f8b3665e158 oid 0x8 in <Connection at 7f8b366cdeb8>>,
 'size': 2961,
 'tid': b'\x03\xda\x0bi 2Wf',
 'time': DateTime('2020/08/20 01:37:7.546089 GMT+2'),
 'user_name': ''}

This is history record "5186" in the reversed history obtained from dm.historical.generateBTreeHistory(t).

The previous transaction (where the bucket chain is still okay) is:

{'description': '',
 'obj': <BTrees.OOBTree.OOBTree object at 0x7f8b3665e1e0 oid 0x8 in <Connection at 7f8b366cdf60>>,
 'size': 2661,
 'tid': b'\x03\xda\x0bi\x02\xa9\xce\xbb',
 'time': DateTime('2020/08/20 01:37:0.624213 GMT+2'),
 'user_name': ''}

In the bad transaction, the bucket following the bucket with the bad nextbucket pointer was emptied, the nextbucket pointer of the previous bucket was correctly updated (and now points to the bucket following the emptied bucket) BUT the emptied bucket was not removed from the parent btree.

I have looked at the code: reponsible is BTreeTemplate.c:_BTree_set. While the bucket chain update is quite far away from the data update, I have not seen how it could be possible to "forget" the data update.

I will see whether I can reproduce the behavior on my linux platform.

d-maurer commented 4 years ago

I will see whether I can reproduce the behavior on my linux platform.

I was unable to emulate the application specific modules/classes -- a requirement for the reproduction on a foreign system. @matthchr can you provide such an emulation (or even the true code)?

d-maurer commented 4 years ago

The following code checks deletion of a bucket:

from BTrees.OOBTree import OOBTree, OOBucket
from copy import deepcopy

b2 = OOBucket(dict.fromkeys(range(20, 22)))
b1 = OOBucket()
b1.__setstate__(((10, None), b2))
b0 = OOBucket()
b0.__setstate__(((0, None), b1))
t = OOBTree()
t.__setstate__(((b0, 10, b1, 20, b2), b0))
t._check()
tc = deepcopy(t)
del tc[10]
tc._check()

No "damaged bucket chain" -- at least not under Linux.

matthchr commented 4 years ago

@d-maurer thanks for the in-depth investigation. I hadn't realized that _check did something different than BTrees.check.check

One thing I noticed when walking through the history myself is that my dm.historical is showing me far fewer history entries than you seem to be seeing. You mentioned seeing 5k+, but I only see ~500. What version of dm.historical are you using? I'm using version 2.0.1, and I think that this version is missing certain history entries. I wrote a simple test where I just commited 2000 transactions (each labeled with a number), and while the first 30 or 40 it had each one of, after that it seemed to skip a bunch between each history entry (i.e. one entry would be txn 67, the next 93).

No "damaged bucket chain" -- at least not under Linux.

I ran the code snippet you shared on my Windows environment and didn't see any issue either.

I was unable to emulate the application specific modules/classes -- a requirement for the reproduction on a foreign system. @matthchr can you provide such an emulation (or even the true code)?

I haven't been able to produce a paired down example which can emulate this problem, and unfortunately I can't share the actual code in question.

Before we get too far, we should revisit this:

This is potentially problematic -- unless you ensure that each coroutine uses its own ZODB connection and the transactions are coroutine specific.

The transaction manager used by ZODB connections by default are thread, not coroutine specific. Thus, you would need to pass a special transaction manager when you open your ZODB connections.

Failure to ensure those may lead to inconsistent persistent data. I am not sure whether it can lead to a physically damanage tree.

We are not using a coroutine specific transaction manager when we open connections - I was (mistakenly?) under the assumption that since we weren't actually multi-threaded, we could just use the default transaction manager. As you mentioned, that defaults to a transaction per-thread - since we only have one thread there is at most a single transaction ongoing at a given time.

Can you explain a bit more about the mechanics of how this is problematic? I am failing to understand what the difference is in these two pieces of code from the perspective of ZODB/BTrees:

async def test(btree):
    await t1(btree)
    await t2(btree)
    await t3(btree)
    transaction.commit()

async def t1(btree):
    btree['foo'] = 5

async def t2(btree):
    btree['bar'] = 6

async def t3(btree):
    btree['baz'] = 7
def test2(btree):
    btree['foo'] = 5
    btree['bar'] = 6
    btree['baz'] = 7
    transaction.commit()

Obviously this example is overly simplistic because it assumes that execution in an async context is always well-ordered, which it's not. Even with a more complicated case where we're reacting to some outside event (a web request or something), such that there may be multiple calls ongoing and order is not guaranteed, I still am not seeing how that's any different than making a bunch of sequential changes and commits in a non-async context.

One issue (which I do understand), is that if we do something like this:

def a():
    # modify some btree state (part 1)
    await <something>
    # modify some more btree state (part 2)
    transaction.commit()

and have the expectation that both part 1 and part 2 will always be committed together, that is not correct (because we awaited in the middle and something else could have committed during that await). I don't understand how this could corrupt the btree though (because as far as ZODB/BTrees is concerned we've just done a bunch of sequential commits).

Can you clarify if you think that our lack of coroutine-specific transactions is the cause of this corruption, or if you are just cautioning against that because of possible logic errors?

d-maurer commented 4 years ago

Matthew Christopher wrote at 2020-8-28 11:37 -0700:

@d-maurer thanks for the in-depth investigation. I hadn't realized that _check did something different than BTrees.check.check

One thing I noticed when walking through the history myself is that my dm.historical is showing me far fewer history entries than you seem to be seeing. You mentioned seeing 5k+, but I only see ~500.

I used generateBTreeHistory (which contained a Python 3 incompatibility -- now fixed with new version 2.0.2).

Note that each persistent object has its own history - independent that of persistent subobjects. getHistory (and generateHistory) accesses this history. generateBTreeHistory tries to take the history of all persistent BTree and Bucket subobjects into account (but it, too, can miss some history records).

What version of dm.historical are you using?

2.0.2 (released today).

I'm using version 2.0.1, and I think that this version is missing certain history entries. I wrote a simple test where I just commited 2000 transactions (each labeled with a number), and while the first 30 or 40 it had each one of, after that it seemed to skip a bunch between each history entry (i.e. one entry would be txn 67, the next 93).

When you made your test with a BTree, then initially there is a single persistent object - and getHistory gives you the full history. When the tree become larger, then it grows persistent subobjects. Modifications to those subobjects (alone) are not reflected in the history of the tree - only modifications of the (top level) tree structure show in the trees history.

generateBTreeHistory is there to take modifications of sub tree structures into account.

No "damaged bucket chain" -- at least not under Linux.

I ran the code snippet you shared on my Windows environment and didn't see any issue either.

I feared that: BTrees comes with extensive tests which are run on Windows as well. Almost surely, there is a test which tests bucket deletion.

I was unable to emulate the application specific modules/classes -- a requirement for the reproduction on a foreign system. @matthchr can you provide such an emulation (or even the true code)?

I haven't been able to produce a paired down example which can emulate this problem, and unfortunately I can't share the actual code in question.

I do not need the real code - just the relevant module/class structure. Below is the emulation I have tried:

from types import ModuleType
m = ModuleType("agent.task.history")
m.task = m.history = m.defs = m
from sys import modules
modules[m.__name__] = modules["agent"] = modules["agent.task"] = modules["agent.task.defs"] = m
from persistent import Persistent
class TaskHistoryInfo(Persistent): pass

TaskHistoryInfo.__module__ = "agent.task.history"
m.TaskHistoryInfo = TaskHistoryInfo

class TaskHistoryData(Persistent): pass

TaskHistoryData.__module__ = "agent.task.history"
m.TaskHistoryData = TaskHistoryData

class TaskKind(object): pass

TaskKind.__module__ = "agent.task.history"
m.TaskKind = TaskKind

It is unfortunately not sufficient: trying to copy the last correct tree gives an object() takes no arguments. Likely, the base class of one of the above classes is wrong.

Before we get too far, we should revisit this:

This is potentially problematic -- unless you ensure that each coroutine uses its own ZODB connection and the transactions are coroutine specific.

Bugs of this kind should not damage the physical structure of the tree. In your case, the physical structure is damaged: for some (still unknown) reason the bucket chain was updated but not the actual tree. Whatever you do (wrong) at the application level, you should not be able to achieve this.

The code can skip the tree update -- but only in case of errors and those should result in an exception. Maybe, under some rare cases (triggered by your tree) an error occurs which (wrongfully) is not transformed into an exception. Debugging will be necessary to find this out.

... Can you explain a bit more about the mechanics of how this is problematic?

An important ZODB aim is to make multi-thread applications (fairly) safe without the need of explicit locking. It uses two techniques for this: "transaction"s and local copies.

Each connection has (logically) its own copies of the objects in the ZODB. Modifications affect the copies, not the persistent data directly. Only when the transaction is committed, modifications are transfered from the local copies to the persistent data.

When several threads share the same transaction, then the transaction commit or abort stores or rolls back not only the modifications of a single thread but of different threads - and usually, the thread issuing the transaction control does not know about the state of other threads. As a consequence, modifications may be persisted (or reverted) in an uncontrolled way.

When a single connection is shared by several threads, they operate on the same local copies and can interfere with one another.

What I wrote about threads above applies also to coroutines (or other concurrent activities).

I am failing to understand what the difference is in these two pieces of code from the perspective of ZODB/BTrees:

async def test(btree):
   await t1(btree)
   await t2(btree)
   await t3(btree)
   transaction.commit()

async def t1(btree):
   btree['foo'] = 5

async def t2(btree):
   btree['bar'] = 6

async def t3(btree):
   btree['baz'] = 7
def test2(btree):
   btree['foo'] = 5
   btree['bar'] = 6
   btree['baz'] = 7
   transaction.commit()

Obviously this example is overly simplistic because it assumes that execution in an async context is always well-ordered, which it's not. Even with a more complicated case where we're reacting to some outside event (a web request or something), such that there may be multiple calls ongoing and order is not guaranteed, I still am not seeing how that's any different than making a bunch of sequential changes and commits in a non-async context.

Lets see:

Assume, the "await t2(btree)" causes a coroutine switch and the foreign coroutine calls transaction.abort(). Then the effect of t1(btree) will be undone - a thing completely unexpected by task.

The ZODB's aim it to ensure that in a single thread/coroutine you can forget that there are other threads/coroutines. If the transactions are not thread/coroutine specific, this is obviously no longer the case.

One issue (which I do understand), is that if we do something like this:

def a():
   # modify some btree state (part 1)
   await <something>
   # modify some more btree state (part 2)
   transaction.commit()

and have the expectation that both part 1 and part 2 will always be committed together, that is not correct (because we awaited in the middle and something else could have committed during that await). I don't understand how this could corrupt the btree though (because as far as ZODB/BTrees is concerned we've just done a bunch of sequential commits).

It should not be possible to corrupt the physical btree structure (unless a coroutinee switch would happen as part of object destruction -- I believe that this is impossible).

Can you clarify if you think that our lack of coroutine-specific transactions is the cause of this corruption,

I do not think so.

Not having coroutine specific transactions (or share connections between coroutines) may give you inconsistencies at the application level. It should not be able to cause physical damage to the btree structure itself (because there should be no coroutine switch during a single btree operation only between btree operations).

d-maurer commented 4 years ago

Matthew Christopher wrote at 2020-8-28 11:37 -0700:

... Can you clarify if you think that our lack of coroutine-specific transactions is the cause of this corruption, or if you are just cautioning against that because of possible logic errors?

You issue is almost surely related to #68 -- and that almost surely has been observed in a standard multi threaded application.

What you do is not right -- but it is not the cause of the observed damage.

jamadden commented 4 years ago

because there should be no coroutine switch during a single btree operation only between btree operations

That's not out of the realm of possibility, though, is it? If the objects you're using as keys wind up switching in their __cmp__ (__eq__, __lt__) methods, then what's supposedly sequential C code, protected by the GIL, is not. Your point about object destruction is also a good one: A __del__ method implemented in Python can do just about anything, so a line of code like btree['a'] = object() could have whatever side-effect that btree['a'].__del__ does; that might include switching. I wonder if we're looking at a case of re-entrancy (a btree method switching out, part way through, and then being called again by another operation).

(I'm far more familiar with gevent than with async/await. This is one place that the explicitness of async/await comes in handy. It's probably tough to arrange for switches in those methods without some trouble and several layers. )

You issue is almost surely related to #68 -- and that almost surely has been observed in a standard multi threaded application.

In fact, it's actually a gevent-based application, but it's using almost exclusively the standard-library as monkey-patched by gevent, and greenlets are treated just like threads: one Connection and transaction per request per greenlet. Persistent objects are never shared between "threads" of control without a great deal of care and following some rules (rule #1: Don't do that!) At least they're not supposed to be, so after this conversation I'm wondering if that might have been the root cause of #68.

matthchr commented 4 years ago

@d-maurer Thanks for your explanation - I understand now.

@jamadden AFAIK it's not possible to actually yield control from a coroutine without an await, and I am not aware of a way to use that keyword in a method such as __eq__, __lt__, etc. Certainly we aren't doing that.

What you do is not right -- but it is not the cause of the observed damage.

Thanks for the clarification - I will look at making changes to use a connection/txn per-coroutine (I agree that there are clear flaws with what we're doing now, although we're managing the application inconsistencies that may arise from it), but good to hear that it shouldn't be causing corruption.

I do not need the real code - just the relevant module/class structure.

In that case, I may be able to help. Can you share the full snippet of code that you're trying to execute (with the mock object hierarchy + copy + usage of dm.historical?) If I get that I can probably fix it up and share it back with you if I can get it working

jamadden commented 4 years ago

AFAIK it's not possible to actually yield control from a coroutine without an await, and I am not aware of a way to use that keyword in a method such as eq, lt, etc. Certainly we aren't doing that.

Yeah, it wouldn't be directly done, it would probably be something buried under layers of function calls. This simple example shows control going to different coroutines while inside of an __eq__ method. If that happened during a BTree operation, and one of the switched-to coroutines was also using the BTree, I'd expect problems.

import asyncio
from BTrees import family64

async def coro2():
    print("In coro 2")
    print("Len of bt", len(bt))

async def thing():
    print("-> Enter coro 1")
    await coro2()
    print("<- Exit coro 1")

def do_thing():
    asyncio.run(thing())

class Foo:

    def __eq__(self, other):
        print("In __eq__")
        import traceback; traceback.print_stack()
        do_thing()
        return True

    __lt__ = __eq__

bt = family64.OO.BTree()
bt[Foo()] = 1
bt[Foo()] = 2
$python /tmp/atest.py
In __eq__
  File "/tmp/atest.py", line 28, in <module>
    bt[Foo()] = 2
  File "/tmp/atest.py", line 20, in __eq__
    import traceback; traceback.print_stack()
-> Enter coro 1
In coro 2
Len of bt 1
<- Exit coro 1
$ PURE_PYTHON=1 python /tmp/atest.py
In __eq__
  File "/tmp/atest.py", line 28, in <module>
    bt[Foo()] = 2
  File "//lib/python3.8/site-packages/BTrees/_base.py", line 819, in __setitem__
    self._set(self._to_key(key), self._to_value(value))
  File "//lib/python3.8/site-packages/BTrees/_base.py", line 949, in _set
    result = child._set(key, value, ifunset)
  File "//lib/python3.8/site-packages/BTrees/_base.py", line 350, in _set
    index = self._search(key)
  File "//lib/python3.8/site-packages/BTrees/_base.py", line 125, in _search
    if k is key or k == key:
  File "/tmp/atest.py", line 20, in __eq__
    import traceback; traceback.print_stack()
-> Enter coro 1
In coro 2
Len of bt 1
<- Exit coro 1
d-maurer commented 4 years ago

Matthew Christopher wrote at 2020-8-28 15:18 -0700:

...

I do not need the real code - just the relevant module/class structure.

In that case, I may be able to help. Can you share the full snippet of code that you're trying to execute (with the mock object hierarchy + copy + usage of dm.historical?) If I get that I can probably fix it up and share it back with you if I can get it working

The idea is to copy the (historical) last good tree into a current connection and then perform the deletion. If this deletion reproduces the problem, it can be debugged. Below is the trial code up to the copying. You can check whether an updated emulation is sufficient by executing the code outside your normal environment (to ensure that not the real agent code is used) and verify that there is no exception.

from pprint import pprint as pp from ZODB.DB import DB from ZODB.FileStorage import FileStorage from transaction import commit

from types import ModuleType m = ModuleType("agent.task.history") m.task = m.history = m.defs = m from sys import modules modules[m.name] = modules["agent"] = modules["agent.task"] = modules["agent.task.defs"] = m from persistent import Persistent class TaskHistoryInfo(Persistent): pass

TaskHistoryInfo.module = "agent.task.history" m.TaskHistoryInfo = TaskHistoryInfo

class TaskHistoryData(Persistent): pass

TaskHistoryData.module = "agent.task.history" m.TaskHistoryData = TaskHistoryData

class TaskKind(object): pass

TaskKind.module = "agent.task.history" m.TaskKind = TaskKind

s=FileStorage("agent.db") db=DB(s) c=db.open() t=c.root()['agent.task.history'].history from dm.historical import * th=generateBTreeHistory(t) h = list(th); h.reverse() good = h[5185]["obj"]; bad = h[5186]["obj"]

from tempfile import TemporaryFile with TemporaryFile() as f: good._p_jar.exportFile(good._p_oid, f) f.seek(0) gc = c.importFile(f)

commit()

d-maurer commented 4 years ago

Jason Madden wrote at 2020-8-28 16:08 -0700:

AFAIK it's not possible to actually yield control from a coroutine without an await, and I am not aware of a way to use that keyword in a method such as eq, lt, etc. Certainly we aren't doing that.

Yeah, it wouldn't be directly done, it would probably be something buried under layers of function calls. This simple example shows control going to different coroutines while inside of an __eq__ method.

I do not think that anyone will run the event loop nested in a special method. As for "normal" (async) coroutines, they need to be specially compiled. I believe that this excludes to be called/awaited indirectly from a C library not specially equipped to support this (such as BTrees): it would mean that the C runtime stack needs to be saved and later restored.

d-maurer commented 4 years ago

Jason Madden wrote at 2020-8-28 14:23 -0700:

because there should be no coroutine switch during a single btree operation only between btree operations

That's not out of the realm of possibility, though, is it?

My current working hypothesis: _Btree_set hits an error between the bucket chain update and the children (aka data) update. The error causes the children update to be skipped. For some (unknown) reason, the error does not cause an exception.

The error might be special for the asynchronous execution environment (to explain why the problem has not popped up previously).

We know that Python suppresses some exceptions (e.g. exceptions in __del__); maybe our observation is related to exception suppression.

d-maurer commented 4 years ago

Matthew Christopher wrote at 2020-8-28 15:18 -0700:

... In that case, I may be able to help.

Rethinking, it may be better that you try to reproduce the error -- the big advantage: you can use the real code (no need for an emulation).

Following the code posted previously (without the (incomplete) emulation code), you copy the last good tree. Comparing the last good and first bad tree, you determine which key you must delete in the tree copy to (hopefully) reproduce the damage.

matthchr commented 4 years ago

I've fixed up your emulation code I believe. The main issue was that TaskKind is an enum.Enum:

Full code that ran without error for me:

from ZODB.DB import DB
from ZODB.FileStorage import FileStorage
from transaction import commit
import enum

from types import ModuleType
m = ModuleType("agent.task.history")
m.task = m.history = m.defs = m
from sys import modules
modules[m.__name__] = modules["agent"] = modules["agent.task"] = modules["agent.task.defs"] = m
from persistent import Persistent
class TaskHistoryInfo(Persistent): pass

TaskHistoryInfo.__module__ = "agent.task.history"
m.TaskHistoryInfo = TaskHistoryInfo

class TaskHistoryData(Persistent): pass

TaskHistoryData.__module__ = "agent.task.history"
m.TaskHistoryData = TaskHistoryData

class TaskKind(enum.Enum):
    A = 0
    B = 1
    C = 2
    D = 3

TaskKind.__module__ = "agent.task.history"
m.TaskKind = TaskKind

s=FileStorage("agent.db")
db=DB(s)
c=db.open()
t=c.root()['agent.task.history'].history
from dm.historical import *
th=generateBTreeHistory(t)
h = list(th); h.reverse()
good = h[5185]["obj"]; bad = h[5186]["obj"]

from tempfile import TemporaryFile
with TemporaryFile() as f:
    good._p_jar.exportFile(good._p_oid, f)
    f.seek(0)
    gc = c.importFile(f)

commit()

I tried investigating how you suggested (diffing the two trees to see what changed). I used the following function:

def compare_btrees(lhs, rhs):

    lhs_keys = {key for key in lhs}
    rhs_keys = {key for key in rhs}

    keys_only_in_lhs = lhs_keys - rhs_keys
    keys_only_in_rhs = rhs_keys - lhs_keys
    if any(keys_only_in_lhs):
        print('{} keys only in lhs ({})'.format(
            len(keys_only_in_lhs), keys_only_in_lhs))
    if any(keys_only_in_rhs):
        print('{} keys only in rhs ({})'.format(
            len(keys_only_in_rhs), keys_only_in_rhs))

From what I can tell, The 5185 -> 5186 BTree diff consisted of the following change:

1 keys only in lhs ({'testaccount1batchtest18 22F1F921291938C1$abs-2-job406589010-4 22F1E3E288A8B2FD$job-1$abs-2-job406589010-4-task-182$0:662512005'})

I tried doing the following which didn't seem to trigger any errors:

with TemporaryFile() as f:
    good._p_jar.exportFile(good._p_oid, f)
    f.seek(0)
    gc = c.importFile(f)
    gc._check()
    gc.pop('testaccount1batchtest18 22F1F921291938C1$abs-2-job406589010-4 22F1E3E288A8B2FD$job-1$abs-2-job406589010-4-task-182$0:662512005')
    gc._check()

I tried on my Linux box though - I will try on my Windows box too and see if there is any difference.

d-maurer commented 4 years ago

Matthew Christopher wrote at 2020-9-3 17:01 -0700:

... I tried doing the following which didn't seem to trigger any errors:

with TemporaryFile() as f:
   good._p_jar.exportFile(good._p_oid, f)
   f.seek(0)
   gc = c.importFile(f)
   gc._check()
   gc.pop('testaccount1batchtest18 22F1F921291938C1$abs-2-job406589010-4 22F1E3E288A8B2FD$job-1$abs-2-job406589010-4-task-182$0:662512005')
   gc._check()

I tried on my Linux box though - I will try on my Windows box too and see if there is any difference.

When you try on Windows, please use the real code (not the emulation): it might be that peculiarites of the real code trigger the error (those would likely be related to destructors).

When you cannot reproduce the problem than the error does not depend only on the data; something dynamic must be involved as well -- maybe a gc run or a context switch (even though I do not see how the latter could happen).

matthchr commented 4 years ago

I tried on my Windows machine and actually I am getting an error here (I believe on the list(th))

th=generateBTreeHistory(t)
h = list(th); h.reverse()

The error is:

C:\work\venv-test\Scripts\python.exe C:/work/src/temp2.py
Traceback (most recent call last):
  File "C:\work\venv-test\lib\site-packages\dm\historical\__init__.py", line 60, in generateHistory
    if not history: raise StopIteration()
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:/work/src/temp2.py", line 53, in <module>
    h = list(th); h.reverse()
  File "C:\work\venv-test\lib\site-packages\dm\historical\__init__.py", line 129, in generateBTreeHistory
    if a.next() is None: heappop(heap)
  File "C:\work\venv-test\lib\site-packages\dm\historical\__init__.py", line 100, in next
    try: value = next(self.gen)
RuntimeError: generator raised StopIteration

Any idea why that would be?

d-maurer commented 4 years ago

Matthew Christopher wrote at 2020-9-4 11:56 -0700:

I tried on my Windows machine and actually I am getting an error here (I believe on the list(th))

th=generateBTreeHistory(t)
h = list(th); h.reverse()

The error is:

C:\work\venv-test\Scripts\python.exe C:/work/src/temp2.py
Traceback (most recent call last):
 File "C:\work\venv-test\lib\site-packages\dm\historical\__init__.py", line 60, in generateHistory
   if not history: raise StopIteration()
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
 File "C:/work/src/temp2.py", line 53, in <module>
   h = list(th); h.reverse()
 File "C:\work\venv-test\lib\site-packages\dm\historical\__init__.py", line 129, in generateBTreeHistory
   if a.next() is None: heappop(heap)
 File "C:\work\venv-test\lib\site-packages\dm\historical\__init__.py", line 100, in next
   try: value = next(self.gen)
RuntimeError: generator raised StopIteration

Any idea why that would be?

An incompatibility of dm.historical with newer Python versions.

The error does not occur with Python 3.6, it occurs with Python 3.9 (and may have been introduced with a Python version in between).

I have released dm.historical==2.0.3 which fixes this incompatibility.

jamadden commented 4 years ago

Any idea why that would be?

An incompatibility of dm.historical with newer Python versions. The error does not occur with Python 3.6, it occurs with Python 3.9 (and may have been introduced with a Python version in between).

Python 3.7 adopted PEP 479 "meaning that StopIteration exceptions raised directly or indirectly in coroutines and generators are transformed into RuntimeError exceptions."

matthchr commented 4 years ago

@jamadden has it -- I was on 3.6 on my Linux but 3.7 on Windows.

Even with dm.historical==2.0.3 I am still getting an error with 3.7 - @d-maurer are you sure you've fixed it in 2.0.3?

Traceback (most recent call last):
  File "C:\work\venv\lib\site-packages\dm\historical\__init__.py", line 131, in generateBTreeHistory
    raise StopIteration()
StopIteration

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:/work/temp2.py", line 53, in <module>
    h = list(th); h.reverse()
RuntimeError: generator raised StopIteration

Process finished with exit code 1

I reverted to Python 3.6 and everything passed in Windows as well, so it looks like probably no reliable repro... That doesn't necessarily surprise me as this issue has been really hard for us to pin down as well (our first thoughts were things like threading issues, etc)...

Are there next steps that we could take to try to pin down what's going on here?

d-maurer commented 4 years ago

Matthew Christopher wrote at 2020-9-4 14:52 -0700:

@jamadden has it -- I was on 3.6 on my Linux but 3.7 on Windows.

Even with dm.historical==2.0.3 I am still getting an error with 3.7 - @d-maurer are you sure you've fixed it in 2.0.3?

There has been another occurrence of the same problem. Fixed in 2.0.4.

... I reverted to Python 3.6 and everything passed in Windows as well, so it looks like probably no reliable repro... That doesn't necessarily surprise me as this issue has been really hard for us to pin down as well (our first thoughts were things like threading issues, etc)...

Are there next steps that we could take to try to pin down what's going on here?

Have you made the Windows trial with the emulation or your real code? In the first case, you could retry with the real code. The hypothesis: maybe one of the destructors (they can be executed inside the BTrees C code) does something strange (e.g. access the tree currently under modification).

If this cannot reproduce the problem, then it seems to be truely dynamic -- which makes the analysis far more difficult.

An analysis approach could come from the following thoughts: your bucket deletion (implemented by _BTree_set in BTreeTemplate.c) consists of two phases: 1. removal from the bucket chain (line 917),

  1. removal from the (inner node) parent (line 962ff). The broken tree analysis indicates that the first phase was successful but the second phase did not happen (or at least was not made persistent). This suggests that something unexpected happened between line 917 and line 962. This unexpected could be an error condition or the execution of (unexpected) application code initiated via the deletion of an object. The only things deleted in your case between the respective lines are the empty bucket itself and the key (which in your case seems to be a string) -- both should not have a destructor in your case. This would leave the error condition. One could run the application inside a debugger and put a breakpoint on the error exit and then try to reproduce the problem. When it occurs, hopefully, the breakpoint is hit and one can (again hopefully) determine what caused the error.
matthchr commented 4 years ago

Have you made the Windows trial with the emulation or your real code?

I tried both -- our real code doesn't define any special destructors (but I tried it anyway). I also tried 3.7 with dm.historical 2.0.4 (just to confirm that somehow the updated version of Python wasn't causing anything). For that also there was no repro.

One could run the application inside a debugger and put a breakpoint on the error exit and then try to reproduce the problem. When it occurs, hopefully, the breakpoint is hit and one can (again hopefully) determine what caused the error.

I'm investigating two angles right now:

  1. Running the application under a debugger and placing some breakpoints when I get a repro.
  2. Coming up with a more minimal repro that can be shared.

For 1, is there a document or instructions on performing this process? It would have to be in Windows (the only environment I can get a repro). I am somewhat familiar with windbg, but have never debugged a native Python module loaded by Python before. How can I get the symbols for the native code - are they included in the .whl for Windows? (I don't think I saw them but maybe missed when I was looking before).

d-maurer commented 4 years ago

Matthew Christopher wrote at 2020-9-15 17:16 -0700:

... I'm investigating two angles right now:

  1. Running the application under a debugger and placing some breakpoints when I get a repro.
  2. Coming up with a more minimal repro that can be shared.

For 1, is there a document or instructions on performing this process?

This depends on the debugger used. I have used gdb in the past for similar purposes. For gdb, their is a set of commands which facilitates the analysis of Python objects from C level (e.g. print the Python stack trace, a Python object, etc.).

BTrees debugging is particularly difficult due to the extensive use of (C preprocessor) macros. I tackled this problem is the past by running the preprocessor stand alone, remove the line information introduced by the preprocessor and use the result as new source file. This ensures that you can put a breakpoint at a precise place (your debugger might not need this kind of preprocessing).

It would have to be in Windows (the only environment I can get a repro). I am somewhat familiar with windbg, but have never debugged a native Python module loaded by Python before. How can I get the symbols for the native code - are they included in the .whl for Windows?

This is quite straight forward with gdb (I do not know windbg).

Even with optimizations active, Python is typically compiled to retain debugging symbols (should this not be the case, you need to compile Python yourself). Extensions, such as BTrees, typically are compiled as Python was compiled. Thuse, they, too, should have debugging symbols.

With gdb (under Linux), I could:

matthchr commented 3 years ago

We were able to get a minimal repro of this issue and determine why we were only seeing the issue in Windows. You can run the repro for yourself, I've posted the code on GitHub here. The problem also repros in Linux.

The key details (this is basically contained in the readme of that repo as well):

  1. We we only seeing the problem on Windows because we were mistakenly using the Python version of BTrees rather than the native version on Windows. The bug only happens in the Python version of BTrees.
  2. The repro happens on the latest version of BTrees (4.7.2).
  3. It takes quite a while to hit the repro (see readme.md for more details). Hopefully you can look at the pattern we're doing and spot a key bit that may help you improve the speed at which the repro is hit (possibly write a deterministic test that can cause it?)

Because there is no issue with the native version of BTrees we've been able to work around this problem by ensuring that we're using the native version of BTrees.

d-maurer commented 3 years ago

Matthew Christopher wrote at 2020-11-6 10:37 -0800:

We were able to get a minimal repro of this issue and determine why we were only seeing the issue in Windows. You can run the repro for yourself, I've posted the code on GitHub here.

Congratulations!

d-maurer commented 3 years ago

I was able to isolate (and quickly reproduce) the problem. The code below depends on your agent.db.

from pprint import pprint as pp
from ZODB.DB import DB
from ZODB.FileStorage import FileStorage
import enum

from types import ModuleType
m = ModuleType("agent.task.history")
m.task = m.history = m.defs = m
from sys import modules
modules[m.__name__] = modules["agent"] = modules["agent.task"] = modules["agent.task.defs"] = m
from persistent import Persistent
class TaskHistoryInfo(Persistent): pass

TaskHistoryInfo.__module__ = "agent.task.history"
m.TaskHistoryInfo = TaskHistoryInfo

class TaskHistoryData(Persistent): pass

TaskHistoryData.__module__ = "agent.task.history"
m.TaskHistoryData = TaskHistoryData

class TaskKind(enum.Enum):
  A = 0
  B = 1
  C = 2
  D = 3

TaskKind.__module__ = "agent.task.history"
m.TaskKind = TaskKind

s=FileStorage("agent.db")
db=DB(s)
c=db.open()
t=c.root()['agent.task.history'].history
from dm.historical import *
from transaction import commit
from tempfile import TemporaryFile

bad = getObjectAt(t, b'\x03\xda\x0bi 2Wf')
good = getObjectAt(t, b'\x03\xda\x0bi\x02\xa9\xce\xbb')
key = 'testaccount1batchtest18 22F1F921291938C1$abs-2-job406589010-4 22F1E3E288A8B2FD$job-1$abs-2-job406589010-4-task-182$0:662512005'
with TemporaryFile() as f:
  good._p_jar.exportFile(good._p_oid, f)
  f.seek(0)
  gc = c.importFile(f)

c.root()["good"] = gc; commit()
gc._check()
del gc[key]; gc._check() # still good
commit()
gc._p_deactivate(); gc._check() # bad

This means that the problem is not caused by the deletion operation itself. Instead, the tree structure gets broken via storing/reloading -- almost surely due to a missing _p_changed=True.

And indeed: there are 2 missing _p_changed = True:

--- _base.py~   2020-03-11 09:12:06.746867179 +0100
+++ _base.py    2020-11-06 23:56:57.799589569 +0100
@@ -100,6 +100,7 @@
     def _deleteNextBucket(self):
         next = self._next
         if next is not None:
+            self._p_changed = True
             self._next = next._next

     def _search(self, key):
@@ -1034,6 +1035,7 @@
                     self._firstbucket = child._next
                     removed_first_bucket = True
             del data[index]
+            self._p_changed = True

         return removed_first_bucket, value

With these additions, the problem disappears.

d-maurer commented 3 years ago

The first _p_changed is actually not necessary (as Bucket derives from Persistent).