Open perrinjerome opened 8 months ago
I'm not sure what should be the behavior if restore
is called with a prev_txn
which does not contain oid.
The docs are :
If prev_txn
does not exist, then it can be ignored. This was not the case so I made a first commit to change this.
But if prev_txn
exists but does not contain oid
it might be an error case, in this version it's also ignored.
Jérome Perrin wrote at 2024-2-14 07:20 -0800:
I'm not sure what should be the behavior if
restore
is called with aprev_txn
which does not contain oid.
I have looked at the code:
The only purpose of prev_txn
is to determine prev_pos
.
If this succeeds, we know that we find data at prev_pos
and therefore we can write a backpointer to prev_pos
instead
of writing data ourself.
This means, prev_txn
is there only as an optimization.
If it does not contain the correct data, prep_pos
will be 0
and the optimazation will not be taken.
Jérome, thanks for doing this PR and adding tests for +8 and other problems.
I agree that prev_txn semantically is just a hint for copy detection on storage side. It is optimization that, unfortunately, is exposed as public API. An UNDO is the same store operation, but usually it stores content that is already there in the database. Git detects this duplication via sha1 - all internally. But in ZODB world the API was setup to help storages handle copies. In simple cases that helps, but generally I think it hurts and constraints storages unneccessaryly.
Anyway about this:
But if prev_txn exists but does not contain oid it might be an error case, in this version it's also ignored.
I agree it is an inconsistency if prev_txn exists but does not contain an oid. I think it is better report an error here, or at least log a warning. But given that frequently nobody read logs raising an error might be better. I say this because this inconsistency indicate a really potential corruption of data, and such events should not pass by silently.
For the record ZODB/go exposes the hint as
https://lab.nexedi.com/kirr/neo/-/blob/f7776fc1/go/zodb/zodb.go#L249-258
and it exposes it only because this prev_txn is already semantically there in IStorage and tools like zodb dump
need it to be able to dump storage with preserving full information, so that bit-to-bit zodb restore
is possible:
https://lab.nexedi.com/nexedi/zodbtools/-/blob/513c296c/zodbtools/zodbdump.py#L49 https://lab.nexedi.com/nexedi/zodbtools/-/blob/513c296c/zodbtools/zodbdump.py#L144-146 https://lab.nexedi.com/nexedi/zodbtools/-/blob/513c296c/zodbtools/zodbdump.py#L541-549 https://lab.nexedi.com/nexedi/zodbtools/-/blob/513c296c/zodbtools/zodbdump.py#L425-432 https://lab.nexedi.com/nexedi/zodbtools/-/blob/513c296c/zodbtools/zodbcommit.py#L91-100 https://lab.nexedi.com/nexedi/zodbtools/-/blob/513c296c/zodbtools/zodbcommit.py#L120-121 https://lab.nexedi.com/kirr/neo/-/blob/f7776fc1/go/zodb/zodbtools/dump.go#L102-103
Kirill
Thank you for the feedback and sorry for being slow to reply here.
That was also my feeling that restoring a transaction with a prev_txn
not containing the referenced oid
is not supposed to happen and could be changed into a log, warning or error. I studied this more and it's a bit more complex.
I tried to change into an error and there is a legitimate case here in https://github.com/zopefoundation/ZODB/blob/6f8878124fa8bf1cc1932a8f67da8d15b653897c/src/ZODB/tests/RecoveryStorage.py#L41-L69
If this becomes an error, that test fails with an error like this:
Error in test testRestoreAcrossPack (ZODB.tests.testFileStorage.FileStorageHexRecoveryTest)
Traceback (most recent call last):
File "/usr/local/python/3.10.13/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
yield
File "/usr/local/python/3.10.13/lib/python3.10/unittest/case.py", line 591, in run
self._callTestMethod(testMethod)
File "/usr/local/python/3.10.13/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
method()
File "/workspaces/ZODB/src/ZODB/tests/RecoveryStorage.py", line 66, in testRestoreAcrossPack
self._dst.restore(r.oid, r.tid, r.data, '', r.data_txn,
File "/workspaces/ZODB/src/ZODB/tests/hexstorage.py", line 82, in restore
return self.base.restore(
File "/workspaces/ZODB/src/ZODB/FileStorage/FileStorage.py", line 738, in restore
panic(
File "/workspaces/ZODB/src/ZODB/FileStorage/FileStorage.py", line 93, in panic
raise CorruptedTransactionError(message % data)
ZODB.FileStorage.FileStorage.CorruptedTransactionError: oid 0x00 not found in prev_txn 0x03f6d00826efb611
What this test does is:
prev_txn
) in "Dest"and this fails, because the prev_txn in "Dest" exists, but is a "p" (pack) transaction without records.
I honestly don't fully understand what the test is doing with "copy the final transaction manually" and not using a higher level API, but this sounds like "this is more complex than just raising". Maybe this could be made a little bit more clever, tolerating the case where oid is not found because the transaction was packed, but for now, I believe we can merge this without trying to change to an error (or log or warning) - and then https://github.com/zopefoundation/ZODB/pull/395 .
Do you agree with this ? I will use "request review" and if you agree I'll apply the commits.
Thanks again for feedback.
( edit, I fixed the traceback, I used an old traceback different from the zodb dumps ) ( edit again: I feel I have misunderstood what's happening in this test, but "it's not so easy to just raise, let's handle it separately and merge this already" remains )
Jérome Perrin wrote at 2024-2-26 22:36 -0800:
... I believe we can merge this without trying to change to an error (or log or warning) - and then https://github.com/zopefoundation/ZODB/pull/395 .
I agree.
Jérome, Dieter, thanks for feedback.
Personally I would still carefully reconsider turning restore
inconsistencies into an error, but if you both think it is ok to go as it is please go ahead. I will be mostly away for the next 3 weeks and do not want to block progress. If you both think the patch is ok in its current form I trust you to make reasonable judgement even though personally I would try to do it differently.
Kirill
Kirill Smelkov wrote at 2024-3-4 00:25 -0800:
Personally I would still carefully reconsider turning
restore
inconsistencies into an error, but if you both think it is ok to go as it is please go ahead. I will be mostly away for the next 3 weeks and do not want to block progress. If you both think the patch is ok in its current form I trust you to make reasonable judgement even though personally I would try to do it differently.
As you know, I am less interested in potential problems with
low probability (than you). In may well be that a restore
inconsistency
poses problems within some decades; I would then act -- with
a concrete case.
Unless you come with such a concrete case, I would not change the current behavior - which apparently is check by a test case (and thus likely carefully considered).
Dieter, thanks for feedback.
The difference in between us is how we perceive importance of the bug here. For me a "data corruption" is the worst that could happen to a database. For an external observer data corruptions are usually hard to detect at the time when they happen, but they introduce a real change to the data. Some of such changes could be innocent, for example if we count a volume of some gas and it will be 17.3 litres instead of 17.2 that's quite tolerable and ok. Some of the changes are not so ok, for example if it will be 123 EUR in total after an exchange transaction instead of correct value of 321000 EUR. And some of the changes could really impact lives: for example if it will be the dose of radiation to emit, and due to corruption it will be 1000 Rentgens instead of 0.1 Rentgen, it makes a huge impact of the person in subject. There is no way to save life of that person with post-mortem analysis.
So when it comes to databases, I try to make dedicated attention to correctness first.
Still, as I said above, if both you and Jérome think it is ok to go as it is now, I trust you to make a reasonable choice and judgement.
Kirill
Kirill Smelkov wrote at 2024-3-4 01:21 -0800:
... So when it comes to databases, I try to make dedicated attention to correctness first.
Still, as I said above, if both you and Jérome think it is ok to go as it is now, I trust you to make a reasonable choice and judgement.
I read: you will come again in a few weeks with a concrete
case to reproduce the restore
inconsistency.
Then, we will look again (in a separate PR).
Initially, this pull request was about extracting the +8 fix from https://github.com/zopefoundation/ZODB/pull/395 . I added testRestorePrevTxnWithoutOid , which is a scenario where the +8 bug would appear (because I don't like changing code without having a test corresponding to the change) but what does not seems good now is that this test would define the behavior for a case where the behavior was not clearly defined: what should happen when restore is called with a prev_txn
which does not contain the oid
.
Before all this, it is an undefined behavior, the data of the next transaction is read as a data header, which very likely causes an unexpected error to be raised. If the transaction is that last in the file, there is no error.
In the current version of the patch, it is tolerated and ignored.
From the discussions, @navytux suggested that this should cause a proper error.
I tried to change into an error, this caused a test to fail. The test is using low level APIs, so maybe it's just the test doing something wrong, I did not try to understand more.
Now I don't feel it would be good to merge testRestorePrevTxnWithoutOid as it is, because this would somehow be a specification of how FileStorage should behave when restore is called with such a transaction, but at this point I feel we don't really know what should be the behavior. The next step for this seems to be studying that failing test. I don't really have time to start doing this now.
https://github.com/zopefoundation/ZODB/pull/395 can be merged without all this, everybody approved so let's merge https://github.com/zopefoundation/ZODB/pull/395 already.
There was another commit in this pull request 7bef1b6e0ddc7bdfe4c1e680d099e8203f95956f , I think there was no controversy for that one.
See discussions from https://github.com/zopefoundation/ZODB/pull/395#discussion_r1474127300 about removing
+ 8
fromtend