zodb / relstorage

A backend for ZODB that stores pickles in a relational database.
Other
53 stars 46 forks source link

Revert #469 and return to taking shared locks first #471

Closed jamadden closed 3 years ago

jamadden commented 3 years ago

Stress testing a large busy Zope application indicated that performance did not improve and was actually slightly worse this way.

The following is part of an internal discussion on why that could have been.

In the test run, there were only 8 UnableToLockRowsToModifyError recorded as triggering retries, vs 66 UnableToLockRowsToReadCurrentError. In at least one case, a transaction got both, first Modify then ReadCurrent on the retry.

In the other thread about BTree sizes, we ran through an analysis of all the reported ReadCurrent errors and came to the conclusion that they're all due to modifications to the BTrees involved in the IntId utility.

The hypothesis goes that many transactions wind up waiting on an exclusive lock for one of those objects, and only after waiting for some period do they discover the ReadCurrent failure and doom the transaction.

If they check for the ReadCurrent failure first, and either find one or deadlock, they doom the transaction quickly without waiting and hopefully a retry succeeds.

I think what we didn't pay close enough attention to in the original discussion about changing the lock order is that any time there was a deadlock, it always really represented a doomed transaction --- there would be no way to recover without retrying anyway. The hope was that (1) instead of letting the RDBMS kill an arbitrary transaction, we would be in control of which transaction died, thus letting the one with the most invested already proceed and commit; and (2) maybe we could do more conflict resolution. (1) didn't pan out because it turns out that if two or more processes are sitting waiting on locks and the RDBMS kills one of them, neither one really had that much of an advantage over the other --- they were basically both in the same place. (2) didn't pan out because one of the transactions was doomed anyway.

This also fixes two bugs in the PostgreSQL implementation: