zodb / relstorage

A backend for ZODB that stores pickles in a relational database.
Other
54 stars 46 forks source link

RelStorage.iterator() is non-transactional; bad things happen with stop= on HF DBs #344

Closed jamadden closed 4 years ago

jamadden commented 5 years ago

Each time it is called, it opens a new database connection to retrieve data. Two iterators opened back to back may thus produce very different data.

For both history-free and history-preserving databases, the second iterator can include transactions not in the first one.

In a history-preserving database, absent a pack, that's fine: the second iterator will be a strict superset of the first one.

But in a history-free database, when an object is changed, it is effectively deleted from its old transaction; its new state is only visible as a record in its new transaction.

That's also fine: the second iterator will include that new state (eventually).

Where it's not fine is when the stop parameter is used. HF RelStorage is only fully consistent as-of the most recent committed transaction visible to any given connection.

Suppose the first iterator is opened and returns transactions 1, 2, 3, 4, 5, where 5 is now the highest visible transaction at the time it is opened. In transaction 5, the state of object A references object C.

Open a second iterator, which lets the view of the database move forward. This time, there's a new transaction, 6: the state of A has changed, but still contains a reference to C. If you pass stop=5, though, you won't see any state for A because it's only found in transaction 6. If A was the only reference to object C, object C now appears to be garbage.

This comes up in zc.zodbdgc, which does exactly that sequence by default, or if you pass --days <non-zero>. If you pass --days 0, it only uses one iterator, but it still passes an arbitrary stop parameter, making that unsafe as well (it tries to use the current time as the value to stop at, in order to view all committed transactions consistently as-of this date, but that's insufficient and there could be committed transactions more recent than that; consider clock differences among machines).

This causes data loss.

I propose that in history-free mode, a storage should use its current load connection for its iterators, making each iterator equivalent to any other one until storage.sync() is called.