Implemented persistentmapping conflict resolution

azmeuk commented 4 years ago

This implements a simple conflict resolution strategy for PersistentMapping.

jamadden commented 4 years ago

In general I don't think this is possible. PersistentMapping objects may have persistent objects as keys (and values) and during conflict resolution it's not possible to hash those, as I'm sure you remember.

Even if that was changed, the way that PersistentReference hashes (and compares) will be very different from the way the actual objects hash and compare, so the results cannot be directly correlated. That is, a set() of PersistentReference objects might be very different from a set() of the actual objects. Two objects with different OIDs might hash and compare themselves equal, but two PersistentReferences representing the same objects would not.

Imagine two transactions, each adding a persistent object (whether key or value) to a mapping. The objects themselves compare and hash equally, so they should be merged. But because they get different OIDs by the time conflict resolution kicks in, it's impossible for the mapping to know that they are equivalent. Best case scenario, the pickle writes out two "identical" objects referring to two different OIDs and on unpicking, whichever one was listed last in the pickle is the one that gets used. But garbage collection and the like will still see both OIDs as referenced. (Worst case scenario, the dict winds up badly corrupted.)

azmeuk commented 4 years ago

Indeed, this implementation will not solve conflicts if persistent objects are used as a mapping key (although there should not be any problem with persistent objects as values, as no hashing is involved for values, only comparisons).

Though, is this a problem? This patch still brings conflict resolution in some cases (when persistent objects are not used as keys), and should keep the current behavior for others (when they are not).

jamadden commented 4 years ago

Though, is this a problem?

I think so, for a few reasons.

First, persistent objects as keys will produce incorrect results on Python 2 or (on Python 3) if something like https://github.com/zopefoundation/ZODB/pull/278 is ever merged. We're relying on the unhashability of PersistentReference to protect us by raising TypeError in that case; if for any reason it doesn't (as on Python 2 or with that PR) then this code is silently broken in ways which will lose data. That's not good.

Second, there's still a problem with persistent values. Even though they aren't hashed, they are checked for equality (com.get(k) == old.get(k):). Unless they represent the identical object, that will raise an exception, meaning we've potentially done a lot of work getting to that point for no reason. Worse, the exception raised will be logged as an error with a traceback, likely alarming and confusing many users. With the class not even implementing _p_resolveConflicts, ZODB can take a fast shortcut and avoid all that work.

I also don't understand why there's special handling for dicts. Why not special handling for lists, or sets or any other mutable data structure? This seems like a hard-to-explain arbitrary rule that's expensive to implement. It also seems like something that applications may very well not want. Suppose my values are dicts that come in from clients sending JSON: they should be atomic, all-or-nothing updates. I never want to try to merge them.

Basically, I'm concerned about the corner cases and potential for breakage and performance loss in a core data structure.

I do understand the desire, though! Sometimes its a shame to redo a transaction because some tiny little dict somewhere changed by one key. Perhaps you could publish a new data structure that's carefully documented and optimized for this conflict-resolving case, e.g.,

No persistent keys or values (or sub-components thereof because the hashing and equality constraints apply to anything used in __hash__ or __eq__ of keys or values…).
Merges dictionaries, etc

I'm not sure that it should belong here, though, because the policies around merging sub-datastructures are likely to be application specific.

d-maurer commented 4 years ago

Jason Madden wrote at 2020-9-18 09:04 -0700:

In general I don't think this is possible. PersistentMapping objects may have persistent objects as keys (and values) and during conflict resolution it's not possible to hash those, as I'm sure you remember.

Persistent objects as keys alone do not exclude conflict resolution -- example the Bucket conflict resolution in the BTrees package. This strategy is based on the fact that each of the concurrent activities has put the keys at the correct place in the data structure (using the keys native ordering). If the conflicting keys are at different places, then the conflict may be resolved without the need to compare the keys themselves.

In principle, the same could apply to PersistentMapping as well if the conflict resolution had access to the storage cells used for the mapping. In this case, conflicting keys stored in different cells could be resolved without the need to know their hash value -- as they are in the correct cells.

The current implementation does not give access to the storage cells. Thus, for the moment no conflict resolution for PersistentMapping involving persistent keys. I even believe that any persistent key in a PersistentMapping can render conflict resolution unreliable, even if this key is not involved in a conflict. Because adding/deleting may cause the underlying storage structure to be restructured and in this case, the hash recomputed. This would give wrong results as a PersistentReference hash is different from that of the corresponding persistent object.

d-maurer commented 4 years ago

Éloi Rivard wrote at 2020-9-18 09:25 -0700:

... Though, is this a problem? This patch still brings conflict resolution in some cases (when persistent objects are not used as keys), and should keep the current behavior for others (when they are not).

Conflict resolution heavily depends on the applications. Adding conflict resolution later on may break applictions which formerly worked correctly. Thus, adding conflict resolution is not necessarily good.

Example is again the Bucket conflict resolution in the BTrees package. An important application thereof is ZCatalog. This application uses code of the form if not tree: ...persistent_modify... This use disallows the resolution of a conflict where one the the conflicting parties has emptied a tree/bucket because this would make the test not tree unsafe: the conflict resolution could make the tree unempty but cannot undo the "persistent_modification".

Keep in mind that any conflict resolution is potentially dangerous. Whether a given strategy is acceptable heavily depends on the applications of the data structure.

zopefoundation / persistent

Implemented persistentmapping conflict resolution #151