zodb / relstorage

A backend for ZODB that stores pickles in a relational database.
Other
54 stars 46 forks source link

zodbconvert: Optionally lose transaction history for performance speedup #425

Closed jamadden closed 3 years ago

jamadden commented 3 years ago

zodbconvert uses destination_storage.copyTransactionsFrom(source_storage). If the destination storage is a RelStorage, this means that it uses the IStorageIteration protocol to iterate across every transaction stored in the source, and replicates them into the destination using the full two-phase commit protocol. Something like this (pseudo-code):

for record in source.iterator():
    destination.tpc_begin(record.tid)
    for oid, data in record:
       destination.store(oid, data)
    destination.tpc_vote()
    destination.tpc_finish()

For many small transactions, running two-phase commit for each one adds a lot of overhead that shouldn't be necessary as no one should be writing to the destination storage.

If the transaction history isn't important, for example, when the source or destination (especially) is history-free, we could use the IStorageCurrentRecordIteration protocol to get just the current records in some arbitrary order. We could batch them into larger transactions and commit many objects at once. That should substantially speed up copies of large databases at the expense of having lost exact TIDs. In history-free destinations, that's a minimal loss, I suspect.

I think this could possibly also be done with just IStorageIteration, but it's more complicated if the source is history-preserving and so is the destination.

We'd probably want to limit this to history-free destinations and storages that implement IStorageCurrentRecordIteration for now.

(Why not do the copy at the database table level if both storages are RelStorage? There are even pre-existing third-party tools for that, and that should have the least amount of overhead. The answer is two-fold: first, there are storage wrappers like zc.zlibstorage that want to apply a transformation to the state data and the source and destination may have different wrappers. Second, the source and destination may have different ideas about where to keep blobs, shared-on-disk vs cached-on-disk.)

jamadden commented 3 years ago

For copying into a history-free RelStorage, on one sample of transactions, the current code takes 1 minute to copy 14,000 transactions and around 185,000 objects, achieving a copy rate of 1.01 MB/s.

With initial modifications to make the history-free RelStorage destination batch objects (committing every 100 objects), that same one minute (from the same source data) copied 417,400 objects at a rate of 5.58 MB/s.

That's a speed up between 2.5 and 5x, depending how you look at it. So there's definitely promise here 😄

jamadden commented 3 years ago

Having streamlined the commit process, copying that same data into a history-free RelStorage destination copies 742,924 objects in a minute, at a rate of 12.7 MB/s.