woocommerce / woocommerce

A customizable, open-source ecommerce platform built on WordPress. Build any commerce solution you can imagine.
https://woocommerce.com
9.41k stars 10.76k forks source link

Implement locking mechanism for data updates to work around race conditions #15780

Closed mikejolley closed 7 years ago

mikejolley commented 7 years ago

A locking mechanism would prevent 2 processes doing the same logic at the same time to data objects.

I'd propose a simple system along the lines of https://api.drupal.org/api/drupal/includes%21lock.inc/7.x We'd need:

Code implementing lock checking would first try to acquire a lock, and then wait for the lock to be available before running code.

e.g.

if ( $lock = WC_Lock::aquire( 'order_123' ) ) {
     // Get the order and do something here
     $lock->release();
}

Whilst a lock is being acquired, there may be a delay if already locked. It will only return once a lock can be made.

If code locks a process but fails to release, there will be an expires column in the database so it cannot lock for more than a certain period of time. This is a protective measure.

This would be implemented for:

So in the PayPal example, PDT or IPN code would do a lock first before running:

if ( $lock = WC_Lock::aquire(  'order_' . $order_id  ) ) {
     $order = wc_get_order( $order_id );
     $order->payment_complete();
     $lock->release();
}

If PDT came in at the same time as IPN, the 2nd request would wait for the first, and then once complete would do no update because it's already done.

@justinshreve @coderkevin @claudiosanches @claudiulodro Thoughts on this?

claudiulodro commented 7 years ago

I like this. Good idea checking Drupal to see how they handle this problem.

claudiosanches commented 7 years ago

I like this plan.

aeadvertising commented 7 years ago

ok this makes sense. looking forward to implementation so I can get all these stores fixed! :)

pmgarman commented 7 years ago

Before being merged tests should be done to see how much potential overhead this adds and documented somewhere. I can already see the edge case of a flash sale on a site even at a small scale but on poor hosting this potentially locking things up. We should know the impact of adding this so sites can plan appropriately.

mikejolley commented 7 years ago

@pmgarman note my example is specific to an order by ID. Ok that case only that ID, for an IPN process would lock?

justinshreve commented 7 years ago

I wanted to share these that I discovered the other day:

Obviously we wouldn't use the upgrader class for this, I just wanted to share these as an FYI since there is a little bit of prior art in WordPress as well. We'll still want our own class.

paplco commented 7 years ago

Following this thread as well. Began getting duplicate orders on July 6th. All have come through using WooCommerce Stripe Gateway 3.2.1. However, one interesting case where the first duplicate in a series of 4, was WooCommerce Amazon Pay Gateway 1.7.3. and the following 3 were Stripe. Thanks

zamson commented 7 years ago

Following this thread and glad it's getting attention in version 3.x and not 4.x as first suggested. My clients are getting duplicated orders since 3.x update and it's causing a lot of headache for everyone involved.

mountaincrawlie commented 7 years ago

Following - my site also has this issue - its a total nightmare as it keeps showing my stock as out of stock intermittently after some orders but not all so hard to keep track and a complete pain to have double check what I have or don't have available in stock in reality

diablodale commented 7 years ago

I agree the doubled PDT/IPNs are troublesome. I've had them for years but sell virtual products so don't have the stock issues. A solution would be great. So with good intentions...

A strong caution on this topic; implementation of locks infrastructure usually causes extreme levels of instability. Locks are usually implemented by core components (e.g. CPU and OS) because only they can be atomic. User applications like Woocommerce, Office, etc. do not have the necessary thread/process management capabilities to adequately implement locks.

A second source of instability is the usage of the implemented locks. In the original post above, @mikejolley writes

if ( $lock = WC_Lock::aquire(  'order_' . $order_id  ) ) {
     $order = wc_get_order( $order_id );
     $order->payment_complete();
     $lock->release();
}

Such an approach is troublesome. Why? Because it requires the caller of an API to know the internals of that API -- a bad thing. It requires the thousands of coders around the world to learn how to use these new locks and update all their code to use them around every call to ->payment_complete() -- a nightmare scenario.

Instead.... APIs should be black boxes; having no need to know how its implemented or needing locks. A more productive approach would be to just call the API: $order->payment_complete() It is the responsibility of the API coders to manage any access/locks internally. Only the API coders know that they need to acquire a lock for a given orderid. And as any API, the results need to be returned. For example in the class WC_Order in class-wc-order.php

public function payment_complete( $transaction_id = '' ) {
    try {
        if ( ! $this->get_id() ) {
            return false;
        }
        $lock = WC_Lock::acquire(  'order_' . $this->get_id() );
        if ( ! $lock ) {
            return false;
        }
        do_action( 'woocommerce_pre_payment_complete', $this->get_id() );
        ...

And with that approach, this means that the code currently in class-wc-gateway-paypal-response.php would look something like:

protected function payment_complete( $order, $txn_id = '', $note = '' ) {
    if ($order->payment_complete( $txn_id )) {
        $order->add_order_note( $note );
    }
    else {
        // do something when paypal has paid us, yet woo can't mark it paid
    }
}

This introduces multiple user-experience issues. What to do when any acquire lock fails/denied. Each HTTP call is atomic. So, for our example here, what is the code (and user experience) when Paypal has paid Woo via IPN or PDT yet at that moment Woo is not able to mark it as paid? We can't assume that it is one of these IPN/PDT doubles. For example, the admin could be cancelling the order using the admin UI at that same moment. The only thing we know is that an HTTP action led to $order->payment_complete() being called and that call returned false.

Finally, regarding the lock implementation, what kind of lock(s) are needed? For example:

pmgarman commented 7 years ago

I've been thinking about this a bit along with a few other things within WC, such as the email queuing which failed to work due to inconsistent hosting configurations that caused cron to just fail.

Rather than implementing a locking system what if we instead looked at a more proper queueing system that was abstracted and opt in for sites. The abstraction would allow for the queue to be run off a variety of systems (once the integrations were made) but a MySQL driven queue would likely be the most reliable queue to have out of the box - but extensions that add things like beanstalkd or redis or amazon sqs.

The real task here I believe would be to create a queue worker and figure out how to make that run reliably on generic hosting. Once that is figured out, then queues for various things could be created. Queues for inventory updates, emails, background processing as a whole for anything integrating into WC.

Inspiration: https://laravel.com/docs/5.4/queues

I use Laravels' queues often for heavy background processing and works really well. Nothing in all this seems to be needing to be done truly sequentially, we just need them to be run in an orderly fashion.

There is a chance here that someone then spins up multiple queue workers to process data and that could still create the same sort of scenario - but I think this scenario is ok because unlike the current scenario where it's just broken for some, people spinning up multiple queues they are shooting themselves in the foot and we can't stop that.

mikejolley commented 7 years ago

@diablodale

Looked at https://github.com/woocommerce/woocommerce/pull/15967? It locks in the payment complete method.

Should the locks use the Wordpress Transients API? This is a core WP API and has both native expiration and caching. The latter helps mitigate the performance overhead of locks.

Custom table to avoid filling wp_options and so we have full control of schema.

Are the locks write locks? Meaning given a lock, only one write lock can be obtained yet can grant one+ readers?

The locks in my PR are more virtual. No DB level locking is done.

@pmgarman Whilst I'm sure there are nice solutions like queues and , to solve this issue short term we just need to make sure concurrent requests cannot work on the same thing at the same time. I think the above PR can help with that.

The real task here I believe would be to create a queue worker and figure out how to make that run reliably on generic hosting.

Inevitably this leads to poor solutions like cron with it's own set of problems. e.g. if the queue was ran on concurrent requests without locking :P

diablodale commented 7 years ago

The implementation of locks in #15967 appears to me to be inadequate. This is alignment with no previous caution that locks need to be implemented by core OS level components to be successful.

For example, your new woocommerce_locks table has 'name' as the primary key and appears to be relying on the SQL engine to enforce key uniqueness. There exists a scenario that when two threads on two peer CPUs run at similar intervals that locks will not be accurately managed. It is the normal gotchas for implementing locks and best left to the volumes of doc on the internet.

Part of the core challenge is that to make adequate locks, one needs to have an atomic check/set. Looking at the current codebase, the set (implemented by the INSERT IGNORE) and related check (IF (! $lock_result)) are not together atomic. You might have more luck creating a more complex bundled transaction including the time/expire management. Still, these problems have long been discussed and solved -- google is the friend on this.

A trivial example is the two threads/processes above. They both run create_lock() at the same time with the same lock_name parameter. They both run the wpdb->query(prepare(INSERT IGNORE)) at exactly the same time. Then the SQL database chooses because it does have atomic OS support. Lets say that P1 got a "true" result from the database and P2 did not.

Next, the thread for P1 is paused for any number of reasons by apache, OS, CPU, etc. Yet, P2 continues and runs the next line of PHP code if ( ! $lock_result ) {.

P2 didn't get a 'true' from the db so it enters the big clause. P2 calls get_lock with the lock_name. That lock does exist because P1 was successful with the INSERT above. FYI, P1 thread is still paused. P2 runs the second if ( ! $lock_result ) {. $lock_result is non-false (because the lock with that name does exist) so it continues past that clause.

Next, P2 thread is paused by apache, os, cpu, etc. Both P1 and P2 are now paused. They are paused long enough that the expiration time passes.

Next, P2 is awakened from its pause (P1 is still paused). P2 runs if ( strtotime( $lock_result->expires ) > time() ) { and that is a false statement...it is expired. So P2 moves down and does the release_lock, then create_lock recursively. The boundless create_lock recursion worries me but lets ignore that for now and assume that in some deep recursion create_lock finally returns "true". And, at the top recursion this P2 thread returns TRUE as its result from create_lock(). This is bad, the lock architecture has failed. The result should be FALSE.

Yet its even worse...remember at the beginning it was P1 that was able to insert the row into the db. It should be P1 that returns TRUE. Our thread P1 is awakened from its slumber and does the first if ( ! $lock_result ) { test. Since it did get a row from the INSERT IGNORE, it skips the big clause and returns true. That means that both P1 and P2 have returned TRUE. The lock architecture has failed.

I do not have a solution for this. Only that the solution is non-trivial and best solved by core components who can guarantee atomic operation.

Next, feedback on the use of locks. I am unsure that this adequately addresses the pains people are experiencing. If the pain is PDT/IPN duplicates, then change the payment gateways so that duplicates are not possible. For example, changing the PayPal gateway to have a radio "either-or...not both" possibility for IPN and PDT. That will address the specific pain more quickly than lock architectures.

When using locks, I believe it important to consider what is being locked. For this scenario, is the lock on the order? On the stock? On the order and the stock? A new example is P1 is an admin in the admin UI. P2 is a customer making a paypal payment. At the same time:

Now we have a race condition for what the status will be, what the stock will be, etc. In that scenario, we need a lock on the order and the stock so that the state and stock can be correctly coordinated for each independent thread. Otherwise, due to slight differences in thread execution, it is possible for P1 to make the state cancelled and P2 to reduce the stock.

DavidAnderson684 commented 7 years ago

I lack the competence and time to parse everything in this thread. But I agree with @diablodale that the suggested code sample seems to involve the API user needing too much knowledge of WC's internals.

Instead of:

if ( $lock = WC_Lock::aquire(  'order_' . $order_id  ) ) {
     $order = wc_get_order( $order_id );
     $order->payment_complete();
     $lock->release();
}

Why not:

if ( $lock = WC_Order::acquire_lock($order_id) ) {
     $order = wc_get_order( $order_id );
     $order->payment_complete();
     $lock->release();
}

Under the hood, WC_Order and everything else can use WC_Lock, and WC_Lock still be available as an API for extensions to use and abuse in other ways.

diablodale commented 7 years ago

That will also cause considerable problems. The lock implementation is inadequate. Plus, the locks are not being used correctly. So its a double-negative. I write about it in detail here https://github.com/woocommerce/woocommerce/pull/15967#issuecomment-319040459

In short, the entire action must be atomic...a complete transaction and completely works or completely does-not-work.

Otherwise, what trivially occurs is a lock is acquired, life happens and the thread ends due to cpu, memory, plugin, action hook, database burp, many reasons. This results in the release never occuring leaving the lock still acquired for a [default] hour. Meanwhile, the order is in disarray with various fields set to values but not all fields. Nothing that touches that order will work for an hour. All actions touching that order will timeout, recurse until they are out of memory, etc. No Woocommerce or WP admin can fix this. Only by hand editing the new private WC-only "lock" table on the database itself can the lock be released and then whatever operation failed at the beginning...be restarted. With hope that is doesn't fail again or the whole hour freeze again occurs.

Meanwhile, the customer could be gone, PDT paypal has long stopped, leaving only IPN and its by-design retry to try again in some timeperiod.

Unfortunately, it is not a pretty picture for the current WC lock idea/implementation.

Using only IPN -or- only PDT, continues to be a ready today well tested fix for the IPN/PDT double email/stock issue.

DavidAnderson684 commented 7 years ago

@diablodale Just to clarify: I wasn't commenting on the wider picture, just on the elegance of that code fragment (I've not read the PR). I haven't analysed anything else. Though, since you mention an hour, that does sound very long. In UpdraftPlus, to try to prevent multiple backups due to WP's racey cron calling the action multiple times (or the browser repeating the HTTP request on a manual backup), we use a similar style lock (i.e. that relies on MySQL enforcing consistency), but it expires after 3 minutes. I think the largest interval between WP cron calling the same action multiple times that I've ever seen in a customer log was 11 minutes, but that was particularly due to cron's mis-design (all entries in a single database row, and easily overwritten with partially stale data by long-running tasks), and wouldn't be relevant to other things (like multiple partial refund requests on the same order due to a browser repeating the HTTP request, which is what sparked my interest in this issue in WC). Yes, PHP can die for almost any reason at any time - with >1 million installs (less than WC) I'd say that we see enough weird and wonderful logs in our support channel to be pretty confident that a blanket 1 hour lock expiry is going to cause plenty of pain. I'd suggest that it should vary on a case by case basis, depending on what the lock is for. e.g. On an incoming authorised refund request, something up to a minute. Similarly the PDT/IPN problem seems to be in that order of magnitude.

diablodale commented 7 years ago

Changing the deadlock timeout may reduce the pain of an inadequate lock infrastructure. As a general rule, doing that is a symptom of a core problem -- it is a poor approach to architect software based on guessing deadlock timeouts. It would wrap the IPN/PDT problem with a inadequate lock implementation used improperly.

Instead, solving the core problem is the approach to take. I've provided one solution for IPN/PDT. So far, no one is willing to hear it.

The idea of a lock infrastructure is a totally fine idea to explore long-term. It needs considerable time to implement so that a reliable lock infrastructure is made + the usage of that infrastructure throughout the codebase. For example, I can change the status of an order using the admin UI at the same time as an IPN/PDT payment is made. This will corrupt the stock levels and/or order data. Such concurrency problems are throughout the Woocommerce codebase and need patience and due diligence to resolve well.

Using try/catch will be needed when using a lock infrastructure. However, the current idea is to put the locks in a database table (significant performance hit). This database approach is problematic. Here is the code you wrote with a try/catch. However, it will also fail.

try
{
    $lock = WC_Lock::acquire( 'order_' . $order_id );
    $order = wc_get_order( $order_id );
    $order->payment_complete(); // responsible for its own rollback
} catch (Exception $err) {
    debugoutput( 'caught exception: ' . $err->getMessage() . "\n" );
} finally {
    if ($lock) $lock->release();
}

It is unproven that database access will be reliable after an exception. The test cases for that are boundless. The current implementation idea for locks requires a SQL database and $lock->release() needs that database. That means the above code can run, encounter a problem, throw an exception, the order rollback code may not work because the database may no longer be accessible due to php thread/stack/memory/network problem, the lock release call may not work because of the same problem. We still have not solved the problem of concurrent access to the order.

I still don't think the user of a Woo object needs to know locks. So the pseudo code should be more like

$order = wc_get_order( $order_id );
if ($order) $order->payment_complete();

The order object should internally manage its state and its concurrency. The implementation idea starts going this direction but it not complete. Putting try/catches and a rollback mechanism inside payment_complete() would continue a positive direction. However, that still does not address the core problem w/ the locks themselves.

mikejolley commented 7 years ago

I think we should stop solely thinking PDT/IPN because this is only a small example of where locking could be used. We're mostly trying to do something about concurrent requests to prevent things like stock being changed incorrectly. Whether or not disabling PDT/IPN is correct or not is not really relevant yet.

I'd encourage a quick look over the PR rather than the pseudo code in the original post because it's not identical @DavidAnderson684

Even with the queue idea from @pmgarman earlier, with the unreliable cron system in WP we'd still need a locking mechanism to prevent those queues being ran simutaniously.

Tracking locks is something I don't see how we can do differently. If not the database layer @diablodale then where? Your suggestion of transients earlier is no better; those use wp_options on most setups. At least in a custom table you can change indexes etc and control the structure fully. There are not more and no less writes to the database.

I agree 1 hour is too long for default release. This should probably be < 1 minute. "Deadlocks" would only affect the code using the same lock/acquire methods, so thats down to implementation.

diablodale commented 7 years ago

If a user-level lock solution was possible and I had the code for it, where would it be? My initial answer would be at some layer which is shared between the multiple php processes running WP and Woo. These processes may be running on different servers (e.g. 2 web + 1 db), so the layer needs to be able to span at least the 2 web servers.

The problem you are tackling is distributed locking. It is a difficult problem because not only do you have the decades old problem of locks within the same machine, but now you have cross-machine contention, network, timing differences, etc.

I would investigate technologies of distributed/shared memory and databases. I would look at memcache, redis, and atomic databases to see the pros/cons of each.

transients I suggested as an earlier starting point (not a recommendation that they alone solve the problem) because they have some qualities that you want like storing a value and a TTL. I also like that they are automatically cached by one or more layers (e.g. php within the current thread, APCU, Redis, memcache, etc.). Since this is a distributed need, apcu isn't a good direction. The core WP docs + redis and memcache need to be consulted to understand caveats for cross-webserver coherency.

I can predict that some may not want to require memcache or redis for locks; as that is a considerable change in requirements. That leads to a reduction of options to php, WP, and the database server on which WP runs. PHP won't help us because it isn't shared by all processes. That leaves then the database.

The databases approved by use for Wordpress are relational databases and support atomic transactions. With atomic transactions it is possible to have a high level of coherency. That is, it is possible to start a transaction, do work, and commit that transaction. With that simple sequence, a coordinated set of changes will occur -- all or nothing. Using the familiar IPN/PDT/Order/Stock scenario, we can (receive payment + mark paid + save paypal transaction id + change order state + reduce stock level) as an all or nothing transaction. Transactions like these need atomic locks. And those atomic locks are built into the start/commit transactions of the SQL language.

I have seen no code in the current implementation which will do the 5-part transaction above -- all or nothing. I have seen no code in the current implementation which reliably implements a lock architecture which could be used to partially simulate a loose transaction.

The proposed lock code has more writes to the database. Plainly seen in the codebase. The roundtrip costs are high. The below are the costs for a[c]quire. There are additional costs for releasing locks.

I've given extremely specific feedback (theory, style, and specific lines of code) on the proposed lock implementation and how Woo uses the lock class. I have yet to hear from anyone on the Woo team that they understand, don't understand, or acknowledge the bugs. Instead, it is strange silence. I can only repeatedly warn of the problems and be prepared to stub-out all lock calls for Woo code that I distribute to my store.

I am seeing behavior (e.g. commits and merges) that suggest there is some desire to quickly publish a new version of Woo with this. I continue to suggest solving any specific immediate pains with a reliable short-term solution while you investigate long-term if you can make Woo coherent/concurrent access.

Please see my try/catch example directly above which begin a path of better usage of locks. Rather easy fixes yet demonstrate some of the gaps in the code. I would also add constructors/destructors on the lock class to manage releasing locks and probably a[c]quiring them. The former is to reduce unclosed locks due to sloppy plugin coders + trying to catch as many error cases as possible at the user-level (vs kernel-level).

Some related links: http://www.onlamp.com/pub/a/php/2003/12/18/transaction_pitfalls.html https://www.sitepoint.com/mysql-transactions-php-emulation/ http://www.mysqltutorial.org/php-mysql-transaction/

mikejolley commented 7 years ago

No one said that PR was finished, and nothing related to this has been merged or committed to master FYI. "Strange silence" = busy on other tasks and travelling. Thanks.

diablodale commented 7 years ago

Good to hear. I'll take it to mean that the code and idea are in flux.

Another problem with the current implementation is the time(). Time needs to be controlled by the coordination master; in the current implementation it is the database. Only the database has a single clock for which all times and expirations can be created and evaluated.

https://github.com/woocommerce/woocommerce/blob/afab2d232dd2d469320bc72583024ebc642311ae/includes/class-wc-database-locks.php#L61 is not reliable code. Take the example of two web servers using the same database. The two web server clocks are not the same. It is very common for webservers to drift many minutes apart. Lets say Web2 is two minutes later than Web1.

  1. Web1 a[c]quires a lock with a one minute expiration.
  2. Web2 a[c]quires the same lock with a one minute expiration. It immediately gets the lock because its time() is two minutes later; the lock has already expired from its perspective.
  3. Web1 and Web2 corrupt the shared data.

To fix this specific issue, one must use a SQL transaction for the entire create/a[c]quire lock code. Only on the database is there a coordinated clock to correctly set/evaluate times. Only on the database are atomic transactions.

mikejolley commented 7 years ago

Closed in favour of a queue https://github.com/woocommerce/woocommerce/issues/16971