woocommerce / action-scheduler

A scalable, traceable job queue for background processing large queues of tasks in WordPress. Specifically designed for distribution in WordPress plugins (and themes) - no server access required.
https://actionscheduler.org
GNU General Public License v3.0
641 stars 117 forks source link

Diagnose intermittent deadlocks #530

Closed rrennick closed 3 years ago

rrennick commented 4 years ago

Continuing discussion from #438 on intermittent deadlocks:

WordPress database error Deadlock found when trying to get lock; try restarting transaction for query UPDATE wp_actionscheduler_actions SET attempts = attempts+1, status='in-progress', last_attempt_gmt = '2020-04-20 17:00:02', last_attempt_local = '2020-04-20 18:00:02' WHERE action_id = 1877421 made by do_action('wp_ajax_as_async_request_queue_runner'), WP_Hook->do_action, WP_Hook->apply_filters, WP_Async_Request->maybe_handle, ActionScheduler_AsyncRequest_QueueRunner->handle, do_action('action_scheduler_run_queue'), WP_Hook->do_action, WP_Hook->apply_filters, ActionScheduler_QueueRunner->run, ActionScheduler_QueueRunner->do_batch, ActionScheduler_Abstract_QueueRunner->process_action, ActionScheduler_DBStore->log_execution, QM_DB->query

and

WordPress database error Deadlock found when trying to get lock; try restarting transaction for query UPDATE wp_actionscheduler_actions SET attempts = attempts+1, status='in-progress', last_attempt_gmt = '2020-04-20 17:02:08', last_attempt_local = '2020-04-20 18:02:08' WHERE action_id = 1877473 made by do_action_ref_array('action_scheduler_run_queue'), WP_Hook->do_action, WP_Hook->apply_filters, ActionScheduler_QueueRunner->run, ActionScheduler_QueueRunner->do_batch, ActionScheduler_Abstract_QueueRunner->process_action, ActionScheduler_DBStore->log_execution
glagonikas commented 4 years ago
------------------------
LATEST DETECTED DEADLOCK
------------------------
2020-04-21 14:44:56 0x7f121c442700
* (1) TRANSACTION:
TRANSACTION 269973366, ACTIVE 4 sec updating or deleting
mysql tables in use 1, locked 1
LOCK WAIT 4 lock struct(s), heap size 1136, 3 row lock(s), undo log entries 1
MySQL thread id 2099893, OS thread handle 139742331373312, query id 218071551 xxx.xxx.xxx.xxx yyyyy Updating
UPDATE wp_actionscheduler_actions SET attempts = attempts+1, status='in-progress', last_attempt_gmt = '2020-04-21 13:44:52', last_attempt_local = '2020-04-21 14:44:52' WHERE action_id = 1890552
* (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 3524 page no 2736 n bits 672 index status of table `yyyyy`.`wp_actionscheduler_actions` trx id 269973366 lock_mode X insert intention waiting
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

* (2) TRANSACTION:
TRANSACTION 269971034, ACTIVE 202 sec fetching rows
mysql tables in use 3, locked 3
3310 lock struct(s), heap size 385232, 339898 row lock(s)
MySQL thread id 2099912, OS thread handle 139715760367360, query id 218027094 xxx.xxx.xxx.xxx yyyyy Init for update
UPDATE wp_actionscheduler_actions SET claim_id=183806, last_attempt_gmt='2020-04-21 13:41:34', last_attempt_local='2020-04-21 14:41:34' WHERE claim_id = 0 AND scheduled_date_gmt <= '2020-04-21 13:41:34' AND status='pending' ORDER BY attempts ASC, scheduled_date_gmt ASC, action_id ASC LIMIT 100
* (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 3524 page no 2736 n bits 672 index status of table `yyyyy`.`wp_actionscheduler_actions` trx id 269971034 lock_mode X
Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0
 0: len 8; hex 73757072656d756d; asc supremum;;

* (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 3524 page no 2737 n bits 776 index status of table `yyyyy`.`wp_actionscheduler_actions` trx id 269971034 lock_mode X waiting
Record lock, heap no 665 PHYSICAL RECORD: n_fields 2; compact format; info bits 32
 0: len 7; hex 70656e64696e67; asc pending;;
 1: len 8; hex 00000000001cd8f8; asc         ;;

* WE ROLL BACK TRANSACTION (1)
------------
TRANSACTIONS
------------
Trx id counter 269973434
Purge done for trx's n:o < 269973434 undo n:o < 0 state: running but idle
History list length 0
LIST OF TRANSACTIONS FOR EACH SESSION:
---TRANSACTION 421217314710296, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 421217314582616, not started
0 lock struct(s), heap size 1136, 0 row lock(s)
---TRANSACTION 269973429, ACTIVE 2 sec starting index read
mysql tables in use 3, locked 3
LOCK WAIT 2 lock struct(s), heap size 1136, 1 row lock(s)
MySQL thread id 2100342, OS thread handle 139715226101504, query id 218072430 xxx.xxx.xxx.xxx yyyyy Init for update
UPDATE wp_actionscheduler_actions SET claim_id=183868, last_attempt_gmt='2020-04-21 13:44:57', last_attempt_local='2020-04-21 14:44:57' WHERE claim_id = 0 AND scheduled_date_gmt <= '2020-04-21 13:44:57' AND status='pending' ORDER BY attempts ASC, scheduled_date_gmt ASC, action_id ASC LIMIT 100
------- TRX HAS BEEN WAITING 2 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 3524 page no 3281 n bits 824 index claim_id of table `yyyyy`.`wp_actionscheduler_actions` trx id 269973429 lock_mode X waiting
Record lock, heap no 2 PHYSICAL RECORD: n_fields 2; compact format; info bits 0
 0: len 8; hex 0000000000000000; asc         ;;
 1: len 8; hex 00000000001975ed; asc       u ;;
glagonikas commented 4 years ago

As mentioned on the other thread, i think the code is trying to log the execution (log_execution), before it's done claiming all jobs, so one update query bumps onto the other.

If that's the case, making the claim query faster (or smaller) or delaying log_execution might help?

According to NewRelic, the following query is quite time-consuming taking 5-10s to complete

UPDATE wp_actionscheduler_actions SET claim_id=?, last_attempt_gmt=?, last_attempt_local=? WHERE claim_id = ? AND scheduled_date_gmt <= ? AND status=? ORDER BY attempts ASC, scheduled_date_gmt ASC, action_id ASC LIMIT ?
thenbrent commented 4 years ago

See also #457 for prior deadlock issues (that may be related).

rrennick commented 4 years ago

the following query is quite time-consuming taking 5-10s to complete

@glagonikas Thanks for the detailed info. That does look like something is blocking the query. Do you have actions that might take that amount of time to execute?

glagonikas commented 4 years ago

We don't have any actions that would fire on that stack trace, but we do have a large and busy site. We use Redis and everything is optimised, but running loads of update queries consequitively is probably is not ideal as they are not cached.

Can you direct me to the place where the code claims 100 actions and confirm log_execution is where each claimed action runs? I can then try to debug locally and see if there is any room for improvement

rrennick commented 4 years ago

Can you direct me to the place where the code claims 100 actions and confirm log_execution is where each claimed action runs?

The default batch size is 25 for Cron or Async - https://github.com/woocommerce/action-scheduler/blob/master/classes/ActionScheduler_QueueRunner.php#L131

The claim / run process starts at https://github.com/woocommerce/action-scheduler/blob/master/classes/ActionScheduler_QueueRunner.php#L154

rrennick commented 4 years ago

@glagonikas You could try lowering the batch size in your cron job to 20 or similar. CLI will still process all the pending actions, just in smaller batches. Make the change at a time when you'll be able to monitor and revert if it increases the number of deadlocks. https://github.com/woocommerce/action-scheduler/blob/master/classes/WP_CLI/ActionScheduler_WPCLI_Scheduler_command.php#L13

rrennick commented 4 years ago

A followup thought is that if your cron job is scheduled to run every minute then it could be multiple instances of the cron job competing. Cron in Linux spawns the task on the schedule regardless of the disposition of the previous iteration of the task. Here's a demo script I wrote and scheduled to run every minute. Log shows the first 190 seconds:

$ cat cron-test.sh
#!/bin/bash
#
for I in {1..20}
do
    echo $I >> /var/www/cron-test.log
    sleep 10
done
$ cat cron-test.log
1
2
3
4
5
6
1
7
2
8
3
9
4
10
5
11
6
12
1
7
13
2
8
14
3
9
15
4
10
16
5
11
17
6
12
18
1
7
13
19
2
8
14
20
thenbrent commented 4 years ago

@glagonikas are you running the Action Scheduler High Volume plugin?

glagonikas commented 4 years ago

The default batch size is 25 for Cron or Async - https://github.com/woocommerce/action-scheduler/blob/master/classes/ActionScheduler_QueueRunner.php#L131

That's odd... If you look on my deadlock details, the second query that's claiming processes, (UPDATE wp_actionscheduler_actions SET claim_id=183806,...) has a limit of 100, not 25!?!

@glagonikas are you running the Action Scheduler High Volume plugin?

No, wasn't aware of that plugin. We use the one that ships with WC, which overrides the one used by WCS.

glagonikas commented 4 years ago

That's odd... If you look on my deadlock details, the second query that's claiming processes, (UPDATE wp_actionscheduler_actions SET claim_id=183806,...) has a limit of 100, not 25!?!

OK, the increased limit comes from https://actionscheduler.org/perf/ that we've added previously. We had set the following

This obviously means that we did use the equivalent of the Action Scheduler High Volume plugin

Removing those filters and reverting to 25 didn't make much of a difference. Tried 10 but still no luck.

glagonikas commented 4 years ago

A followup thought is that if your cron job is scheduled to run every minute then it could be multiple instances of the cron job competing. Cron in Linux spawns the task on the schedule regardless of the disposition of the previous iteration of the task. Here's a demo script I wrote and scheduled to run every minute. Log shows the first 190 seconds:

Correct me if I'm wrong, but even if cron jobs double up, having the lock mechanism would prevent jobs bumping to each other. On each spawn of wp-cron.php I'd imagine you'll have a new claim_id, different rows would be claimed and then updated and jobs would work in parallel on the same INNODB table.

glagonikas commented 4 years ago

OK, so I run an explain on the claim query and noticed it states Using intersect(claim_id,status); Using where; Using filesort. This is because there are indexes on each individual column, but not on the three columns used by the update query.

Is there a particular reason the table wp_actionscheduler_actions doesn't have an index for claim_id, status, scheduled_date_gmt? This would result in Using where; Using filesort and therefore a faster query

I do understand that excessive use of keys is not a good idea, but in our case, it seems to have solved the issue and the deadlocks have gone away. I'll keep an eye on logs for the next couple of days to confirm they've stopped permanently.

According to NewRelic, adding the extra key, reduced the query time from ~1.5s to around ~1.3ms!

The High Volume filters have been removed, but I might give them another go if the keys have solved the deadlocks.

rrennick commented 4 years ago

@glagonikas Nice find. If things still look good after a couple days would you submit a PR adding the index & bumping the schema version? Schema is in https://github.com/woocommerce/action-scheduler/blob/master/classes/schema/ActionScheduler_StoreSchema.php

thenbrent commented 4 years ago

This obviously means that we did use the equivalent of the Action Scheduler High Volume plugin

@glagonikas that's largely what I was asking - have you tweaked the defaults. In particular, the action_scheduler_queue_runner_concurrent_batches increase there is what was leading to the higher concurrency and likely deadlocks (the default is 1 in AS now, whereas it used to be 5 or 10, because only WP Cron used to run actions, and only on the WP Cron request, not we have the async runner and chained, loop-back requests meaning both that after a WP Cron queue starts, it can go on for much longer in separate requests, and that it can process many more actions.)

The High Volume filters have been removed, but I might give them another go if the keys have solved the deadlocks.

If you can reinstate the filters, that would be a great help. As above, simply restoring action_scheduler_queue_runner_concurrent_batches to the default of 1 will almost certainly fix the deadlocks also.

Is there a particular reason the table wp_actionscheduler_actions doesn't have an index for claim_id, status, scheduled_date_gmt?

Almost certainly just:

  1. those columns used to come from WP Posts, so we used their defaults IIRC
  2. we hadn't seen a real need in our tests, so had the weakly held view that they they weren't justified (but based on your usage and tests, they certainly are justified).

I'm all for adding them given what we've found here.

glagonikas commented 4 years ago

Hi @thenbrent prior to adding the new index, i was getting deadlocks regardless of the setting on action_scheduler_queue_runner_concurrent_batches.

After adding the index, the deadlocks stopped. I then reinstated the filters and I haven't seen any issues, so the index seems to have fixed the issue for us.

rrennick commented 4 years ago

I added some testing code to my dev install that created 100+ actions at a time for testing a PR. I forgot to comment it out. This morning I saw deadlock errors in my debug log. AS has 700K+ actions.

I added quite a bit of logging and it looks like MySQL returns the action IDs before locking the table/rows for WRITE. The next 2 queries are the fetch & status check of the first action. The deadlock error is logged after the status check of the first action in the list of claimed actions.

slaFFik commented 4 years ago

We are getting those deadlocks quite frequently (almost daily). We have only 7k items in Scheduled Actions admin area table. We do not use high volume plugin or those filters to modify default AS behavior.

``` [05-Nov-2020 04:25:02 UTC] WordPress database error Deadlock found when trying to get lock; try restarting transaction for query UPDATE wp_actionscheduler_actions SET attempts = attempts+1, status='in-progress', last_attempt_gmt = '2020-11-05 04:25:02', last_attempt_local = '2020-11-04 23:25:02' WHERE action_id = 262412 made by do_action('wp_ajax_as_async_request_queue_runner')... ```

We use WooCommerce 4.6.1.

rrennick commented 4 years ago

@slaFFik Have you tried adding the index in #535?

slaFFik commented 4 years ago

No, we haven't tried that. We will apply that fix early next week and I will report back with the results. Thank you.

peterfabian commented 3 years ago

Hi @slaFFik , were you able to resolve your problem with the fix in the PR #535 ?

samimakela commented 3 years ago

We've also been looking into these deadlocks at our end when working with a busy WooCommerce Subscription site as well. I've been following this issue and the Pull Request #535 and I'm wondering, that was the root cause of the issue ever unambiguously identified?

We've been seeing this issue only when running concurrent batches. We have used Action Scheduler High Volume plugin for this.

Adding the index fixes the issue, but in my opinion it's very important to understand the root cause even if we seem to have found a fix.

To my understanding the UPDATE query in ActionScheduler_DBStore.php used to claim actions locks the entire wp_actionscheduler_actions table. This happens when the WHERE clause does not match any index.

This could explain why updates relating to the running queue collide with the claim_actions of a freshly created concurrent batch.

Adding an index that matches the WHERE clause of the claim_actions method helps, because it essentially prevents locking the entire table and locks only the claimed rows instead.

In my opinion this is a very important fix and clear summary of the cause is in order. I might be wrong with the diagnosis, but would much appreciate a second opinion on this.

We've been discussing this with our hosting provider and it seems that there are several other sites experiencing the deadlocks as well. In my opinion this should be tagged as a critical fix.

Finally I'd like to mention that we've also seen an intermittent extra concurrent batch. In other words when setting the maximum number of concurrent batches to e.g. 2, we've seen 3 concurrent batches running. This problem was confirmed by three unique and simultaneous batch_id values in the wp_actioncheduler_actions table. Based on the source code, this should not be possible. We have not looked into this more in-depth, but it seems that adding the index has a positive effect on this one as well. That being sais, we don't have enough data be sure yet.

References:

peterfabian commented 3 years ago

I'll try to sum up my findings while I continue working on this. Hope it would provide a bit of valuable info to people participating in this thread.

I've generated 1.32M actions and I'm still at roughly 1.2M pending ones, the actions keep running (pretty slowly for now).

I haven't seen any deadlocks, but I'm seeing transaction aborts, very slow processing, and high CPU utilization (100%+) coming from the mysqld.

Lots of the following messages in the mysql error log:

2021-03-24 11:24:56 5330 [Warning] Sort aborted, host: localhost, user: wp, thread: 5330, query: UPDATE wp_actionscheduler_actions SET claim_id=64765, last_attempt_gmt='2021-03-24 11:24:05', last_attempt_local='2021-03-24 11:24:05' WHERE claim_id = 0 AND scheduled_date_gmt <= '2021-03-24 11:24:05' AND status='pending' ORDER BY attempts ASC, scheduled_date_gmt ASC, action_id ASC LIMIT 25

When I increased the number of concurrent batches to 10 and tried to run 5 instances of WP CLI processing (php8.0 wp-cli.phar action-scheduler run --batch-size=100), I was constantly running into database timeouts, such as below and couldn't run the actions in parallel reliably enough, the CLI kept returning errors.

<div id="error"><p class="wpdberror"><strong>WordPress database error:</strong> [Lock wait timeout exceeded; try restarting transaction]<br /><code>UPDATE wp_actionscheduler_actions SET claim_id=59348, last_attempt_gmt=&#039;2021-03-23 11:19:17&#039;, last_attempt_local=&#039;2021-03-23 11:19:17&#039; WHERE claim_id = 0 AND scheduled_date_gmt &lt;= &#039;2021-03-23 11:19:17&#039; AND status=&#039;pending&#039; ORDER BY attempts ASC, scheduled_date_gmt ASC, action_id ASC LIMIT 100</code></p></div>Error: There was an error running the action scheduler: Unable to claim actions. Database error.

Using these concurrent runners, I tested the level of concurrency with both the new index proposed by @glagonikas and without it. If anything, I saw even larger congestion and more frequent lock timeouts with the index than without it. The actions table itself already has plenty of indices, so it seemed rather strange that this would cause a substantial difference. Though I think we'd need to be better familiar with the setup folks are using (db server vendor and version, my.cnf settings related to locking, key buffers, etc) to be able to see why adding this index had such a positive effect.

When examining the database using SHOW ENGINE INNODB STATUS, I could see that mostly the following two queries were competing for the actions table:

Both were super-slow, and waiting tens of seconds until they can acquire the lock they need, as shown by SHOW PROCESSLIST:

MariaDB [(none)]> SHOW PROCESSLIST;
+-----+-------------+-----------+---------------+---------+------+--------------------------+------------------------------------------------------------------------------------------------------+----------+
| Id  | User        | Host      | db            | Command | Time | State                    | Info                                                                                                 | Progress |
+-----+-------------+-----------+---------------+---------+------+--------------------------+------------------------------------------------------------------------------------------------------+----------+
|   1 | system user |           | NULL          | Daemon  | NULL | InnoDB purge worker      | NULL                                                                                                 |    0.000 |
|   3 | system user |           | NULL          | Daemon  | NULL | InnoDB purge coordinator | NULL                                                                                                 |    0.000 |
|   2 | system user |           | NULL          | Daemon  | NULL | InnoDB purge worker      | NULL                                                                                                 |    0.000 |
|   4 | system user |           | NULL          | Daemon  | NULL | InnoDB purge worker      | NULL                                                                                                 |    0.000 |
|   5 | system user |           | NULL          | Daemon  | NULL | InnoDB shutdown handler  | NULL                                                                                                 |    0.000 |
| 138 | wp          | localhost | wordpress-one | Query   |   45 | Init for update          | UPDATE wp_actionscheduler_actions SET claim_id=60176, last_attempt_gmt='2021-03-23 15:08:03', last_a |    0.000 |
| 141 | root        | localhost | NULL          | Query   |    0 | Init                     | SHOW PROCESSLIST                                                                                     |    0.000 |
| 142 | wp          | localhost | wordpress-one | Query   |   19 | Init for update          | UPDATE wp_actionscheduler_actions SET claim_id=60177, last_attempt_gmt='2021-03-23 15:08:30', last_a |    0.000 |
| 143 | wp          | localhost | wordpress-one | Query   |   19 | Init for update          | UPDATE wp_actionscheduler_actions SET claim_id=60178, last_attempt_gmt='2021-03-23 15:08:30', last_a |    0.000 |
| 144 | wp          | localhost | wordpress-one | Query   |   16 | Sending data             | SELECT a.action_id FROM wp_actionscheduler_actions a WHERE 1=1 AND a.status='complete' AND a.last_at |    0.000 |
| 145 | wp          | localhost | wordpress-one | Query   |   11 | Sending data             | SELECT a.action_id FROM wp_actionscheduler_actions a WHERE 1=1 AND a.status='complete' AND a.last_at |    0.000 |
+-----+-------------+-----------+---------------+---------+------+--------------------------+------------------------------------------------------------------------------------------------------+----------+
11 rows in set (0.000 sec)

(Notice the SELECTs waiting for 11 and 16 seconds to acquire locks and UPDATE queries waiting for 45 and 19 seconds.)

Looking at the SELECT first, there was nothing really suspicious about the query itself at first sight. However, there are 2 aspects I found suboptimal after digging a bit more:

MariaDB [wordpress-one]> EXPLAIN SELECT a.action_id FROM wp_actionscheduler_actions a WHERE 1=1 AND a.status='complete' AND a.last_attempt_gmt <= '2021-02-20 15:24:33' ORDER BY a.scheduled_date_gmt ASC LIMIT 0, 20;
+------+-------------+-------+-------+-------------------------+--------------------+---------+------+------+-------------+
| id   | select_type | table | type  | possible_keys           | key                | key_len | ref  | rows | Extra       |
+------+-------------+-------+-------+-------------------------+--------------------+---------+------+------+-------------+
|    1 | SIMPLE      | a     | index | status,last_attempt_gmt | scheduled_date_gmt | 5       | NULL | 538  | Using where |
+------+-------------+-------+-------+-------------------------+--------------------+---------+------+------+-------------+
1 row in set (0.002 sec)

Changing the ORDER BY from scheduled_date_gmt to action_id cut the processing time for this query considerably (16.5sec to 0.4sec) and I don't think it would cause any problems with the clean up:

MariaDB [wordpress-one]> SELECT a.action_id, a.last_attempt_gmt, a.scheduled_date_gmt FROM wp_actionscheduler_actions a WHERE 1=1 AND a.status='complete' AND a.last_attempt_gmt <= '2021-02-20 15:36:39' ORDER BY a.scheduled_date_gmt ASC LIMIT 0, 20;
Empty set (16.512 sec)

MariaDB [wordpress-one]> SELECT a.action_id, a.last_attempt_gmt, a.scheduled_date_gmt FROM wp_actionscheduler_actions a WHERE 1=1 AND a.status='complete' AND a.last_attempt_gmt <= '2021-02-20 15:36:39' ORDER BY a.action_id ASC LIMIT 0, 20;
Empty set (0.473 sec)

I will prep a PR to add this change, as I think it's much easier to do than change how and when AS does its cleanup.

Interestingly, in my MariaDB instance, if I set the date cut-off to something closer to now, so that the query actually returns some results, it runs much faster and returns almost instantly:

MariaDB [wordpress-one]> SELECT a.action_id, a.last_attempt_gmt, a.scheduled_date_gmt FROM wp_actionscheduler_actions a WHERE 1=1 AND a.status='complete' AND a.last_attempt_gmt <= '2021-03-01 15:36:39' ORDER BY a.scheduled_date_gmt ASC LIMIT 0, 20;
+-----------+---------------------+---------------------+
| action_id | last_attempt_gmt    | scheduled_date_gmt  |
+-----------+---------------------+---------------------+
|     16965 | 2021-02-23 10:59:45 | 2021-02-19 18:22:10 |
... 18 more rows
|     16984 | 2021-02-26 14:48:38 | 2021-02-25 19:41:22 |
+-----------+---------------------+---------------------+
20 rows in set (0.004 sec)

I've tested this also on MySQL 5.7 and saw very similar performance pattern:

mysql> SELECT a.action_id, a.last_attempt_gmt, a.scheduled_date_gmt FROM wp_actionscheduler_actions a WHERE 1=1 AND a.status='complete' AND a.last_attempt_gmt <= '2021-02-20 15:36:39' ORDER BY a.scheduled_date_gmt ASC LIMIT 0, 20;
Empty set (12.43 sec)

mysql> SELECT a.action_id, a.last_attempt_gmt, a.scheduled_date_gmt FROM wp_actionscheduler_actions a WHERE 1=1 AND a.status='complete' AND a.last_attempt_gmt <= '2021-02-20 15:36:39' ORDER BY a.action_id ASC LIMIT 0, 20;
Empty set (0.19 sec)

I will look into the UPDATE next.

glagonikas commented 3 years ago

Hi @peterfabian good job with the investigation on this.

It's been quite a while since we first reported this issue, so we no longer have the my.ini settings, but i don't think they will be much different to our current ones.

Our setup at the time, consisted of one database server (64GB RAM i think) with the latest MariaDB and many web servers (32GB RAM) with Nginx behind a Load Balancer. At the time, they were all dedicated cloud VMs running the latest Ubuntu LTS.

I am not sure if the network latency would have anything to do with this, but query times are certainly affected if you use different machines, even if they are all in the same datacentre. On a different setup, we saw a major improvement when we introduced a dedicated trunk network cable between the application and database servers, because the query data used a different network port and not the one already used for internet access.

We currently use

action_scheduler_queue_runner_time_limit - 120
action_scheduler_queue_runner_batch_size - 100
action_scheduler_queue_runner_concurrent_batches - 10

and we have the following keys

PRIMARY KEY (`action_id`) USING BTREE,
INDEX `hook` (`hook`) USING BTREE,
INDEX `status` (`status`) USING BTREE,
INDEX `scheduled_date_gmt` (`scheduled_date_gmt`) USING BTREE,
INDEX `args` (`args`) USING BTREE,
INDEX `group_id` (`group_id`) USING BTREE,
INDEX `last_attempt_gmt` (`last_attempt_gmt`) USING BTREE,
INDEX `claim_id` (`claim_id`) USING BTREE,
INDEX `claim_id, status, scheduled_date_gmt` (`claim_id`, `status`, `scheduled_date_gmt`) USING BTREE,
INDEX `hook, status` (`hook`, `status`) USING BTREE

After adding the index, the deadlocks stopped immediately. I didn't check for any performance issues at the time and if the load or response time was increased due to this, but deadlocks have to be eliminated at any cost (if the app cannot mitigate the issues).

Maybe you can try the same tests on a multi-server setup as described before and see if this would make a difference? I have a feeling you'll see different results due to network latency and data transfer.

Our deadlocks would occur with less that 1m records.

Also, you mentioned that

Changing the ORDER BY from scheduled_date_gmt to action_id cut the processing time for this query considerably (16.5sec to 0.4sec) and I don't think it would cause any problems with the clean up:

To me, this makes sense, because you are just sorting by integers (albeit bigint) instead of datetime (or any other complicated format). I have no evidence to support this, but came across this which seems to suggest that int is faster. Regardless of performance, I see no reason not to use the action_id - it is very unlikely (impossible) to have a bigger action_id created before a smaller one.

The output of our current SHOW VARIABLES is below. (If anyone spots something that can be improved, I'd gladly onboard the comments)

``` "Variable_name" "Value" "alter_algorithm" "DEFAULT" "analyze_sample_percentage" "100.000000" "aria_block_size" "8192" "aria_checkpoint_interval" "30" "aria_checkpoint_log_activity" "1048576" "aria_encrypt_tables" "OFF" "aria_force_start_after_recovery_failures" "0" "aria_group_commit" "none" "aria_group_commit_interval" "0" "aria_log_file_size" "1073741824" "aria_log_purge_type" "immediate" "aria_max_sort_file_size" "9223372036853727232" "aria_page_checksum" "ON" "aria_pagecache_age_threshold" "300" "aria_pagecache_buffer_size" "134217728" "aria_pagecache_division_limit" "100" "aria_pagecache_file_hash_size" "512" "aria_recover_options" "BACKUP,QUICK" "aria_repair_threads" "1" "aria_sort_buffer_size" "268434432" "aria_stats_method" "nulls_unequal" "aria_sync_log_dir" "NEWFILE" "aria_used_for_temp_tables" "ON" "auto_increment_increment" "1" "auto_increment_offset" "1" "autocommit" "ON" "automatic_sp_privileges" "ON" "back_log" "350" "basedir" "/usr" "big_tables" "OFF" "bind_address" "0.0.0.0" "binlog_annotate_row_events" "ON" "binlog_cache_size" "32768" "binlog_checksum" "CRC32" "binlog_commit_wait_count" "0" "binlog_commit_wait_usec" "100000" "binlog_direct_non_transactional_updates" "OFF" "binlog_file_cache_size" "16384" "binlog_format" "MIXED" "binlog_optimize_thread_scheduling" "ON" "binlog_row_image" "FULL" "binlog_row_metadata" "NO_LOG" "binlog_stmt_cache_size" "32768" "bulk_insert_buffer_size" "8388608" "character_set_client" "utf8mb4" "character_set_connection" "utf8mb4" "character_set_database" "utf8mb4" "character_set_filesystem" "binary" "character_set_results" "utf8mb4" "character_set_server" "utf8mb4" "character_set_system" "utf8" "character_sets_dir" "/usr/share/mysql/charsets/" "check_constraint_checks" "ON" "collation_connection" "utf8mb4_general_ci" "collation_database" "utf8mb4_general_ci" "collation_server" "utf8mb4_general_ci" "column_compression_threshold" "100" "column_compression_zlib_level" "6" "column_compression_zlib_strategy" "DEFAULT_STRATEGY" "column_compression_zlib_wrap" "OFF" "completion_type" "NO_CHAIN" "concurrent_insert" "ALWAYS" "connect_timeout" "10" "core_file" "OFF" "datadir" "/var/lib/mysql/" "date_format" "%Y-%m-%d" "datetime_format" "%Y-%m-%d %H:%i:%s" "deadlock_search_depth_long" "15" "deadlock_search_depth_short" "4" "deadlock_timeout_long" "50000000" "deadlock_timeout_short" "10000" "debug_no_thread_alarm" "OFF" "default_master_connection" "" "default_password_lifetime" "0" "default_regex_flags" "" "default_storage_engine" "InnoDB" "default_tmp_storage_engine" "" "default_week_format" "0" "delay_key_write" "ON" "delayed_insert_limit" "100" "delayed_insert_timeout" "300" "delayed_queue_size" "1000" "disconnect_on_expired_password" "OFF" "div_precision_increment" "4" "encrypt_binlog" "OFF" "encrypt_tmp_disk_tables" "OFF" "encrypt_tmp_files" "OFF" "enforce_storage_engine" "" "eq_range_index_dive_limit" "200" "error_count" "0" "event_scheduler" "OFF" "expensive_subquery_limit" "100" "expire_logs_days" "7" "explicit_defaults_for_timestamp" "OFF" "external_user" "" "extra_max_connections" "1" "extra_port" "0" "flush" "OFF" "flush_time" "0" "foreign_key_checks" "ON" "ft_boolean_syntax" "+ -><()~*:""""&|" "ft_max_word_len" "84" "ft_min_word_len" "4" "ft_query_expansion_limit" "20" "ft_stopword_file" "(built-in)" "general_log" "OFF" "general_log_file" "db1.log" "group_concat_max_len" "1048576" "gtid_binlog_pos" "0-3-751803048" "gtid_binlog_state" "0-1-116281,0-3-751803048" "gtid_cleanup_batch_size" "64" "gtid_current_pos" "0-3-751803048,9-3-2" "gtid_domain_id" "0" "gtid_ignore_duplicates" "OFF" "gtid_pos_auto_engines" "" "gtid_seq_no" "0" "gtid_slave_pos" "0-1-596159241,9-3-2" "gtid_strict_mode" "OFF" "have_compress" "YES" "have_crypt" "YES" "have_dynamic_loading" "YES" "have_geometry" "YES" "have_openssl" "YES" "have_profiling" "YES" "have_query_cache" "YES" "have_rtree_keys" "YES" "have_ssl" "DISABLED" "have_symlink" "YES" "histogram_size" "254" "histogram_type" "DOUBLE_PREC_HB" "host_cache_size" "678" "hostname" "db1.gruum.com" "identity" "0" "idle_readonly_transaction_timeout" "0" "idle_transaction_timeout" "0" "idle_write_transaction_timeout" "0" "ignore_builtin_innodb" "OFF" "ignore_db_dirs" "" "in_predicate_conversion_threshold" "1000" "in_transaction" "0" "init_connect" "" "init_file" "" "init_slave" "" "innodb_adaptive_flushing" "ON" "innodb_adaptive_flushing_lwm" "10.000000" "innodb_adaptive_hash_index" "OFF" "innodb_adaptive_hash_index_parts" "8" "innodb_adaptive_max_sleep_delay" "0" "innodb_autoextend_increment" "64" "innodb_autoinc_lock_mode" "1" "innodb_background_scrub_data_check_interval" "0" "innodb_background_scrub_data_compressed" "OFF" "innodb_background_scrub_data_interval" "0" "innodb_background_scrub_data_uncompressed" "OFF" "innodb_buf_dump_status_frequency" "0" "innodb_buffer_pool_chunk_size" "8589934592" "innodb_buffer_pool_dump_at_shutdown" "ON" "innodb_buffer_pool_dump_now" "OFF" "innodb_buffer_pool_dump_pct" "25" "innodb_buffer_pool_filename" "ib_buffer_pool" "innodb_buffer_pool_instances" "1" "innodb_buffer_pool_load_abort" "OFF" "innodb_buffer_pool_load_at_startup" "ON" "innodb_buffer_pool_load_now" "OFF" "innodb_buffer_pool_size" "137438953472" "innodb_change_buffer_max_size" "25" "innodb_change_buffering" "all" "innodb_checksum_algorithm" "full_crc32" "innodb_cmp_per_index_enabled" "OFF" "innodb_commit_concurrency" "0" "innodb_compression_algorithm" "lz4" "innodb_compression_default" "OFF" "innodb_compression_failure_threshold_pct" "5" "innodb_compression_level" "6" "innodb_compression_pad_pct_max" "50" "innodb_concurrency_tickets" "0" "innodb_data_file_path" "ibdata1:12M:autoextend" "innodb_data_home_dir" "" "innodb_deadlock_detect" "ON" "innodb_default_encryption_key_id" "1" "innodb_default_row_format" "dynamic" "innodb_defragment" "OFF" "innodb_defragment_fill_factor" "0.900000" "innodb_defragment_fill_factor_n_recs" "20" "innodb_defragment_frequency" "40" "innodb_defragment_n_pages" "7" "innodb_defragment_stats_accuracy" "0" "innodb_disable_sort_file_cache" "OFF" "innodb_disallow_writes" "OFF" "innodb_doublewrite" "ON" "innodb_encrypt_log" "OFF" "innodb_encrypt_tables" "OFF" "innodb_encrypt_temporary_tables" "OFF" "innodb_encryption_rotate_key_age" "1" "innodb_encryption_rotation_iops" "100" "innodb_encryption_threads" "0" "innodb_fast_shutdown" "1" "innodb_fatal_semaphore_wait_threshold" "600" "innodb_file_format" "" "innodb_file_per_table" "ON" "innodb_fill_factor" "100" "innodb_flush_log_at_timeout" "1" "innodb_flush_log_at_trx_commit" "1" "innodb_flush_method" "fsync" "innodb_flush_neighbors" "1" "innodb_flush_sync" "ON" "innodb_flushing_avg_loops" "30" "innodb_force_load_corrupted" "OFF" "innodb_force_primary_key" "OFF" "innodb_force_recovery" "0" "innodb_ft_aux_table" "" "innodb_ft_cache_size" "8000000" "innodb_ft_enable_diag_print" "OFF" "innodb_ft_enable_stopword" "ON" "innodb_ft_max_token_size" "84" "innodb_ft_min_token_size" "3" "innodb_ft_num_word_optimize" "2000" "innodb_ft_result_cache_limit" "2000000000" "innodb_ft_server_stopword_table" "" "innodb_ft_sort_pll_degree" "2" "innodb_ft_total_cache_size" "640000000" "innodb_ft_user_stopword_table" "" "innodb_immediate_scrub_data_uncompressed" "OFF" "innodb_instant_alter_column_allowed" "add_drop_reorder" "innodb_io_capacity" "1000" "innodb_io_capacity_max" "2000" "innodb_large_prefix" "" "innodb_lock_schedule_algorithm" "fcfs" "innodb_lock_wait_timeout" "60" "innodb_log_buffer_size" "67108864" "innodb_log_checksums" "ON" "innodb_log_compressed_pages" "ON" "innodb_log_file_size" "4294967296" "innodb_log_files_in_group" "1" "innodb_log_group_home_dir" "./" "innodb_log_optimize_ddl" "OFF" "innodb_log_write_ahead_size" "8192" "innodb_lru_flush_size" "32" "innodb_lru_scan_depth" "1536" "innodb_max_dirty_pages_pct" "90.000000" "innodb_max_dirty_pages_pct_lwm" "0.000000" "innodb_max_purge_lag" "0" "innodb_max_purge_lag_delay" "0" "innodb_max_purge_lag_wait" "4294967295" "innodb_max_undo_log_size" "10485760" "innodb_monitor_disable" "" "innodb_monitor_enable" "" "innodb_monitor_reset" "" "innodb_monitor_reset_all" "" "innodb_old_blocks_pct" "37" "innodb_old_blocks_time" "1000" "innodb_online_alter_log_max_size" "134217728" "innodb_open_files" "1000" "innodb_optimize_fulltext_only" "OFF" "innodb_page_cleaners" "1" "innodb_page_size" "16384" "innodb_prefix_index_cluster_optimization" "OFF" "innodb_print_all_deadlocks" "OFF" "innodb_purge_batch_size" "300" "innodb_purge_rseg_truncate_frequency" "128" "innodb_purge_threads" "4" "innodb_random_read_ahead" "OFF" "innodb_read_ahead_threshold" "56" "innodb_read_io_threads" "32" "innodb_read_only" "OFF" "innodb_replication_delay" "0" "innodb_rollback_on_timeout" "OFF" "innodb_scrub_log" "OFF" "innodb_scrub_log_speed" "256" "innodb_sort_buffer_size" "1048576" "innodb_spin_wait_delay" "4" "innodb_stats_auto_recalc" "ON" "innodb_stats_include_delete_marked" "OFF" "innodb_stats_method" "nulls_equal" "innodb_stats_modified_counter" "0" "innodb_stats_on_metadata" "OFF" "innodb_stats_persistent" "ON" "innodb_stats_persistent_sample_pages" "20" "innodb_stats_traditional" "ON" "innodb_stats_transient_sample_pages" "8" "innodb_status_output" "OFF" "innodb_status_output_locks" "OFF" "innodb_strict_mode" "ON" "innodb_sync_array_size" "1" "innodb_sync_spin_loops" "30" "innodb_table_locks" "ON" "innodb_temp_data_file_path" "ibtmp1:12M:autoextend" "innodb_thread_concurrency" "0" "innodb_thread_sleep_delay" "0" "innodb_tmpdir" "" "innodb_undo_directory" "./" "innodb_undo_log_truncate" "OFF" "innodb_undo_logs" "128" "innodb_undo_tablespaces" "0" "innodb_use_atomic_writes" "ON" "innodb_use_native_aio" "ON" "innodb_version" "10.5.9" "innodb_write_io_threads" "32" "insert_id" "0" "interactive_timeout" "28800" "join_buffer_size" "2097152" "join_buffer_space_limit" "2097152" "join_cache_level" "2" "keep_files_on_create" "OFF" "key_buffer_size" "134217728" "key_cache_age_threshold" "300" "key_cache_block_size" "1024" "key_cache_division_limit" "100" "key_cache_file_hash_size" "512" "key_cache_segments" "0" "large_files_support" "ON" "large_page_size" "0" "large_pages" "OFF" "last_gtid" "" "last_insert_id" "0" "lc_messages" "en_US" "lc_messages_dir" "/usr/share/mysql" "lc_time_names" "en_US" "license" "GPL" "local_infile" "ON" "lock_wait_timeout" "86400" "locked_in_memory" "OFF" "log_bin" "ON" "log_bin_basename" "/var/log/mysql/mariadb-bin" "log_bin_compress" "OFF" "log_bin_compress_min_len" "256" "log_bin_index" "/var/log/mysql/mariadb-bin.index" "log_bin_trust_function_creators" "OFF" "log_disabled_statements" "sp" "log_error" "" "log_output" "FILE" "log_queries_not_using_indexes" "OFF" "log_slave_updates" "OFF" "log_slow_admin_statements" "ON" "log_slow_disabled_statements" "sp" "log_slow_filter" "admin,filesort,filesort_on_disk,filesort_priority_queue,full_join,full_scan,query_cache,query_cache_miss,tmp_table,tmp_table_on_disk" "log_slow_rate_limit" "1" "log_slow_slave_statements" "ON" "log_slow_verbosity" "query_plan" "log_tc_size" "24576" "log_warnings" "2" "long_query_time" "5.000000" "low_priority_updates" "OFF" "lower_case_file_system" "OFF" "lower_case_table_names" "0" "master_verify_checksum" "OFF" "max_allowed_packet" "1073741824" "max_binlog_cache_size" "18446744073709547520" "max_binlog_size" "1073741824" "max_binlog_stmt_cache_size" "18446744073709547520" "max_connect_errors" "100" "max_connections" "1500" "max_delayed_threads" "20" "max_digest_length" "1024" "max_error_count" "64" "max_heap_table_size" "268435456" "max_insert_delayed_threads" "20" "max_join_size" "18446744073709551615" "max_length_for_sort_data" "1024" "max_password_errors" "4294967295" "max_prepared_stmt_count" "16382" "max_recursive_iterations" "4294967295" "max_relay_log_size" "1073741824" "max_rowid_filter_size" "131072" "max_seeks_for_key" "4294967295" "max_session_mem_used" "9223372036854775807" "max_sort_length" "1024" "max_sp_recursion_depth" "0" "max_statement_time" "0.000000" "max_tmp_tables" "32" "max_user_connections" "0" "max_write_lock_count" "4294967295" "metadata_locks_cache_size" "1024" "metadata_locks_hash_instances" "8" "min_examined_row_limit" "0" "mrr_buffer_size" "262144" "myisam_block_size" "1024" "myisam_data_pointer_size" "6" "myisam_max_sort_file_size" "9223372036853727232" "myisam_mmap_size" "18446744073709551615" "myisam_recover_options" "BACKUP,QUICK" "myisam_repair_threads" "1" "myisam_sort_buffer_size" "134216704" "myisam_stats_method" "NULLS_UNEQUAL" "myisam_use_mmap" "OFF" "mysql56_temporal_format" "ON" "net_buffer_length" "16384" "net_read_timeout" "30" "net_retry_count" "10" "net_write_timeout" "60" "old" "OFF" "old_alter_table" "DEFAULT" "old_mode" "" "old_passwords" "OFF" "open_files_limit" "16384" "optimizer_max_sel_arg_weight" "32000" "optimizer_prune_level" "1" "optimizer_search_depth" "62" "optimizer_selectivity_sampling_limit" "100" "optimizer_switch" "index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=on,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on,rowid_filter=on,condition_pushdown_from_having=on,not_null_range_scan=off" "optimizer_trace" "enabled=off" "optimizer_trace_max_mem_size" "1048576" "optimizer_use_condition_selectivity" "4" "performance_schema" "ON" "performance_schema_accounts_size" "-1" "performance_schema_digests_size" "10000" "performance_schema_events_stages_history_long_size" "10000" "performance_schema_events_stages_history_size" "10" "performance_schema_events_statements_history_long_size" "10000" "performance_schema_events_statements_history_size" "10" "performance_schema_events_transactions_history_long_size" "10000" "performance_schema_events_transactions_history_size" "10" "performance_schema_events_waits_history_long_size" "10000" "performance_schema_events_waits_history_size" "10" "performance_schema_hosts_size" "-1" "performance_schema_max_cond_classes" "90" "performance_schema_max_cond_instances" "-1" "performance_schema_max_digest_length" "1024" "performance_schema_max_file_classes" "80" "performance_schema_max_file_handles" "32768" "performance_schema_max_file_instances" "-1" "performance_schema_max_index_stat" "-1" "performance_schema_max_memory_classes" "320" "performance_schema_max_metadata_locks" "-1" "performance_schema_max_mutex_classes" "210" "performance_schema_max_mutex_instances" "-1" "performance_schema_max_prepared_statements_instances" "-1" "performance_schema_max_program_instances" "-1" "performance_schema_max_rwlock_classes" "50" "performance_schema_max_rwlock_instances" "-1" "performance_schema_max_socket_classes" "10" "performance_schema_max_socket_instances" "-1" "performance_schema_max_sql_text_length" "1024" "performance_schema_max_stage_classes" "160" "performance_schema_max_statement_classes" "222" "performance_schema_max_statement_stack" "10" "performance_schema_max_table_handles" "-1" "performance_schema_max_table_instances" "-1" "performance_schema_max_table_lock_stat" "-1" "performance_schema_max_thread_classes" "50" "performance_schema_max_thread_instances" "-1" "performance_schema_session_connect_attrs_size" "512" "performance_schema_setup_actors_size" "-1" "performance_schema_setup_objects_size" "-1" "performance_schema_users_size" "-1" "pid_file" "/run/mysqld/mysqld.pid" "plugin_dir" "/usr/lib/mysql/plugin/" "plugin_maturity" "gamma" "port" "3306" "preload_buffer_size" "32768" "profiling" "OFF" "profiling_history_size" "15" "progress_report_time" "5" "protocol_version" "10" "proxy_protocol_networks" "" "proxy_user" "" "pseudo_slave_mode" "OFF" "pseudo_thread_id" "238817" "query_alloc_block_size" "16384" "query_cache_limit" "0" "query_cache_min_res_unit" "4096" "query_cache_size" "0" "query_cache_strip_comments" "OFF" "query_cache_type" "OFF" "query_cache_wlock_invalidate" "OFF" "query_prealloc_size" "24576" "rand_seed1" "834957938" "rand_seed2" "69824374" "range_alloc_block_size" "4096" "read_binlog_speed_limit" "0" "read_buffer_size" "131072" "read_only" "OFF" "read_rnd_buffer_size" "262144" "relay_log" "" "relay_log_basename" "" "relay_log_index" "" "relay_log_info_file" "relay-log.info" "relay_log_purge" "ON" "relay_log_recovery" "OFF" "relay_log_space_limit" "0" "replicate_annotate_row_events" "ON" "replicate_do_db" "" "replicate_do_table" "" "replicate_events_marked_for_skip" "REPLICATE" "replicate_ignore_db" "" "replicate_ignore_table" "" "replicate_wild_do_table" "" "replicate_wild_ignore_table" "" "report_host" "" "report_password" "" "report_port" "3306" "report_user" "" "require_secure_transport" "OFF" "rowid_merge_buff_size" "8388608" "rpl_semi_sync_master_enabled" "OFF" "rpl_semi_sync_master_timeout" "10000" "rpl_semi_sync_master_trace_level" "32" "rpl_semi_sync_master_wait_no_slave" "ON" "rpl_semi_sync_master_wait_point" "AFTER_COMMIT" "rpl_semi_sync_slave_delay_master" "OFF" "rpl_semi_sync_slave_enabled" "OFF" "rpl_semi_sync_slave_kill_conn_timeout" "5" "rpl_semi_sync_slave_trace_level" "32" "secure_auth" "ON" "secure_file_priv" "" "secure_timestamp" "NO" "server_id" "3" "session_track_schema" "ON" "session_track_state_change" "OFF" "session_track_system_variables" "autocommit,character_set_client,character_set_connection,character_set_results,time_zone" "session_track_transaction_info" "OFF" "skip_external_locking" "ON" "skip_name_resolve" "ON" "skip_networking" "OFF" "skip_parallel_replication" "OFF" "skip_replication" "OFF" "skip_show_database" "OFF" "slave_compressed_protocol" "OFF" "slave_ddl_exec_mode" "IDEMPOTENT" "slave_domain_parallel_threads" "0" "slave_exec_mode" "STRICT" "slave_load_tmpdir" "/tmp" "slave_max_allowed_packet" "1073741824" "slave_net_timeout" "60" "slave_parallel_max_queued" "131072" "slave_parallel_mode" "optimistic" "slave_parallel_threads" "0" "slave_parallel_workers" "0" "slave_run_triggers_for_rbr" "NO" "slave_skip_errors" "OFF" "slave_sql_verify_checksum" "ON" "slave_transaction_retries" "10" "slave_transaction_retry_errors" "1158,1159,1160,1161,1205,1213,1429,2013,12701" "slave_transaction_retry_interval" "0" "slave_type_conversions" "" "slow_launch_time" "2" "slow_query_log" "ON" "slow_query_log_file" "/var/log/mysql/mysql-slow.log" "socket" "/run/mysqld/mysqld.sock" "sort_buffer_size" "4194304" "sql_auto_is_null" "OFF" "sql_big_selects" "ON" "sql_buffer_result" "OFF" "sql_if_exists" "OFF" "sql_log_bin" "ON" "sql_log_off" "OFF" "sql_mode" "STRICT_TRANS_TABLES,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION" "sql_notes" "ON" "sql_quote_show_create" "ON" "sql_safe_updates" "OFF" "sql_select_limit" "18446744073709551615" "sql_slave_skip_counter" "0" "sql_warnings" "OFF" "ssl_ca" "" "ssl_capath" "" "ssl_cert" "" "ssl_cipher" "" "ssl_crl" "" "ssl_crlpath" "" "ssl_key" "" "standard_compliant_cte" "ON" "storage_engine" "InnoDB" "stored_program_cache" "256" "strict_password_validation" "ON" "sync_binlog" "0" "sync_frm" "ON" "sync_master_info" "10000" "sync_relay_log" "10000" "sync_relay_log_info" "10000" "system_time_zone" "GMT" "system_versioning_alter_history" "ERROR" "system_versioning_asof" "DEFAULT" "table_definition_cache" "400" "table_open_cache" "1200" "table_open_cache_instances" "6" "tcp_keepalive_interval" "0" "tcp_keepalive_probes" "0" "tcp_keepalive_time" "0" "tcp_nodelay" "ON" "thread_cache_size" "512" "thread_handling" "one-thread-per-connection" "thread_pool_dedicated_listener" "OFF" "thread_pool_exact_stats" "OFF" "thread_pool_idle_timeout" "60" "thread_pool_max_threads" "65536" "thread_pool_oversubscribe" "3" "thread_pool_prio_kickup_timer" "1000" "thread_pool_priority" "auto" "thread_pool_size" "64" "thread_pool_stall_limit" "500" "thread_stack" "299008" "time_format" "%H:%i:%s" "time_zone" "SYSTEM" "timestamp" "1616620944.174301" "tls_version" "TLSv1.1,TLSv1.2,TLSv1.3" "tmp_disk_table_size" "18446744073709551615" "tmp_memory_table_size" "268435456" "tmp_table_size" "268435456" "tmpdir" "/tmp" "transaction_alloc_block_size" "8192" "transaction_prealloc_size" "4096" "tx_isolation" "REPEATABLE-READ" "tx_read_only" "OFF" "unique_checks" "ON" "updatable_views_with_limit" "YES" "use_stat_tables" "PREFERABLY_FOR_QUERIES" "userstat" "OFF" "version" "10.5.9-MariaDB-1:10.5.9+maria~focal-log" "version_comment" "mariadb.org binary distribution" "version_compile_machine" "x86_64" "version_compile_os" "debian-linux-gnu" "version_malloc_library" "system" "version_source_revision" "3a8ca9096ea82ca61811450775511533d6cb1bb4" "version_ssl_library" "OpenSSL 1.1.1f 31 Mar 2020" "wait_timeout" "28800" "warning_count" "0" "wsrep_osu_method" "TOI" "wsrep_sr_store" "table" "wsrep_auto_increment_control" "ON" "wsrep_causal_reads" "OFF" "wsrep_certification_rules" "strict" "wsrep_certify_nonpk" "ON" "wsrep_cluster_address" "" "wsrep_cluster_name" "my_wsrep_cluster" "wsrep_convert_lock_to_trx" "OFF" "wsrep_data_home_dir" "/var/lib/mysql/" "wsrep_dbug_option" "" "wsrep_debug" "NONE" "wsrep_desync" "OFF" "wsrep_dirty_reads" "OFF" "wsrep_drupal_282555_workaround" "OFF" "wsrep_forced_binlog_format" "NONE" "wsrep_gtid_domain_id" "0" "wsrep_gtid_mode" "OFF" "wsrep_gtid_seq_no" "0" "wsrep_ignore_apply_errors" "7" "wsrep_load_data_splitting" "OFF" "wsrep_log_conflicts" "OFF" "wsrep_max_ws_rows" "0" "wsrep_max_ws_size" "2147483647" "wsrep_mysql_replication_bundle" "0" "wsrep_node_address" "" "wsrep_node_incoming_address" "AUTO" "wsrep_node_name" "db1.gruum.com" "wsrep_notify_cmd" "" "wsrep_on" "OFF" "wsrep_patch_version" "wsrep_26.22" "wsrep_provider" "none" "wsrep_provider_options" "" "wsrep_recover" "OFF" "wsrep_reject_queries" "NONE" "wsrep_replicate_myisam" "OFF" "wsrep_restart_slave" "OFF" "wsrep_retry_autocommit" "1" "wsrep_slave_fk_checks" "ON" "wsrep_slave_uk_checks" "OFF" "wsrep_slave_threads" "1" "wsrep_sst_auth" "" "wsrep_sst_donor" "" "wsrep_sst_donor_rejects_queries" "OFF" "wsrep_sst_method" "rsync" "wsrep_sst_receive_address" "AUTO" "wsrep_start_position" "00000000-0000-0000-0000-000000000000:-1" "wsrep_strict_ddl" "OFF" "wsrep_sync_wait" "0" "wsrep_trx_fragment_size" "0" "wsrep_trx_fragment_unit" "bytes" ```
peterfabian commented 3 years ago

As for UPDATE, things a bit more complex here. I'm evaluating the SELECT subquery to make sure the selection of rows to be updated is quick: SELECT action_id, attempts, scheduled_date_gmt FROM wp_actionscheduler_actions WHERE claim_id = 0 AND scheduled_date_gmt <= '2021-03-24 15:34:20' AND status='pending' ORDER BY attempts ASC, scheduled_date_gmt ASC, action_id ASC LIMIT 25;

Optimization 1: don't sort on scheduled date

Now the easiest solution would be to just drop the scheduled_date_gmt ASC from the ORDER BY clause and we're done (11sec to 0.005sec):

+-----------+----------+---------------------+
| action_id | attempts | scheduled_date_gmt  |
+-----------+----------+---------------------+
|     71462 |        0 | 2021-03-18 16:41:40 |
...
|     71358 |        0 | 2021-03-18 16:42:11 |
+-----------+----------+---------------------+
25 rows in set (11.004 sec)

MariaDB [wordpress-one]> SELECT action_id, attempts, scheduled_date_gmt FROM wp_actionscheduler_actions  WHERE claim_id = 0 AND scheduled_date_gmt <= '2021-03-24 15:34:20' AND status='pending' ORDER BY attempts ASC, action_id ASC LIMIT 25;
+-----------+----------+---------------------+
| action_id | attempts | scheduled_date_gmt  |
+-----------+----------+---------------------+
|     70605 |        0 | 2021-03-18 16:42:13 |
...
|     70721 |        0 | 2021-03-18 16:42:42 |
+-----------+----------+---------------------+
25 rows in set (0.005 sec)

Theoretically speaking, I don't think the docs mention anywhere explicitly what order past actions will be run in. That being said, since this is a scheduling library and most people consider time to be somewhat linear, it's not the best solution. Especially if we can somehow make sure the invariant that action a1 scheduled for time t1 runs before action a2 scheduled for time t2 if t1 < t2 is upheld.

Optimization 2: subquery to reduce result set

One possible optimization is that it seems that query SELECT scheduled_date_gmt FROM wp_actionscheduler_actions WHERE claim_id = 0 AND status='pending' ORDER BY scheduled_date_gmt ASC LIMIT 25; is a relatively quick one.

MariaDB [wordpress-one]> select scheduled_date_gmt from wp_actionscheduler_actions where claim_id = 0 AND status='pending' order by scheduled_date_gmt ASC limit 25;
+---------------------+
| scheduled_date_gmt  |
+---------------------+
| 2021-03-18 16:42:14 |
...23 more items...
| 2021-03-18 16:42:48 |
+---------------------+
25 rows in set (0.201 sec)

The largest value in the result set gives a better upper bound on scheduled_date_gmt than looking at all actions that are scheduled before now (if there is at most 1 action scheduled for each second, then this is equivalend to the result set; if there are mutliple actions scheduled for each second, the 25 earliest ones are for sure schedled in the interval between the min and max from the subquery above). Side note: don't forget to take into account group, hook and possible query filtering.

Now once we have a better upper bound, we can make the subsequent query more exclusive and faster:

New query: 0.2s:

MariaDB [wordpress-one]> SELECT action_id, attempts, scheduled_date_gmt FROM wp_actionscheduler_actions  WHERE claim_id = 0 AND scheduled_date_gmt <= '2021-03-25 09:24:20' AND scheduled_date_gmt <= '2021-03-18 16:43:02' AND status='pending' ORDER BY attempts ASC, scheduled_date_gmt ASC, action_id ASC LIMIT 25;
+-----------+----------+---------------------+
| action_id | attempts | scheduled_date_gmt  |
+-----------+----------+---------------------+
|     71652 |        0 | 2021-03-18 16:42:29 |
|     71657 |        0 | 2021-03-18 16:42:30 |
|     71662 |        0 | 2021-03-18 16:42:31 |
|     71667 |        0 | 2021-03-18 16:42:32 |
|     71672 |        0 | 2021-03-18 16:42:33 |
|     71677 |        0 | 2021-03-18 16:42:34 |
|     71682 |        0 | 2021-03-18 16:42:36 |
|     71687 |        0 | 2021-03-18 16:42:37 |
|     71692 |        0 | 2021-03-18 16:42:38 |
|     71697 |        0 | 2021-03-18 16:42:46 |
|     71702 |        0 | 2021-03-18 16:42:47 |
|     71707 |        0 | 2021-03-18 16:42:48 |
|     71712 |        0 | 2021-03-18 16:42:49 |
|     71717 |        0 | 2021-03-18 16:42:50 |
|     71722 |        0 | 2021-03-18 16:42:51 |
|     71727 |        0 | 2021-03-18 16:42:52 |
|     71732 |        0 | 2021-03-18 16:42:53 |
|     71737 |        0 | 2021-03-18 16:42:55 |
|     71742 |        0 | 2021-03-18 16:42:56 |
|     71747 |        0 | 2021-03-18 16:42:57 |
|     71752 |        0 | 2021-03-18 16:42:58 |
|     71757 |        0 | 2021-03-18 16:42:59 |
|     71762 |        0 | 2021-03-18 16:43:00 |
|     71767 |        0 | 2021-03-18 16:43:01 |
|     71772 |        0 | 2021-03-18 16:43:02 |
+-----------+----------+---------------------+
25 rows in set (0.205 sec)

Old query (13s):

MariaDB [wordpress-one]> SELECT action_id, attempts, scheduled_date_gmt FROM wp_actionscheduler_actions  WHERE claim_id = 0 AND scheduled_date_gmt <= '2021-03-25 09:24:20' AND status='pending' ORDER BY attempts ASC, scheduled_date_gmt ASC, action_id ASC LIMIT 25;
+-----------+----------+---------------------+
| action_id | attempts | scheduled_date_gmt  |
+-----------+----------+---------------------+
|     71662 |        0 | 2021-03-18 16:42:31 |
|     71667 |        0 | 2021-03-18 16:42:32 |
|     71672 |        0 | 2021-03-18 16:42:33 |
|     71677 |        0 | 2021-03-18 16:42:34 |
|     71682 |        0 | 2021-03-18 16:42:36 |
|     71687 |        0 | 2021-03-18 16:42:37 |
|     71692 |        0 | 2021-03-18 16:42:38 |
|     71697 |        0 | 2021-03-18 16:42:46 |
|     71702 |        0 | 2021-03-18 16:42:47 |
|     71707 |        0 | 2021-03-18 16:42:48 |
|     71712 |        0 | 2021-03-18 16:42:49 |
|     71717 |        0 | 2021-03-18 16:42:50 |
|     71722 |        0 | 2021-03-18 16:42:51 |
|     71727 |        0 | 2021-03-18 16:42:52 |
|     71732 |        0 | 2021-03-18 16:42:53 |
|     71737 |        0 | 2021-03-18 16:42:55 |
|     71742 |        0 | 2021-03-18 16:42:56 |
|     71747 |        0 | 2021-03-18 16:42:57 |
|     71752 |        0 | 2021-03-18 16:42:58 |
|     71757 |        0 | 2021-03-18 16:42:59 |
|     71762 |        0 | 2021-03-18 16:43:00 |
|     71767 |        0 | 2021-03-18 16:43:01 |
|     71772 |        0 | 2021-03-18 16:43:02 |
|     71777 |        0 | 2021-03-18 16:43:03 |
|     70786 |        0 | 2021-03-18 16:43:04 |
+-----------+----------+---------------------+
25 rows in set (13.187 sec)

(the result set is slightly different since 2 actions got done in the meantime).

Optimization 3: better data structure for date exclusions

Now it's nice that we can schedule actions up to a second accuracy, but now that I see that actions older than a month get purged, I'm intrigued by the fact that there are only 30 days worth of timeslots. Of course, a lot depends on specific distribution of actions over that time period. I can see 2 possible extremes that we should be able to cater to in some way:

In the first case, adding a column that would count number of days since the beginning of the epoch and filter actions based on that could help quite a bit, as it would quickly eliminate 97% of actions (100% of actions within the last 30 days, each day is ±3% of actions, assuming more than 25 actions per day) with a simple index on an integer column not larger than 16 bits (and it would break in year 2149). However, this will be pretty useless for the antiuniform distribution (i.e. all actions scheduled on 1 day, rest of the month nothing).

Theoretically speaking, we could go one step more granular to hours. Hopefully, nobody wants AS to process hundreds of thousands of actions per hour and then hours since epoch is still quite okayish 19bit int.

So I've tried to add column hours_since_epoch, but the performance wasn't any better than Optimization 2 above, so I stopped going in that direction.

Optimization 4: add composite index

The solution proposed in https://github.com/woocommerce/action-scheduler/pull/535. I wasn't able to verify this brings any benefit, so I'm reluctant to merge it tbh. I've tried other possible combinations for difference composite indices, but could not see much of a positive difference, maybe even slow down.


In conclusion, for now, I'll prep a PR for approach no 2 and see where the next bottleneck lies after applying both the changes mentioned in my last 2 comments.

glagonikas commented 3 years ago

@peterfabian if you can, it would be great to run your tests on a multi-server (one db and one app server) environment to see if you get different results.

I suspect it might and if so, it will be good to remember in the future when testing performance-related improvements.

peterfabian commented 3 years ago

In my local testing on MariaDB with 5 CLI runners I was able to go from ±65 noop actions per minute before the change to ±1800 noop actions after applying the changes in #678 and #680. I will test it tomorrow on local MySQL 5.7 server and then on actual wp.com setup with a separate HTTP and DB server. In the meantime, feel free to test these PRs.

peterfabian commented 3 years ago

I'm happy to report back that the test on wordpress.com (ephemeral site) was successful. I've imported the 1.3M actions into the database and did the comparison below:

Original AS 3.1.6

I let the AS run for 20 minutes with 1 wp-cli worker, I haven't tried more because I was already getting large number of deadlocks:

New changes applied

I let the AS run for 10 minutes with 1-3 wp-cli workers (gradually increased the load).

I think it looks solid so far.

peterfabian commented 3 years ago

I'd also like to note that perhaps there's a possibility to improve \ActionScheduler_DBStore::find_action, as it's used quite widely (e.g. in as_unschedule_action and as_next_scheduled_action) and it also sorts by the scheduled_date_gmt, but that's perhaps outside of scope for this one now.

glagonikas commented 3 years ago

That's great work @peterfabian! I'm glad you went deep into this and came up with solid changes.

It's also great that you confirmed my suspicions that having a separate DB server [in some cases/scenarios] can cause completely different behaviour than you'd get on local/single server.

samimakela commented 3 years ago

@peterfabian and @glagonikas I'm super glad to see this investigated further. Great work @peterfabian with the performance optimisation.

To my understanding the optimisation in #678 and #680 will, in addition to improved performance, reduce the likelihood of deadlocks, but it won't eliminate it completely. Adding the index as would (as explained in my earlier comment). @peterfabian what's your opinion on this?

What I'm trying to say is that I would add the index to compliment the performance optimisation in 678 and 680.

We've now had the new composite index running for several weeks on a large WooCommerce Subscriptions site (with several thousand payments scheduled to run at the same time) We've had zero deadlocks since we applied the changes.

peterfabian commented 3 years ago

Hi @samimakela , I'm not really an expert on MySQL, so happy to learn and update my mental model, but I don't think you can actually eliminate the deadlocks completely if you have 2 queries asking to update the same rows which could end up waiting for each other. Based on some more reading I've done now, my understanding is that the only way to avoid deadlocks would be to create a covered index that would include all columns as the primary index, so that all the data would be directly included in index nodes and the data doesn't need to be fetched from the table itself. Worth a try maybe, but I think the change proposed in #535 does not completely eliminate deadlocks.

As for other effects of adding the extra composite index as proposed in https://github.com/woocommerce/action-scheduler/pull/535, I saw in the EXPLAIN that the query using the new index actually didn't make much of a difference, MariaDB estimated the number of examined rows even higher (when forcing the new index, otherwise sql optimizer doesn't choose the composite index for me).

Without the index:

MariaDB [wordpress-one]> EXPLAIN SELECT action_id FROM wp_actionscheduler_actions WHERE status='pending' AND claim_id = 0 AND scheduled_date_gmt <= '2021-03-27 13:04:34' ORDER BY attempts ASC, scheduled_date_gmt ASC, action_id ASC LIMIT 25;
+------+-------------+----------------------------+------+------------------------------------+--------+---------+-------+--------+----------------------------------------------------+
| id   | select_type | table                      | type | possible_keys                      | key    | key_len | ref   | rows   | Extra                                              |
+------+-------------+----------------------------+------+------------------------------------+--------+---------+-------+--------+----------------------------------------------------+
|    1 | SIMPLE      | wp_actionscheduler_actions | ref  | status,scheduled_date_gmt,claim_id | status | 82      | const | 639595 | Using index condition; Using where; Using filesort |
+------+-------------+----------------------------+------+------------------------------------+--------+---------+-------+--------+----------------------------------------------------+

With the index, default query:

MariaDB [wordpress-one]> EXPLAIN SELECT action_id FROM wp_actionscheduler_actions WHERE status='pending' AND claim_id = 0 AND scheduled_date_gmt <= '2021-03-27 13:04:34' ORDER BY attempts ASC, scheduled_date_gmt ASC, action_id ASC LIMIT 25;
+------+-------------+----------------------------+------+-----------------------------------------------------------------------+--------+---------+-------+--------+----------------------------------------------------+
| id   | select_type | table                      | type | possible_keys                                                         | key    | key_len | ref   | rows   | Extra                                              |
+------+-------------+----------------------------+------+-----------------------------------------------------------------------+--------+---------+-------+--------+----------------------------------------------------+
|    1 | SIMPLE      | wp_actionscheduler_actions | ref  | status,scheduled_date_gmt,claim_id,claim_id-status-scheduled_date_gmt | status | 82      | const | 639595 | Using index condition; Using where; Using filesort |
+------+-------------+----------------------------+------+-----------------------------------------------------------------------+--------+---------+-------+--------+----------------------------------------------------+
1 row in set (0.001 sec)

WIth index, giving a hint to use the new index:

MariaDB [wordpress-one]> EXPLAIN SELECT action_id FROM wp_actionscheduler_actions USE INDEX(`claim_id-status-scheduled_date_gmt`)  WHERE status='pending' AND claim_id = 0 AND scheduled_date_gmt <= '2021-03-27 13:04:34' ORDER BY attempts ASC, scheduled_date_gmt ASC, action_id ASC LIMIT 25;
+------+-------------+----------------------------+------+------------------------------------+------------------------------------+---------+-------------+--------+----------------------------------------------------+
| id   | select_type | table                      | type | possible_keys                      | key                                | key_len | ref         | rows   | Extra                                              |
+------+-------------+----------------------------+------+------------------------------------+------------------------------------+---------+-------------+--------+----------------------------------------------------+
|    1 | SIMPLE      | wp_actionscheduler_actions | ref  | claim_id-status-scheduled_date_gmt | claim_id-status-scheduled_date_gmt | 90      | const,const | 662406 | Using index condition; Using where; Using filesort |
+------+-------------+----------------------------+------+------------------------------------+------------------------------------+---------+-------------+--------+----------------------------------------------------+
1 row in set (0.000 sec)

Just for completeness, also including the EXPLAIN for the UPDATE after adding the index, first without index hint, then with the hint:

MariaDB [wordpress-one]> EXPLAIN UPDATE wp_actionscheduler_actions SET claim_id=60232, last_attempt_gmt='2021-03-23 15:24:34', last_attempt_local='2021-03-23 15:24:34' WHERE status='pending' AND claim_id = 0 AND scheduled_date_gmt <= '2021-03-27 15:24:34' ORDER BY attempts ASC, scheduled_date_gmt ASC, action_id ASC LIMIT 25;
+------+-------------+----------------------------+-------+-----------------------------------------------------------------------+--------------------+---------+------+--------+-----------------------------+
| id   | select_type | table                      | type  | possible_keys                                                         | key                | key_len | ref  | rows   | Extra                       |
+------+-------------+----------------------------+-------+-----------------------------------------------------------------------+--------------------+---------+------+--------+-----------------------------+
|    1 | SIMPLE      | wp_actionscheduler_actions | range | status,scheduled_date_gmt,claim_id,claim_id-status-scheduled_date_gmt | scheduled_date_gmt | 5       | NULL | 662406 | Using where; Using filesort |
+------+-------------+----------------------------+-------+-----------------------------------------------------------------------+--------------------+---------+------+--------+-----------------------------+
1 row in set (0.001 sec)

MariaDB [wordpress-one]> EXPLAIN UPDATE wp_actionscheduler_actions USE INDEX(`claim_id-status-scheduled_date_gmt`) SET claim_id=60232, last_attempt_gmt='2021-03-23 15:24:34', last_attempt_local='2021-03-23 15:24:34' WHERE status='pending' AND claim_id = 0 AND scheduled_date_gmt <= '2021-03-27 15:24:34' ORDER BY attempts ASC, scheduled_date_gmt ASC, action_id ASC LIMIT 25;
+------+-------------+----------------------------+-------+------------------------------------+------------------------------------+---------+------+--------+-----------------------------+
| id   | select_type | table                      | type  | possible_keys                      | key                                | key_len | ref  | rows   | Extra                       |
+------+-------------+----------------------------+-------+------------------------------------+------------------------------------+---------+------+--------+-----------------------------+
|    1 | SIMPLE      | wp_actionscheduler_actions | range | claim_id-status-scheduled_date_gmt | claim_id-status-scheduled_date_gmt | 95      | NULL | 662406 | Using where; Using filesort |
+------+-------------+----------------------------+-------+------------------------------------+------------------------------------+---------+------+--------+-----------------------------+
1 row in set (0.000 sec)

I've also tried the code with the additional index on a new site on wordpress.com with 1.2M pending/1.3M total actions and got the following result in 10 minutes:

While I agree that reasoning from you folks makes sense and the 2 anecdata is already promising, ultimately, I wasn't able to replicate the positive effect of the additional index (actually it seemed worse, I wasn't even able to start any WP CLI batch processing despite multiple retries), so can't really recommend it to everyone.

However, if you'd be willing to privately share a db dump from the actions table with us, we can maybe get a better understanding of what happens on your sites. Unfortunately, I won't be able to follow up on this any time soon, as I'll be on extended leave from Monday, but maybe someone else from the team might be able to test it.

samimakela commented 3 years ago

@peterfabian thanks for taking the time to answer.

The underlying issue in my opinion is that whenever actions are claimed the whole table is locked until the query is complete. This is stated in the InnoDB documentation.

If there is another batch already running, the execution logging might bump into the query used to claim actions. Making the queries faster (as you did) reduces the likelihood of this happening, but does not completely prevent it.

As explained in my earlier answer, adding the index would mean, that only the rows matching the query are locked during the query (instead of the whole table)

Since the two queries (claiming actions and logging the execution) don't try to update the same rows in any case, adding the index essentially prevents the deadlocks completely.

As mentioned after adding the index we haven't seen any deadlocks since. We are running a WooCommerce subscriptions site, where several thousand Stripe payments are scheduled to run at the same time. We used to hundreds of deadlocks during the execution.

Creating the index is enough. You don't need to change the query itself.

You can create an index for testing purposes by running the following query:

CREATE INDEX composite_test_index ON wp_actionscheduler_actions (claim_id, status, scheduled_date_gmt);

After adding the index run the default query e.g:

EXPLAIN SELECT action_id FROM wp_actionscheduler_actions WHERE status='pending' AND claim_id = 0 AND scheduled_date_gmt <= '2021-03-28 13:04:34' ORDER BY attempts ASC, scheduled_date_gmt ASC, action_id ASC LIMIT 25;

The result should be as follows:

+------+-------------+----------------------------+-------+---------------------------------------------------------+----------------------+---------+------+------+---------------------------------------+
| id   | select_type | table                      | type  | possible_keys                                           | key                  | key_len | ref  | rows | Extra                                 |
+------+-------------+----------------------------+-------+---------------------------------------------------------+----------------------+---------+------+------+---------------------------------------+
|    1 | SIMPLE      | wp_actionscheduler_actions | range | status,scheduled_date_gmt,claim_id,composite_test_index | composite_test_index | 95      | NULL |    1 | Using index condition; Using filesort |
+------+-------------+----------------------------+-------+---------------------------------------------------------+----------------------+---------+------+------+---------------------------------------+

The difference to your results is that the key used for the query is the _composite_testindex while in your case the key was the status Please note that I did not make any changes to the default query.

It looks like there is something wrong with the index in your test environment. I would very much encourage to take another look at this and check that the index is actually working.

I'm happy to provide help with this one. Let me know what you need and I'll do my best to help out.

glagonikas commented 3 years ago

I'm not sure why the EXPLAIN @peterfabian gets on the default query doesn't pick the index and requires a nudge. In our case, once the index is added, the default query starts using it straight-away, which is why I proposed this as a quick fix.

Our output is identical to @samimakela (just a different index name), so ref is null, rows is 1 and Extra doesn't have "using where;" which all make sense because the index is used correctly.

Since we've added the index, we too have never seen a single deadlock, but it is possible that we simply don't do such intensive processing as in Peter's case, and might resurface when the site grows further.

I see no harm in merging Peter's changes and speed up all queries, then either merge my PR, or simply put it in the docs as a possible solution for sites that might still experience deadlocks.

I'm not sure why there aren't many others reporting deadlocks, given that WC is used across loads of websites and certainly some will be big and busy ones.

off-topic As a general thought, it might be good for WC to compile a list of large websites using it and reach out to them for feedback on performance or issues like this. (Usage Tracking won't always help because some folks like us might have this switched off). Surely others will be experiencing the same frustrations like we do and would like to see other performance improvements like custom orders table and custom products table being implemented in core. Some (like us) might even be willing to fund those changes considering the huge impact they might have on server costs and lost income caused by slow page load.

samimakela commented 3 years ago

One more thought. Adding the index should also speed up the query itself as reported by @glagonikas earlier.

Since it seems, that the index did not work correctly during the tests made by @peterfabian, also the results regarding the query performance can't be relied on.

I know I'm sounding like a broken record at this point, but I strongly feel that it would be important to revisit the issue and reconsider the index as a primary solution.

As mentioned - happy to help in any way possible.

contemplate commented 3 years ago

I want to thank all those working on this thread. We have 25k+ actions on our large server and were having Deadlock issues all the time which was affecting our site performance. I added the index and we haven't had one deadlock since. Look forward to the further improvements suggested here to Action Scheduler.

barryhughes commented 3 years ago

I know I'm sounding like a broken record at this point, but I strongly feel that it would be important to revisit the issue and reconsider the index as a primary solution.

Re being a broken record ... not at all: what you're drawing attention to is important.

Specifically in relation to the work done over here (to add a new index), I think it's worth us considering this as a further (and significant) incremental improvement, but it probably will not be the 'last word' on the matter and, likely, we are going to need to continue to refine and expand upon the menu of options that exist for running AS smoothly at scale.

glagonikas commented 3 years ago

Just an update on this, we recently got AutomateWoo and this has pushed our actions to over 1m. Roughly 900k come from AW (!).

During a large promo, where hundreds of customers were browsing and purchasing, we saw the deadlocks again... 😭

So it seems that having the index helps a lot, but as @peterfabian said, it won't eliminate the deadlocks in all scenarios.

We've had similar (and even more) traffic before adding AW and hadn't had any deadlocks since adding the index, but our table was never nearly as full.

Deadlock details ``` ------------------------ LATEST DETECTED DEADLOCK ------------------------ 2021-04-21 12:47:01 0x7f50243a5700 *** (1) TRANSACTION: TRANSACTION 357935390, ACTIVE 51 sec updating or deleting mysql tables in use 1, locked 1 LOCK WAIT 4 lock struct(s), heap size 1128, 3 row lock(s), undo log entries 1 MySQL thread id 3064098, OS thread handle 139986161121024, query id 183073701 10.0.0.2 gruum Updating UPDATE wp_actionscheduler_actions SET attempts = attempts+1, status='in-progress', last_attempt_gmt = '2021-04-21 11:46:10', last_attempt_local = '2021-04-21 12:46:10' WHERE action_id = 6058068 *** (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 1806 page no 42332 n bits 776 index status of table `gruum`.`wp_actionscheduler_actions` trx id 357935390 lock_mode X locks gap before rec insert intention waiting Record lock, heap no 536 PHYSICAL RECORD: n_fields 2; compact format; info bits 0 0: len 7; hex 70656e64696e67; asc pending;; 1: len 8; hex 0000000000198cc9; asc ;; *** (2) TRANSACTION: TRANSACTION 357931945, ACTIVE 104 sec fetching rows mysql tables in use 3, locked 3 23594 lock struct(s), heap size 2629752, 2272318 row lock(s) MySQL thread id 3064099, OS thread handle 139982181914368, query id 183034411 10.0.0.2 gruum init for update UPDATE wp_actionscheduler_actions SET claim_id=2763905, last_attempt_gmt='2021-04-21 11:45:17', last_attempt_local='2021-04-21 12:45:17' WHERE claim_id = 0 AND scheduled_date_gmt <= '2021-04-21 11:45:17' AND status='pending' ORDER BY attempts ASC, scheduled_date_gmt ASC, action_id ASC LIMIT 25 *** (2) HOLDS THE LOCK(S): RECORD LOCKS space id 1806 page no 42332 n bits 776 index status of table `gruum`.`wp_actionscheduler_actions` trx id 357931945 lock_mode X Record lock, heap no 1 PHYSICAL RECORD: n_fields 1; compact format; info bits 0 0: len 8; hex 73757072656d756d; asc supremum;; Record lock, heap no 536 PHYSICAL RECORD: n_fields 2; compact format; info bits 0 0: len 7; hex 70656e64696e67; asc pending;; 1: len 8; hex 0000000000198cc9; asc ;; *** (2) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 1806 page no 1295 n bits 824 index status of table `gruum`.`wp_actionscheduler_actions` trx id 357931945 lock_mode X waiting Record lock, heap no 37 PHYSICAL RECORD: n_fields 2; compact format; info bits 32 0: len 7; hex 70656e64696e67; asc pending;; 1: len 8; hex 00000000005c7054; asc \pT;; *** WE ROLL BACK TRANSACTION (1) ```
barryhughes commented 3 years ago

Thanks for the update, @glagonikas — out of interest, have you had an opportunity to try this related change? I'm currently reviewing that alongside the new index PR, and so far (but I still have more testing to do) the former seems to have the greater impact in terms of reducing deadlocks.

glagonikas commented 3 years ago

hi @barryhughes i tried the code you sent me, but some deadlocks still happen. Traffic has died down however so it's not a like-for-like comparison

barryhughes commented 3 years ago

Thanks for taking it for a spin—this is all useful feedback :+1:

crstauf commented 3 years ago

This is easily the most interesting, thorough, and exciting issue I've ever read. Thanks to everyone. 🤔 😂 💯

samimakela commented 3 years ago

@glagonikas could you run EXPLAIN on the query claiming the actions?

On a quick glimpse it seems that the entire table is locked. As long as the claim actions query locks the entire table the possibility of deadlocks exits. In my opinion the only way to eliminate these deadlocks is to figure out a way to claim actions without locking the entire table.

The two queries are not updating the same rows by design. However one of them (the query used to claim actions) clearly locks more rows than it needs to.

glagonikas commented 3 years ago

Hi @samimakela

The explain is as follows

"id"    "select_type"   "table" "type"  "possible_keys" "key"   "key_len"   "ref"   "rows"  "Extra"
"1" "SIMPLE"    "wp_actionscheduler_actions"    "range" "status,scheduled_date_gmt,claim_id,claim_id, status, scheduled_date_gmt,status, scheduled_date_gmt"    "status, scheduled_date_gmt"    "87"    \N  "1" "Using where; Using filesort"

There's even an index for all three conditions (claim_id, status, scheduled_date_gmt) but doesn't seem to help, however, Ii'm not sure if having status first would help this index.

> WordPress database error Deadlock found when trying to get lock; try restarting transaction for query UPDATE wp_actionscheduler_actions SET claim_id=2777385, last_attempt_gmt='2021-04-21 15:11:05', last_attempt_local='2021-04-21 16:11:05' WHERE claim_id = 0 AND scheduled_date_gmt <= '2021-04-21 15:11:05' AND status='pending' AND scheduled_date_gmt <= '2021-04-21 07:42:34' ORDER BY attempts ASC, scheduled_date_gmt ASC, action_id ASC LIMIT 25 made by do_action_ref_array('action_scheduler_run_queue'), WP_Hook->do_action, WP_Hook->apply_filters, ActionScheduler_QueueRunner->run, ActionScheduler_QueueRunner->do_batch, ActionScheduler_DBStore->stake_claim, ActionScheduler_DBStore->claim_actions
samimakela commented 3 years ago

@glagonikas have you added any other "extra" indexes on top of the composite index combining the claim_id, status, scheduled_date_gmt?

If I'm not mistaken the query seems to have used an index containing only the status and scheduled_date_gmt, which should not exist.

glagonikas commented 3 years ago

@samimakela yes, we've added more indexes depending on what was necessary at the time. Those would have come from slow queries in slow.log or NewRelic

samimakela commented 3 years ago

@glagonikas any chance you could try this out with only one composite index (claim_id, status, scheduled_date_gmt)?

I'm not sure why the optimiser seems to have picked a different index, but unless the chosen index matches fully with the WHERE clause of the query all the rows will be locked, which opens a window for deadlocks.

glagonikas commented 3 years ago

I've removed the index status and scheduled_date_gmt and the EXPLAIN now uses the other composite index as below.

"id"    "select_type"   "table" "type"  "possible_keys" "key"   "key_len"   "ref"   "rows"  "Extra"
"1" "SIMPLE"    "wp_actionscheduler_actions"    "range" "status,scheduled_date_gmt,claim_id,claim_id, status, scheduled_date_gmt"   "claim_id, status, scheduled_date_gmt"  "95"    \N  "1" "Using where; Using filesort"

This didn't make any difference and deadlocks keep cropping up...

barryhughes commented 3 years ago

Closing this issue: we'll see how the new index impacts this issue 'in the wild' via the next AS release. If there are still ongoing problems for a sizeable cross-section of the community we can re-evaluate via a fresh issue and perhaps re-examine some of the other approaches discussed earlier on.

rogercoathup commented 3 years ago

Confirming we were also seeing the problem intermittently across our sites. Have deployed the patch today and not seen any further deadlocks -- will keep observing.