woocommerce / action-scheduler

A scalable, traceable job queue for background processing large queues of tasks in WordPress. Specifically designed for distribution in WordPress plugins (and themes) - no server access required.
https://actionscheduler.org
GNU General Public License v3.0
641 stars 117 forks source link

[Docs] Example for splitting up a large DB update #556

Open helgatheviking opened 4 years ago

helgatheviking commented 4 years ago

Could y'all add an example in the docs of how to break a MYSQL task down into small chunks that the AS then processes in batches. Lets assume you have an enormous database and you want to limit the updates to I don't know... 20 at a time or something. How do you keep the task rescheduling itself so as to prevent moving on to the next task until it is finished?

rrennick commented 4 years ago

@helgatheviking Thanks for the suggestion. That would make a good addition to the documentation.

The migration works this way. Since we don't have that doc yet, two files you could look at are https://github.com/woocommerce/action-scheduler/blob/master/classes/migration/Scheduler.php and https://github.com/woocommerce/action-scheduler/blob/master/classes/migration/Runner.php

helgatheviking commented 4 years ago

Thanks @rrennick I think I have been missing 2 parts... 1. the update task appears to launch a sub-task that does the actual DB updates in batches. and 2. unscheduling the update task when all the sub-tasks are done. I'll tinker with that today and see what I get.

helgatheviking commented 4 years ago

@rrennick Here's a non-destructive, minimal example. I didn't fully understand how to use the Store so some doc there would be cool, but I did figure out the gist of it.

NB: Needs error logging on to see any result:

/**
 * Schedule the update that needs to be batched.
 */
function kia_test_batch_schedule() {

    $next_scheduled_date = WC()->queue()->get_next( 'kia_run_batch_update', null, 'kia-group' );

    if ( $next_scheduled_date || get_option( 'kia_run_upgrade_once', false ) ) {
        return;
    }

    // Schedule the update.     
    WC()->queue()->schedule_single(
        time(),
        'kia_run_batch_update',
        array(
            'step' => 1
        ),
        'kia-group'
    );

}
add_action( 'admin_init', 'kia_test_batch_schedule' );

/**
 * Batch process all product posts
 *
 * @param  int $step - count of interations
 */
function kia_process_batch( $step ) { 

    global $wpdb;

    $batch_size = 20;
    $step = $step ? intval( $step ) : 1;
    $offset = $batch_size * ( $step - 1 );

    $posts = $wpdb->get_results( 
        $wpdb->prepare( 
            "SELECT ID FROM {$wpdb->posts} AS posts
            WHERE posts.post_type = 'product'
            LIMIT 20
            OFFSET %d",
        $offset )
    );

    if ( ! empty( $posts ) ) {

        error_log( 'starting batch step ' . $step );

        foreach ( $posts as $id ) {
            error_log( 'process post ' .  $id->ID );
        }

        error_log( 'end batch ' . $step );

        // Increment step counter.
        $step++;

        // Schedule the next batch.     
        WC()->queue()->schedule_single(
            time(),
            'kia_run_batch_update',
            array(
                'step'=> $step
            ),
            'kia-group'
        );

    } else {

        error_log( 'update finished' );

        // Update DB when finished.
        update_option( 'kia_run_upgrade_once', 'done' );
    }

}
add_action( 'kia_run_batch_update', 'kia_process_batch', 10, 2 );

Related to this, it could be a cool feature for AS to have some kind of schedule_batch() function.... if in your callback you return a boolean for some kind of completion status.... so return false it auto-schedules itself and if you return true it knows all the batches are done.

This next thing I'm stuck on is that I am borrowing heavily from Woo's core install class.... https://github.com/woocommerce/woocommerce/blob/master/includes/class-wc-install.php#L145-L150

And would like to define an array of updates that then get scheduled. That's working great... my confusion/question is taking

'wc_update_400_reset_action_scheduler_migration_status',
'wc_update_400_db_version'

as an example. The wc_update_400_db_version is run (And the DB is updated) before the action schedule migration is complete. There's no way to block the next action from running until all iterations of one action are done? Is there even a need for that? If it's in the queue it will get completed right? I think I am overthinking this, but would appreciate any insight!

rrennick commented 4 years ago

There's no way to block the next action from running until all iterations of one action are done? Is there even a need for that?

We implemented action dependencies in WooCommerce Admin to ensure the order data is imported. It's possible that could be ported to a future version of AS.

it could be a cool feature for AS to have some kind of schedule_batch() function.... if in your callback you return a boolean for some kind of completion status....

This is an interesting idea. The initial barrier for it is that do_action doesn't return a value but we might be able to create a BatchAction class that handles the return value some other way.

helgatheviking commented 4 years ago

I just noticed in WooCommerce's install class that the scheduled callbacks are wrapped in a before/action functions here

public static function run_update_callback( $callback ) {
        include_once dirname( __FILE__ ) . '/wc-update-functions.php';

        if ( is_callable( $callback ) ) {
            self::run_update_callback_start( $callback );
            $result = (bool) call_user_func( $callback );
            self::run_update_callback_end( $callback, $result );
        }
    }

with the result being sent to run_update_callback_end() after function and being used there to re-schedule. That's realllllly close to what I was describing, except that a bool can't be used to determine the next step/offset.

Is there really a need to block an action from happening if the data in the latter action is not dependent on the data in the former action? Yesterday I was worried about that, but today I can't see a real downside to updating the database version option while the batching still continues.

Anyway, this has been super fascinating to dig in to. It's very cool work! And I really appreciate your feedback in helping me getting it running.

NicolaModugno commented 9 months ago

Hello everyone. Sorry if I’m jumping into this thread, I hope it’s not a problem.

I was wondering how to set up a continuous update system for WooCommerce products using the Action Scheduler. The import must be cyclical: a set of data is requested, the products are updated, a new set of data is requested, and so on.

I would like to understand the best way to manage a single set of data (restarting the process for new data is not a problem):

1) schedule a task that imports a limited number of records (e.g., 20) and, at the end, schedule an additional task if there are still records to be processed; 2) schedule a repetitive task that handles a limited number of records if present (otherwise it ends without doing anything); 3) schedule a single task per record.

To be honest I don't like the idea 1... The problem is that importing few records per cycle means “losing time,” while importing too many records risks falling into timeout errors. In addition, in case of unexpected errors, the entire process gets blocked.

With idea 2, it would be necessary to manage any concurrency of scheduled jobs, even if it is feasible.

From what I understand at a theoretical level of the Action Scheduler, I think that following idea 3 is the best way. This way, I think that:

Can someone give me some input, please? Thank you very much!

jorgeatorres commented 9 months ago

Hi @NicolaModugno!

There are some other things to consider. For instance, if the processing that takes place for each record is very light, then maybe processing in batches would be better than scheduling one job per record. You could pass a subset of the records to process as argument to the scheduled job and maybe split out things in various jobs if necessary, each handling just a small subset of the records (not necessarily all that need processing).

Still, scheduling one job per record should work correctly and it's definitely easier (at least from the scheduling perspective) and we agree that it sounds like the best course of action in this case, without knowing all the details.

I apologize if the response is a bit ambiguous but usually figuring out the best way to utilize A-S depends on a lot of factors and finding the right balance between batch processing and individual jobs involves some testing and adjusting.