Add tsvector update rake task

zwolf commented 4 years ago

Questions:

Should the model calls be .unscoped? There are no default_scopes but maybe just to be safe?
Should the batch size be larger than 1000 for comments & discussions? We got a few million to get through.
Is the throttling necessary if doing this in batches? I don't have a sense of how heavy these searchable_update SQL calls are, but better to do them slowly than to crush the DB.
Discussion, Board, and Comment are the three searchables. Should these be separated out further, since obv discussions/comments are gonna be way heavier a lift than boards?
I'm assuming that running these on Jenkins is the way to go. If there's an issue and we need to restart, the logs & batch number should give us a checkpoint. That sound right?

zwolf commented 4 years ago

I was imagining the batch numbers in the Jenkins console as the way we could track progress. Since recalculating the tsvectors is basically idempotent (we can set_content as many times as we like and it'll be the same, hope I used that word right :D ), we could just back up a batch or two and use the start: X option to restart at pkey X into the process (where X is the id, batch number * 1000).

I need to update the task to take at least that start value as an arg to do this, but I think it'd be less heavy than a db. Maybe I allow the task to take all of start, end, and batch_size so that we can test how long it takes to process however many we want, wherever in the table we want.

zwolf commented 4 years ago

Turns out that the finish option to find_in_batches was added in Rails 5 so no dice on that one but just running the smaller board task would only process 3k or so rows, though with much smaller text sizes.

camallen commented 4 years ago

I was imagining the batch numbers in the Jenkins console as the way we could track progress. Since recalculating the tsvectors is basically idempotent (we can set_content as many times as we like and it'll be the same, hope I used that word right :D ), we could just back up a batch or two and use the start: X option to restart at pkey X into the process (where X is the id, batch number * 1000).

Yep - that stragey will work. Batch numbers won't report the actual record PK id so you can't use the batch as a value for the :start param in find_in_batches. https://apidock.com/rails/v4.2.9/ActiveRecord/Batches/find_in_batches

We will need to report that batch manually in the logs if you want to harvest it from there and feed that in as a param for the next rake job run.

I need to update the task to take at least that start value as an arg to do this, but I think it'd be less heavy than a db. Maybe I allow the task to take all of start, end, and batch_size so that we can test how long it takes to process however many we want, wherever in the table we want.

TBH - manually harvesting from the logs and feeding it into the rake job params seems more manual to me than storing redis where it can be looked up without manual intervention.

fwiw it's worth this is how you can get access to redis via sidekiq,

# https://github.com/mperham/sidekiq/blob/74ccba6c68b1df31d615991fb2749fc19de8fbf7/lib/sidekiq.rb#L92
# get the connection info from sidekiq redis connection pool
Sidekiq.redis { |conn| conn.info }

that said it might be worthwhile using a different redis db / connection here.

# https://github.com/redis/redis-rb#getting-started
redis = Redis.new(host: REDISHOST, db: 15)
# https://github.com/redis/redis-rb/blob/9fd381b18da20248a111cd9551354907c772a00a/lib/redis.rb#L859
# set the record with an expiry time in one month (2,629,746 seconds)
redis.setex("rake-task-#{rake_task_name}", 2629746, record.id)

Up to you how you want to proceed.

zwolf commented 4 years ago

The batch number is being printed to the logs already via .with_index and a puts. I wasn't seeing the goal here as making this task automatically resumable, because if something goes wrong, we'll need to intervene and figure out why anyway. Not sure what redis and autoresuming would buy us, unless you are imagining a glitch that would mean a successfully processed row on a second attempt after a failed first?

Since it doesn't matter how many times we recalculate a row, I see a 1000 row batch as atomically small enough to get us in the ballpark for debugging and checkpointing. We'd know which batch it failed on, restart on an ID in the one previous. The order's always the same, since

It’s not possible to set the order. That is automatically set to ascending on the primary key (“id ASC”) to make the batch ordering work.

zwolf commented 4 years ago

Real talk If you've got a feeling that making that search update rake task auto-resumable would save us a headache later, I'm on board. If a resource fails to update once, it seems like it would fail a second time and we'd need to know it anyway, but that might not be a safe assumption.

Alternatively, maybe wrap that call with a begin/rescue and catch Exception, log the ID and the error, and move on?

zwolf commented 4 years ago

Couple notes: If you're running these in zsh, you'll have to escape the brackets. I.e.,

rake search:update_comment_tsvectors[44]
zsh: no matches found: search:update_comment_tsvectors[44]

so try rake search:update_comment_tsvectors\[44\]

Also, we might have to run them locally cause it looks like the rake task runner in Jenkins hasn't been updated for Azure. Is there a new way I'm forgetting? Is it worth updating? Or is this another shared task for a screen on the Bucardo VM?

camallen commented 4 years ago

Adding the start and batch_size args is a nice addtion 👍 this will be useful for manually seeding the job starting point.

    args.with_defaults(start: 0, batch_size: 1000)
    Comment.find_in_batches(start: args[:start].to_i, batch_size: args[:batch_size].to_i).with_index do |comments, batch|

The batch number is being printed to the logs already via .with_index and a puts.

As it stands, the logs only contain the batch number, not the last record id we processed.

irb(main):002:0> args = { start: 0, batch_size: 5 }
=> {:start=>0, :batch_size=>5}
irb(main):003:0> Comment.find_in_batches(start: args[:start].to_i, batch_size: args[:batch_size].to_i).with_index do |comments, batch|
irb(main):004:1*       puts "Processing comment group ##{batch}"
irb(main):005:1>       comments.map { |comm| comm.searchable.set_content }
irb(main):006:1>       sleep(0.01) # throttle
irb(main):007:1>     end
Processing comment group #0
Processing comment group #1

I may be mistaken but reading the docs for find_in_batches, the batch numbers can not be used as the start argument offset. Rather the start argument is the record id to start processing from, i.e. Comment.where(id: args[:start]). We will have to print the last processed id in each batch in the logs to use this for check pointing / seeding a job start.

Here is some staging console logs showing the batch num vs record id as the start arg, note the record ids being processed each time.

irb(main):014:0> args = { start: 0, batch_size: 5 }
=> {:start=>0, :batch_size=>5}

### Starting at batch 0
irb(main):015:0> Comment.find_in_batches(start: args[:start], batch_size:args[:batch_size].to_i).with_index do |comments, batch| 
irb(main):016:1*   puts "Processing comment group ##{batch}"
irb(main):017:1>   puts comments.map { |comm| comm.id }
irb(main):018:1>   sleep(0.01) # throttle
irb(main):019:1>   break
irb(main):020:1> end
Processing comment group #0
85
86
87
88
89
=> nil
irb(main):021:0> args = { start: 10, batch_size: 5 }
=> {:start=>10, :batch_size=>5}

### Starting at batch 10
irb(main):022:0> Comment.find_in_batches(start: args[:start], batch_size:args[:batch_size].to_i).with_index do |comments, batch| 
irb(main):023:1*   puts "Processing comment group ##{batch}"
irb(main):024:1>   puts comments.map { |comm| comm.id }
irb(main):025:1>   sleep(0.01) # throttle
irb(main):026:1>   break
irb(main):027:1> end
Processing comment group #0
85
86
87
88
89
=> nil
irb(main):028:0> args = { start: 89, batch_size: 5 }
=> {:start=>89, :batch_size=>5}

### Starting at record ID 89 not batch num
irb(main):029:0> Comment.find_in_batches(start: args[:start], batch_size:args[:batch_size].to_i).with_index do |comments, batch| 
irb(main):030:1*   puts "Processing comment group ##{batch}"
irb(main):031:1>   puts comments.map { |comm| comm.id }
irb(main):032:1>   sleep(0.01) # throttle
irb(main):033:1>   break
irb(main):034:1> end
Processing comment group #0
89
90
91
92
94
=> nil

I wasn't seeing the goal here as making this task automatically resumable

Auto restart isn't required, that said I think it's a good idea but it's i'll defer to you on this. The start arg seems a good approach as well.

Noting though that if we run this rake task in a container and the container goes away we may lose the logs on this via kubectl logs so harder to get back for reuse in the rake task (i think we can get it from the AKS logs monitor via their custom query language - https://github.com/zooniverse/operations/issues/476#issue-678605097)

because if something goes wrong, we'll need to intervene and figure out why anyway.

Yes for sure we will want to figure it out. I don't think there will be an issue with the actual record processing 🤞 but i've been bitten in the past with data updates in rake tasks. If it does error due to data issues we will have to intervene and fix the code / data and resume where we left off.

My main concern with a long lived task like this is container restarts / failures. We don't have a good way to run rake tasks (like migrations via jobs https://github.com/zooniverse/operations/projects/1#card-42007513) right now. So if the job / container we use to run the rake task fails or restarts we will have to restart our job processing. When we restart we don't want to reprocess records we have already updated, rather use the checkpoint offset and start from there to optimize the run.

In my mind for a restarted container / deploy etc simply restarting the job (with it's auto offset in say redis) is easier than scraping the logs manually and seeding the rake task but probably not much.

Not sure what redis and autoresuming would buy us, unless you are imagining a glitch that would mean a successfully processed row on a second attempt after a failed first?

I'm thinking our main concern is a restarted container rather than data error (a data error will require code change / manual data fix). If our rake processes gets killed then we need to track the last processed db record id, we don't currently have this

Since it doesn't matter how many times we recalculate a row, I see a 1000 row batch as atomically small enough to get us in the ballpark for debugging and check pointing.

I think the main issue i see with the current PR is that we don't have anything in the logs to feed into the start arg so we can't actually restart from where we stopped. If we don't have to worry about losing the logs, ensuring we have the last record ID then great. If not then we might have to reprocesses 1000s or records over and over and slow down the data fix for search feature.

We'd know which batch it failed on, restart on an ID in the one previous. The order's always the same, since

I don't think we can use batch this way - see points above.

happy to jump on a call to discuss or discuss here.

zwolf commented 4 years ago

We'd know which batch it failed on, restart on an ID in the one previous. The order's always the same, since

I don't think we can use batch this way - see points above.

My thinking is this: If the last processed batch was 89, I use (last processed batch-1) * batch_size reprocess a minimum of rows, but still ensure we didn't have to start over. So in this example, the start arg would be 88000. Maybe a quick convo would be best to make sure I'm not misunderstanding the issue with this.

I totally hear you about containers perhaps not being long lived enough to get through this in a reasonable amount of time, and right now how this task actually gets run is still up in the air. Starting a container from Jenkins seems the likeliest, so I'll approach it from that direction.

zwolf commented 4 years ago

I think I finally understand what you meant by not being able to use the batch id as a bookmark: it's relative to the starting ID, so even if we restart batch 89 like find_in_batches(start: 89000), then it starts on id 89,000 but on batch 1.

So rather than make millions and millions of remote redis calls, I'm going to use the same idea of checkpointing by batch but use the id of the last resource in the batch instead. That gets stored in redis, and on start it checks for that key, then the arg passed into the rake task, then the default.

If you can check my logic, I'll test this on staging. Thanks!

zooniverse / talk-api

Add tsvector update rake task #245