zooniverse / panoptes

Zooniverse API to support user defined volunteer research projects
Apache License 2.0
103 stars 41 forks source link

Batch aggregation #4303

Closed zwolf closed 3 months ago

zwolf commented 6 months ago

Adds the necessary pieces to implement batch aggregation. In total, they should include:

Aggregation table migration Client to connect to aggregation service Refactors of: model, factory, controller, schemas, policy, serializer Specs of all the above

This is ready for review.

Review checklist

zwolf commented 4 months ago

I had some trouble with Doorkeeper scopes when I was testing this API outside of the specs (not Pundit scopes, but Doorkeeper's own). Nothing else in the app scopes solely by workflow and just adding to the optional scopes wasn't enough. Rather than keep fighting it, I switched the base scope for the Aggregation resource back to Project and added that column to the table--that's all it's used for so far, but the relation is logical and Doorkeeper is happy.

I think its worth documenting the flow of this to aggregations and back.

Added a page to the docs for the new model. @lcjohnso I think you wanted a ping to have a look.

Semi related to ^, I'm currently not entirely sure of what the difference is between a task id vs the uuid vs the aggregations id. Maybe adding a description on the schema or more description on PR would be helpful?

Included this info in the new docs. The task_id comes from Celery via Aggregation and represents the running background task. This id can be used to check on the status of a running job via the aggregation API and to see if it's been completed. This could be implemented by the front end as a way to check the status of pending, long-running jobs. The UUID is generated by the aggregation run and used in filenames and folders on blob storage. It'll be used to construct a URL that the files will be downloaded directly from.

Other thing/s I am wondering about: From what I understand of the new aggregations model, it is 1 aggregation per user per workflow. What happens if a user wants to re-run an aggregation for a workflow? (Does an aggregation record on panoptes get "cleaned" up after a certain point? Do we manually have to remove the aggregation record from the db so that re-run can be done? Is it even a thing to "re-run"?)

Yes, aggregations are unique per user & workflow. And true, this was missing a way to re-run the aggregation--what makes the most sense? The FE checks for an existing aggregation resource via GET api/aggregations?workflow_id=[wf_id]&user_id=[user_id]. I have now implemented DELETE on the controller and added a couple specs to more clearly indicate what the intended behavior is. And though I was trying to avoid making the FE worry about it, the standard approach might be easiest: FE modal checks for existing agg, and if it's completed/failed, a click of "rerun" DELETEs by id, then POSTs to create new one, triggering a new aggregation run.

Talk actually does clean up data requests after a day and blocks early rerun attempts, I don't love that either. For now, I think this functionality is enough to work with, but maybe I could add some kind of shortcut (via route, POST body short circuit, other?) to reduce the required FE calls. Thoughts?

lcjohnso commented 3 months ago

Docs look good -- thanks!

One question regarding the existing aggregations table: I see you're removing the main aggregation data column in the migration, but will the >1 million rows stay in place? Or do those need to be cleaned up too? Is that happening elsewhere, or in this or a similar migration?

zwolf commented 3 months ago

I see you're removing the main aggregation data column in the migration, but will the >1 million rows stay in place? Or do those need to be cleaned up too? Is that happening elsewhere, or in this or a similar migration?

The plan looks like this:

I added the extra tasks to the board, plan to get them done this week.

zwolf commented 3 months ago

This is ready to merge. I truncated the aggregations table on staging in preparation:

irb(main):005:0> Aggregation.count
[primary]   (55.2ms)  SELECT COUNT(*) FROM "aggregations"
=> 1
irb(main):007:0> Aggregation.connection.truncate(Aggregation.table_name)
[primary]   (777.3ms)  TRUNCATE TABLE "aggregations"
=> #<PG::Result:0x00007fad5f56b968 status=PGRES_COMMAND_OK ntuples=0 nfields=0 cmd_tuples=0>
irb(main):008:0> Aggregation.count
[primary]   (65.2ms)  SELECT COUNT(*) FROM "aggregations"
=> 0

I plan on doing the same thing to production before this gets deployed. In the end, this feels like a data operation rather than a schema change and not something that should exist as a migration or a repeatable rake task. Truncate is the fastest way to remove the data, which will be important on prod where there are ~1.2M row (newest from 2017).