zwolf commented 6 months ago

Adds the necessary pieces to implement batch aggregation. In total, they should include:

Aggregation table migration Client to connect to aggregation service Refactors of: model, factory, controller, schemas, policy, serializer Specs of all the above

This is ready for review.

Review checklist

[ ] First, the most important one: is this PR small enough that you can actually review it? Feel free to just reject a branch if the changes are hard to review due to the length of the diff.
[ ] If there are any migrations, will they the previous version of the app work correctly after they've been run (e.g. the don't remove columns still known about by ActiveRecord).
[ ] If anything changed with regards to the public API, are those changes also documented in the apiary.apib file?
[ ] Are all the changes covered by tests? Think about any possible edge cases that might be left untested.

zwolf commented 4 months ago

I had some trouble with Doorkeeper scopes when I was testing this API outside of the specs (not Pundit scopes, but Doorkeeper's own). Nothing else in the app scopes solely by workflow and just adding to the optional scopes wasn't enough. Rather than keep fighting it, I switched the base scope for the Aggregation resource back to Project and added that column to the table--that's all it's used for so far, but the relation is logical and Doorkeeper is happy.

I think its worth documenting the flow of this to aggregations and back.

Added a page to the docs for the new model. @lcjohnso I think you wanted a ping to have a look.

Semi related to ^, I'm currently not entirely sure of what the difference is between a task id vs the uuid vs the aggregations id. Maybe adding a description on the schema or more description on PR would be helpful?

Included this info in the new docs. The task_id comes from Celery via Aggregation and represents the running background task. This id can be used to check on the status of a running job via the aggregation API and to see if it's been completed. This could be implemented by the front end as a way to check the status of pending, long-running jobs. The UUID is generated by the aggregation run and used in filenames and folders on blob storage. It'll be used to construct a URL that the files will be downloaded directly from.

Other thing/s I am wondering about: From what I understand of the new aggregations model, it is 1 aggregation per user per workflow. What happens if a user wants to re-run an aggregation for a workflow? (Does an aggregation record on panoptes get "cleaned" up after a certain point? Do we manually have to remove the aggregation record from the db so that re-run can be done? Is it even a thing to "re-run"?)

Yes, aggregations are unique per user & workflow. And true, this was missing a way to re-run the aggregation--what makes the most sense? The FE checks for an existing aggregation resource via GET api/aggregations?workflow_id=[wf_id]&user_id=[user_id]. I have now implemented DELETE on the controller and added a couple specs to more clearly indicate what the intended behavior is. And though I was trying to avoid making the FE worry about it, the standard approach might be easiest: FE modal checks for existing agg, and if it's completed/failed, a click of "rerun" DELETEs by id, then POSTs to create new one, triggering a new aggregation run.

Talk actually does clean up data requests after a day and blocks early rerun attempts, I don't love that either. For now, I think this functionality is enough to work with, but maybe I could add some kind of shortcut (via route, POST body short circuit, other?) to reduce the required FE calls. Thoughts?

lcjohnso commented 3 months ago

Docs look good -- thanks!

One question regarding the existing aggregations table: I see you're removing the main aggregation data column in the migration, but will the >1 million rows stay in place? Or do those need to be cleaned up too? Is that happening elsewhere, or in this or a similar migration?

zwolf commented 3 months ago

I see you're removing the main aggregation data column in the migration, but will the >1 million rows stay in place? Or do those need to be cleaned up too? Is that happening elsewhere, or in this or a similar migration?

The plan looks like this:

This PR merges and cleans up the references to the aggregations attribute on the workflows table.
A new PR runs a migration that actually deletes it
The data in the aggregations table is actually deleted. This could happen in a rake task but will likely just be me running a truncate on Postgres.
This PR merges and modifies the aggregations table along with everything else.

I added the extra tasks to the board, plan to get them done this week.

zwolf commented 3 months ago

This is ready to merge. I truncated the aggregations table on staging in preparation:

irb(main):005:0> Aggregation.count
[primary]   (55.2ms)  SELECT COUNT(*) FROM "aggregations"
=> 1
irb(main):007:0> Aggregation.connection.truncate(Aggregation.table_name)
[primary]   (777.3ms)  TRUNCATE TABLE "aggregations"
=> #<PG::Result:0x00007fad5f56b968 status=PGRES_COMMAND_OK ntuples=0 nfields=0 cmd_tuples=0>
irb(main):008:0> Aggregation.count
[primary]   (65.2ms)  SELECT COUNT(*) FROM "aggregations"
=> 0

I plan on doing the same thing to production before this gets deployed. In the end, this feels like a data operation rather than a schema change and not something that should exist as a migration or a repeatable rake task. Truncate is the fastest way to remove the data, which will be important on prod where there are ~1.2M row (newest from 2017).

zooniverse / panoptes

Batch aggregation #4303

Review checklist