Aggregations Viewer mixes up data for multi-Task workflows

shaunanoordin commented 2 years ago

Functionality Issue

⚠️ Warning: issue is incredibly hard to spot unless you're aware of the expected data in advance. Scope: we're only worried about workflows with multiple Single Answer Question Tasks - specifically the Galaxy Zoo for Schools project, workflow ID 40)

If a Workflow has multiple Tasks, then the Aggregations Viewer may not show the correct aggregations for the selected Task.

Problem is best illustrated with an example:

For this Subject, I've selected Task T3 (aka workflow.tasks[index 2] ) to view
Task T3 has only two possible answers, "yes" or "no"
However, looking at the results from Caesar, in aggregations.data.workflow.reductions[index 2], there are six aggregated answers: {0: 1, 1: 1, 2: 2, 3: 5, 4: 10, 5: 4}

This indicates that the workflow's index IDs and the aggregation's index IDs aren't synced to the same tasks.

The problem could be any of the following:

Caesar's reduction code is providing results in a semi random order (I'm disinclined to believe this)
The Caesar setup for WF 40 is just bad (more likely)
Zoo-Note's graphql aggregation-fetching code is dumb and should explicitly spell out which Task ID (T2, etc) it wants (probably most likely)

Dev Notes

Relevant Resources

Galaxy Zoo for Schools - the project that needs this issue fixed.
🔑 Workflow 40 config)
🔑 Caesar setup for WF 40
Example of a GZforSchools Subject

🔑 indicate URLs that require Zooniverse logins

Baseline Success States (i.e. examples of what's working)

The Aggregations Viewer works fine if a Workflow has only one Task. For example:

URL: https://zoo-notes.zooniverse.org/view/workflow/17096/subject/53160411
Here, Zoo-Notes pings https://caesar.zooniverse.org/graphql with the very simple query { workflow(id: 17096) { reductions(subjectId: 53160411) { data } extracts(subjectId: 53160411) { data } } } (Get all extracts & reductions for WF 17096, subject 53160411)
The returned aggregations.data.workflow.reductions[] array has only one index entry, so there's no possible confusion as to which aggregation data maps to which Task.

Status

This is a major issue preventing educational classrooms from using a large swathe of workflow types. I'm aware that Kat wishes to use Galaxy Zoo for Classrooms with Zoo Notes some time in mid-May, so a fix is ideal, though this is not at a do-or-die priority. (The fallback is to use Galaxy Zoo for Classrooms without Zoo Notes)

shaunanoordin commented 2 years ago

Tagging in @eatyourgreens for dev work, and @camallen for any graphql advice. I'll add additional dev notes on the code on the Aggregations Viewer in a minute or ten. Processing, processing...

shaunanoordin commented 2 years ago

Additional Dev Notes

Here were my thought going into this issue:

The problem was first noticed when, as a test, I created a Caesar aggregator for task init (workflow.tasks[index 8]), but when I looked at it on the Aggregations Viewer, the results were displaying on task T1 (workflow.tasks[index 0])
Zoo Notes has the incorrect base assumption 🐛 that the workflow.tasks and aggregations.data.results have a 1-to-1 Task mapping.

Here are some files of interest:

AggregationsStore.js has a function called fetchAggregations()
- This constructs a very simple graphQL query which, if you'll note, doesn't specify which Task IDs to fetch.
```
const query = gql`{
workflow(id: ${workflowId}) {
  reductions(subjectId: ${subjectId}) {
    data
  }
  extracts(subjectId: ${subjectId}) {
    data
  }
}
}`
```
The AggregationsViewer.js component actually isn't that interesting, it's just a container.
Of more interest is SingleTask.js, which displays the aggregated results for a Single Answer Question Task as a bar chart/pie chart.
- The line of interest is this: const reductionsData = reductions && reductions[selectedTaskIndex]?.data
- This line is where the incorrect base assumption 🐛 is executed in the code.

Thoughts:

My current view is that the Aggregations code isn't robust enough.
⚠️ If the Caesar config is changed and/or if the graphQL aggregations-fetching code is changed to explicitly spell out the Task IDs, we need to be aware of other projects/Caesar configs that may need to be updated accordingly.
- We'll need to check with Kat on this, but as far as I know, the following are the only workflows that currently use Zoo Notes:
- Science Scribbler: Virus Factory - In Schools! (project 131134)
- Virus Picker 🔑
- Virus Classifier 🔑
- Virus Picker - Christmas Lecture 🔑
- Virus Classifier - Christmas Lecture 🔑

eatyourgreens commented 2 years ago

I don't know if this is relevant here, but the query that we use to get reductions for transcription tasks is slightly different. There's an extra reducerKey parameter. https://github.com/zooniverse/front-end-monorepo/blob/8fbe0d6a5fda63ba28e1c0d400b702aa19019b84/packages/lib-classifier/src/store/SubjectStore/Subject/TranscriptionReductions/TranscriptionReductions.js#L92-L98

const query = `{
            workflow(id: ${workflowId}) {
              subject_reductions(subjectId: ${subjectId}, reducerKey:"${REDUCER_KEY}")
              {
                data
              }
            }
          }`

lcjohnso commented 2 years ago

In agreement with @eatyourgreens above, I think the standard assumption has been that anyone pulling extracts or reductions from Caesar will be requesting data from specific extractors or reducers by key -- example: the Caesar config for transcription workflows all require a specific key=alice for extractor and reducer, and that is used to identify the data of interest (via workflowId + subjectId AND reducerKey selection).

There is no standard convention for extractor and reducer key names -- these are totally up to the discretion of the project team / configurer. The case of GZ, where each task has its own extractor and reducer with a key in form T (e.g., T0), is useful here but should not be assumed to be used by any project generally.

eatyourgreens commented 2 years ago

Thanks for taking a look @lcjohnso. That's really useful.

To get GZ for Classrooms working, I've got no problem with setting up queries that use specific keys just for that project. @shaunanoordin @camallen do you have any thoughts?

eatyourgreens commented 2 years ago

I think the non-breaking solution would be to add the reducer key as a URL parameter eg. https://zoo-notes.zooniverse.org/view/workflow/40/subject/475202?reducer=T4.

When the reducer parameter is present, we query just for that key, which should give us back results for only one task.

EDIT: we probably want to pass a key with the extracts query too.

shaunanoordin commented 2 years ago

(Copy-pasting a response I wrote in Slack:)

Instead of asking the app to accept ?reducer=T1&extractor=T1, I think a better solution is to standardise the Caesar Extraction & Reduction rules, so we manually enforce the "reduction/extractions keys MUST match Task keys" at the Caesar config level.

This will place responsibility on the dev/techs to ensure uniformity, instead of asking the educators to learn which extractor/reducer keys apply to which workflow.

This one is on me - when I set up the Virus Picker and Virus Classifier reducers/extractors, I should have made it a point to use the matching WF Task keys instead of blindly copying a template. When I set up the reducers/extractors for Galaxy Zoo, I made it a point to explicitly match reducer/extractor keys with the task keys, and I think it works much better (see PR #89)

zooniverse / zoo-notes