zooniverse / panoptes-cli

A command-line interface for Panoptes
Apache License 2.0
18 stars 5 forks source link

subject upload erroring with 4+ images tied to a subject #178

Closed mfidino closed 4 years ago

mfidino commented 4 years ago

I had some back and forth email with @trouille and they suggested I open up an issue here. When uploading subjects to a subject set we have been seeing issues (at times) when the function errors out when there are 4+ images tied to a subject id. In this specific case a user is trying to upload a maximum of 6 images per subject id (though I tested this with 4 and 5 images per id and the same issue occurred). This has been happening when not every subject id has an image in the fourth, fifth, or sixth image path column (depending on the number of images total). For example, running:

panoptes subject-set upload-subjects --allow-missing -m image/JPG {subject set id here} manifest_1.csv

with this manifest:

error_1

The above call will error out with: Error: File "C:" could not be found.

Conversely, this manifest would upload:

okay_1

None of the images are corrupted. Looking through a bit of the code for upload-subjects I'm guessing it is related to this function here: https://github.com/zooniverse/panoptes-cli/blob/cb2e6fc3a17644055102f396344f8390c3878d3f/panoptes_cli/commands/subject_set.py#L276-L284

but I'm not 100% sure.

lcjohnso commented 4 years ago

Hi @mfidino -- Let me start with a couple clarifications about the CLI's behavior:

I argue that the current behavior of the code is a feature, not a bug: it is better to throw errors regarding potential missing media files rather than unexpectedly upload subjects that are missing data.

Also note: if you had ordered your manifest differently (e.g., starting with an entry where only three image filenames were included), the code would have run successfully but in an unexpected way: the file_name_4 would have been interpreted as text metadata and all subjects would have been uploaded with only 3 JPG files each.

There is a workaround for your case where the number of media files varies per subject: create a new manifest file for each group of subjects depending on the number of media files (e.g., those with 2 JPG images, those with 3 JPG images, those with 4 JPG images) and use the CLI to upload each batch, one at a time. Each upload can point to the same subject set, so it just requires a little additional bookkeeping when preparing the data for upload.