ytsapras / robonet_site

Django RoboNet operational database.
GNU General Public License v2.0
0 stars 0 forks source link

Images being ingested multiple times #23

Closed rachel3834 closed 7 years ago

rachel3834 commented 7 years ago

Last week, Etienne found an unexpected behaviour from the update_db_2 image ingestion function. We expected that it would not re-ingest an image that was already in the DB but this appears not to be the case. I've written a management command (verify_image_ingestion.py) to check this which produced the attached log recording multiple instances of individual images in the DB.
Are we correct to think this is not the intended behaviour?

image_ingestion.txt

ytsapras commented 7 years ago

There used to be a 'catch' for image duplication in the image ingestion function but if I recall correctly there were some objections (images may come from different pipeline reductions) so I removed it. Here is what we can do: 1) Add a 'check if image exists' function before adding any new images. This runs into the problem above. But the new version of the database does not contain any 'reduced by pipeline x' field so it may be ok - assuming we are only ingesting images from a single pre-processing pipeline. 2) Etienne maintains a file with all the images that have been ingested. Any function that adds images should first check that file. This worked fine in the past but I suppose it was lost when we had the disk problems.

I am happy with either option. Just let me know.

rachel3834 commented 7 years ago

I think option 1 is the best. After all, the purpose of the DB is being able to refer to it, rather than keep separate lists. It is true that there are some cases where we may want to add different reduction products for the same image to the DB, so I think your add_image function is behaving as it should. But in the case of reception which is handling the same reduction products (BANZAI pre-processed frames), we should have a check_image_exists function. Reception should call this first, and decide whether or not to use add_image on the basis of the output. I will develop that function now.

rachel3834 commented 7 years ago

I've implemented a management command to check the image table for duplicates and remove them, and applied this to the operational database. Attached are log files comparing the DB contents before and after the duplicates are removed.

I've also added a function to the query_db.py module, called check_image_in_db - Etienne, you can go ahead and use this function in reception_data.py

image_ingestion_test_before.txt image_ingestion_test_after.txt

ebachelet commented 7 years ago

Hi Rachel

Thanks a lot to implement this. I will adapt reception_data.

Cheers

2017-09-11 21:39 GMT-07:00 Rachel Street notifications@github.com:

I've implemented a management command to check the image table for duplicates and remove them, and applied this to the operational database. Attached are log files comparing the DB contents before and after the duplicates are removed.

I've also added a function to the query_db.py module, called check_image_in_db - Etienne, you can go ahead and use this function in reception_data.py

image_ingestion_test_before.txt https://github.com/ytsapras/robonet_site/files/1294745/image_ingestion_test_before.txt image_ingestion_test_after.txt https://github.com/ytsapras/robonet_site/files/1294746/image_ingestion_test_after.txt

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/ytsapras/robonet_site/issues/23#issuecomment-328733862, or mute the thread https://github.com/notifications/unsubscribe-auth/AOLtDedG5zkwQsEpzEpPHD32rnIDGOJ_ks5shgsEgaJpZM4PSXAm .