yalelibrary / YUL-DC

Preliminary issue tracking for Yale University Libraries Digital Collections project
3 stars 0 forks source link

Investigate Checksum Locations #2900

Closed laurenb33 closed 2 weeks ago

laurenb33 commented 4 months ago

While working on the Nightly Job Integrity check dev work, it was discovered that our migrated children in DCS don't have checksums! However, @mikeapp mentioned during standup on Friday, 7/19 that there may be checksums somewhere in Ladybird. This ticket is to investigate if there are checksums in Ladybird? If there are, where are they and how could they be imported to DCS? These questions are also listed below in the acceptance criteria.

Acceptance Please Investigate/answer the following questions:

mikeapp commented 3 months ago

I found a spreadsheet that lists the field in Ladybird as: MDfive Checksum [fdid=306]

laurenb33 commented 3 months ago

Screenshot of File Checksums in LB SQL: Image

jillpe commented 3 months ago

Need to manually check a couple to see if they match the originals rather than the derivatives

sshetenhelm commented 4 weeks ago

@martinlovell Were you able to check out these checksums?

martinlovell commented 4 weeks ago

Not yet, I have a reminder but haven't gotten to it.

martinlovell commented 3 weeks ago

Here's the md5s and sometimes the sha256 from the c#_file tables. Some have "PRIMARY" and some "DERIVE" for the images. Depending on where we got the image, either may be a match. (Guess: If it's from Fedora, then it seems like it might be DERIVE. If we got the original image, then it migth be PRIMARY.)

The columns in the CSV are collection, OID, label, _md5, _sha256 from the c#_file tables.

Collection1.csv Collection2.csv Collection3.csv Collection4.csv Collection6.csv Collection7.csv Collection9.csv Collection10.csv Collection11.csv Collection12.csv Collection13.csv Collection14.csv Collection15.csv Collection16.csv

martinlovell commented 2 weeks ago

Checked a couple random oids:

c1_file._md5 1000151.tif does not match 1000234.tif does not match 1102347.tif matches (8070d7a362dba2e3250907a82647002d) md5 17171234.tif matches (a1fa1ceb6508d2c4136180a819bd4f6a) md5

c4_file._md5 14795346.tif matches (6fdd9c105be034e8ff4c5350f2aec760) 14795349.tif matches (cc4ad072a6e9915752c88174b6778d8a)

c9_file._md5 15479454.tif matches (54807b7d9dc34dcde655556b0f7bcc9b) 15479474.tif matches (c089cab30ac90b8a1b25c09f22ee8426)

c16_file._md5 and _sha256 11400908.tif matches both (30e22d740b0db98dc94b0d49aaceb41c) 11400912.tif matches both

So...started a little discouraging, but after that everything matched. For the ones that didn't match, the file size differs. (file size is also stored in ladybird).

sshetenhelm commented 2 weeks ago

Given that matching/ingesting all migrated content with/into Preservica in the most optimized way for DCS will likely be a long-term goal, we would like to move forward with using these checksums to verify migrated files.

I'll make new tickets for future work, since this ticket was technically just to "investigate."