yalelibrary / YUL-DC

Preliminary issue tracking for Yale University Libraries Digital Collections project
3 stars 0 forks source link

Migrate 2016 OwP Objects from MS 2004 into UAT (PID 39) #2787

Open sshetenhelm opened 8 months ago

sshetenhelm commented 8 months ago

Story Migrate 1944 parent objects from the Kissinger MS 2004 collection (PID 39) from FindIt to DCS. Parents OIDs in this attached file: MS2004_ForMigration.csv

Acceptance

MaggieZhaoYale commented 6 months ago

Create parents object issues: failed to convert PTIFF due to Expected file originals xxxx not found.

MaggieZhaoYale commented 6 months ago

@sshetenhelm It looks an issue in UAT. I tried to create one in prod, it went through fine.

MaggieZhaoYale commented 6 months ago

162/1944 parents still have some expected original ptiff not found on S3, even though these images are in the pairetree. https://collections-uat.library.yale.edu/management/batch_processes/2004 4 parents are the complex parents, downloading images of their children's children ... 12481487 12481219 12479863 12479629

Three child images:Child Object 14775661 failed to convert PTIFF due to Conversion script exited with error code 1. --- Script exited with status 1 at line 54 Script exited with status 1 at line 58 --- (vips:14261): VIPS-WARNING **: 20:38:50.063: error in tile 0 x 2862 TIFFFillStrip: Read error at scanline 2860; got 8553 bytes, expected 21288 tiff2vips: read error

14775661 14740525 14759134

sshetenhelm commented 5 months ago

Looking at 14863855

In MGMT UAT, the Title in the JSON is

"title": [
    "Zo-Zu, Image 1”

Which is a child of parent 12479629, which was created in MGMT UAT but has 0 child images generated due to missing images. The child OID for this is 14863855 in MGMT UAT; however, in Ladybird, it is 17488697.

I will review more of these failed parents to see if other similar things are happening. However, I don’t know why the child OID is different from Ladybird.

sshetenhelm commented 5 months ago

Deleted 381 parents that were actually LB children - https://collections-uat.library.yale.edu/management/batch_processes/2036

MaggieZhaoYale commented 5 months ago

Looking at 14863855

In MGMT UAT, the Title in the JSON is

"title": [
    "Zo-Zu, Image 1”

Which is a child of parent 12479629, which was created in MGMT UAT but has 0 child images generated due to missing images. The child OID for this is 14863855 in MGMT UAT; however, in Ladybird, it is 17488697.

I will review more of these failed parents to see if other similar things are happening. However, I don’t know why the child OID is different from Ladybird.

The parent 12479629 has 0 child images, its 74 children are archivalDigitized, and all have images. It looks that all oids of the 74 children in MGMT are different from the ladybird. The children of this parent in ladybird uat have different oids from the prod ladybird. https://metadata-api-uat.library.yale.edu/metadatacloud/api/1.0.1/ladybird/oid/12479629?include-children=1&mediaType=json VS https://metadata-api.library.yale.edu/metadatacloud/api/1.0.1/ladybird/oid/12479629?include-children=1&mediaType=json

sshetenhelm commented 5 months ago

So, if I am understanding correctly, this means that the object is structured differently in UAT Ladybird than Prod Ladybird?

In Prod, I see: Parent - 12479629 Child 1 - 17488697 - image Child 2 - 17488698 - image etc.

I don't have permissions for Ladybird UAT, although the record looks the same in FindIt UAT vs. FindIt PROD.

My inclination is to not let this single parent be a blocker for migrating the rest of the collection, so perhaps we should just roll with this one as it stands, and try it in PROD, then troubleshoot with Josh from there if it doesn't work.

sshetenhelm commented 5 months ago

The following parents have 0 child images: 12479629 (the problematic Zo-Zu parent mentioned above) 12479863 - EDIT: should have images 12481219 EDIT: Hierarchical; will delete from UAT 12481487 - EDIT: should have images

The following parents are missing 1 child image: 12482030 - missing child order 57, oid 14775661 12482044 - missing child order 27, oid 14740525 12482081 - missing child order 140, oid 14759134

I will check Ladybird to see if images were attached to these in Ladybird/FindIt.

EDIT: The three missing children should all have tifs/jpgs attached as per Ladybird, but child 14740525's tif filesize is only 3152 kb, which is much smaller than the others.

sshetenhelm commented 4 months ago

ONE: @MaggieZhaoYale could you please try and pull in the following three children? 12482030 - missing child order 57, oid 14775661 12482044 - missing child order 27, oid 14740525 12482081 - missing child order 140, oid 14759134

If they won't migrate, I'll see if we can get original images from elsewhere.

TWO: We will address the three parents with 0 images in UAT when we migrate to PROD, due to discrepancies between OIDs across Ladybird UAT and PROD.

THREE: @MaggieZhaoYale , could we start pulling in text files for these objects, when you have a chance? Thank you!

MaggieZhaoYale commented 4 months ago

@sshetenhelm Although the images of the above 3 can be dowloaded from Fedora, all have conversion issue. e.g. https://collections-uat.library.yale.edu/management/batch_processes/2081/parent_objects/12482081/child_objects/14759134

sshetenhelm commented 4 months ago

Pulled three child images from Preservica--they are also having file issues. Have asked Josh if he knows of any other copies of these scans. We may need to bookmark these for solving in PROD later, especially if we need to have the pages re-scanned.

EDIT: Asking IT for access to full Kissinger storage share, to see if images work there.

EDIT 2: These three images are also broken in the Kissinger Storage @ Yale share, so we will need to request to have the pages re-digitized. I will make a backlog ticket to address these issues once the pages have been re-digitized, and we will move forward with the migration process with that in mind.

sshetenhelm commented 3 months ago

Working through OCR for partial text objects.

sshetenhelm commented 3 months ago

@MaggieZhaoYale Could you please re-try migrating the text files for these parents? Thank you!

MS2004-OwP-None.csv

sshetenhelm commented 2 months ago

After comparing UAT to Aspace, it looks like we have 73 outstanding MS 2004 objects that were not migrated. 1 is OwP (12482156), 72 are public.

csv: MS2004-Additions.csv

Outstanding actions:

MaggieZhaoYale commented 2 months ago

https://collections-uat.library.yale.edu/management/batch_processes/2202 please ignore 140 failed parents, which were deleted. No ocr was found for the 73 parents.

sshetenhelm commented 2 months ago

@MaggieZhaoYale First round of replacement OCR .txt files (~10,714) available at: FC_YULDCS-807001-YUL > DCS_MaggieandSummer > Kissinger_OCR > MS2004-01

sshetenhelm commented 1 month ago

Next files: FC_YULDCS-807001-YUL > DCS_MaggieandSummer > Kissinger_OCR > MS2004-02

sshetenhelm commented 1 month ago

Thank you @MaggieZhaoYale ! MS2004-03 is uploaded in the same place, and MS2004-04 should be transferred into our share by the end of day (Mon 10/14)

sshetenhelm commented 1 month ago

Thank you @MaggieZhaoYale ! Here are the next two batches; the folders are in the same location on the shared drive:

sshetenhelm commented 1 month ago

Thank you @MaggieZhaoYale !!!

We are down to 10 OwP objects that have 'None' for OCR. Not sure how I missed those but I will wrangle the files.

~ In theory ~, all the partials should be in FC_YULDCS-807001-YUL > DCS_MaggieandSummer > Kissinger_OCR > MS2004-Partials

Once we do the partials, I'll look at the entire collection again and provide both the missing Text = None and any remaining Text = Partial

sshetenhelm commented 1 month ago

Down to: 10 OwP objects with Full Text = None 4 OwP objects with Full Text = Partial

Gathering files now.

sshetenhelm commented 1 month ago

@MaggieZhaoYale - The final OCR files are in FC_YULDCS-807001-YUL > DCS_MaggieandSummer > Kissinger_OCR > MS2004-FINAL

Could you please upload them when you have a chance?

sshetenhelm commented 1 month ago

Totals

Discrepancy of 8 parents due to the following: Corrupted image files on server; requesting/have requested rescans 12479779 - Missing child order 143, oid 15035394 12482030 - Missing child order 57, oid 14775661 12482044 - Missing child order 27, oid 14740525 12482081 - Missing child order 140, oid 14759134

Missing all images; Different OIDs in Ladybird UAT vs PROD 12479629 - Box 115 | Folder 7 12479863 - Box 145 | Folder 5 12481487 - Box 213 | Folder 9

Missing some images 12482156 - Box 35 Folder 2 - Missing images for children 74 and 83 through 93

I will pass collection off to stakeholder to review and ask if there is any reason why we can't provide the images for the last four parents (ex. are images missing for a reason?)

jillpe commented 1 month ago

Waiting for feedback from others

sshetenhelm commented 5 days ago

Starting the first week of December, we will begin moving this into PROD, but under "PRIVATE" until January.