yalelibrary / YUL-DC

Preliminary issue tracking for Yale University Libraries Digital Collections project
3 stars 0 forks source link

Investigate Integrity Check on Test #2933

Closed K8Sewell closed 1 month ago

K8Sewell commented 2 months ago

Summary

On the Test env there are several days of integrity checks that nearly the entire sample has failed. On UAT the integrity check job appears to be working as expected.

Acceptance Criteria

laurenb33 commented 1 month ago

On 10/7, we discussed increasing the number of objects in the integrity check for production and limiting the objects checked in other enviros to around 100 to make sure the check is operating properly.

jpengst commented 1 month ago

Ok, I found the bug. It looks like we have thousands on parent objects in test, uat, and prod that have a nil digital_object_source. In these cases, the integrity check will not grab those parents. We added the "None" default value for digital_object_source June 2023, so there's still quite a few records that have not been updated since. We'll probably need to shell into the server to get the exact number.

I mentioned in standup on Monday that test is consistently only grabbing 127 parent objects each time instead of 2000. This is because there are only 127 parent objects in test that have a correct "None" digital_object_source, so the integrity check is only checking those 127 objects.

I can modify the query to also include parents with a nil digital_object_source if we want? Or we can work on updating all the records with a nil digital_object_source. Thoughts?

jpengst commented 1 month ago

I think the network file system mounts on test fell over. We're getting a stale file handle error when trying to access the /data folder: cannot access /data/00: Stale file handle

jpengst commented 1 month ago

@thescreamingmandrake
ec2-user@ip-10-5-69-135

martinlovell commented 1 month ago

It looks like all 11 disconnected: find /data /data find: ‘/data/00’: Stale file handle /data/00 find: ‘/data/01’: Stale file handle /data/01 find: ‘/data/02’: Stale file handle /data/02 find: ‘/data/03’: Stale file handle /data/03 find: ‘/data/04’: Stale file handle /data/04 find: ‘/data/05’: Stale file handle /data/05 find: ‘/data/06’: Stale file handle /data/06 find: ‘/data/07’: Stale file handle /data/07 find: ‘/data/08’: Stale file handle /data/08 find: ‘/data/09’: Stale file handle /data/09 find: ‘/data/10’: Stale file handle /data/10

(I don't think data/10 is really needed, but that's the way it is.)

thescreamingmandrake commented 1 month ago

it looks like ip-10-5-69-135 is terminated and replaced with a new ec2 instance - is still still an issue? I do not see it happening on either running instance in test: ip-10-5-68-24.ec2.internal or ip-10-5-69-204.ec2.internal

thescreamingmandrake commented 1 month ago

new mounts created, ASG instruct updated - https://github.com/yalelibrary/yul-dc-camerata/pull/389