Closed K8Sewell closed 1 month ago
On 10/7, we discussed increasing the number of objects in the integrity check for production and limiting the objects checked in other enviros to around 100 to make sure the check is operating properly.
Ok, I found the bug. It looks like we have thousands on parent objects in test, uat, and prod that have a nil
digital_object_source
. In these cases, the integrity check will not grab those parents. We added the "None"
default value for digital_object_source
June 2023, so there's still quite a few records that have not been updated since. We'll probably need to shell into the server to get the exact number.
I mentioned in standup on Monday that test is consistently only grabbing 127 parent objects each time instead of 2000. This is because there are only 127 parent objects in test that have a correct "None"
digital_object_source
, so the integrity check is only checking those 127 objects.
I can modify the query to also include parents with a nil
digital_object_source
if we want? Or we can work on updating all the records with a nil
digital_object_source
. Thoughts?
I think the network file system mounts on test fell over. We're getting a stale file handle error when trying to access the /data folder: cannot access /data/00: Stale file handle
@thescreamingmandrake
ec2-user@ip-10-5-69-135
It looks like all 11 disconnected: find /data /data find: ‘/data/00’: Stale file handle /data/00 find: ‘/data/01’: Stale file handle /data/01 find: ‘/data/02’: Stale file handle /data/02 find: ‘/data/03’: Stale file handle /data/03 find: ‘/data/04’: Stale file handle /data/04 find: ‘/data/05’: Stale file handle /data/05 find: ‘/data/06’: Stale file handle /data/06 find: ‘/data/07’: Stale file handle /data/07 find: ‘/data/08’: Stale file handle /data/08 find: ‘/data/09’: Stale file handle /data/09 find: ‘/data/10’: Stale file handle /data/10
(I don't think data/10 is really needed, but that's the way it is.)
it looks like ip-10-5-69-135 is terminated and replaced with a new ec2 instance - is still still an issue? I do not see it happening on either running instance in test: ip-10-5-68-24.ec2.internal or ip-10-5-69-204.ec2.internal
new mounts created, ASG instruct updated - https://github.com/yalelibrary/yul-dc-camerata/pull/389
Summary
On the Test env there are several days of integrity checks that nearly the entire sample has failed. On UAT the integrity check job appears to be working as expected.
Acceptance Criteria