Open stvoutsin opened 1 year ago
Initial check on the live service and all the CephFS mounts are healthy.
@stvoutsin can you find some error messages from the logs that indicates what is causing the problem ? Something we can put in an issue at Cambridge to tell them what to look for.
Just did another test deploy, and the mounts seem to work ok, but the hash value for some of the shares is set as FAILED. Is this expected?
TASK [Linking data directories] ************************************************
PLAY RECAP *********************************************************************
worker01 : ok=0 changed=0 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0
worker02 : ok=0 changed=0 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0
worker03 : ok=0 changed=0 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0
zeppelin : ok=0 changed=0 unreachable=0 failed=0 skipped=1 rescued=0 ignored=0
/
Share [/data/gaia/GEDR3/GEDR3_2048_GAIASOURCE]
Count [PASS]
Hash [PASS]
Share [/data/gaia/GEDR3/GEDR3_2048_PS1_BEST_NEIGHBOURS]
Count [PASS]
Hash [PASS]
Share [/data/gaia/GEDR3/GEDR3_2048_ALLWISE_BEST_NEIGHBOURS]
Count [PASS]
Hash [PASS]
Share [/data/gaia/GEDR3/GEDR3_2048_2MASSPSC_BEST_NEIGHBOURS]
Count [PASS]
Hash [PASS]
Share [/data/gaia/GDR3/GDR3_ASTROPHYSICAL_PARAMETERS]
Count [PASS]
Hash [FAIL][null][27ef862b779049eafbc6764f57262eb8]
Share [/data/gaia/GDR3/GDR3_ASTROPHYSICAL_PARAMETERS_SUPP]
Count [PASS]
Hash [FAIL][null][cbc0ceba50979a60d83b401b5fd18a8c]
Share [/data/gaia/GDR3/GDR3_GAIASOURCE]
Count [PASS]
Hash [FAIL][null][bbfabec832404f6193ab0036a215d83b]
Share [/data/gaia/GDR3/GDR3_RVS_MEAN_SPECTRUM]
Count [PASS]
Hash [FAIL][null][d9d071325ef336ae464e281bfca1a99c]
Share [/data/gaia/GDR3/GDR3_XP_CONTINUOUS_MEAN_SPECTRUM]
Count [PASS]
Hash [FAIL][null][7d383e248ef2330acdbf468705e5915f]
Share [/data/gaia/GDR3/GDR3_XP_SAMPLED_MEAN_SPECTRUM]
Count [PASS]
Hash [FAIL][null][4a80fcb94bbc0a070b299016accc4ead]
Share [/data/gaia/GDR3/GDR3_XP_SUMMARY]
Count [PASS]
Hash [FAIL][null][fae64c68fc6f5276ae69af0288b69701]
Share [/data/gaia/GDR3/GDR3_PS1_BEST_NEIGHBOURS]
Count [PASS]
Hash [FAIL][null][36f5d1a4098a2e1a006bf5adce6228d0]
Share [/data/gaia/GDR3/GDR3_ALLWISE_BEST_NEIGHBOURS]
Count [PASS]
Hash [FAIL][null][99edd9a489d76d06190a772dc2d819fe]
Share [/data/gaia/GDR3/GDR3_2MASSPSC_BEST_NEIGHBOURS]
Count [PASS]
Hash [FAIL][null][01f469f33798bdad68a028d2749ef3a2]
Deleting shares on red seems to fail:
---- ----
Deleting shares
- Deleting share [iris-gaia-red-home-Evison]
Failed to delete share with name or ID 'iris-gaia-red-home-Evison': Invalid share: Share status must be one of ('available', 'error', 'inactive'). (HTTP 403) (Request-ID: req-c50c4d80-327e-4591-bbe6-62f915beeb31)
1 of 1 shares failed to delete.
- Deleting share [iris-gaia-red-home-Evison1]
Failed to delete share with name or ID 'iris-gaia-red-home-Evison1': Invalid share: Share status must be one of ('available', 'error', 'inactive'). (HTTP 403) (Request-ID: req-a1eb23ef-e3f2-49f3-9ae0-345a2d8c37db)
1 of 1 shares failed to delete.
- Deleting share [iris-gaia-red-home-Reyesfan]
Failed to delete share with name or ID 'iris-gaia-red-home-Reyesfan': Invalid share: Share status must be one of ('available', 'error', 'inactive'). (HTTP 403) (Request-ID: req-304a5f28-1ebc-4319-9faa-03f0e7ccdc65)
1 of 1 shares failed to delete.
- Deleting share [iris-gaia-red-home-Reyesfan1]
Failed to delete share with name or ID 'iris-gaia-red-home-Reyesfan1': Invalid share: Share status must be one of ('available', 'error', 'inactive'). (HTTP 403) (Request-ID: req-8efb0210-a02b-4fdb-b84a-bdebd7d9ab14)
1 of 1 shares failed to delete.
- Deleting share [iris-gaia-red-user-Evison]
Failed to delete share with name or ID 'iris-gaia-red-user-Evison': Invalid share: Share status must be one of ('available', 'error', 'inactive'). (HTTP 403) (Request-ID: req-be06b4da-dcd6-439e-9812-bd78a3a42ca6)
1 of 1 shares failed to delete.
- Deleting share [iris-gaia-red-user-Evison1]
Failed to delete share with name or ID 'iris-gaia-red-user-Evison1': Invalid share: Share status must be one of ('available', 'error', 'inactive'). (HTTP 403) (Request-ID: req-2d90c67f-4b3c-447f-ab4a-95faf05ebe4f)
1 of 1 shares failed to delete.
- Deleting share [iris-gaia-red-user-Reyesfan]
Failed to delete share with name or ID 'iris-gaia-red-user-Reyesfan': Invalid share: Share status must be one of ('available', 'error', 'inactive'). (HTTP 403) (Request-ID: req-ded2176b-486c-4396-91a6-f4aef76478b0)
1 of 1 shares failed to delete.
- Deleting share [iris-gaia-red-user-Reyesfan1]
Failed to delete share with name or ID 'iris-gaia-red-user-Reyesfan1': Invalid share: Share status must be one of ('available', 'error', 'inactive'). (HTTP 403) (Request-ID: req-484c4b48-1caf-48a5-877b-7b389c9a6f7d)
1 of 1 shares failed to delete.
In Openstack, these appear to be stuck in a "Deleting" status
Just did another test deploy, and the mounts seem to work ok, but the hash value for some of the shares is set as FAILED. Is this expected?
What do you have in your notes from previous deployment on green ?
Deleting shares on red seems to fail: .... In Openstack, these appear to be stuck in a "Deleting" status
This looks like a separate issue, not failing to mount (this issue) but failing to delete (see #1198). Can you add a list of the failed mounts to #1198 and then we can raise an issue with Cambridge.
This is linked to 1083#issuecomment-1397973799..
strict
mode they should immediately fail on any error.strict
mode.relaxed
mode, we could get on with development and testing while we fix things like the CephFS mounts and checksums.
From a recent test deploy (25/07/2023), some/most of the GDR3 science directories fail to mount: