wfau / gaia-dmp

Gaia data analysis platform
GNU General Public License v3.0
1 stars 5 forks source link

Deploy fails to mount some of the Science (Ceph) shares #1200

Open stvoutsin opened 1 year ago

stvoutsin commented 1 year ago

From a recent test deploy (25/07/2023), some/most of the GDR3 science directories fail to mount:

Share [/data/gaia/GEDR3/GEDR3_2048_GAIASOURCE]
Count [PASS]
Hash  [PASS]

Share [/data/gaia/GEDR3/GEDR3_2048_PS1_BEST_NEIGHBOURS]
Count [PASS]
Hash  [PASS]

Share [/data/gaia/GEDR3/GEDR3_2048_ALLWISE_BEST_NEIGHBOURS]
Count [PASS]
Hash  [PASS]

Share [/data/gaia/GEDR3/GEDR3_2048_2MASSPSC_BEST_NEIGHBOURS]
Count [PASS]
Hash  [PASS]

Share [/data/gaia/GDR3/GDR3_ASTROPHYSICAL_PARAMETERS]
Count [PASS]
Hash  [FAIL][null][27ef862b779049eafbc6764f57262eb8]

Share [/data/gaia/GDR3/GDR3_ASTROPHYSICAL_PARAMETERS_SUPP]
Count [PASS]
Hash  [FAIL][null][cbc0ceba50979a60d83b401b5fd18a8c]

Share [/data/gaia/GDR3/GDR3_GAIASOURCE]
Count [PASS]
Hash  [FAIL][null][bbfabec832404f6193ab0036a215d83b]

Share [/data/gaia/GDR3/GDR3_RVS_MEAN_SPECTRUM]
Count [PASS]
Hash  [FAIL][null][d9d071325ef336ae464e281bfca1a99c]

Share [/data/gaia/GDR3/GDR3_XP_CONTINUOUS_MEAN_SPECTRUM]
Count [PASS]
Hash  [FAIL][null][7d383e248ef2330acdbf468705e5915f]

Share [/data/gaia/GDR3/GDR3_XP_SAMPLED_MEAN_SPECTRUM]
Count [PASS]
Hash  [FAIL][null][4a80fcb94bbc0a070b299016accc4ead]

Share [/data/gaia/GDR3/GDR3_XP_SUMMARY]
Count [PASS]
Hash  [FAIL][null][fae64c68fc6f5276ae69af0288b69701]

Share [/data/gaia/GDR3/GDR3_PS1_BEST_NEIGHBOURS]
Count [PASS]
Hash  [FAIL][null][36f5d1a4098a2e1a006bf5adce6228d0]

Share [/data/gaia/GDR3/GDR3_ALLWISE_BEST_NEIGHBOURS]
Count [PASS]
Hash  [FAIL][null][99edd9a489d76d06190a772dc2d819fe]

Share [/data/gaia/GDR3/GDR3_2MASSPSC_BEST_NEIGHBOURS]
Count [PASS]
Hash  [FAIL][null][01f469f33798bdad68a028d2749ef3a2]
Zarquan commented 1 year ago

Initial check on the live service and all the CephFS mounts are healthy.

Zarquan commented 1 year ago

@stvoutsin can you find some error messages from the logs that indicates what is causing the problem ? Something we can put in an issue at Cambridge to tell them what to look for.

stvoutsin commented 1 year ago

Just did another test deploy, and the mounts seem to work ok, but the hash value for some of the shares is set as FAILED. Is this expected?


TASK [Linking data directories] ************************************************

PLAY RECAP *********************************************************************
worker01                   : ok=0    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
worker02                   : ok=0    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
worker03                   : ok=0    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   
zeppelin                   : ok=0    changed=0    unreachable=0    failed=0    skipped=1    rescued=0    ignored=0   

/

Share [/data/gaia/GEDR3/GEDR3_2048_GAIASOURCE]
Count [PASS]
Hash  [PASS]

Share [/data/gaia/GEDR3/GEDR3_2048_PS1_BEST_NEIGHBOURS]
Count [PASS]
Hash  [PASS]

Share [/data/gaia/GEDR3/GEDR3_2048_ALLWISE_BEST_NEIGHBOURS]
Count [PASS]
Hash  [PASS]

Share [/data/gaia/GEDR3/GEDR3_2048_2MASSPSC_BEST_NEIGHBOURS]
Count [PASS]
Hash  [PASS]

Share [/data/gaia/GDR3/GDR3_ASTROPHYSICAL_PARAMETERS]
Count [PASS]
Hash  [FAIL][null][27ef862b779049eafbc6764f57262eb8]

Share [/data/gaia/GDR3/GDR3_ASTROPHYSICAL_PARAMETERS_SUPP]
Count [PASS]
Hash  [FAIL][null][cbc0ceba50979a60d83b401b5fd18a8c]

Share [/data/gaia/GDR3/GDR3_GAIASOURCE]
Count [PASS]
Hash  [FAIL][null][bbfabec832404f6193ab0036a215d83b]

Share [/data/gaia/GDR3/GDR3_RVS_MEAN_SPECTRUM]
Count [PASS]
Hash  [FAIL][null][d9d071325ef336ae464e281bfca1a99c]

Share [/data/gaia/GDR3/GDR3_XP_CONTINUOUS_MEAN_SPECTRUM]
Count [PASS]
Hash  [FAIL][null][7d383e248ef2330acdbf468705e5915f]

Share [/data/gaia/GDR3/GDR3_XP_SAMPLED_MEAN_SPECTRUM]
Count [PASS]
Hash  [FAIL][null][4a80fcb94bbc0a070b299016accc4ead]

Share [/data/gaia/GDR3/GDR3_XP_SUMMARY]
Count [PASS]
Hash  [FAIL][null][fae64c68fc6f5276ae69af0288b69701]

Share [/data/gaia/GDR3/GDR3_PS1_BEST_NEIGHBOURS]
Count [PASS]
Hash  [FAIL][null][36f5d1a4098a2e1a006bf5adce6228d0]

Share [/data/gaia/GDR3/GDR3_ALLWISE_BEST_NEIGHBOURS]
Count [PASS]
Hash  [FAIL][null][99edd9a489d76d06190a772dc2d819fe]

Share [/data/gaia/GDR3/GDR3_2MASSPSC_BEST_NEIGHBOURS]
Count [PASS]
Hash  [FAIL][null][01f469f33798bdad68a028d2749ef3a2]
stvoutsin commented 1 year ago

Deleting shares on red seems to fail:


---- ----
Deleting shares
- Deleting share [iris-gaia-red-home-Evison]
Failed to delete share with name or ID 'iris-gaia-red-home-Evison': Invalid share: Share status must be one of ('available', 'error', 'inactive'). (HTTP 403) (Request-ID: req-c50c4d80-327e-4591-bbe6-62f915beeb31)
1 of 1 shares failed to delete.
- Deleting share [iris-gaia-red-home-Evison1]
Failed to delete share with name or ID 'iris-gaia-red-home-Evison1': Invalid share: Share status must be one of ('available', 'error', 'inactive'). (HTTP 403) (Request-ID: req-a1eb23ef-e3f2-49f3-9ae0-345a2d8c37db)
1 of 1 shares failed to delete.
- Deleting share [iris-gaia-red-home-Reyesfan]
Failed to delete share with name or ID 'iris-gaia-red-home-Reyesfan': Invalid share: Share status must be one of ('available', 'error', 'inactive'). (HTTP 403) (Request-ID: req-304a5f28-1ebc-4319-9faa-03f0e7ccdc65)
1 of 1 shares failed to delete.
- Deleting share [iris-gaia-red-home-Reyesfan1]
Failed to delete share with name or ID 'iris-gaia-red-home-Reyesfan1': Invalid share: Share status must be one of ('available', 'error', 'inactive'). (HTTP 403) (Request-ID: req-8efb0210-a02b-4fdb-b84a-bdebd7d9ab14)
1 of 1 shares failed to delete.
- Deleting share [iris-gaia-red-user-Evison]
Failed to delete share with name or ID 'iris-gaia-red-user-Evison': Invalid share: Share status must be one of ('available', 'error', 'inactive'). (HTTP 403) (Request-ID: req-be06b4da-dcd6-439e-9812-bd78a3a42ca6)
1 of 1 shares failed to delete.
- Deleting share [iris-gaia-red-user-Evison1]
Failed to delete share with name or ID 'iris-gaia-red-user-Evison1': Invalid share: Share status must be one of ('available', 'error', 'inactive'). (HTTP 403) (Request-ID: req-2d90c67f-4b3c-447f-ab4a-95faf05ebe4f)
1 of 1 shares failed to delete.
- Deleting share [iris-gaia-red-user-Reyesfan]
Failed to delete share with name or ID 'iris-gaia-red-user-Reyesfan': Invalid share: Share status must be one of ('available', 'error', 'inactive'). (HTTP 403) (Request-ID: req-ded2176b-486c-4396-91a6-f4aef76478b0)
1 of 1 shares failed to delete.
- Deleting share [iris-gaia-red-user-Reyesfan1]
Failed to delete share with name or ID 'iris-gaia-red-user-Reyesfan1': Invalid share: Share status must be one of ('available', 'error', 'inactive'). (HTTP 403) (Request-ID: req-484c4b48-1caf-48a5-877b-7b389c9a6f7d)
1 of 1 shares failed to delete.

In Openstack, these appear to be stuck in a "Deleting" status

Zarquan commented 1 year ago

Just did another test deploy, and the mounts seem to work ok, but the hash value for some of the shares is set as FAILED. Is this expected?

What do you have in your notes from previous deployment on green ?

Zarquan commented 1 year ago

Deleting shares on red seems to fail: .... In Openstack, these appear to be stuck in a "Deleting" status

This looks like a separate issue, not failing to mount (this issue) but failing to delete (see #1198). Can you add a list of the failed mounts to #1198 and then we can raise an issue with Cambridge.

Zarquan commented 1 year ago

This is linked to 1083#issuecomment-1397973799..