xorbitsai / xorbits

Scalable Python DS & ML, in an API compatible & lightning fast way.
https://xorbits.readthedocs.io
Apache License 2.0
1.1k stars 67 forks source link

ENH: Fix cuda storage transfer deadlock on multiple GPUs #788

Open luweizheng opened 1 month ago

luweizheng commented 1 month ago

Fix storage transfer deadlock of CUDA storage on multiple GPUs. I do this work based on https://github.com/xorbitsai/xorbits/pull/488

The current implementation of the transfer function leads to a deadlock when executing Xorbits on multiple GPUs. The issue arises from the StorageHandlerActor.fetch_batch function, which invokes SenderManagerActor.send_batch_data and subsequently calls StorageHandlerActor.request_quota_with_spill. Due to the locking mechanism within the StorageHandlerActor method call, a deadlock arises.

NOTEs:

Check code requirements

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 19.35484% with 75 lines in your changes missing coverage. Please review.

Project coverage is 52.76%. Comparing base (a3af71f) to head (77cf425).

Files Patch % Lines
python/xorbits/_mars/services/storage/handler.py 15.27% 61 Missing :warning:
python/xorbits/_mars/services/storage/transfer.py 22.22% 14 Missing :warning:

:exclamation: There is a different number of reports uploaded between BASE (a3af71f) and HEAD (77cf425). Click for more details.

HEAD has 10 uploads less than BASE | Flag | BASE (a3af71f) | HEAD (77cf425) | |------|------|------| |unittests|11|1|
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #788 +/- ## =========================================== - Coverage 83.31% 52.76% -30.55% =========================================== Files 1058 1058 Lines 79781 79741 -40 Branches 16493 12138 -4355 =========================================== - Hits 66467 42073 -24394 - Misses 11801 36233 +24432 + Partials 1513 1435 -78 ``` | [Flag](https://app.codecov.io/gh/xorbitsai/xorbits/pull/788/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=xorbitsai) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/xorbitsai/xorbits/pull/788/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=xorbitsai) | `52.76% <19.35%> (-30.49%)` | :arrow_down: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=xorbitsai#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.