Open bugaenkoleonid opened 10 months ago
Sound like a good addition, I will look into it.
However it might also be useful to investigate why / where / when it is freezing, because normally this should be fixed in the sheepit client itself. I do not know if the sheepit client has a health check also, but I would argue that it would be its job to detect a stall and do something about it.
If I recall correctly, that image does its healthcheck by watching the logs and maybe comparing timestamps.
Thing is, on older, community-driven Sheepit docker images, this tended to happen, I know because I run one constantly myself.
On the official image this does not happen anymore.
Should also work with modern docker CUDA support, if it's needed.
Anywho, a native healthcheck in the client would probably be wise to implement. Just watching logs seems primitive and error-prone.
It is a client issue, had this happen natively too when running for a long time, not sure why you think this would be docker related, because it isn't.
A health check makes perfect sense in an environment where you pay per minute, so it is a very reasonable request.
Just a hunch, used to happen every few months but it does not anymore.
I will see to it to get a health signal natively implemented in the client.
I couldn't run the official image on vast.ai
I don't wanna clutter this issue with conversation of a separate docker image.
If you want, our GitLab Issue Tracker
You can also reach out to me on Discord via @dacool
im using your new docker tag for deployment on vast.ai. Thank you so much for that, it works pretty well!
but sometimes agents hang up and freeze the entire render queue
Maybe you can use this solution in your image?