wlandau / crew

A distributed worker launcher
https://wlandau.github.io/crew/
Other
129 stars 4 forks source link

Sporadic instances of 'errorValue' int 5 | Timed out #76

Closed wlandau closed 1 year ago

wlandau commented 1 year ago

Related: #75, https://github.com/shikokuchuo/mirai/issues/53. Happens both with and without targets, although most often on GitHub Actions targets R CMD check and in targets pipelines on my local Ubuntu machine.

I am starting to think this issue is specific to Ubuntu. My local Ubuntu installation:

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=22.04
DISTRIB_CODENAME=jammy
DISTRIB_DESCRIPTION="Ubuntu 22.04.2 LTS"

And ubuntu-latest on GitHub Actions is also 22.04.

I wrote a Dockerfile at https://github.com/ropensci/targets/blob/debug-crew/Dockerfile to get set up for reproducible examples on Ubuntu, included below for reference. (I suppose the version in FROM ubuntu:20.04 might need to change.)

# Use the same base image as 'ubuntu-latest'
FROM ubuntu:20.04

# Set environment variables
ENV NOT_CRAN=true
ENV TAR_NOT_CRAN=true
ENV R_KEEP_PKG_SOURCE=yes
ENV R_REMOTES_NO_ERRORS_FROM_WARNINGS=false
ENV TORCH_INSTALL=1
ENV DEBIAN_FRONTEND=noninteractive
ENV TZ=Etc/UTC
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US:en
ENV LC_ALL=en_US.UTF-8
ENV _R_CHECK_FORCE_SUGGESTS_=false

# Install system dependencies
RUN apt-get update && \
    apt-get install -y --no-install-recommends \
    build-essential \
    gfortran \
    cmake \
    git \
    emacs \
    ca-certificates \
    libglpk-dev \
    libglpk40 \
    libmbedtls-dev \
    curl \
    htop \
    gnupg2 \
    lsb-release \
    locales \
    libxml2-dev \
    libssl-dev \
    libcurl4-openssl-dev && \
    ln -fs /usr/share/zoneinfo/$TZ /etc/localtime && \
    dpkg-reconfigure --frontend noninteractive tzdata

# Configure locale
RUN apt-get clean && apt-get update && \
    apt-get install -y locales && \
    sed -i -e 's/# en_US.UTF-8 UTF-8/en_US.UTF-8 UTF-8/' /etc/locale.gen && \
    dpkg-reconfigure --frontend=noninteractive locales && \
    update-locale LANG=en_US.UTF-8

# Install R
RUN echo "deb https://cloud.r-project.org/bin/linux/ubuntu focal-cran40/" > /etc/apt/sources.list.d/cran.list && \
    apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 && \
    apt-get update && \
    apt-get install -y --no-install-recommends r-base

# Install pandoc from r-lib/actions/setup-pandoc
RUN curl -fsSL https://github.com/jgm/pandoc/releases/download/2.11.4/pandoc-2.11.4-1-amd64.deb -o pandoc.deb && \
    dpkg -i pandoc.deb && \
    rm pandoc.deb

# Set up R environment
RUN R -e 'install.packages(c("remotes", "rcmdcheck", "languageserver", "pkgload", "testthat", "rmarkdown", "markdown", "knitr"), repos = "https://cloud.r-project.org/")'

# Install nanonext and mirai
RUN R -e 'install.packages(c("nanonext", "mirai"), repos = "https://shikokuchuo.r-universe.dev")'

# Install crew
RUN R -e 'remotes::install_github("wlandau/crew")'

# Install targets
RUN R -e 'remotes::install_github("ropensci/targets")'

# Set up the working directory
WORKDIR /workspace

@shikokuchuo, which systems do you test with?

shikokuchuo commented 1 year ago

I thought my update for 0.8.7.9003 would have fixed this. At least you are just getting the timeout (which is intentional) and not the other errors/warnings I assume. Any ideas to reproduce?

There is an existing 3 sec timeout if nothing comes back from dispatcher. This is to prevent hanging. I have not made this tuneable as 3 secs should be plenty as there is not much to send and it is a local process using fast IPC sockets. If it does indeed produce this timeout, I'd assume dispatcher died. Unless that's not the case?

EDIT: I've just pushed v0.8.7.9004. This finishes the job of the previous build, which was quite hastily put together yesterday. I don't think this would address the issue, but just in case... it might well have fixed it?

wlandau commented 1 year ago

Again, I really appreciate your work on this! I just tried 0.8.7.9004 on the example from https://github.com/wlandau/crew/issues/75#issue-1709009643 several times, and the timeout happens far less often. However, I still see occurrences on my Ubuntu machine. One time I ran into it when the example was just starting:

# [1] 1
# Error:
# ! invalid daemons: structure(5L, class = "errorValue")

And another time after all tasks were pushed and were in the process of being popped:

...
[1] "3d7e364e8c1e7994ba448da75798b86d3d12df1e"
[1] "cd08685844d62fac3c213ee1cd643b7bd6906704"
[1] "e48993ae1e1dbad82ecd1697955c579becb2e471"
[1] "9fceae13b365a397000b4b4cae7ac70d15fc79ef"
[1] "fe58a2ffa1522d6136df58c6cee392608b0ca36f"
[1] "82f03a6a32f4b31f4dc979f0947fd12617754a84"
Error:
! invalid daemons: structure(5L, class = "errorValue")

In the latter case, there was only 1 task left in the crew queue, and all 5999 available results were already popped. Calling $pop() once to start a server and then again to return the result took care of the last task. All the while, the dispatcher never stopped running. I could see it on htop, and I confirmed the PID matches the one from rownames(daemons()$daemons).

shikokuchuo commented 1 year ago

Btw. this is not Ubuntu-specific. I have not managed to get it to reproduce once on my Ubuntu laptop. But given you can get this on an Ubuntu machine, it is a real problem.

From what you're describing, this seems to point to possibly something like daemons() being called in a spin loop - and as it happens at the start and end, possibly to do with its use in scaling.

You should consider adding msleep(1L) or even msleep(10L) calls where appropriate. There is a reason Garrett D'Amore provided this utility as part of NNG and is a legitimate use where you can't solve things though synchronisation. If you are calling daemons() constantly, it also prevents dispatcher from doing real work.

Reframing the problem - I am not seeing a good reason why a daemons() request should get stuck. That's not to say that it's not possible, and factually it seems it does get stuck say 0.1% of the time, but the other 99.9% of the time it works as expected. If crew does not probe the 0.1% of cases where it does get stuck, this is likely to be a more sustainable solution going forward.

shikokuchuo commented 1 year ago

Also I made a couple of small tweaks in v0.8.7.9005. It would be great to know if these are enough to eliminate the timeouts!

shikokuchuo commented 1 year ago

I've finally seen this:

Error:
! invalid daemons: structure(5L, class = "errorValue")

when I reverted the control socket back to "bus"! I'm taking a slightly different approach that should prevent these altogether.

shikokuchuo commented 1 year ago

In mirai 3c04d04 v0.8.7.9006 I've re-implemented the dispatcher control socket.

This eliminates the timeouts for me, even though it still only uses the same lightweight "bus" sockets. The issue is addressed through synchronisation techniques instead.

However, as I was not able to produce the timeouts with the req/rep sockets before, I still need you to confirm that it works on your Ubuntu machine.

wlandau commented 1 year ago

That certainly improves things! I can no longer reproduce the timeouts locally on Ubuntu (at least they are very rare) and I only got one timeout on GitHub actions: https://github.com/ropensci/targets/actions/runs/5006076955.

Meanwhile, I implemented basic throttling in crew's auto-scaling to be nicer to the dispatcher. (Auto-scaling does not need to happen again if it already took place e.g. half a second ago.) This also has the effect of doubling the speed of my tests, particularly the one at the top of #75. So that should make timeouts even less likely.

Currently testing on GitHub Actions.

Not to propose a new direction in development, but this does get me thinking big-picture about the mirai/crew integration. mirai has a dispatcher as a separate process which frees the user session. But in crew, because of auto-scaling and transient workers, the main R process needs to be an event loop that continuously calls push(), pop(), or wait() (similar to Gabor's task queue idea for callr). The crew event loop is even more important now that there is throttling. All this makes me wonder if crew itself could manually run an iteration of dispatch each time e.g. pop() is called, rather than overburdening the mirai dispatcher. It might not be feasible due to the scope and intent of mirai, but I keep wondering about this.

wlandau commented 1 year ago

Unfortunately, I still got instances of 'errorValue' int 7 | Object closed for the task in the GitHub Actions targets tests. Example: https://github.com/ropensci/targets/actions/runs/5006290328/jobs/8971368720#step:9:257

shikokuchuo commented 1 year ago

Unfortunately, I still got instances of 'errorValue' int 7 | Object closed for the task in the GitHub Actions targets tests. Example: https://github.com/ropensci/targets/actions/runs/5006290328/jobs/8971368720#step:9:257

This is new to me. Have you seen this recently? Would be good to see when it started popping up. I'm not immediately sure what this is about.

Sorry - how am I meant to read the logs - is 7 |Object closed the return value of a mirai, or is there an unresolved mirai and that was the return value of a daemons() call?

Perhaps we should open a new issue to prevent this one from getting too long.

wlandau commented 1 year ago

I will open a new issue, sorry to throw so many things here at once.

wlandau commented 1 year ago

Just posted #77.

wlandau commented 1 year ago

The specific "'errorValue' int 5" messages started to creep back just after I implemented #87. At first I only saw them intermittently on GitHub Actions Mac OS runners, but today I saw it for the first time on my local MacBook during R CMD check. The local instance happened here in a test, which traces back to launcher$tally() and ultimately to mirai::daemons(). Good news though: I added more detail to the error message so we can find out if the dispatcher is still alive when this happens.

wlandau commented 1 year ago

If the dispatcher is still running when the error happens, then I might be able to implement a retry/timeout mechanism in crew as a workaround. Wouldn't work if the dispatcher already exited, but if that's the case, maybe the root cause would be easier to solve.

wlandau commented 1 year ago

From https://github.com/wlandau/crew/actions/runs/5410657252/jobs/9832405357#step:8:167, it looks like the dispatcher is actually still running when the daemons() error happens. So maybe I should try implementing retries.

wlandau commented 1 year ago

I'm not sure, but https://github.com/shikokuchuo/mirai/issues/64 might be a clue as to what might be happening. If the dispatcher is busy/called the moment a server dials in, I wonder if that would cause problems. Unfortunately crew needs asyncdial = FALSE as a safeguard to make sure workers shut down when they need to shut down.

shikokuchuo commented 1 year ago

I don't think https://github.com/shikokuchuo/mirai/issues/64 is related.

Is there an actual deadlock here at dispatcher i.e. if you just record but otherwise ignore the error at the client and retry/let things continue, does the pipeline finish?

wlandau commented 1 year ago

Is there an actual deadlock here at dispatcher i.e. if you just record but otherwise ignore the error at the client and retry/let things continue, does the pipeline finish?

I don't know, actually. I implemented retries a day or 2 ago, but I haven't had time to really test it. Currently testing in branches with-retries and without-retries in this repo.

wlandau commented 1 year ago

I tested 200 builds with retries and 200 without, and only one build had connection issues (in the latter). I expect more issues without retries, but regardless, something seems to have fixed this.