wlandau / crew

A distributed worker launcher
https://wlandau.github.io/crew/
Other
123 stars 4 forks source link

Sporadic instances of 'errorValue' int 7 |Object closed #77

Closed wlandau closed 1 year ago

wlandau commented 1 year ago

Related: #76, #75, https://github.com/shikokuchuo/mirai/issues/53. Seen in these instances in the output of mirai_task$data:

I don't recall whether I see this outside GitHub Actions.

wlandau commented 1 year ago

Sorry - how am I meant to read the logs - is 7 |Object closed the return value of a mirai, or is there an unresolved mirai and that was the return value of a daemons() call?

I'm pretty sure it's always the output of $data from an unresolved mirai task. I am not 100% sure, but daemons() seems to be returning good values lately.

wlandau commented 1 year ago

https://github.com/shikokuchuo/mirai/issues/61 seems unrelated, but I wonder if it might have the same underlying cause.

shikokuchuo commented 1 year ago

Sorry - how am I meant to read the logs - is 7 |Object closed the return value of a mirai, or is there an unresolved mirai and that was the return value of a daemons() call?

I'm pretty sure it's always the output of $data from an unresolved mirai task. I am not 100% sure, but daemons() seems to be returning good values lately.

That makes sense. In which case it seems to be that for both cases, there is an unresolved mirai, AND a mirai with 7 Object closed.

Overall, this issue doesn't worry me at all, as there is a sure backup solution, which would be to inspect all evaluated results at dispatcher, and if a 7 errorValue is encountered, then simply re-send the original task (it is still retained at this point). However this would be the 'wrong' solution as it penalises every single evaluation to catch these extremely rare occurrences. This would only be implemented if we understand what is causing this AND there is no other solution.

What is puzzling about this is - as I think you know as well - 7 Object closed means that the listener socket is closed. saisei() closes listeners, but not the socket. There does not seem to be code that does this, not at the dispatcher level at least.

However it is not just dispatcher that has listener sockets, the client also has the main socket to dispatcher.

Hence a much more plausible hypothesis is simply that a mirai got stuck possibly because #79 still needs to be implemented. That caused a timeout error, and there is a handler somewhere perhaps in the test itself that calls daemons(0) or equivalent that closes the client socket. If a previously unresolved mirai is inspected after that happens, then it resolves to 7 Object closed.

If this is the case, then this is correct behaviour and not an issue. The fact that this has not been seen outside of this test seems to point to this.

EDIT: adding simple example below

library(mirai)
daemons(1)
#> [1] 1
m <- mirai(Sys.sleep(20))
m$data
#> 'unresolved' logi NA
daemons(0)
#> [1] 0
m$data
#> 'errorValue' int 7 | Object closed

Created on 2023-05-18 with reprex v2.0.2

wlandau commented 1 year ago

That makes sense. In which case it seems to be that for both cases, there is an unresolved mirai, AND a mirai with 7 Object closed.

Yes, at this point, both problems separately manifest at the task level.

Overall, this issue doesn't worry me at all, as there is a sure backup solution, which would be to inspect all evaluated results at dispatcher, and if a 7 errorValue is encountered, then simply re-send the original task (it is still retained at this point). However this would be the 'wrong' solution as it penalises every single evaluation to catch these extremely rare occurrences. This would only be implemented if we understand what is causing this AND there is no other solution.

I agree, I hope it doesn't come to that. Inspecting tasks requires downloading the data, right? That seems inefficient. Unless there is an equivalent of .unresolved()?

What is puzzling about this is - as I think you know as well - 7 Object closed means that the listener socket is closed. saisei() closes listeners, but not the socket. There does not seem to be code that does this, not at the dispatcher level at least.

I don't think I actually fully understand yet: how is it possible to change the URL token of the listener without changing the socket / confusing the dialer?

Hence a much more plausible hypothesis is simply that a mirai got stuck possibly because https://github.com/wlandau/crew/issues/79 still needs to be implemented. That caused a timeout error, and there is a handler somewhere perhaps in the test itself that calls daemons(0) or equivalent that closes the client socket. If a previously unresolved mirai is inspected after that happens, then it resolves to 7 Object closed.

I solved #79, but I am still seeing hanging and "object closed" errors on GitHub Actions. I think this is because each of those tests only runs 1 mirai server which does not auto-scale down (it has maxtasks = Inf, idletime = Inf, etc.).

EDIT: adding simple example below

Thanks for explaining, that make sense given what you said about closed listener sockets.

wlandau commented 1 year ago

For the GitHub Actions tests in https://github.com/ropensci/targets/tree/debug-crew, I increased the length of all the polling intervals in hopes that targets could be friendlier on the crew/mirai infrastructure in the face of limited resources. However, instead it caused many more failures: https://github.com/ropensci/targets/actions/runs/5016109351/jobs/8992554758. Out of the 20 runs, there were 9 "unresolved" timeouts and 3 "object closed" errors. Maybe this means I will have better luck reproducing it locally.

wlandau commented 1 year ago

I wonder, could it be because of an outdated version of libmbedtls? On GitHub Actions nng_version() shows [1] "1.6.0pre" "mbed TLS 2.28.0" but locally it's [1] "1.6.0pre" "mbed TLS 3.4.0". I do run sudo apt-get install libmbedtls-dev and I suppose package repos could be behind.

shikokuchuo commented 1 year ago

I wonder, could it be because of an outdated version of libmbedtls? On GitHub Actions nng_version() shows [1] "1.6.0pre" "mbed TLS 2.28.0" but locally it's [1] "1.6.0pre" "mbed TLS 3.4.0". I do run sudo apt-get install libmbedtls-dev and I suppose package repos could be behind.

Unlikely, but you never know. Just remove the apt command. It will then build libmbedtls from the package source.

wlandau commented 1 year ago

Worth a shot, but unfortunately mbedtls wasn't the answer. And everything runs fine locally on my Ubuntu machine after 32 runs (8 R CMD check runs at a time).

wlandau commented 1 year ago

I think next I might try to create a Codespace from https://github.com/ropensci/targets/tree/debug-crew. I think the Dockerfile there should still work.

shikokuchuo commented 1 year ago

I agree, I hope it doesn't come to that. Inspecting tasks requires downloading the data, right? That seems inefficient. Unless there is an equivalent of .unresolved()?

No, it is not a good choice at all.

What is puzzling about this is - as I think you know as well - 7 Object closed means that the listener socket is closed. saisei() closes listeners, but not the socket. There does not seem to be code that does this, not at the dispatcher level at least.

I don't think I actually fully understand yet: how is it possible to change the URL token of the listener without changing the socket / confusing the dialer?

NNG is a socket library, everything revolves around the socket. A socket can have many dialers and listeners if you remember. We are using req/rep which is a guaranteed delivery protocol. That means when the socket at dispatcher sends a message to a server, it will keep trying until it has got a reply. If a connection fails (e.g. server times out), it will just try again when another connection is available - it doesn't care if it is the same server or a different one. The listener/dialer terminology is really just an abstraction.

Thanks for explaining, that make sense given what you said about closed listener sockets.

I am convinced it is just printing out the result after the client socket has been closed, hence we should just solve the task that doesn't complete and not worry specifically about this error message.

shikokuchuo commented 1 year ago

Worth a shot, but unfortunately mbedtls wasn't the answer. And everything runs fine locally on my Ubuntu machine after 32 runs (8 R CMD check runs at a time).

This may sound silly, but for a fair test, shouldn't the Github actions also be running 8 tests at a time then? Could it be that it just doesn't like you running so many tests at the same time? What if you tried say 10, multiple times in succession.

shikokuchuo commented 1 year ago

I wonder, could it be because of an outdated version of libmbedtls? On GitHub Actions nng_version() shows [1] "1.6.0pre" "mbed TLS 2.28.0" but locally it's [1] "1.6.0pre" "mbed TLS 3.4.0". I do run sudo apt-get install libmbedtls-dev and I suppose package repos could be behind.

Btw. mbedtls 2.28 may sound hopelessly outdated but it is the LTS branch so it is 'current'. And practically shouldn't make a difference. Although when I was benchmarking the hash functions, the newer v3 did have an edge!

shikokuchuo commented 1 year ago

Meanwhile, I have updated nanonext to 0.8.3.9002 (this rolls forward 'libnng' to a confirmed stable version), and mirai 0.8.7.9009 which tightens the bus socket sends so they can't fail silently. Please do ensure you test with these versions going forward.

shikokuchuo commented 1 year ago

nanonext 0.8.3.9003 e0436c5 and mirai 0.8.7.9010 178b1f2. I think these can possibly solve the problem.

Can you test and confirm? If successful can lead to some further simplification in mirai.

wlandau commented 1 year ago

Wow, https://github.com/ropensci/targets/actions/runs/5023527267 shows no errors or timeouts at all this time! That's down from 12 /20 failures before! This is so exciting!

I will repeat these tests and run others to confirm.

shikokuchuo commented 1 year ago

Fantastic! I was correct in my hypothesis then. These would only occur on rare occasions and I think only when using a single server... so unlikely to affect users in the wild, even without these updates.

I'm putting through some further simplifications that this allows, as I mentioned. It will be great if you can test again afterwards, make sure everything still works!

wlandau commented 1 year ago

Yes, absolutely, I will test again afterwards.

I really can't thank you enough! With assurance that these tools work in the usual test environments, I will sleep much better at night, and I will feel much better when I spread the word about our tools. As always, thank you for sticking with me.

wlandau commented 1 year ago

Update: the last 160 contiguous GitHub Actions runs were flawless!

shikokuchuo commented 1 year ago

160!!

The updates are ready: nanonext 0.8.3.9004 and mirai 0.8.7.9012.

Perhaps do one test run first before the entire 160 again :)

wlandau commented 1 year ago

The first batch of tests worked great with nanonext 0.8.3.9004 and mirai 0.8.7.9012. Scaling up the testing now.

wlandau commented 1 year ago

200 contiguous tests pass! What a huge relief! Again, thank you so much. I can now re-enable automated testing in CI.

wlandau commented 1 year ago

For posterity, in case anything like this comes up again, this is the Dockerfile I most recently used to launch GitHub Codespaces for debugging:

# Use the same base image as 'ubuntu-latest'
FROM ubuntu:22.04

# Set environment variables
ENV NOT_CRAN=true
ENV TAR_NOT_CRAN=true
ENV R_KEEP_PKG_SOURCE=yes
ENV R_REMOTES_NO_ERRORS_FROM_WARNINGS=false
ENV TORCH_INSTALL=1
ENV DEBIAN_FRONTEND=noninteractive
ENV TZ=Etc/UTC
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US:en
ENV LC_ALL=en_US.UTF-8
ENV _R_CHECK_FORCE_SUGGESTS_=false

# Install system dependencies
RUN apt-get update
RUN apt-get install -y --no-install-recommends build-essential
RUN apt-get install -y --no-install-recommends gfortran
RUN apt-get install -y --no-install-recommends cmake
RUN apt-get install -y --no-install-recommends git
RUN apt-get install -y --no-install-recommends emacs
RUN apt-get install -y --no-install-recommends ca-certificates
RUN apt-get install -y --no-install-recommends libglpk-dev
RUN apt-get install -y --no-install-recommends libglpk40
RUN apt-get install -y --no-install-recommends libmbedtls-dev
RUN apt-get install -y --no-install-recommends curl
RUN apt-get install -y --no-install-recommends htop
RUN apt-get install -y --no-install-recommends gnupg2
RUN apt-get install -y --no-install-recommends lsb-release
RUN apt-get install -y --no-install-recommends locales
RUN apt-get install -y --no-install-recommends libxml2-dev
RUN apt-get install -y --no-install-recommends libssl-dev
RUN apt-get install -y --no-install-recommends libcurl4-openssl-dev
RUN apt-get install -y --no-install-recommends tzdata

# Configure locale
RUN ln -fs /usr/share/zoneinfo/$TZ /etc/localtime
RUN dpkg-reconfigure --frontend noninteractive tzdata
RUN sed -i -e 's/# en_US.UTF-8 UTF-8/en_US.UTF-8 UTF-8/' /etc/locale.gen
RUN dpkg-reconfigure --frontend=noninteractive locales
RUN update-locale LANG=en_US.UTF-8

# Install R
RUN echo "deb https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/" > /etc/apt/sources.list.d/cran.list
RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
RUN apt-get update
RUN apt-get install -y --no-install-recommends r-base

# Install pandoc from r-lib/actions/setup-pandoc
RUN curl -fsSL https://github.com/jgm/pandoc/releases/download/2.11.4/pandoc-2.11.4-1-amd64.deb -o pandoc.deb
RUN dpkg -i pandoc.deb && rm pandoc.deb

# Set up R environment
RUN R -e 'install.packages(c("remotes", "rcmdcheck", "languageserver", "pkgload", "testthat", "rmarkdown", "markdown", "knitr"), repos = "https://cloud.r-project.org/")'

# Install nanonext and mirai
RUN R -e 'install.packages(c("nanonext", "mirai"), repos = "https://shikokuchuo.r-universe.dev")'

# Install crew
RUN R -e 'remotes::install_github("wlandau/crew")'

# Install targets
RUN R -e 'remotes::install_github("ropensci/targets")'

# Set up the working directory
WORKDIR /workspace