Closed wlandau closed 1 year ago
Sorry - how am I meant to read the logs - is 7 |Object closed the return value of a mirai, or is there an unresolved mirai and that was the return value of a daemons() call?
I'm pretty sure it's always the output of $data
from an unresolved mirai
task. I am not 100% sure, but daemons()
seems to be returning good values lately.
https://github.com/shikokuchuo/mirai/issues/61 seems unrelated, but I wonder if it might have the same underlying cause.
Sorry - how am I meant to read the logs - is 7 |Object closed the return value of a mirai, or is there an unresolved mirai and that was the return value of a daemons() call?
I'm pretty sure it's always the output of
$data
from an unresolvedmirai
task. I am not 100% sure, butdaemons()
seems to be returning good values lately.
That makes sense. In which case it seems to be that for both cases, there is an unresolved
mirai, AND a mirai with 7 Object closed
.
Overall, this issue doesn't worry me at all, as there is a sure backup solution, which would be to inspect all evaluated results at dispatcher, and if a 7 errorValue
is encountered, then simply re-send the original task (it is still retained at this point). However this would be the 'wrong' solution as it penalises every single evaluation to catch these extremely rare occurrences. This would only be implemented if we understand what is causing this AND there is no other solution.
What is puzzling about this is - as I think you know as well - 7 Object closed means that the listener socket is closed. saisei()
closes listeners, but not the socket. There does not seem to be code that does this, not at the dispatcher level at least.
However it is not just dispatcher that has listener sockets, the client also has the main socket to dispatcher.
Hence a much more plausible hypothesis is simply that a mirai got stuck possibly because #79 still needs to be implemented. That caused a timeout error, and there is a handler somewhere perhaps in the test itself that calls daemons(0)
or equivalent that closes the client socket. If a previously unresolved mirai is inspected after that happens, then it resolves to 7 Object closed
.
If this is the case, then this is correct behaviour and not an issue. The fact that this has not been seen outside of this test seems to point to this.
EDIT: adding simple example below
library(mirai)
daemons(1)
#> [1] 1
m <- mirai(Sys.sleep(20))
m$data
#> 'unresolved' logi NA
daemons(0)
#> [1] 0
m$data
#> 'errorValue' int 7 | Object closed
Created on 2023-05-18 with reprex v2.0.2
That makes sense. In which case it seems to be that for both cases, there is an unresolved mirai, AND a mirai with 7 Object closed.
Yes, at this point, both problems separately manifest at the task level.
Overall, this issue doesn't worry me at all, as there is a sure backup solution, which would be to inspect all evaluated results at dispatcher, and if a 7 errorValue is encountered, then simply re-send the original task (it is still retained at this point). However this would be the 'wrong' solution as it penalises every single evaluation to catch these extremely rare occurrences. This would only be implemented if we understand what is causing this AND there is no other solution.
I agree, I hope it doesn't come to that. Inspecting tasks requires downloading the data, right? That seems inefficient. Unless there is an equivalent of .unresolved()
?
What is puzzling about this is - as I think you know as well - 7 Object closed means that the listener socket is closed. saisei() closes listeners, but not the socket. There does not seem to be code that does this, not at the dispatcher level at least.
I don't think I actually fully understand yet: how is it possible to change the URL token of the listener without changing the socket / confusing the dialer?
Hence a much more plausible hypothesis is simply that a mirai got stuck possibly because https://github.com/wlandau/crew/issues/79 still needs to be implemented. That caused a timeout error, and there is a handler somewhere perhaps in the test itself that calls daemons(0) or equivalent that closes the client socket. If a previously unresolved mirai is inspected after that happens, then it resolves to 7 Object closed.
I solved #79, but I am still seeing hanging and "object closed" errors on GitHub Actions. I think this is because each of those tests only runs 1 mirai
server which does not auto-scale down (it has maxtasks = Inf
, idletime = Inf
, etc.).
EDIT: adding simple example below
Thanks for explaining, that make sense given what you said about closed listener sockets.
For the GitHub Actions tests in https://github.com/ropensci/targets/tree/debug-crew, I increased the length of all the polling intervals in hopes that targets
could be friendlier on the crew
/mirai
infrastructure in the face of limited resources. However, instead it caused many more failures: https://github.com/ropensci/targets/actions/runs/5016109351/jobs/8992554758. Out of the 20 runs, there were 9 "unresolved" timeouts and 3 "object closed" errors. Maybe this means I will have better luck reproducing it locally.
I wonder, could it be because of an outdated version of libmbedtls
? On GitHub Actions nng_version()
shows [1] "1.6.0pre" "mbed TLS 2.28.0"
but locally it's [1] "1.6.0pre" "mbed TLS 3.4.0"
. I do run sudo apt-get install libmbedtls-dev
and I suppose package repos could be behind.
I wonder, could it be because of an outdated version of
libmbedtls
? On GitHub Actionsnng_version()
shows[1] "1.6.0pre" "mbed TLS 2.28.0"
but locally it's[1] "1.6.0pre" "mbed TLS 3.4.0"
. I do runsudo apt-get install libmbedtls-dev
and I suppose package repos could be behind.
Unlikely, but you never know. Just remove the apt command. It will then build libmbedtls from the package source.
Worth a shot, but unfortunately mbedtls wasn't the answer. And everything runs fine locally on my Ubuntu machine after 32 runs (8 R CMD check
runs at a time).
I think next I might try to create a Codespace from https://github.com/ropensci/targets/tree/debug-crew. I think the Dockerfile there should still work.
I agree, I hope it doesn't come to that. Inspecting tasks requires downloading the data, right? That seems inefficient. Unless there is an equivalent of
.unresolved()
?
No, it is not a good choice at all.
What is puzzling about this is - as I think you know as well - 7 Object closed means that the listener socket is closed. saisei() closes listeners, but not the socket. There does not seem to be code that does this, not at the dispatcher level at least.
I don't think I actually fully understand yet: how is it possible to change the URL token of the listener without changing the socket / confusing the dialer?
NNG is a socket library, everything revolves around the socket. A socket can have many dialers and listeners if you remember. We are using req/rep which is a guaranteed delivery protocol. That means when the socket at dispatcher sends a message to a server, it will keep trying until it has got a reply. If a connection fails (e.g. server times out), it will just try again when another connection is available - it doesn't care if it is the same server or a different one. The listener/dialer terminology is really just an abstraction.
Thanks for explaining, that make sense given what you said about closed listener sockets.
I am convinced it is just printing out the result after the client socket has been closed, hence we should just solve the task that doesn't complete and not worry specifically about this error message.
Worth a shot, but unfortunately mbedtls wasn't the answer. And everything runs fine locally on my Ubuntu machine after 32 runs (8
R CMD check
runs at a time).
This may sound silly, but for a fair test, shouldn't the Github actions also be running 8 tests at a time then? Could it be that it just doesn't like you running so many tests at the same time? What if you tried say 10, multiple times in succession.
I wonder, could it be because of an outdated version of
libmbedtls
? On GitHub Actionsnng_version()
shows[1] "1.6.0pre" "mbed TLS 2.28.0"
but locally it's[1] "1.6.0pre" "mbed TLS 3.4.0"
. I do runsudo apt-get install libmbedtls-dev
and I suppose package repos could be behind.
Btw. mbedtls 2.28 may sound hopelessly outdated but it is the LTS branch so it is 'current'. And practically shouldn't make a difference. Although when I was benchmarking the hash functions, the newer v3 did have an edge!
Meanwhile, I have updated nanonext
to 0.8.3.9002 (this rolls forward 'libnng' to a confirmed stable version), and mirai
0.8.7.9009 which tightens the bus socket sends so they can't fail silently. Please do ensure you test with these versions going forward.
Wow, https://github.com/ropensci/targets/actions/runs/5023527267 shows no errors or timeouts at all this time! That's down from 12 /20 failures before! This is so exciting!
I will repeat these tests and run others to confirm.
Fantastic! I was correct in my hypothesis then. These would only occur on rare occasions and I think only when using a single server... so unlikely to affect users in the wild, even without these updates.
I'm putting through some further simplifications that this allows, as I mentioned. It will be great if you can test again afterwards, make sure everything still works!
Yes, absolutely, I will test again afterwards.
I really can't thank you enough! With assurance that these tools work in the usual test environments, I will sleep much better at night, and I will feel much better when I spread the word about our tools. As always, thank you for sticking with me.
Update: the last 160 contiguous GitHub Actions runs were flawless!
160!!
The updates are ready: nanonext
0.8.3.9004 and mirai
0.8.7.9012.
Perhaps do one test run first before the entire 160 again :)
The first batch of tests worked great with nanonext
0.8.3.9004 and mirai
0.8.7.9012. Scaling up the testing now.
200 contiguous tests pass! What a huge relief! Again, thank you so much. I can now re-enable automated testing in CI.
For posterity, in case anything like this comes up again, this is the Dockerfile
I most recently used to launch GitHub Codespaces for debugging:
# Use the same base image as 'ubuntu-latest'
FROM ubuntu:22.04
# Set environment variables
ENV NOT_CRAN=true
ENV TAR_NOT_CRAN=true
ENV R_KEEP_PKG_SOURCE=yes
ENV R_REMOTES_NO_ERRORS_FROM_WARNINGS=false
ENV TORCH_INSTALL=1
ENV DEBIAN_FRONTEND=noninteractive
ENV TZ=Etc/UTC
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US:en
ENV LC_ALL=en_US.UTF-8
ENV _R_CHECK_FORCE_SUGGESTS_=false
# Install system dependencies
RUN apt-get update
RUN apt-get install -y --no-install-recommends build-essential
RUN apt-get install -y --no-install-recommends gfortran
RUN apt-get install -y --no-install-recommends cmake
RUN apt-get install -y --no-install-recommends git
RUN apt-get install -y --no-install-recommends emacs
RUN apt-get install -y --no-install-recommends ca-certificates
RUN apt-get install -y --no-install-recommends libglpk-dev
RUN apt-get install -y --no-install-recommends libglpk40
RUN apt-get install -y --no-install-recommends libmbedtls-dev
RUN apt-get install -y --no-install-recommends curl
RUN apt-get install -y --no-install-recommends htop
RUN apt-get install -y --no-install-recommends gnupg2
RUN apt-get install -y --no-install-recommends lsb-release
RUN apt-get install -y --no-install-recommends locales
RUN apt-get install -y --no-install-recommends libxml2-dev
RUN apt-get install -y --no-install-recommends libssl-dev
RUN apt-get install -y --no-install-recommends libcurl4-openssl-dev
RUN apt-get install -y --no-install-recommends tzdata
# Configure locale
RUN ln -fs /usr/share/zoneinfo/$TZ /etc/localtime
RUN dpkg-reconfigure --frontend noninteractive tzdata
RUN sed -i -e 's/# en_US.UTF-8 UTF-8/en_US.UTF-8 UTF-8/' /etc/locale.gen
RUN dpkg-reconfigure --frontend=noninteractive locales
RUN update-locale LANG=en_US.UTF-8
# Install R
RUN echo "deb https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/" > /etc/apt/sources.list.d/cran.list
RUN apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9
RUN apt-get update
RUN apt-get install -y --no-install-recommends r-base
# Install pandoc from r-lib/actions/setup-pandoc
RUN curl -fsSL https://github.com/jgm/pandoc/releases/download/2.11.4/pandoc-2.11.4-1-amd64.deb -o pandoc.deb
RUN dpkg -i pandoc.deb && rm pandoc.deb
# Set up R environment
RUN R -e 'install.packages(c("remotes", "rcmdcheck", "languageserver", "pkgload", "testthat", "rmarkdown", "markdown", "knitr"), repos = "https://cloud.r-project.org/")'
# Install nanonext and mirai
RUN R -e 'install.packages(c("nanonext", "mirai"), repos = "https://shikokuchuo.r-universe.dev")'
# Install crew
RUN R -e 'remotes::install_github("wlandau/crew")'
# Install targets
RUN R -e 'remotes::install_github("ropensci/targets")'
# Set up the working directory
WORKDIR /workspace
Related: #76, #75, https://github.com/shikokuchuo/mirai/issues/53. Seen in these instances in the output of
mirai_task$data
:I don't recall whether I see this outside GitHub Actions.