Closed jmaassen closed 3 years ago
When starting the docker image manually like so:
docker run --detach --name xenon-torque --hostname xenon-torque --publish 10022:22 --cap-add SYS_RESOURCE xenonmiddleware/torque
and running the liveTest like this:
./gradlew liveTest -Dxenon.scheduler=torque -Dxenon.username=xenon -Dxenon.password=javagat -Dxenon.scheduler.location=ssh://localhost:10022 -Dxenon.scheduler.workdir=/home/xenon
the test run succesfully. So it seems there is an issue in the testing framework itself, not the code or docker image. Maybe the healthcheck succeeds too quickly?
Starting the docker images with docker compose:
docker-compose -f torque-5.0.0.yml up
and running the live tests in the same fashion:
./gradlew liveTest -Dxenon.scheduler=torque -Dxenon.username=xenon -Dxenon.password=javagat -Dxenon.scheduler.location=ssh://localhost:32830 -Dxenon.scheduler.workdir=/home/xenon
does not work. It results in the same error as with the integration tests.
Works for me
docker --version
Docker version 20.10.2, build 2291f61
docker image inspect xenonmiddleware/torque | jq '.[0].RepoDigests'
[
"xenonmiddleware/torque@sha256:5a98982c2ad0cefc6994004ce4da69e68b8f0c4d596a9732dd7574e75f2153d4"
]
gradlew integrationTest --tests '*torque*'
Starting a Gradle Daemon, 2 incompatible Daemons could not be reused, use --status for details
Deprecated Gradle features were used in this build, making it incompatible with Gradle 6.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/5.4.1/userguide/command_line_interface.html#sec:command_line_warnings
BUILD SUCCESSFUL in 59s
6 actionable tasks: 2 executed, 4 up-to-date
PS. xenonmiddleware/torque
is the only Docker image where we don't install the scheduler/fs ourselves.
Using livetest command also works for me. Did have to prime the known_hosts file by logging in manually before calling gradle.
Also saw that the healthcheck is causing 2021-01-18 13:58:45,612 CRIT reaped unknown pid 5465)
in docker compose log.
The digest of xenonmiddleware/torque matches. I do have an older version of docker though: Docker version 19.03.8, build afacb8b7f0
I also see the unknown pid
message, but also these (when I start docker-compose manually):
Starting docker-compose_torque_1 ... done
Attaching to docker-compose_torque_1
torque_1 | /usr/lib/python2.6/site-packages/supervisor/options.py:295: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
torque_1 | 'Supervisord is running as root and it is searching '
torque_1 | 2021-01-18 14:27:22,213 CRIT Supervisor running as root (no user in config file)
torque_1 | 2021-01-18 14:27:22,215 INFO supervisord started with pid 1
torque_1 | 2021-01-18 14:27:23,217 INFO spawned: 'pbsmom' with pid 16
torque_1 | 2021-01-18 14:27:23,218 INFO spawned: 'sshd' with pid 17
torque_1 | 2021-01-18 14:27:23,219 INFO spawned: 'pbssched' with pid 18
torque_1 | 2021-01-18 14:27:23,220 INFO spawned: 'pbsserver' with pid 20
torque_1 | 2021-01-18 14:27:23,220 INFO spawned: 'trqauthd' with pid 21
torque_1 | 2021-01-18 14:27:23,248 CRIT reaped unknown pid 24)
torque_1 | 2021-01-18 14:27:23,345 INFO exited: pbssched (exit status 0; not expected)
torque_1 | 2021-01-18 14:27:23,352 INFO gave up: pbssched entered FATAL state, too many start retries too quickly
torque_1 | 2021-01-18 14:27:24,444 INFO success: pbsmom entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
torque_1 | 2021-01-18 14:27:24,444 INFO success: sshd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
torque_1 | 2021-01-18 14:27:24,444 INFO success: pbsserver entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
torque_1 | 2021-01-18 14:27:24,444 INFO success: trqauthd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
torque_1 | 2021-01-18 14:27:24,444 CRIT reaped unknown pid 58)
torque_1 | 2021-01-18 14:27:25,604 CRIT reaped unknown pid 77)
torque_1 | 2021-01-18 14:27:26,779 CRIT reaped unknown pid 96)
does pbssched
fail?
I have the same log message, but when I log in the pbs_sched
process is running and qsub work as expected.
Updating docker from 19.03.8
to 20.10.2
did not help, but updating docker-compose from 1.25.0
to 1.27.4
seems to have squashed this bug.
Resolved as a docker-compose version issue.
When I run the integration tests for xenon, the Torque tests fail with the following error:
Other integrations test that use docker images (such as slurm and gridengine) seem to work as expected.