xenon-middleware / xenon

A middleware abstraction library that provides a simple programming interface to various compute and storage resources.
http://xenon-middleware.github.io/xenon/
Apache License 2.0
34 stars 17 forks source link

Torque integration tests broken? #677

Closed jmaassen closed 3 years ago

jmaassen commented 3 years ago

When I run the integration tests for xenon, the Torque tests fail with the following error:

java.lang.IllegalArgumentException: No internal port '22' for container 'torque': com.palantir.docker.compose.connection.Container$$Lambda$96/0x0000000840137840@2d2ea655
    at com.palantir.docker.compose.connection.Container.lambda$port$11(Container.java:91)

Other integrations test that use docker images (such as slurm and gridengine) seem to work as expected.

jmaassen commented 3 years ago

When starting the docker image manually like so:

 docker run --detach --name xenon-torque --hostname xenon-torque --publish 10022:22 --cap-add SYS_RESOURCE xenonmiddleware/torque

and running the liveTest like this:

./gradlew liveTest -Dxenon.scheduler=torque -Dxenon.username=xenon -Dxenon.password=javagat -Dxenon.scheduler.location=ssh://localhost:10022 -Dxenon.scheduler.workdir=/home/xenon

the test run succesfully. So it seems there is an issue in the testing framework itself, not the code or docker image. Maybe the healthcheck succeeds too quickly?

jmaassen commented 3 years ago

Starting the docker images with docker compose:

docker-compose -f torque-5.0.0.yml up

and running the live tests in the same fashion:

./gradlew liveTest -Dxenon.scheduler=torque -Dxenon.username=xenon -Dxenon.password=javagat -Dxenon.scheduler.location=ssh://localhost:32830 -Dxenon.scheduler.workdir=/home/xenon 

does not work. It results in the same error as with the integration tests.

sverhoeven commented 3 years ago

Works for me

docker --version
Docker version 20.10.2, build 2291f61

docker image inspect xenonmiddleware/torque | jq '.[0].RepoDigests'
[
  "xenonmiddleware/torque@sha256:5a98982c2ad0cefc6994004ce4da69e68b8f0c4d596a9732dd7574e75f2153d4"
]

gradlew integrationTest --tests '*torque*'
Starting a Gradle Daemon, 2 incompatible Daemons could not be reused, use --status for details

Deprecated Gradle features were used in this build, making it incompatible with Gradle 6.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/5.4.1/userguide/command_line_interface.html#sec:command_line_warnings

BUILD SUCCESSFUL in 59s
6 actionable tasks: 2 executed, 4 up-to-date

PS. xenonmiddleware/torque is the only Docker image where we don't install the scheduler/fs ourselves.

sverhoeven commented 3 years ago

Using livetest command also works for me. Did have to prime the known_hosts file by logging in manually before calling gradle.

Also saw that the healthcheck is causing 2021-01-18 13:58:45,612 CRIT reaped unknown pid 5465) in docker compose log.

jmaassen commented 3 years ago

The digest of xenonmiddleware/torque matches. I do have an older version of docker though: Docker version 19.03.8, build afacb8b7f0

jmaassen commented 3 years ago

I also see the unknown pid message, but also these (when I start docker-compose manually):

Starting docker-compose_torque_1 ... done
Attaching to docker-compose_torque_1
torque_1  | /usr/lib/python2.6/site-packages/supervisor/options.py:295: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
torque_1  |   'Supervisord is running as root and it is searching '
torque_1  | 2021-01-18 14:27:22,213 CRIT Supervisor running as root (no user in config file)
torque_1  | 2021-01-18 14:27:22,215 INFO supervisord started with pid 1
torque_1  | 2021-01-18 14:27:23,217 INFO spawned: 'pbsmom' with pid 16
torque_1  | 2021-01-18 14:27:23,218 INFO spawned: 'sshd' with pid 17
torque_1  | 2021-01-18 14:27:23,219 INFO spawned: 'pbssched' with pid 18
torque_1  | 2021-01-18 14:27:23,220 INFO spawned: 'pbsserver' with pid 20
torque_1  | 2021-01-18 14:27:23,220 INFO spawned: 'trqauthd' with pid 21
torque_1  | 2021-01-18 14:27:23,248 CRIT reaped unknown pid 24)
torque_1  | 2021-01-18 14:27:23,345 INFO exited: pbssched (exit status 0; not expected)
torque_1  | 2021-01-18 14:27:23,352 INFO gave up: pbssched entered FATAL state, too many start retries too quickly
torque_1  | 2021-01-18 14:27:24,444 INFO success: pbsmom entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
torque_1  | 2021-01-18 14:27:24,444 INFO success: sshd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
torque_1  | 2021-01-18 14:27:24,444 INFO success: pbsserver entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
torque_1  | 2021-01-18 14:27:24,444 INFO success: trqauthd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
torque_1  | 2021-01-18 14:27:24,444 CRIT reaped unknown pid 58)
torque_1  | 2021-01-18 14:27:25,604 CRIT reaped unknown pid 77)
torque_1  | 2021-01-18 14:27:26,779 CRIT reaped unknown pid 96)

does pbssched fail?

sverhoeven commented 3 years ago

I have the same log message, but when I log in the pbs_sched process is running and qsub work as expected.

jmaassen commented 3 years ago

Updating docker from 19.03.8 to 20.10.2 did not help, but updating docker-compose from 1.25.0 to 1.27.4 seems to have squashed this bug.

jmaassen commented 3 years ago

Resolved as a docker-compose version issue.