zifeo / whiz

Modern DAG/tasks runner for multi-platform monorepos with live reloading, env management, pipes, and more in a tabbed view.
https://metatype.dev/docs/reference/ecosystem?utm_source=github&utm_medium=about&utm_campaign=whiz
Mozilla Public License 2.0
19 stars 14 forks source link

feat: better CI support #98

Closed Yohe-Am closed 10 months ago

Yohe-Am commented 10 months ago

This PR fixes #83.

I've decided to go with a straight forward solution that, when the program is provided with the --exit-after flag, uses an Actor that asks the CommandActors to tell it when they're finished. Once it has the go from all its targets, it exits the program.

In order to make the tool more CI friendly, I've also introduced a --no-watch flag that globally disables reloading triggered by fs watch events. I also recommend adding, in this PR or another, code that disables the Console actor when not needed and joins all the child std pipes. The console makes little sense in a CI context.

zifeo commented 10 months ago

Also do not forget to include the flags in the readme.

Yohe-Am commented 10 months ago

Alright, I've resolved a few of the suggestions. They might bloat this PR but I have a few minor ideas for improvements in this direction:

zifeo commented 10 months ago

@Yohe-Am That makes sense. I would keep this behaviour separated from q. Alternatively, flushing the log buffer ordered would maybe be even better if a debug is necessary or checking the run (as most of the CI will not show any log with the current implementation).

Yohe-Am commented 10 months ago

Ah, the test I wrote appear to be flaky. The test asserts that the return code of all the test tasks, simple ls and echo commands, are 0. The test reads this code from the Actix::System when it's done and GrimReaperActor is responsible for setting this. It only sets 0 if all tasks it was targeted to kill exit with a 0. The problem is, sometimes, some tasks return ExitStatus::Undetermined which I've decided to interpret as a non-0 error code.

Looking through manuals and the subprocess code, this ExitStatus is only returned when something has already used waitpid on the child to completion. And waitpid is only used by the poll and wait_timeout invocations. All the new code I've added goes through the Child abstraction which avoids using poll if the child has already been waited upon to completion. Actix, at least as used in whiz, doesn't actors move across threads either. Not sure what's wrong, maybe I'm looking in the wrong place...

Any ideas?

Yohe-Am commented 10 months ago

After some investigation, the error was indeed elsewhere. I was led astray by the Child::Killed to ExitStatus::Undetermined mapping in Child::exit_status. The main issue appears to be that the child is not always dead by the time the actor processes the StdoutTerminated event and the ensure_stopped therein call ends up pulling the plug. I suppose I'll have it wait for a while for the child to die, in the manner of WaitStatus before pulling the plug. I'm thinking, poll if the child is dead every 20 millis and kill it if a second has elapsed. Lemme see...

Yohe-Am commented 10 months ago

Ayup, that seems to have fixed it. Seems to be working well now and ready to merge. I don't think the solution is a semantic violation, right? StdoutTerminated always correlates to death?

zifeo commented 10 months ago

@Yohe-Am unsure of the last (quite complex :)) commit, I believe ensure_stopped should do the same behaviour. Do you see a diff?

Yohe-Am commented 10 months ago

@zifeo ensure_stopped kills the child if it's still running. As it turns out, the child might still be running by the time StdoutTerminated is processed by the actor. That complex async bit's just giving slight breathing room (a second) for the command to complete naturally. Otherwise, this edge case robs us a chance of reading its actual error code.

zifeo commented 10 months ago

@Yohe-Am would use ensure_stop somehow when the stdout terminated event is sent simplify that edge case? The CI seems still failing (maybe os specific).

Yohe-Am commented 10 months ago

I suppose, to simplify the code, I can use the wait_timeout syscall to achieve the same behavior. It won't be async but I don't think that's a boon here. On the behavior this achieves though, I can't think of a simpler way of handling the edge case.

Yohe-Am commented 10 months ago

CI

yeah, I'm not sure why macOS bash is acting differently. hmm...

Yohe-Am commented 10 months ago

Alright, I've pushed a change that simplify that async bit.

CI, macOs

According to this, it might be the time unit. on sleep. I've pushed a change that'll hopefully get it along.

Yohe-Am commented 10 months ago

windows

oof, right. Let me add a quick cfg case for the test

Yohe-Am commented 10 months ago

Edit: nvm, Github appears to be working now.

Something appears to be wrong with Github, I'm unable to access the logs of the run.

image

I also tried accessing the raw longs and downloading the log archive but they just timeout after a while. I tried multiple browsers and I'm not behind a VPN either. The rest of the website appears functional as well. Hmm...

Yohe-Am commented 10 months ago

Maybe starting another CI run might help?

Yohe-Am commented 10 months ago

Turns out the TIMEOUT command on Windows only works in interactive shell sessions. According to this and other discussions I've found, abusing the PING program as an alternative to sleep is common under cmd and that should hopefully fix that.

Aand, after some consideration, seeing that a command as simple as sleep has quirked up all three platforms, I've just replaced it all with python invocations in the tests.