Rework the puma-helper output

spuyet commented 5 years ago

I think that we should rework thepuma-helper output, the current output is really hard to read and we should use passenger-status as example:

$ sudo passenger-status
Version : 5.3.4
Date    : 2019-05-06 13:02:07 +0200
Instance: S0h5vRIG (nginx/1.14.0 Phusion_Passenger/5.3.4)

----------- General information -----------
Max pool size : 200
App groups    : 2
Processes     : 31
Requests in top-level queue : 0

----------- Application groups -----------
/home/smcc/api_current (production):
  App root: /home/smcc/api_current
  Requests in queue: 0
  * PID: 35276   Sessions: 0       Processed: 4610    Uptime: 2h 11m 6s
    CPU: 2%      Memory  : 315M    Last used: 0s ago
  * PID: 44452   Sessions: 0       Processed: 399     Uptime: 1h 2m 36s
    CPU: 0%      Memory  : 260M    Last used: 1s ago
  * PID: 37492   Sessions: 0       Processed: 16      Uptime: 11m 38s
    CPU: 0%      Memory  : 153M    Last used: 1s ago

VS

$ puma-helper status
WARN[May  6 17:19:42] [lds] configuration is invalid. Error: Get http://unix/stats?token=768935b7ed117cfac6c8c99da4db6bd: dial unix /tmp/puma-status-1545126439210-2886: connect: no such file or directory

---------- Global informations ----------
Version: 1.0.0 | Date: Mon, 06 May 2019 17:19:42 +0200

----------- Application groups -----------
-> answers_integration application | App root: /home/answers_integration
-> 38423 (/home/answers_integration/current/tmp/pids/puma.state) Phase: 28 | Workers: 2 | Active threads: 0[░░]2

* 0 [27771] CPU Av: 0.3% CPU Times: 12s Mem: 230.3MiB Phase: 28 Uptime: 11m29d23h Threads: 0[░]1 (Last checkin: 0s)
* 1 [27949] CPU Av: 0.2% CPU Times: 9s Mem: 230.3MiB Phase: 28 Uptime: 11m29d23h Threads: 0[░]1 (Last checkin: 5s)

-> 38424 (/home/answers_integration/current/tmp/pids/puma_api.state) Phase: 28 | Workers: 2 | Active threads: 0[░░]2

* 0 [27892] CPU Av: 0.1% CPU Times: 5s Mem: 172.1MiB Phase: 28 Uptime: 11m29d23h Threads: 0[░]1 (Last checkin: 3s)
* 1 [27961] CPU Av: 0.1% CPU Times: 5s Mem: 171.4MiB Phase: 28 Uptime: 11m29d23h Threads: 0[░]1 (Last checkin: 2s)

Here is a first draft of something a bit readable and largely inspired from the passenger output:

Version : 1.0.0
Date:   : Mon, 06 May 2019 13:02:49 +0200

----------- General information -----------
App groups   : 2
Processes    : 6

----------- Application groups -----------
smcc_api (20190419144610):
  * PID: 35276    Queued: 0       Phase  : 42     Uptime      : 2h 11m 6s
    CPU: 2%       Memory: 315M    Threads: 3/5    Last checkin: 0s ago
  * PID: 44452    Queued: 0       Phase  : 42     Uptime      : 1h 2m 36s
    CPU: 0%       Memory: 260M    Threads: 1/5    Last checkin: 3s ago
  * PID: 37492    Queued: 0       Phase  : 42     Uptime      : 11m 38s
    CPU: 0%       Memory: 153M    Threads: 2/5    Last checkin: 1s ago

smcc (20190419144610):
  * PID: 20487    Queued: 0       Phase  : 42       Uptime      : 1h 57m 57s
    CPU: 24%      Memory: 456M    Threads: 12/20    Last checkin: 1s ago
  * PID: 32417    Queued: 0       Phase  : 42       Uptime      : 1h 42m 49s
    CPU: 27%      Memory: 463M    Threads: 5/20     Last checkin: 9s ago
  * PID: 34253    Queued: 0       Phase  : 42       Uptime      : 1h 40m 13s
    CPU: 25%      Memory: 497M    Threads: 19/20    Last checkin: 0s ago

I know that the go dependency that you're using doesn't support current CPU percentage and to be honest portability is a welcome feature but current CPU usage is a mandatory one IMHO, a software cannot be restricted only because of portability mostly for a tiny project like this one.

This is a first draft, comments and improvements are welcome.

Notes: CPU percentage => use /proc/PID/stat Uptime => use /proc/PID/stat too Queued => corresponding to the backlog value per process in puma stats

jarthod commented 5 years ago

Yep, I don't agree with the two lines per process though, I'm the one who asked for one because we have a lot in production and it means you can see twice as much in one console screen.

Also @w3st3ry you (still) didn't take into account half of the things I asked in my last feedback (which I can't see any more because I sent it via PM in Slack), I gave you the benefit of the doubt but I guess by now I can consider you will never do them. This is not acceptable, you can't waste our time like that. You need to at least read what we say, and if you don't want to do something it must be written somewhere, with arguments, you can't just ignore.

So OK I'll spend some time again to try to give you (mostly) the same feedback, please do it this time:

For the CPU time of course we should display the instant value it if we can, last time you told me it was not possible I didn't check more but if it is standard in /proc/stat of course we should
We don't need the process index, start with the PID directly to simplify the line, maybe start with "└" for the process to make it look like a child of the application line
All numbers must be padded on a fixed size to preserve alignement
Process phase should only be displayed if it's different from application phase to save space, at the end of the line, in orange.
Uptime should have 1 space between number and letter of course, you should only pad the whole string to maintain alignement, so extra space should be on the left (or right) of the date.
Last checkin should only be disabled if there's an issue (> 10 sec)
We don't really need the headers just to loose lines
I think we can remove the number of workers now that we have the little graph showing the total
No need for digit in memory size
Would be better to use human names than path if we have a configuration file

Example output (made quickly by hand):

$ puma-helper status
Version: 1.0.0 | Date: Mon, 06 May 2019 17:19:42 +0200

38423 answers_integration           Phase: 28  Active threads: 0/2 [░░]
└ 27771 CPU: 98.4%  Mem: 1230MiB  Uptime:       9d 3h Threads: 0/1 [░]
└ 27949 CPU:  0.2%  Mem:  230MiB  Uptime: 11m 29d 23h Threads: 0/1 [░] (Last checkin: 13s) (Phase: 27)

38424 answers_integration API       Phase: 28  Active threads: 0/2 [░░]
└ 27892 CPU:  0.1%  Mem:  172MiB  Uptime: 11m 29d 23h Threads: 0/1 [░]
└ 27961 CPU:  0.1%  Mem:  171MiB  Uptime: 11m 29d 23h Threads: 0/1 [░]

WDYT @spuyet @ylecuyer?

ylecuyer commented 5 years ago

This output looks better ;)

I would like to add something about errors too. Currently errors are on the top of the output

$ puma-helper status
WARN[May  6 17:19:42] [lds] configuration is invalid. Error: Get http://unix/stats?token=768935b7ed117cfac6c8c99da4db6bd: dial unix /tmp/puma-status-1545126439210-2886: connect: no such file or directory

---------- Global informations ----------
Version: 1.0.0 | Date: Mon, 06 May 2019 17:19:42 +0200

We should have them at the end if any otherwise it is difficult to notice them

$ puma-helper status
Version: 1.0.0 | Date: Mon, 06 May 2019 17:19:42 +0200

38423 answers_integration           Phase: 28  Active threads: 0/2 [░░]
└ 27771 CPU: 98.4%  Mem: 1230MiB  Uptime:       9d 3h Threads: 0/1 [░]
└ 27949 CPU:  0.2%  Mem:  230MiB  Uptime: 11m 29d 23h Threads: 0/1 [░] (Last checkin: 13s) (Phase: 27)

38424 answers_integration API       Phase: 28  Active threads: 0/2 [░░]
└ 27892 CPU:  0.1%  Mem:  172MiB  Uptime: 11m 29d 23h Threads: 0/1 [░]
└ 27961 CPU:  0.1%  Mem:  171MiB  Uptime: 11m 29d 23h Threads: 0/1 [░]

[lds] configuration is invalid. Error: Get http://unix/stats?token=768935b7ed117cfac6c8c99da4db6bd: dial unix /tmp/puma-status-1545126439210-2886: connect: no such file or directory

spuyet commented 5 years ago

$ puma-helper status
Version: 1.0.0 | Date: Mon, 06 May 2019 17:19:42 +0200

38423 answers_integration        Load: 0/2 [░░]
 └ 27949 CPU:  0.2%  Mem:  230M  Uptime: 11m 29d 23h  Load: [░] (Last checkin: 13s) (Phase 27 != Phase app 28)
 └ 27771 CPU: 98.4%  Mem: 1230M  Uptime:       9d 3h  Load: [░]

38424 answers_integration_api     Queuing: 42          Load: 2/2 [░░]
 └ 27892 CPU:  99.8%  Mem:  172M  Uptime: 11m 29d 23h  Load: [░]
 └ 27961 CPU: 100.0%  Mem:  171M  Uptime: 11m 29d 23h  Load: [░]

[lds] configuration is invalid. Error: Get http://unix/stats?token=768935b7ed117cfac6c8c99da4db6bd: dial unix /tmp/puma-status-1545126439210-2886: connect: no such file or directory

Here is a little update:

I've added some space for readibility
MiB -> M
Active Threads / Threads -> Load (just a proposal)
Removed Phase info from top process and added only on workers on a different phase
Removed threads number per process
Added queuing
Workers should be sorted by uptime

About CPU percentage, I will test to do something with /proc/PID/stat to see if we can have something reliable

jarthod commented 5 years ago

Queuing is at worker level, no? not application?

spuyet commented 5 years ago

It's at worker level in puma stats yes, but It seems more interesting to me have a queuing overview and then to aggregate the value and display it at application level. WDYT ?

jarthod commented 5 years ago

Yep ok

w3st3ry commented 5 years ago

In order, to summarize all these changes:

Suppress the process index and start with the PID directly to simplify the line, start with "└" for the process to make it look like a child of the application line
All numbers be padded on a fixed size to preserve alignment
Process phase only displayed if it's different from application phase to save space, at the end of the line, in orange
Uptime /w 1 space between number and letter, only pad the whole string to maintain alignment
Last checkin disabled if there's an issue (> 10 sec)
Delete headers lines
Remove the number of workers per apps
No digit in memory
Throw errors at the end of the output
MiB to M
Active Threads / Threads to Load
Sort workers by uptime

Questions before adding in todo-list:

"Removed threads number per process" -> Then I remove directly Load?
"Added queuing" -> How I should get this information? I get nothing according to queuing in the worker Json, could you explain?
"CPU percentage" -> The norm to calculate CPU % of a process according to ticks is a minima of 1 second. Per example top using 3seconds to find the right amount. This program is one shot printing then we have 3 options: change nothing, keep the program thread safe without locks and wait NB of apps x seconds to get the output, or just finally rework absolutely all of this, use go routines and locks everywhere.

jarthod commented 5 years ago

About the CPU percentage, what's the relation with go routines and locks? you can iterate on all apps to get the first number then wait 1 second, and then get the second number for all apps, and then print. You don't have to over-engineer this.

w3st3ry commented 5 years ago

I just gave this kind of solution in first because it's more compliant with the philosophy of the language. BTW after some change it should works like you said too and it's more easy to maintain for someone is not a full-time Gopher, I agree.

Otherwise, did you agree about the summary? @jarthod @spuyet

jarthod commented 5 years ago

A couple typos but otherwise, yes:

Last checkin disabled if there's an issue (> 10 sec)

It's the opposite: visible if there's an issue

MiB to B

→ to M

About the queue I think it's "backlog" in the status json but I let @spuyet confirm.

About "Removed threads number per process" I think @spuyet said we should remove the number but keep the load bar graph for worker (because it's almost always 1) but I'm not sure it's a good idea as we can have multiple threads in which case it'll not be 1. @spuyet WDYT?

spuyet commented 5 years ago

"Added queuing" -> How I should get this information? I get nothing according to queuing in the worker Json, could you explain?

I've explained it in the body of this issue, that's the backlog value that you have per process in status json output yeah.

About "Removed threads number per process" I think @spuyet said we should remove the number but keep the load bar graph for worker (because it's almost always 1) but I'm not sure it's a good idea as we can have multiple threads in which case it'll not be 1. @spuyet WDYT?

Yep, I don't find the number really useful as we already have the global value, the load graph seems to be enough to me.

jarthod commented 5 years ago

Ok!

w3st3ry commented 5 years ago

Please checkout https://github.com/dimelo/puma-helper/pull/21 about the UI. Available on staging env.

ylecuyer / puma-helper

Rework the puma-helper output #18