Open portnov opened 6 months ago
After some thought, it seems the same principle is applicable to "Top SQL" sections: it would be good to add "% of DB Time" column. Moreover, that would solve known problem that "%Total" for SQL queries is not quite correct when nested queries are counted.
Hello @zubkov-andrei ! Sorry to bother you, but... any comments on this matter? :)
Hi, @portnov Excuse me for late answer, I was fixing some bugs in the last release.
Yes, we have a problem here. However I'm not sure we can implement your suggestion "as is" because of several thoughts:
But it seems, we should exclude 'idle' waits from %Total calculation. However we should think about what to do with the %Total values for idle wait events.
Hello. Thanks for the answer.
About wait sampling, probably one way to do it would be to modify pg_wait_sampling in such a way that it would record not only event and event_type columns from pg_stat_activity, but state as well. So that monitoring system could filter based on state, like select * from pg_wait_sampling_profile where state = 'active'
. This way we could more or less reliably exclude "parasite" ClientRead (and probably other) wait events.
We don't have session timing statistics before Postgres 14.
I do not have objections against having both "% Total" and "% of DB Time" columns at the same time.
Session stats is true timer values collected on session state changes, while wait sampling is near-real time sampling but it doesn't actually know the exact waiting time. I really don't want to deal with rates between such different values.
Yes, such percentage would not be exact, but imho it would be much more exact comparing to existing "%Total".
And what about the same idea (% of db time) with relation to "Top SQL" sections?
Hello.
Currently, in "Wait Sampling" section, the report shows %Total as percentage of particular event time in relation to all events which were registered. This number can be very misleading. For example, we can see something like:
and we say: wow, we have a problem with locks! But the report was taken for 10 minutes, and there were in average 100 active connections during the time, so total database time was
10*60*100 = 60000
seconds. So 90 seconds is 0.15% (tenth of a percent), not 90% at all!On the other hand,
— this looks like database was waiting for client, but in fact "Lock:tuple" took 59000 out of 60000 database time seconds, i.e. 98% of database time. High "ClientRead" can mean just idle time (lots of sessions that are connected to DB but do nothing).
Suggestion is: in "wait sampling" section, add "% of database time" column, calculated as percentage of "Waited (s)" relative to "Active" column from "Session statistics" block. As far as I understand, that "Active" column more or less corresponds to what Oracle calls "DB Time". I'm not sure, maybe current "%Total" column should be removed. Or maybe someone knows a good use for it, so maybe let there be two percentage columns. Maybe it also worth to show "% of elapsed time" column - relation of "Waited (s)" to ~"Total"~ sum of "Active" + "Idle" from "Session statistics".
Another related point that may be misleading as well is about "Client:ClientRead" event. As far as I understood, it will be almost always in top, just because of the way how pg_wait_sampling works (it collects all data from pg_stat_activity, including idle sessions). So maybe somehow format that row with gray, for example, to do not confuse people who see it occupying most of the time?