Bad performance on experimental glx backend with tabbed windows

zsolt-donca commented 4 years ago

I have very bad performance with the experimental glx backend when having several tabbed windows (using i3wm). I get a noticeable input lag if I have multiple, large windows on top of another, such when using the i3 tabbed layout. The performance impact can be shown by https://www.vsynctester.com/; when I have multiple large windows open, I can see that the compositor drops frames (the framedrops not always visible in vsynctester's graphs).

I don't have the same issue when having open only a single window, or when using the "old" backends (without the --experimental-backends flag). I also don't have the issue when using the experimental xrender backend, which actually works way faster for me than the experimental glx backend.

Besides changing backends, I have noticed several ways to mitigate the issue:

activating the --transparent-clipping makes the issue disappear (after switching to a newer build: picom-git), at least in case when I have multiple windows in top of each other (in the tabbed layout); however, I don't like this feature because breaks some visual effects in some applications (e.g. dragging a selected text in the browser appears to opens a rectangular "window" to the wallpaper);
disabling the drawing of the underlying tabbed windows, as described in the ArchWiki here;
lowering my screen's resolution to Full HD 1920x1080.

Note that I have the same performance issue when using GNOME also, making GNOME practically unusable for me in 4K resolution. Not sure if it is relevant, but I noticed that when using KWin, I have a significantly worse performance when using their OpenGL 3.x backend, and much better on OpenGL 2.0.

Platform

Arch Linux x86_64

GPU, drivers, and screen setup

GPU: Intel HD Graphics P630 Drivers:

xf86-video-intel 1:2.99.917+899+gf66d3954-1
mesa 20.0.4-1

Screen setup: single 4K display 3840x2160 @ 60Hz

➜  ~ glxinfo -B
name of display: :0
display: :0  screen: 0
direct rendering: Yes
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: Intel (0x8086)
    Device: Mesa Intel(R) HD Graphics P630 (KBL GT2) (0x591d)
    Version: 20.0.4
    Accelerated: yes
    Video memory: 3072MB
    Unified memory: yes
    Preferred profile: core (0x1)
    Max core profile version: 4.6
    Max compat profile version: 4.6
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 3.2
OpenGL vendor string: Intel
OpenGL renderer string: Mesa Intel(R) HD Graphics P630 (KBL GT2)
OpenGL core profile version string: 4.6 (Core Profile) Mesa 20.0.4
OpenGL core profile shading language version string: 4.60
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile

OpenGL version string: 4.6 (Compatibility Profile) Mesa 20.0.4
OpenGL shading language version string: 4.60
OpenGL context flags: (none)
OpenGL profile mask: compatibility profile

OpenGL ES profile version string: OpenGL ES 3.2 Mesa 20.0.4
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20

Note that I have an Optimus setup with an NVIDIA Quadro M1200 Mobile, with nvidia-lts 1:440.64-9, using optimus manager, but I have the performance issue when using the Intel GPU only (which I do most of the time), with the nvidia GPU powered off.

Environment

I am on i3-gaps, but I had the same fundamental performance issue also on Gnome (obviously not while using picom/compton).

picom version

➜  ~ picom --version
v7.5

Configuration:

unredir-if-possible = false;
backend = "glx";
glx-no-stencil = true;
glx-no-rebind-pixmap = true;
vsync = true;

I am starting picom with picom --experimental-backends --no-fading-openclose.

Steps of reproduction

Open the page https://www.vsynctester.com/ in a Chromium, and notice that it runs at 60 fps without any significant frame dropping (the VSYNC label appears gray)
Open three additional maximized windows; it can be anything, such as terminals or text editors; if you are using i3 (or other tiling vm), have all four windows on the same workspace and use the tabbed layout (mod+W);
Switch to Chromium, and notice in vsynctester that frames drop; in my case, with a total number of four windows, the VSYNC label appears mostly only red or cyan, with switching colors every couple of seconds; from this, I deduce that the compositor appears to fall back to ~ 30 FPS. Note that vsynctester's metrics still claim a 60 FPS performance. Notes: a. At this point, the entire system feels laggy, not just the browser, but also the text editors/terminals. b. With having less additional windows open, less frames appear dropped, and with more additonal windows, more frames get dropped.

Expected behavior

Having a handful of background windows open shouldn't make the entire desktop feel laggy. Also, I am expecting the experimental backends to have at least the performance of the old backends.

Current Behavior

My system feels much slower on the experimental glx backend, not just when having multiple windows open.

yshui commented 4 years ago

It might be useful to get a trace of picom with apitrace, and use the profile feature of apitrace to measure the frame timings.

See https://github.com/apitrace/apitrace/blob/master/docs/USAGE.markdown for how to do this, esp. https://github.com/apitrace/apitrace/blob/master/docs/USAGE.markdown#profiling-a-trace

yshui commented 4 years ago

qapitrace can generate frame timing graphs as well. It's in Trace -> Profile

yshui commented 4 years ago

Maybe also capture traces of both the experimental and the old glx backends, and compare them.

absolutelynothelix commented 4 years ago

i feel like the root cause of the problem is

Intel HD Graphics P630 4K display

zsolt-donca commented 4 years ago

Thanks for the reply, @yshui!

I will get back to you with the results of the apitrace investigation.

zsolt-donca commented 4 years ago

@mighty9245 What do you mean? The GPU is not something that I can easily replace (other than by replacing the laptop). It seems to work fine with the old backends, so there is some proof that the GPU is capable (enough). I'd also rather not use the dedicated nvidia GPU all the time, as it my laptop is already hot enough as is (and the CPU thermal throttling often kicks in).

zsolt-donca commented 4 years ago

I've created some traces for two scenarios. I've uploaded the .trace files here - I will let the experts analyze them. I repeated both scenarios two times, trying to repeat them as close as possible, once with the old backend, and once with the new backend.

For tracing the old backend, I used:

apitrace trace --api gl /usr/bin/picom

For tracing the new backends, I used:

apitrace trace --api gl /usr/bin/picom --experimental-backends

After all, I created the following structure:

├── run1
│   ├── picom-new-backend.trace
│   └── picom-old-backend.trace
└── run2
    ├── picom-new-backend.trace
    └── picom-old-backend.trace

The scenarios:

In run1, I've started with a Chrome window and 3 additional termite windows, in the i3 split layout; this means that all the 4 windows in total were resized to fit the screen. In Chrome, I had https://www.vsynctester.com/ open. I've kept this layout for 30 seconds, then switched to tabbed layout while focusing Chrome; this meant that the termite windows went into background, and Chrome was maximized (the termite windows also being maximized in the background). At this phase while on the new backends, the vsynctester clearly showed that my overall FPS rate fell down to around 30 FPS, because the VSYNC label was almost constantly cyan or red, changing colors every 2-3 seconds only. However, the graphs on the page claimed 60 FPS; this makes me think that it's the compositor that's not performing. I've kept this also for 30 seconds, then ended the profiling.
In run2, I've started with a Chrome window and a termite window in tabbed layout (both maximized, one occluding the other), and progressively increased the load by opening new termite windows every 10 seconds. I did this test also for 60 seconds. When on the new backends, the performance got progressively worse, with the entire system lagging heavily towards the end (when I had 8 windows open in total). When the system was lagging, the VSYNC label was flashing and the animation was heavily stuttering.

If you'd like, I can also make a 60 FPS video of reproducing these scenarios with my phone, to demonstrate the issue.

yshui commented 4 years ago

@zsolt-donca Hi, can you do the profiling on your setup and post the result? If I do the profiling using your apitrace recording, what I will get is how well picom performance on my machine. So you have to do it.

zsolt-donca commented 4 years ago

@yshui I've uploaded the results of the profiling here for the above traces.

I hope I used the correct command. I executed the following for both runs:

apitrace replay --pgpu --pcpu --ppd picom-old-backend.trace > picom-old-backend.profile
apitrace replay --pgpu --pcpu --ppd picom-new-backend.trace > picom-new-backend.profile

My screen changed for the duration of the "replays" for approximately the same duration as I recorded them, but it wasn't actually replaying everything, it was only showing a couple of still frames.

I hope it was okay; if you guys need more data to work with, let me know.

ghost commented 4 years ago

This sounds like a very similar issue I'm having. I run i3wm and generally have a lot of tabbed windows.

Running default config, only changing the flag to run experimental backends, I'm getting a noticeable lag and jitter effect on picom with Intel UHD gfx (using an X1 Carbon Gen 7)

I've tried changing just about every variation in the config file to try to diagnose, but the only thing I can pinpoint is running with the experimental backends feature, which ultimately I want to do anyways because of kawase blur.

If there's anything I can provide to help diagnose this let me know - happy to help wherever I can.

yshui / picom