Drops and errors on system load

wwmm / easyeffects

Limiter, compressor, convolver, equalizer and auto volume and many other plugins for PipeWire applications

GNU General Public License v3.0

6.57k stars 270 forks source link

Drops and errors on system load #3224

Open Massimo-B opened 4 months ago

Massimo-B commented 4 months ago

EasyEffects Version

7.1.3

What package are you using?

Gentoo

Distribution

Gentoo 23.0

Describe the bug

On higher system loads I have drops and errors in pw-top when running easyeffects. This also happens when CPU is not fully used (about 50%), but with Chrome and a MS Teams session with video. Load AVG is about 20 with some btrfs workers in the back. The issue is usually solved when killing easyeffects. I also disabled the spectrum now as it also did a lot of load.

Expected Behavior

No response

Debug Log

S   ID  QUANT   RATE    WAIT    BUSY   W/Q   B/Q  ERR FORMAT           NAME                                                                                                                                                                                                                                                                                                                                                               
S   29      0      0    ---     ---   ---   ---     0                  Dummy-Driver
S   30      0      0    ---     ---   ---   ---     0                  Freewheel-Driver
R   75   1024  48000  31,9us   0,5us  0,00  0,00    2    S16LE 1 48000 alsa_input.usb-Microsoft_Microsoft___LifeCam_HD-3000-02.mono-fallback
R   92      1     25  19,4us   6,1us  0,00  0,00    6       F32LE 1 25  + PulseAudio Volume Control
R   76   1024  48000  73,5us 105,6us  0,00  0,00  550    S32LE 2 48000 alsa_output.pci-0000_00_1b.0.analog-stereo
R  124      1     25  28,9us  28,9us  0,00  0,00    8       F32LE 1 25  + PulseAudio Volume Control
R   77   1024  48000  65,8us   0,9us  0,00  0,00    2    S32LE 2 48000 alsa_input.pci-0000_00_1b.0.analog-stereo
R  128      1     25  34,3us  22,2us  0,00  0,00   10       F32LE 1 25  + PulseAudio Volume Control
I  269      0      0   0,0us   0,0us  ???   ???     0                  ee_test_signals
R   74   1024  48000  15,2ms   1,6us  0,71  0,00  169    S16LE 2 48000 alsa_input.usb-Samson_Technologies_Samson_C01U-00.analog-stereo
R   35      0      0   8,3us  12,7us  0,00  0,00    5     F32P 2 48000  + easyeffects_sink
R   36      0      0  33,9us  13,1us  0,00  0,00  202     F32P 2 48000  + easyeffects_source
R   33      0      0   4,5us 260,4us  0,00  0,01  8772                   + ee_soe_output_level
R   41      0      0   5,1ms   1,4ms  0,24  0,07  12328                   + ee_soe_spectrum
R   52      0      0   3,1us 263,0us  0,00  0,01  6256                   + ee_sie_output_level
R   57      0      0   3,1us   1,3ms  0,00  0,06  13722                   + ee_sie_spectrum
R   89      1     25  78,2us  20,3us  0,00  0,00    7       F32LE 1 25  + PulseAudio Volume Control
R   93      1     25  60,3us  12,0us  0,00  0,00  128       F32LE 1 25  + PulseAudio Volume Control
R  239      1     25  87,5us   8,6us  0,00  0,00    5       F32LE 1 25  + PulseAudio Volume Control
R   95      0      0   9,3us  52,4us  0,00  0,00   58    S32LE 2 48000  + alsa_output.usb-GuangZhou_FiiO_Electronics_Co._Ltd_FiiO_K5_Pro-00.analog-stereo
R  147      1     25  28,7us   9,2us  0,00  0,00   54       F32LE 1 25  + PulseAudio Volume Control
R  212      1     25  53,0us  11,4us  0,00  0,00    6       F32LE 1 25  + PulseAudio Volume Control
R  177      0      0  34,3us   5,2ms  0,00  0,24  2341                   + ee_sie_rnnoise
R  218      0      0  15,6us  12,2us  0,00  0,00  2090                   + ee_sie_echo_canceller
R  130      0      0   3,3us   7,4us  0,00  0,00  796                   + ee_sie_speex
R  185      0      0   4,0us   9,2us  0,00  0,00  2470                   + ee_sie_filter
R  138      0      0   2,8us 420,4us  0,00  0,02  1458                   + ee_sie_bass_enhancer
R  120      0      0   3,5us   6,5us  0,00  0,00  666                   + ee_sie_maximizer
R  207      0      0   3,3us   5,2ms  0,00  0,25  5255                   + ee_sie_crystalizer
R   78      0      0   4,8us 453,7us  0,00  0,02  2286                   + ee_sie_exciter
R  188      0      0   3,4us 458,1us  0,00  0,02  1202                   + ee_sie_stereo_tools
R  168      0      0   4,3us   7,4us  0,00  0,00  167                   + ee_sie_delay
R  267    512  48000  30,9us  18,2us  0,00  0,00    1    F32LE 2 48000  + Google

S   ID  QUANT   RATE    WAIT    BUSY   W/Q   B/Q  ERR FORMAT           NAME                                                                                                                                                                                                                                                                                                                                                               
S   29      0      0    ---     ---   ---   ---     0                  Dummy-Driver
S   30      0      0    ---     ---   ---   ---     0                  Freewheel-Driver
R   75    512  48000 106,9us   0,7us  0,01  0,00   19    S16LE 1 48000 alsa_input.usb-Microsoft_Microsoft___LifeCam_HD-3000-02.mono-fallback
R   92      1     25  70,0us   2,7us  0,01  0,00  154       F32LE 1 25  + PulseAudio Volume Control
R   77    512  48000  57,8us   0,7us  0,01  0,00   20    S32LE 2 48000 alsa_input.pci-0000_00_1b.0.analog-stereo
R  128      1     25  28,0us   9,0us  0,00  0,00  222       F32LE 1 25  + PulseAudio Volume Control
R   74    512  48000   5,6ms   1,1us  0,53  0,00  7575    S16LE 2 48000 alsa_input.usb-Samson_Technologies_Samson_C01U-00.analog-stereo
R   76      0      0  26,4us  60,6us  0,00  0,01  5553    S32LE 2 48000  + alsa_output.pci-0000_00_1b.0.analog-stereo
R  124      1     25  25,5us  19,4us  0,00  0,00  119       F32LE 1 25  + PulseAudio Volume Control
R  212      1     25  41,4us   5,2us  0,00  0,00  150       F32LE 1 25  + PulseAudio Volume Control
R  298      0      0  15,1us   4,2us  0,00  0,00   25     F32P 2 48000  + easyeffects_sink
R  123      0      0  23,1us  13,4us  0,00  0,00  5473     F32P 2 48000  + easyeffects_source
R   72      0      0   3,5us 128,6us  0,00  0,01  2022                   + ee_soe_output_level
R  127      0      0   1,4ms  10,5us  0,13  0,00  3687                   + ee_soe_spectrum
R  234      0      0   3,7us 129,4us  0,00  0,01  11492                   + ee_sie_output_level
R  136      0      0   4,4us   7,2us  0,00  0,00  1358                   + ee_sie_spectrum
R   57      1     25   5,6us   3,4us  0,00  0,00   28       F32LE 1 25  + PulseAudio Volume Control
R  113      1     25  16,9us   7,9us  0,00  0,00  502       F32LE 1 25  + PulseAudio Volume Control
R  226      0      0  41,4us   1,4ms  0,00  0,13  7664                   + ee_sie_rnnoise
R  209      0      0  24,2us  13,1us  0,00  0,00  1717                   + ee_sie_echo_canceller
R  155      0      0   5,0us   7,4us  0,00  0,00  681                   + ee_sie_speex
R  227      0      0   4,2us   6,8us  0,00  0,00  407                   + ee_sie_filter
R  224      0      0   3,3us 224,3us  0,00  0,02  2283                   + ee_sie_bass_enhancer
R  211      0      0   3,1us   6,2us  0,00  0,00  180                   + ee_sie_maximizer
R  343      0      0   3,4us   2,9ms  0,00  0,27  32625                   + ee_sie_crystalizer
R  170      0      0   6,0us 289,1us  0,00  0,03  15473                   + ee_sie_exciter
R  114      0      0   4,7us 197,4us  0,00  0,02  13265                   + ee_sie_stereo_tools
R   52      0      0  10,1us   7,7us  0,00  0,00  1401                   + ee_sie_delay
R  379    480  48000  25,9us  23,7us  0,00  0,00  175    S16LE 2 48000  + Google Chrome input
R  255    512  48000  47,8us   7,6us  0,00  0,00   35    F32LE 2 48000  + Google Chrome
R  363      1     25   4,0us   2,9us  0,00  0,00    5       F32LE 1 25  + PulseAudio Volume Control
S  259      0      0    ---     ---   ---   ---     0                  ee_test_signals

Additional Information

$ chrt -a -p `pidof easyeffects`
pid 18254's current scheduling policy: SCHED_OTHER
pid 18254's current scheduling priority: 0
pid 18261's current scheduling policy: SCHED_OTHER
pid 18261's current scheduling priority: 0
pid 18262's current scheduling policy: SCHED_OTHER
pid 18262's current scheduling priority: 0
pid 18263's current scheduling policy: SCHED_OTHER
pid 18263's current scheduling priority: 0
pid 18264's current scheduling policy: SCHED_OTHER
pid 18264's current scheduling priority: 0
pid 18265's current scheduling policy: SCHED_OTHER
pid 18265's current scheduling priority: 0
pid 18267's current scheduling policy: SCHED_BATCH
pid 18267's current scheduling priority: 0
pid 18268's current scheduling policy: SCHED_OTHER
pid 18268's current scheduling priority: 0
pid 18270's current scheduling policy: SCHED_OTHER
pid 18270's current scheduling priority: 0
pid 18271's current scheduling policy: SCHED_FIFO|SCHED_RESET_ON_FORK
pid 18271's current scheduling priority: 83
pid 18413's current scheduling policy: SCHED_OTHER
pid 18413's current scheduling priority: 0
pid 18432's current scheduling policy: SCHED_OTHER
pid 18432's current scheduling priority: 0
pid 18506's current scheduling policy: SCHED_OTHER
pid 18506's current scheduling priority: 0
pid 18799's current scheduling policy: SCHED_OTHER
pid 18799's current scheduling priority: 0
pid 18895's current scheduling policy: SCHED_OTHER
pid 18895's current scheduling priority: 0

Massimo-B commented 4 months ago

For higher loads I tried to increase quantum. Sometimes that helped in the past while accepting a higher latency, but with easyeffects it seems to get worse after I trying to tinker with quantum like pw-metadata -n settings 0 clock.force-quantum 1024 or pw-metadata -n settings 0 clock.force-quantum 2048

wwmm commented 4 months ago

The error column in pw-top indicate xrun errors. In most cases a xrun means that the soundcard is not receiving audio buffers as fast as it would like. In our case this means that for some reason PipeWire is not being able to do it. PipeWire/Pulseaudio apps send buffers to the server and not directly to the soundcard. So for some reason the additional load in PipeWire's realtime thread is being too much for it to handle on your system.

There isn't really much that can be done from the client side. The obvious solution would be to make the plugins code super fast. But besides the fact most of the ones we use come from third party projects they are probably already as optimized as they can be.

Something to try would be to change some server configuration options like ALSA Headroom or changing kernel versions. This has helped some users in the past. It is also worth to identify which plugins contribute the most to the xrun errors and avoiding using them in this machine.

As xrun errors depending strongly on the hardware it is hard to suggest a direct solution. For example on my current desktop even putting all the cores of my Ryzen 7700 under 100% stress test I have zero errors in pw-top while watching youtube videos on Firefox.

wwmm commented 4 months ago

Something I have been doing for years is booting with the kernel option threadirqs. Maybe this is helping somehow to play audio at higher loads. Does it make any difference?

tleb commented 4 months ago

Hi!

There isn't really much that can be done from the client side. The obvious solution would be to make the plugins code super fast. But besides the fact most of the ones we use come from third party projects they are probably already as optimized as they can be.

I've looked at spectrum code as that is the plugin with the most xrun and is implemented by EasyEffects.

What would you think about moving most of the processing to outside the realtime thread? Currently it does the DFT inside the RT thread and hands off a mono double buffer when done.

The idea would be that it would do the left+right average, put that to the end of a buffer and be done for the realtime thread. When the spectrum needs to be rendered, then this is copied and worked on from the thread doing the rendering.

Note: one issue of the current approach is that the DFT is done per process event. This might be much more frequent than needed if the PW quantum is really small. In that case, we would reduce work done in the RT thread but also overall CPU load by avoiding work.

Do you have any thoughts on that? I'll be able to work on that.

A tengant: how expensive are the util::idle_add() calls? There is one at the end of setup() (called when sample rate or quantum changes) and one at the end of process(). I see an allocation then a call to g_idle_add(). That call, I do not know about.

wwmm commented 4 months ago

What would you think about moving most of the processing to outside the realtime thread?

I updated our master branch now moving the fft call to the main thread. Let's see if this helps weaker processors to handle the extra load. The thing is that all this load was already avoided when the window was hidden. So unless all those people having xruns are with EE window always opened I do not expect much change.

A tengant: how expensive are the util::idle_add() calls? There is one at the end of setup() (called when sample rate or quantum changes) and one at the end of process(). I see an allocation then a call to g_idle_add(). That call, I do not know about.

g_idle_add schedules the execution of a function in glib/gtk main thread.

tleb commented 4 months ago

Ah, nice! Thanks.

g_idle_add schedules the execution of a function in glib/gtk main thread.

This sounds like something that requires synchronization and might even be blocking. Nothing great in the audio processing codepath. I see it gets used by most plugins to export results.

wwmm commented 4 months ago

Ah, nice! Thanks.

g_idle_add schedules the execution of a function in glib/gtk main thread.

This sounds like something that requires synchronization and might even be blocking. Nothing great in the audio processing codepath. I see it gets used by most plugins to export results.

It does not require blocking or synchronization. And we must use it because of the usual requirement from graphical toolkits about not having other threads messing with widgets. Sooner or later a move to the main thread will have to be done.

tleb commented 4 months ago

It does not require blocking or synchronization.

The underlying implementation is idle_add_full (code). I'm counting two malloc calls in idle_source_new, one in g_source_set_callback and a mutex lock in g_source_attach.

I'm wondering if putting data in the plugin and letting the main thread access it, without creating a new idle task on each process() call, wouldn't be more efficient? I'd be down to attempt a proof-of-concept if you want.

wwmm commented 4 months ago

I'm wondering if putting data in the plugin and letting the main thread access it, without creating a new idle task on each process() call, wouldn't be more efficient? I'd be down to attempt a proof-of-concept if you want.

Besides the fact you would have to rewrite considerable amounts of EE code because what you propose is the opposite of what is done everywhere in EasyEffects code I am skeptical about the performance gains being worth of such a radical change. Looking at perf top output the calls to g_idle_add are not a bottleneck.

And like I said before when the window is hidden (EE in background) none of this code is operational. But some machines will still have xrun. So it is unlikely that the calls to g_idle_add are the source of problem.

wwmm commented 4 months ago

I'm wondering if putting data in the plugin and letting the main thread access it, without creating a new idle task on each process() call, wouldn't be more efficient?

This would probably require to put the main thread in some kind of polling mode. Possibly involving the insertion of our own mutex objects because now we have to worry about what is being done to the data structures inside the plugins. It does not seem very appealing.

tleb commented 4 months ago

Besides the fact you would have to rewrite considerable amounts of EE code because what you propose is the opposite of what is done everywhere in EasyEffects code I am skeptical about the performance gains being worth of such a radical change. Looking at perf top output the calls to g_idle_add are not a bottleneck.

Ah I just looked at spectrum and noticed the above remarks. I was thinking about contributing small improvements without aiming for a major rework. Spectrum sounded especially interesting as it runs on all GUI EE instances by default.

About perf top: the goal is not really lower CPU usage, but more predictable timings in the realtime thread. Lower CPU usage is a nice side-effect. That is why I am using pw-profiler ran during a few seconds to see what the spectrum PW node timings look like.

And like I said before when the window is hidden (EE in background) none of this code is operational. But some machines will still have xrun. So it is unlikely that the calls to g_idle_add are the source of problem.

From experiments, the Spectrum::process callback is still being ran ie the PW node is in running state. Is that not expected? I see the code that early returns on the Spectrum::power signal if the chart is not visible, but I see no code that disables the PW node when hidden.

This would probably require to put the main thread in some kind of polling mode. Possibly involving the insertion of our own mutex objects because now we have to worry about what is being done to the data structures inside the plugins. It does not seem very appealing.

I was thinking more of an atomic pointer to A/B data buffers. This means no locking, but is more complex.

I wrote a proof-of-concept as 4 commits located on this branch. Check out commit messages for context. I haven't touched data_mutex which might be unused, I'm not sure; it is the last big source of timing uncertainty in Spectrum::process. The before/after is something like this (pw-profiler output, 128 quantum, ran 30 seconds):

               Before        After
process calls   12022        12165
avg (µs)          120.5         45.0
stdev              20.6         17.3
min                 0            0
5th perc          107           31
25th perc         112           35
median            117           40
75th perc         127           47
95th perc         152           81
max               294          165

This is a PoC because I need to (1) sleep and (2) re-read that atomic code! One small, last, note: I see how my previous messages could be read as rather arrogant, from someone spawning out of nowhere and writing down critics. My desire is only to contribute to a project I enjoy using. What I want to say is, thanks for your work!

Anyway, if you have feedback, maybe see potential roadblocks for this stuff, I'd be happy to ear.

wwmm commented 4 months ago

From experiments, the Spectrum::process callback is still being ran ie the PW node is in running state. Is that not expected? I see the code that early returns on the Spectrum::power signal if the chart is not visible, but I see no code that disables the PW node when hidden.

The callback is still called because the plugin is still in the pipeline. But g_idle_add is not invoked if the window is not visible. The only thing that happens is the copy of the buffer received at the input ports to the output ports because the spectrum plugin is put in bypass mode when the window is closed https://github.com/wwmm/easyeffects/blob/36adff90d587f1dda12f1b943f90077dd81e97e0/src/effects_box.cpp#L498. In other words the spectrum is in passthrough mode and none of those computations should be being executed unless some weird regression happened.

wwmm commented 4 months ago

One small, last, note: I see how my previous messages could be read as rather arrogant, from someone spawning out of nowhere and writing down critics. My desire is only to contribute to a project I enjoy using. What I want to say is, thanks for your work!

No offense taken :-). It just seem to me that it is too much effort for too little gain that is going to become unnoticeable in front of how CPU hungry some of the plugins are. Specially when we consider that unless the window is opened none of the work you are doing will take effect. And when the windows is visible the load that the spectrum drawing brings will matter more to the overall CPU usage and possible indirect effects over the realtime thread execution than using or not a c++ deque in the spectrum calculation for example.

It is nice to make things faster as long the code does not become much more complicated. But it is hard to see those microseconds gained in pw-top estimations helping a hardware that is struggling to handle heavy plugins like the bass enhancer, crystalizer, multiband compressor, autogain, etc.

At the same time it is obviously not going to do harm. If you are having fun go ahead :-)

tleb commented 4 months ago

The callback is still called because the plugin is still in the pipeline. But g_idle_add is not invoked if the window is not visible. The only thing that happens is the copy of the buffer received at the input ports to the output ports because the spectrum plugin is put in bypass mode when the window is closed

https://github.com/wwmm/easyeffects/blob/36adff90d587f1dda12f1b943f90077dd81e97e0/src/effects_box.cpp#L498 . In other words the spectrum is in passthrough mode and none of those computations should be being executed unless some weird regression happened.

I cannot see bypass change so on my setup (Wayland, Sway) all the code is executed all the time. spectrum_chart:hide signal is not called, nor the src/effects_box.cpp:dispose function.

Indeed this should be a priority, with work on top is only useful when the window is visible. I'll have a deeper look.

No offense taken :-). It just seem to me that it is too much effort for too little gain that is going to become unnoticeable in front of how CPU hungry some of the plugins are.

Well, this issue shows someone with many many xruns on spectrum, much more than on the other plugins running (output_level, rnnoise, echo_canceller, speex, filter, bass_enhancer, maximizer, crystalizer, exciter, stereo_tools, delay).

This could be explained by timing spikes. Some other plugins take more time on average but their high percentile is pretty close to the average. Spectrum appears to have a low average with a very high percentile. That's my only explanation for the behavior observed. I've tried, but I have not managed to reproduce (even with high loadavg ie while compiling).

And when the windows is visible the load that the spectrum drawing brings will matter more to the overall CPU usage and possible indirect effects over the realtime thread execution than using or not a c++ deque in the spectrum calculation for example.

Patches precompute Hann window and replace input deque by a simpler vector reduce work required (whether it is on realtime or GUI thread). Patch move all computations and drawing to a tick callback moves this work from realtime to GUI thread. If it does affect CPU load, it should reduce it by avoiding allocations and scheduling into the glib event loop.

I don't see how the patches can bring more overall CPU usage? And possible indirect effects over the realtime thread? There might be implications I am not aware of?

It is nice to make things faster as long the code does not become much more complicated. But it is hard to see those microseconds gained in pw-top estimations helping a hardware that is struggling to handle heavy plugins like the bass enhancer, crystalizer, multiband compressor, autogain, etc.

Well, it is microseconds on my setup but milliseconds on other setups. I do have a top-of-the-line workstation. The issue author spends 1.4ms in the realtime thread to handle one quantum.

I want to rework the last patch to make the sychronization between realtime and GUI thread more straight forward. I agree the sent patches are more complicated.

At the same time it is obviously not going to do harm. If you are having fun go ahead :-)

:-)

Massimo-B commented 4 months ago

Thank you all for your efforts trying to reduce the load and optimize the code. After the spectrum had the most xruns I immediately disabled spectrum, as I don't need it. I still had the xruns. I usually only use easyeffects for my microphone input in order to optimize the quality for conference calls.

As you see some of the used modules belong to the cpu-expensive ones. Here is what I use:

SpeechProcessor (with a headset microphone only. On Freehand condenser mic I additionally have NoiseReduction)
BassEnhancer
Exciter
Crystalizer
StereoTools

BassEnhancer, Exciter, Crytalizer and just nice to have, no requirement, these might be the most expensive and I could drop them. StereoTools I only need for quickly making the Left channel being the main mono, as my mic reports being a stereo by mistake while only the left channel is the amplified signal. With that setup, the easyeffects process on an idle old i7-3770 Ivy Bridge has only about ~1% CPU, absolutely affordable. Other machines have a i7-4790 Haswell, not much better...

As xrun errors depending strongly on the hardware it is hard to suggest a direct solution. For example on my current desktop even putting all the cores of my Ryzen 7700 under 100% stress test I have zero errors in pw-top while watching youtube videos on Firefox.

Interesting. Right now I need to stop CPU-intensive tasks and IO for conference calls.

Something I have been doing for years is booting with the kernel option threadirqs. Maybe this is helping somehow to play audio at higher loads. Does it make any difference?

Interesting, going to try that soon. Any way to enable that in the kernel conf or at runtime? Searching for that option I found CONFIG_IRQ_FORCED_THREADING, is that the same? That's enabled in my current kernel.

I'm currently using self compiled Gentoo patched kernels, currently 6.6.30-gentoo. I could switch over to a distribution kernel which is a binary universal pre-configured kernel that should be sufficient for almost any hardware and usecase. Not sure if I made any serious mistakes in my custom kernel config regarding realtime processing and audio:

# zgrep -i IRQ /proc/config.gz  |grep -v "^#"
CONFIG_IRQ_WORK=y
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_IRQ_MSI_IOMMU=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
CONFIG_TRACE_IRQFLAGS_NMI_SUPPORT=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y
CONFIG_HAVE_SOFTIRQ_ON_OWN_STACK=y
CONFIG_SOFTIRQ_ON_OWN_STACK=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
CONFIG_SERIAL_8250_DETECT_IRQ=y

# zgrep -i PREEMPT /proc/config.gz  |grep -v "^#"
CONFIG_PREEMPT_BUILD=y
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y
CONFIG_PREEMPT_DYNAMIC=y
CONFIG_PREEMPT_RCU=y
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
CONFIG_DRM_I915_PREEMPT_TIMEOUT=640
CONFIG_DRM_I915_PREEMPT_TIMEOUT_COMPUTE=7500

# zgrep -i HZ /proc/config.gz  |grep -v "^#"
CONFIG_NO_HZ_COMMON=y
CONFIG_NO_HZ_FULL=y
CONFIG_HZ_300=y
CONFIG_HZ=300

# zgrep -i TICK /proc/config.gz  |grep -v "^#"
CONFIG_TICK_ONESHOT=y
CONFIG_SCHED_HRTICK=y

wwmm commented 4 months ago

I cannot see bypass change so on my setup (Wayland, Sway) all the code is executed all the time. spectrum_chart:hide signal is not called, nor the src/effects_box.cpp:dispose function.

Indeed this should be a priority, with work on top is only useful when the window is visible. I'll have a deeper look.

Maybe you just did not test in a way that you can see gtk's "destructor" begin called. I tested it now on my computer (Arch Linux / KDE / Wayland) through a simple "util::warning("test);" put after

if (bypass || !fftw_ready) {
    return;
}

After recompiling EasyEffects and restarting it in service mode easyeffects --gapplication-service I opened another terminal and executed easyeffects as easyeffects. No messages were printed after I closed the window.

I don't see how the patches can bring more overall CPU usage? And possible indirect effects over the realtime thread? There might be implications I am not aware of?

No. Your patches should be fine. I did not see anything in them that would make things slower. I may not have expressed myself well. What I was trying to say is that there are other operations EasyEffects has to do that are heavy and will fight for CPU time with the realtime thread. What could potentially make all the work you are doing invisible to the user. For example drawing the spectrum in the window is demanding on the CPU. This is a place where optimizations would definetely make noticeable difference for many people. The problem is that the heaviest operations are happening inside gtk and as a result are out of our reach.

wwmm commented 4 months ago

Not sure if I made any serious mistakes in my custom kernel config regarding realtime processing and audio:

@Massimo-B so far the only difference I've noticed when compared to the standard Arch Linux kernel is that it also has CONFIG_NO_HZ=y set. The rest seems to be the same.

Searching for that option I found CONFIG_IRQ_FORCED_THREADING, is that the same? That's enabled in my current kernel.

According to these sources it is necessary to build the kernel with this flag enabled but it alone is not enough

https://wiki.ubuntu.com/UbuntuStudio/rtirq https://wiki.linuxaudio.org/wiki/system_configuration

. It is still necessary to pass the threadirqs boot option. Based on my previous experiences on something else not related to audio this seems to be true. Arch Linux's kernel has that flag enabled but it makes difference having the boot option specified or not.

A long time ago when the driver for my radeon was not very stable the desktop completely froze when the driver crashed and threadirqs was not used. When it crashed with threaded irqs enabled I could still use the desktop and close the programs before forcing a reboot. Things were super slow after the crash but at least I could avoid loosing data when forcing the reboot.

tleb commented 4 months ago

Maybe you just did not test in a way that you can see gtk's "destructor" begin called. I tested it now on my computer (Arch Linux / KDE / Wayland) through a simple "util::warning("test);" put after

Ah, correct me if I'm wrong, but if the window exists but is not displayed, then bypass is not enabled. You must have EE running with no existing window ("in the taskbar") for the bypass to be active?

I thought sending the window to another workspace was enough. This is what I do usually.

I'll looking at Gtk APIs to see if we can detect that no part of the window is displayed.

GdkSurface has enter-monitor and leave-monitor but I cannot find a way to grab a reference to our window surface.
Else the frame tick stops when on another desktop so if no ticks are detected after a while we can assume we are hidden and bypass the spectrum. I've done a version of that but I still have a bug to fix before sending.

wwmm commented 4 months ago

I thought sending the window to another workspace was enough. This is what I do usually.

This won't make the destructor to be called. The window has to be closed.

I'll looking at Gtk APIs to see if we can detect that no part of the window is displayed.

As far as I remember there is no API for that. And it makes sense that there isn't if we consider that the virtual desktop management and the knowledge about which windows are visible or not to the user is the domain of the window manager. And as fragmented as Linux is when it comes to desktops the chances that this can be implemented in a way that works for everybody are probably very small.

Else the frame tick stops when on another desktop so if no ticks are detected after a while we can assume we are hidden and bypass the spectrum. I've done a version of that but I still have a bug to fix before sending.

I wonder how well this works across different desktops. And between wayland and xorg. Considering that the main gtk doc about gtk_widget_add_tick_callback isn't even clear about this particular behavior there is the danger this approach may have unexpected behavior in some desktops. It is better to separate this implementation from the others you are working on because this probably will require more time and people for testing.

tleb commented 4 months ago

Noted, I'll make sure to submit work split across pull requests.

Massimo-B commented 4 months ago

@Massimo-B so far the only difference I've noticed when compared to the standard Arch Linux kernel is that it also has CONFIG_NO_HZ=y set. The rest seems to be the same.

From the menuconfig help:

CONFIG_NO_HZ:                                                                                                                                                                                                                                                                                                            │ This is the old config entry that enables dynticks idle.                                                                                                                                                                                                                                                               │  
  │ We keep it around for a little while to enforce backward                                                                                                                                                                                                                                                               │  
  │ compatibility with older config files.

Instead of this I have set:

 CONFIG_NO_HZ_FULL:                                                                                                                                                                                                                                                                                                       │ Adaptively try to shutdown the tick whenever possible, even when                                                                                                                                                                                                                                                       │  
  │ the CPU is running tasks. Typically this requires running a single                                                                                                                                                                                                                                                     │  
  │ task on the CPU. Chances for running tickless are maximized when                                                                                                                                                                                                                                                       │  
  │ the task mostly runs in userspace and has few kernel activity.                                                                                                                                                                                                                                                         │  
  │                                                                                                                                                                                                                                                                                                                        │  
  │ You need to fill up the nohz_full boot parameter with the                                                                                                                                                                                                                                                              │  
  │ desired range of dynticks CPUs to use it. This is implemented at                                                                                                                                                                                                                                                       │  
  │ the expense of some overhead in user <-> kernel transitions:                                                                                                                                                                                                                                                           │  
  │ syscalls, exceptions and interrupts.                                                                                                                                                                                                                                                                                   │  
  │                                                                                                                                                                                                                                                                                                                        │  
  │ By default, without passing the nohz_full parameter, this behaves just                                                                                                                                                                                                                                                 │  
  │ like NO_HZ_IDLE.                                                                                                                                                                                                                                                                                                       │  
  │                                                                                                                                                                                                                                                                                                                        │  
  │ If you're a distro say Y.

What does threadirqs boot option actually do? Any documentation about that?

I'm not into the linux audio topic yet, going to compare with https://wiki.linuxaudio.org/wiki/system_configuration#the_kernel They also have some description of that threadirqs kernel option. Of course we also have the sys-kernel/liquorix-sources in some 3rd party repos but I don't think I need that for a simple desktop station doing voice calls. Comparing, what they recommend and what matches my config (6.6.30-gentoo):

CONFIG_HIGH_RES_TIMERS=y :white_check_mark:
CONFIG_NO_HZ_IDLE=y :x:
CONFIG_PREEMPT=y # low-latency kernel :white_check_mark:
CONFIG_PREEMPT_RT=y # real-time kernel :o: Missing in my kernel

According to the quoted help above, the difference between CONFIG_NO_HZ_IDLE and my CONFIG_NO_HZ_FULL should be zero as I don't pass the nohz_full parameter.

I understand now, that CONFIG_PREEMPT_RT=y can be achieved on generic kernels by that threadirqs boot option as well. I'm still not sure what the benefit and drawbacks of threaded irqs are, have not found any official documentation. Why isn't it enabled by default? Similar discussion and conclusions I found at https://www.reddit.com/r/linux_gaming/comments/15buy6a/known_caveats_of_the_threadirqs_boot_parameter/ It feels like getting reduced latency by increasing CPU usage, which would not improve the situation if a exaggerated CPU load is the issue.

Mitigations: Ok I often thought about it... I'm not going to disable those as my machines have network and internet connections. Trying to estimate the actual risk is beyond my capabilities. If the performance impact is an issue I should better go and replace with a CPU of the next generation.

tleb commented 4 months ago

What does threadirqs boot option actually do? Any documentation about that?

Documentation/admin-guide/kernel-parameters.txt says this: "Force threading of all interrupt handlers except those marked explicitly IRQF_NO_THREAD".

It might help latency if you have crappy drivers that have slow interrupt handlers. It will also increase CPU load by increasing overhead work to be done per IRQ. At least that is my instinct reading the above description.

This sounds pretty similar to what PREEMPT_RT attempts to do. The latter does go much, much further by turning most locking primitives into sleeping ones (it makes no sense turning IRQs into threaded ones if they still all use a raw spinlock), avoiding noirq (atomic) context whenever possible, and many other things I am not aware of.

* CONFIG_PREEMPT_RT=y # real-time kernel ⭕ Missing in my kernel

This requires external patches. It should be fully integrated into the kernel in a few releases, they have done massive work! You shouldn't need that for a desktop with voice calls setup.

I understand now, that CONFIG_PREEMPT_RT=y can be achieved on generic kernels by that threadirqs boot option as well. I'm still not sure what the benefit and drawbacks of threaded irqs are, have not found any official documentation. Why isn't it enabled by default? Similar discussion and conclusions I found at https://www.reddit.com/r/linux_gaming/comments/15buy6a/known_caveats_of_the_threadirqs_boot_parameter/ It feels like getting reduced latency by increasing CPU usage, which would not improve the situation if a exaggerated CPU load is the issue.

No, PREEMPT_RT is not achieved on upstream kernels with threadirqs, this would ignore most of the work done by PREEMPT_RT people. Part of the work has already been integrated upstream though, so you already benefit for it.

wwmm commented 4 months ago

It feels like getting reduced latency by increasing CPU usage, which would not improve the situation if a exaggerated CPU load is the issue

@Massimo-B at least on my computer the overhead is unnoticeable. And the threadirqs option does not require kernel recompilation. Just set it in your bootloader configuration and see what happens. If things get worse just remove it from the boot loader configuration and reboot.

Massimo-B commented 4 months ago

at least on my computer the overhead is unnoticeable. And the threadirqs option does not require kernel recompilation. Just set it in your bootloader configuration and see what happens. If things get worse just remove it from the boot loader configuration and reboot.

Recompiliation is not an issue, but reboot often is as I loose all my session. There is no way to enable that feature at runtime? Is there any flag to check if it is active? I searched for something like find /proc /sys -iname "*threadirq*" ...

Where do threaded irqs help? I guess irqs are only used for hardware interaction, such as sound, but also graphic or disk IO? I often have lots of btrfs workers in the background, not much CPU usage but still a lot of AVG load. I had the feeling that even that disc IO lead to errors on EasyEffects.

Beside that I reproduced these errors on both audio outputs I have, internal Intel and external amplifier/DAC via USB 2.0. No difference. Any recommendation?

wwmm commented 4 months ago

Recompiliation is not an issue, but reboot often is as I loose all my session.

And how are dealing with pipewire updates? It is possible to manually restart it but sometimes desktop's volume managers get broken in the process and a reboot is more practical.

There is no way to enable that feature at runtime?

Not that I am aware of.

Where do threaded irqs help? I guess irqs are only used for hardware interaction, such as sound, but also graphic or disk IO?

They should help to reduce the kernel's latency https://lwn.net/Articles/302043/. What in turns might have some effect on the xrun errors you are seeing. Only tests will tell if there will be any difference.

Beside that I reproduced these errors on both audio outputs I have, internal Intel and external amplifier/DAC via USB 2.0. No difference. Any recommendation?

Besides trying the threadirq option the only other thing that comes to my mind would be to change the dynamic preempt mode https://www.phoronix.com/news/Linux-5.12-Dynamic-Preempt. Recent kernels allows this to be changed on the fly through the debugfs interface. But I would expect this option to have an even lower chance to impact those xruns.

Massimo-B commented 4 months ago

As mentioned above, CONFIG_HAVE_PREEMPT_DYNAMIC is enabled here.

I booted with the new option and looked around in dmesg and syslog if there is any information about it being enabled. The only thing I find is the quoted CMDLINE:

# grep thread /var/log/everything/current 
Jul 05 07:58:52 [kernel] Command line: BOOT_IMAGE=/vmlinuz-6.6.21-gentoo root=PARTUUID= ro rootflags=subvol=volumes/root rd.vconsole.font=ter-u12n rd.vconsole.keymap=de-latin1-nodeadkeys rd.locale.LANG=de_DE.UTF-8 rd.lvm=0 rd.md=0 rd.dm=0 rd.luks.uuid=67... rd.luks.allow-discards=67... root=LABEL=gentoo rootflags=subvol=volumes/root threadirqs
Jul 05 07:58:53 [kernel] Kernel command line: BOOT_IMAGE=/vmlinuz-6.6.21-gentoo root=PARTUUID= ro rootflags=subvol=volumes/root rd.vconsole.font=ter-u12n rd.vconsole.keymap=de-latin1-nodeadkeys rd.locale.LANG=de_DE.UTF-8 rd.lvm=0 rd.md=0 rd.dm=0 rd.luks.uuid=67... rd.luks.allow-discards=67... root=LABEL=gentoo rootflags=subvol=volumes/root threadirqs
Jul 05 07:58:53 [kernel] process: using mwait in idle threads

Massimo-B commented 4 months ago

I found that with the threadirqs option enabled I have 15 processes in pgrep -alf 'irq/', one for each hardware device.

Massimo-B commented 4 months ago

Running very smooth now with threadirqs no, having all 8 cores exagerated at 100% and load avg of about 25 and more.

In pw-top ERR column isn't zero, but not obviously increasing and no drops to hear. Currently highest ERR count is about the rnnoise module:

S   ID  QUANT   RATE    WAIT    BUSY   W/Q   B/Q  ERR FORMAT           NAME                                                                                                                                                                                                                                                    
R  133      0      0  34,9us 705,9us  0,00  0,03  11468                   + ee_sie_rnnoise

tleb commented 4 months ago

I'd be curious if you could check your old setup with the master branch now that #3231 and #3242 have been applied? Both optimise the spectrum realtime code so the ee_*_spectrum ERR column values should increase much more slowly. I see a 5x perf improvement on average (but in absolute values on my setup we only go from 183µs to 34µs), and I expect much better worst case performance (meaning less xruns).

Massimo-B commented 4 months ago

I would but even with the old setup it was not easy to reproduce as it happend after several days of uptime, sometimes got better by restarting Pipewire and EasyEffects which itself not always worked, so I usually got it solved by a reboot... I need to watch the issue with the new threadirqs.

Massimo-B commented 3 months ago

On a different machine I have the drops and errors eventhough I'm running threadirqs. There I'm using a Bluetooth headset. In pw-top I see that bluetooth input is using a very low quantum:

R 134 256 48000 896,3us 1,3us 0,17 0,00 196 S16LE 1 16000 bluez_input.08_C8_C2_1E_AA_4C.0

Other hardware has 1024, the browser process has 512. Should I increase the quantum? Ok, I tried. Increasing up to 2048 makes the ERR column not increasing anymore but still has drops in the playback.

wwmm commented 3 months ago

Other hardware has 1024, the browser process has 512. Should I increase the quantum? Ok, I tried. Increasing up to 2048 makes the ERR column not increasing anymore but still has drops in the playback.

If increasing the quantum helped maybe adjusting some settings like ALSA headroom in PipeWire/WirePlumber's configuration files will "fix" the issue.

Massimo-B commented 3 months ago

How? Any why would that help? In https://pipewire.pages.freedesktop.org/wireplumber/daemon/configuration/alsa.html I found some api.alsa.headroom which is 0 by default? Is that the setting you mean? It sounds like a workaround for bad devices, not sure if this is the case here.

Massimo-B commented 3 months ago

wwmm commented 3 months ago

How? Any why would that help? In https://pipewire.pages.freedesktop.org/wireplumber/daemon/configuration/alsa.html I found some api.alsa.headroom which is 0 by default? Is that the setting you mean? It sounds like a workaround for bad devices, not sure if this is the case here.

In the past I saw users in similar situations seeing improvements after tweaking the alsa headroom. I am not sure it this was done in PipeWire's or WirePlumber's configuration files. When this parameter was exposed some years ago it was in PipeWire's files. Maybe it is only in WirePlumber now.

As explained in the link this parameter impacts the timing between hardware and software when it comes to audio buffer management. When they say "bad devices" have in mind that this does not necessarily means a defective hardware. In most cases it will be just the driver not behaving as PipeWire is expecting it to. The extra delay may be what your system needs to handle audio under pressure.

There is not need to reboot and loose your session to test this. Manually restarting PipeWire and WirePlumber should be enough.