ghost commented 5 years ago

After several days linux becomes slow/unresponsive - when xmrig-nvidia is killed responsiveness comes back.

I think there is some kind of resource leak in xmrig-nvidia 2.14.1 - I will try to narrow it down - it does not appear to be a memory leak because there is still swap space available.

Fedora 24, cuda 9.1, nvidia 410.93 driver, GTX 1080, xmrig-nvidia 2.14.1.

A workaround is to restart every day. FYI more later.

Spudz76 commented 5 years ago

See table at nvidia - You should be running either CUDA 10.0 compiled client (match driver you're running now), or run a driver between 390 and 395 (has CUDA 9.1)

Mismatch is backward compatible but more like backweird... Exact match is always best and basically the only thing fully tested by anyone. Strange leaks are the least of the expected issues running such a mismatch (9.1->10.0 is a two revision jump, even worse than a single). I used to have huge issues with 8.0-GA2 compiled apps on the 9.2 level drivers (same two step jump) and it only hurt specific GPU models (very random pitfalls, all undocumented and unknown until you hit them and wonder...)

Since you have 1080 pascal there is not much reason to run old cuda, I'd upgrade the compile to match your driver. If you prefer to do it only once (in the next months) then upgrade driver to 418.39 or better and get CUDA 10.1 toolkit and compile for 10.1/10.1 full matchup.

Otherwise best is probably to recompile on CUDA 10.0 toolkit to match the driver you're currently running (and never upgrade it beyond 418.39)

Spudz76 commented 5 years ago

RedHat based distros usually have on-by-default extra garbage (oops I mean hardening and security) compared to normal distros like Debian. Having some AppArmor police all up in your mining operations isn't going to help anything work better, for sure.

I'd also try to hunt and disable as much extra browbeating/paranoia Fedora does by default so it isn't possibly causing slowdown after while (like accounting jam maybe, when it doesn't need to be counting anything at all).

Or run Debian (Ubuntu, Mint... etc) because RedHats are for governments and medical record hosting and nobody actually needs that much hardening for normal ops. And if they do they should be the ones to enable it so they understand it (not on by default and wonder how to shut it off or if it's blocking up something)

ghost commented 5 years ago

Rebuilt with cuda-10.0, running now and will report back in a few days.

i-guru commented 5 years ago

Same situation with my cards. Mint 18.2 750Ti (arch50) and 760GTX (arch30). Tried with Cuda 8.0, 9.0, 9.1 and 9.2. Same result.

ghost commented 5 years ago

After recompile with cuda-10.0 my system started dragging less that 24 hours later. In top and "cat /proc/meminfo" I see that "MemAvailable" is very low. Swap space isn't being consumed so I believe there is a physical page leak and the system is gradually running out of physical pages.

Before kill -9 on xmrig-nvidia: MemFree: 153080 kB MemAvailable: 29924 kB AnonPages: 15391096 kB

After kill -9 xmrig-nvidia MemFree: 15169864 kB MemAvailable: 15090196 kB AnonPages: 437368 kB

i'll be looking for a physical page leak in the cryptonight/r code this weekend.

RandomErrorMessage commented 5 years ago

I'm also getting a severe memory leak on xmrig-nvidia with CUDA 9.2

ABOUT XMRig-NVIDIA/2.14.1 gcc/7.4.0
LIBS libuv/1.24.1 CUDA/9.20 OpenSSL/1.1.1b

Spudz76 commented 5 years ago

I may not be seeing it on my GTX970 CUDA 10.1 due to coin switching with meta-miner (relaunches the miner every several minutes to switch coins)

Although I did notice after days the cpu miner begins being unable to lock memory, so I reboot.

ghost commented 5 years ago

take a look at xmrig-nvidia-2.14.1/src/nvidia/CryptonightR.cu line 393

(it seems cosmetic but is it?)

 #if __CUDA_ARCH__ < 350

vs

if __CUDA_ARCH__ < 350

with this change 24 hours in no apparent physical page leak.....

xmrig commented 5 years ago

Seems this is same issue with #260, only possible solution is use CUDA 10.1. Thank you.

xmrig / xmrig-nvidia

resource leak #259

if __CUDA_ARCH__ < 350