xmrig / xmrig-cuda

NVIDIA CUDA plugin for XMRig miner
GNU General Public License v3.0
358 stars 155 forks source link

How to specify number of threads/blocks? #144

Closed Fjodor42 closed 2 years ago

Fjodor42 commented 2 years ago

Trying out an older Quadro card, I notice that the number of threads and blocks that xmrig chooses for me, add up to below half of the RAM on the card.

How would I go about adjusting the chosen parameters, and are there any rules of thumb as to whether one should increase one, the other or both, and are their values restricted to specific multiples of some numbers?

Spudz76 commented 2 years ago

Autoconfigurator leverages as many threads and blocks as the memory controller can reasonably handle (limited by processing power, not memory usage). Using more memory only overloads the bus from GPU to VRAM and at least does not add much speed, more likely removing some. Fermi had a skinny buswidth and low shader count among other limits that are all hit before using more than ~1GB. Until Pascal there wasn't really enough bus speed to use larger amounts of VRAM although steps along the way got better.

Which family chip does it have and which arch is it (20/21/30/32/35/37...)?

Fjodor42 commented 2 years ago

Thank you for a swift and informative answer :-)

As for your last question, it's a Quadro M4000 which, according to https://developer.nvidia.com/cuda-gpus would seem to have Compute Capability 5.2, and as I understand it, that would mean arch 52?

Spudz76 commented 2 years ago

That's right, so Maxwells such as yours, or my GTX970-4GB, will tend to only use about 2.5GB due to going slower when you use more (depending on algorithm). I'm not sure which algorithm you tested with. So with 8GB it will definitely never touch most of it. Sort of like owning a 100 acre farm but your tractor is too slow to get to the other end so most of it is unused. The tractor speed is a fixed part of the Maxwell chip. Can't farm all the land because it'll be dark by the time you get all the laps done. I can't even farm 50 acres with my Maxwell. :)

It may use some more on other algorithms, mine likes either cn-gpu (700H/s) or autolykos2 (29.37MH/s) but they have less memory bus usage thus could use more space / do more blocks * threads (=intensity). But still limited by the number of compute units / shaders. For cn-gpu you have to use the MoneroOcean fork since mainstream dropped it a while ago. For autolykos2 I use T-Rex, but lolMiner is another decent option. The autolykos2 miner can use double the normal memory which might go up into the 6GB range, and claims to be faster when it does so. Since I can't fit double datasets in my 4GB that is unverifiable for me. You can point these into the MoneroOcean pool and be paid raw XMR if you prefer untraceable coins and only handling one wallet/currency. The pool does the exchanging automatically regardless which actual coin/algo you are mining, and it autoseeks whichever one is currently worth the most per your hardware's talents.

I commonly look GPUs up on techpowerup where they list the Compute Capability as the "CUDA Version" (misnomer) and yes you just drop the dot to convert to "CUDA_ARCH". Sometimes more straightforward than how nvidia's site organizes things, and there are pictures so you can be more certain it's the exact brand/variant. With the Quadro's there weren't many variants or alternative brands so not as much an issue. There are maybe a hundred slightly different GTX9xx's with various shaders disabled or other non-obvious changes.

Fjodor42 commented 2 years ago

Thank you for a thorough and illustrative explanation as well as good suggestions, @Spudz76.

As far as I am concerned, my question has been fully answered and then some :-)