Open yuzi-co opened 5 years ago
Auto sizing for pico doesn't work well, you have to use config file, leave threads: null
Then launch and it will write its guess for sizing into threads, in that config file, and crash...
Then edit the threads it thought would work by turning them down (sometimes a lot) until it stops crashing.
Automatic will not work. However once you find a working combo you could convert back to no-config-file mode by adding --cuda-launch=TxB
for the values of thread/block that end up working. Or possibly figure out the max threads and then set cuda-threads to that so it won't autosize too large. But I still think autosizing is broken (for pico) and doesn't even work with limit on threads (it will just make the blocks way too many)
The real issue is also that there are so many variants of the cards that have a differing numbers of cuda cores . Especially in the GTX realm of cards with the same chip or less number of cores. The algorithm for tuning on the fly detection gets you within 80% of max hashrate. However, determining the number of threads/blocks changes with each algorithm mined. Coupled that if you are on a rotating algorithm mining pool, the static settings entered might not work if the miner is configured "algo": "auto" and "variant": "auto" in the confg.json file.
A good proposal for fixing this would be a lookup table to optimize the settings for each variant card with some headspace for threads, correct block settings. Reduce the number of threads to not bang into the VRAM limitation of each card assuming it's loaded in Windows which reserves some RAM for the base drivers. This adaptation memory allocation problem is moot on cards of 4GB or higher, but you can run into it on 1-3GB VRAM cards. The math to determine threads and blocks doesn't make much sense as documented from xmr-stak. You can't just sort the thread/blocks settings by architecture. No, that would be a simple 5-6 case statement. Its determined by the SMX count, the amount of free usable ram on the GPU, and some kind of MOD divisor based on the number of CUDA cores involved with the SMX allocation.
The advantage is that you can get 25% more hashrate when you manually tune the cards by approximation and incremental stepping up/down the threads/blocks. However that doesn't work if you are mining on a pool with possible rotating algos. Also if a coin changes algorithm settings , possibly from a fork, then you might not be optimally set to mine the highest hashrate(or crash) possible.
xmrig-nvidia.exe -o loki.miner.rocks:4005 -u XXX -p w=YYY --cuda-devices=0 -a cn-pico/trtl (XXX is valid LOKI address)
[CUDA] Error gpu 0::718 "invalid configuration argument"