Open zeratax opened 4 years ago
you should consider #47, since both are async.
Device device;
CudaStream Sone, Stwo, Sthree;
CudaStream[] streams= {Sone, Stwo, Sthree};
Kernel[] kernels = {kernel1, kernel2, kernel3};
for(size_t i{0}; i < 3; ++i} {
kernels[i].queueupload(args, device, streams[i]);
kernels[i].queuelaunch(args, device, streams[i]); // i guess args could be implicitly known here?
kernels[i].queuedownload(device, streams[i]);
}
// nonblocking, cpu can still execute while gpu is busy (but not download and upload??)
kernel.sync() // blocking, gpu done after this
this should be equivalent to async version 1.
I'm not sure how much upload and download need device. we need to be more explicit about context for this.
more info: https://devblogs.nvidia.com/how-overlap-data-transfers-cuda-cc/
Ich habe auf dem Branch mal ein bisschen angefangen mit Streams. So wird halt alles auf einem einzigen Stream asynchron ausgeführt. Aber irgendwie werden die upload-operationen und download-operationen trotzdem synchron durchgeführt (weiß irgendwie nicht wieso), weswegen das dann nicht wirklich viel bringt
streams to execute kernels in async. Device should probably take care of context?