Not sure if this needs an extra synchronization, see
In addition to the two calls to the generic host time-stamp function myCPUTimer(), we use the explicit synchronization barrier cudaDeviceSynchronize() to block CPU execution until all previously issued commands on the device have completed. Without this barrier, this code would measure the kernel launch time and not the kernel execution time.
This is not done using events, but we should consider adding possibly another sync event and differentiating between Kernel Launch and Kernel Execution
fixes #50
Not sure if this needs an extra synchronization, see
https://devblogs.nvidia.com/how-implement-performance-metrics-cuda-cc/
This is not done using events, but we should consider adding possibly another sync event and differentiating between Kernel Launch and Kernel Execution