use extra event instead of sum of events

fixes #50

Not sure if this needs an extra synchronization, see

In addition to the two calls to the generic host time-stamp function myCPUTimer(), we use the explicit synchronization barrier cudaDeviceSynchronize() to block CPU execution until all previously issued commands on the device have completed. Without this barrier, this code would measure the kernel launch time and not the kernel execution time.

https://devblogs.nvidia.com/how-implement-performance-metrics-cuda-cc/

This is not done using events, but we should consider adding possibly another sync event and differentiating between Kernel Launch and Kernel Execution

zeratax / yacx

use extra event instead of sum of events #95