Using streams with gtensor

td-mpcdf commented 3 years ago

It seems to be necessary to somehow make it possible to use the (CUDA) streams with gtensor. In the launch of an explicit kernel this could be another argument, in the assignment kernels, we should add a function set_stream(streamID) and launch the assignment kernel in this stream. A set_default_stream should set it back.

germasch commented 3 years ago

So what is the use case that you're actually thinking about? Passing a stream to launch wouldn't be too hard, but it'll also require wrappers for creating / destroying streams, etc -- in a way that's portable to HIP (easy) and Intel, too. And there's the rather complicated sync semantics with respect to other streams and the default stream to keep in mind. A set_stream function that essentially sets a default stream for subsequent gtensor kernel invocations also needs some more thinking to ensure thread safety if gtensor is used from within a (host-)threaded environment.

td-mpcdf commented 3 years ago

We are working on a project where we employ the StarPU framework to taskify GENE. One advantage is to run tasks on GPU and CPU, but the transfers and also the kernel launches needs to use streams to happen asynchonrously. And as we use the already ported kernel of GENE (employing gtensor), I think we need this functionality. But maybe Carlos (who is working on the project) can add more.

carlosenlopez commented 3 years ago

Hello Kai,

As Tilman mentioned I am working in a project where we use the StarPU framework to taskify GENE. The main goals is to taskify the most expensive computations and execute the tasks in the CPU and GPU concurrently whenever is possible. In Starpu we just define the tasks and the task scheduler will choose where to execute such tasks and will transfer all the data required automatically. This is, we do not do any data transfer explicitly, all data transfers are done by the scheduler. So far, we are only using CUDA for the GPUs but Starpu supports other accelerators.

To make the data transfers asynchronous to the GPU using CUDA, Starpu requires to pin the memory and to use streams when launching kernels. StarPU provides a stream for every CUDA device controlled by StarPU. For creating the streams it provides a function

cudaStream_t starpu_cuda_get_local_stream(void)

which returns the current worker's CUDA stream. This function is only provided for convenience so that programmers can easily use asynchronous operations within tasks without having to create a stream by hand. I am not sure, but I think that all streams are destroyed when Starpu finalizes.

I am new to Gtensor but I thought that it would be a good idea to have a new parameter where we specify, for each kernel, the stream we want to use and if not specified it is set to use the default stream. In my case, Starpu will be responsible to create/destroy the streams.

germasch commented 3 years ago

Does starpu use threads? The reason I'm asking is that if you have multiple host threads offloading things to the gpu (either the same gpu in different streams, or different gpus), a global set_stream() will cause race conditions.

I don't know starpu, but another approach could be to use different host threads for each task or sequence of kernels to be offloaded, and use cuda's "per thread default stream" option, which would avoid having to set cuda streams yourself, and the associated potential races. Another advantage to this, IMO, is that it's also possible to put things in a host thread that don't work without major hassles as part of cuda streams, like MPI communication.

bd4 commented 3 years ago

FWIW I think I can do SYCL stream emulation, by creating extra queues on demand as new streams are set in the already existing global singleton class, which is currenlty used to emulate library global default device.

I don't know if HIP has a per thread default stream option? Also SYCL definitely does not, so that approach is not portable (although perhaps DPCPP has something with there compatibility tool for translating CUDA?).

I think an argument can be made that what SYCL does, with no global state and having queue objects to encapsulate a device/stream combo, is a better API - if an application wants global state, it can define a global queue. While defining it in the library is convenient, it lacks flexibility and leads to challenges when you have multiple host threads. Currently gtensor has already inherited the global state stuff from CUDA/HIP though, so I'm not sure we want to change - it basically involves passing an extra object (the queue) to every API call, or making it a method on the queue object. I guess I'm not sure if CUDA/HIP support overriding the global device though.... something to think about.

carlosenlopez commented 3 years ago

From the Starpu manual: StarPU automatically binds one thread per CPU core. It does not use SMT/hyperthreading because kernels are usually already optimized for using a full core. Since driving GPUs is a CPU-consuming task, StarPU dedicates one core per GPU. While StarPU tasks are executing, the application is not supposed to do computations in the threads it starts itself, tasks should be used instead.

Due to CUDA limitations, StarPU will have a hard time overlapping its own communications and the codelet computations if the application does not use a dedicated CUDA stream for its computations instead of the default stream, which synchronizes all operations of the GPU.

bd4 commented 3 years ago

Sounds like StarPU already has it's stream model, and if it uses multiple threads handles stream assignment itself? Perhaps was developed before the CUDA per thread default stream feature was available.

@carlosenlopez what other accelerators does it support? On the main website it just says CUDA.

I would like to better understand the use case and how StarPU works, but in principal I am fine with adding global streams for now. It could be hard to use correctly, but I don't think the implementation inside gtensor will be much work - the user will have to manage the tricky part of it. We already have global device, and I am not keen on rearchitecting gtensor to be more SYCL-like until v2 :P.

We could also avoid global set_stream by making it an argument, and exposing a functional version of assign. I would love to figure out a way to pass a name to the kernel that would appear in nvprof/rocprof debugs, rather than the gtensor template which is hard to recognize, so there is something to be said for functional interface over just using X=y, as nice as that looks.

bd4 commented 3 years ago

Looks like it uses OpenCL to support other accelerators.

germasch commented 3 years ago

I don't know whether AMD supports per-thread default streams. I'd say it should, since they're generally aiming to copy cuda's approach.

Implementing this for sycl (in gtensor) would be pretty straightforward.

It sounds like starpu's approach is one host thread per gpu, and then running multiple tasks on that gpu, so that doesn't work with per-thread default streams as-is. Instead, it would have to create a new host thread for each task (or sequence of tasks) it wants to run, and then launch from that thread (ie, create threads instead of streams). These default streams will then not synchronize with each other.

To make a point about the subtleness involved: The way things currently are (if I understand them correctly), on a multi-gpu system, starpu would create one host thread for each gpu. If one host thread goes gt::set_stream(...); run_kernel(); while another does the same do run something else on another gpu, you now already have a race condition. That's why I think any API that relies on global-something (like the stream to launch on) is not great.

bd4 commented 3 years ago

@carlosenlopez do you know if a global (per process) set_stream would work with the StarPU integration? As an alternative, we could define an assign(lhs, rhs, stream=null) function, so you could just pass the stream from starpu, but loose the nice lhs = rhs syntax.

carlosenlopez commented 3 years ago

what other accelerators does it support? On the main website it just says CUDA.

Nvidia GPUs, OpenCL devices and there were some attempts to use FPGAs (https://hal.univ-grenoble-alpes.fr/hal-01858951/file/recosoc.pdf).

I would like to better understand the use case

To use StarPU you must write your code into StarPU tasks. Each task can have an implementation on different devices: a CPU function for de CPU and/or a CUDA kernel for the GPU, for example. During runtime, the application only inserts the tasks and the StarPU task scheduler will decide where to execute them. If the scheduler chooses the CPU, it will call the CPU implementation and if it chooses the GPU it will execute the CUDA kernel (GPU implementation). The task execution is done by calling an additional CPU function (codelet kernel) which in case of the GPU will launch the CUDA kernel and for the CPU function call the corresponding CPU function. All data is transferred to the device before the task is executed. In C/C++, the codelet kernel launches CUDA kernels as follows

  func <grid,block,0,starpu_cuda_get_local_stream()>(foo, bar); 
  cudaError_t status = cudaGetLastError();
  if (status != cudaSuccess) STARPU_CUDA_REPORT_ERROR(status);
  cudaStreamSynchronize(starpu_cuda_get_local_stream());

In the case of GENE, the case is similar. We have a function with a CPU implementation and GPU implementation. In the codelet kernel instead of launching the kernel as mentioned above, it calls the CUDA wrapper that uses gtensor. What it is missing, is how pass/specify the stream that it is provided by starpu_cuda_get_local_stream() when launching the kernel.

and how StarPU works,

Unfortunately, I do not know the internals of StarPU but we can ask the StarPU developers via their mailing list: starpu-devel@inria.fr. I asked the developers to explain me how CUDA streams work in StarPU and this their answer:

StarPU creates one input stream and one output stream per GPU device, that are used for data transfers to and from that GPU device. Also, StarPU creates one stream per worker associated to a GPU device. These streams are used for executing tasks. By default, there is one worker per GPU device, but this can be changed by setting the environment variable STARPU_NWORKER_PER_CUDA to an integer larger than 1. Thus input prefetches, output prefetches, and computations all can happen concurrently as long as hardware supports it. If STARPU_NWORKER_PER_CUDA > 1, multiple computations can happen concurrently on the same GPU device (again, provided hardware supports it). If you have multiple GPU devices on the machine and supports for CUDA's GPU direct is enabled, two additional streams are created for each pair of GPU devices, to support direct inter-GPU transfers. All the streams are created at StarPU's initialization time and freed at finalization time. If there is no space available on the GPU, StarPU may try to evict old data from the GPU memory when possible, to make room. If this is not sufficient, the prefetch operation will wait until more GPU data can be freed as the result of tasks completion.

carlosenlopez commented 3 years ago

The way things currently are (if I understand them correctly), on a multi-gpu system, starpu would create one host thread for each gpu.

This is true. StarPU assigns one CPU core per each GPU.

If one host thread goes gt::set_stream(...); run_kernel(); while another does the same do run something else on another gpu, you now already have a race condition.

According to StarPU, you should specify the CUDA stream when launching the CUDA kernels. The stream will correspond to the computing stream already created for the GPU that StarPU chose (see comment above). If I have two GPU tasks which can run in parallel in two different devices, I think we will have this behavior if gt::set_stream(...) changes the stream for all kernels (I am not sure that I understand correctly how gt:set_stream() will work in this case)

do you know if a global (per process) set_stream would work with the StarPU integration?

I understand this as different interpretation of gt::set_stream(...) as the one mentioned above. Does this mean that I will have one stream per GPU? So, if I have 4 GPU, will I be able to set a different stream for each GPU? I think this will work. I will be limited to use one StarPU worker per GPU (STARPU_NWORKER_PER_CUDA=1) but I think that is OK. I do not know yet if we need more workers to overlap computations in the same GPU.

bd4 commented 3 years ago

@carlosenlopez can you ask how many host threads are used by the CUDA driver? You can tell them that you want to integrate with an existing GPU library, and ask if using cudaSetStream within this library would be safe, or if multiple threads will be involved and it could break.

Looking at drivers/cuda/driver_cuda.c, it appears that there may be only one thread, tied to worker0, but I am not at all confident there isn't another thread involved somewhere, like for data movement. Absent a strong statement from the StarPU team, my assumption is that cudaSetStream is not safe to use, or at the very least would be very fragile to configuration changes.

If we can't use cudaSetStream, we could provide gt::assign which takes an optional stream argument. We may also need to add a stream argument to gt::copy - do you think all data movement would be handled at the StarPU level, or are you expecting to do copies from gtensor as well? Note that StarPU uses separate streams for data movement (one per worker, two per device in/out, and ndevs*ndevs for device to device). It also has a pthread_mutex and pthread_cond per cuda device. This all makes it seem less likely that cudaSetStream is safe to use.

carlosenlopez commented 3 years ago

Here it is the answer of one of the StarPU developers:

By default, 1 host thread per GPU device is used by the CUDA driver. If STARPU_NWORKER_PER_CUDA > 1, this environment variable indicates how many host threads are used by the CUDA driver for one GPU device.

Asynchronous data movements involving cudaMemcpy*() functions always explicitly specify the stream they used for transfers. However, if asynchronous transfers fail, StarPU falls back to synchronous transfers over the default stream.

In StarPU's source file "src/drivers/cuda/starpu_cublas.c", you can have a look at the starpu_cublas_init() function, which basically executes a cublasSetStream() on each worker thread at initialization time. Perhaps the same approach would be fine for the gtensor library?

carlosenlopez commented 3 years ago

You can access the StarPU source code here https://gitlab.inria.fr/starpu/starpu

bd4 commented 3 years ago

So if using one GPU device and set STARPU_NWORKER_PER_CUDA=1, then using cudaSetStream would be safe to use in gtensor. Otherwise I think it may not be safe, although I'm not sure what happens with cudaSetDevice when switching back and forth. I wonder how cudaSetDevice works when multiple threads are involved?

I think the suggestion is a good one - call cudaSetStream in the StarPU worker threads, and let gtensor use that as the default stream automagically. I need to better understand how CUDA works with this stuff, but I think that could work.

bd4 commented 3 years ago

Ok so there actually is no cudaSetStream, but we could create our own global stream in gtensor and pass it to all gpu calls. This is what we would have to do for SYCL anyway, after a fashion.

However rather than introducing another global, we can just add an optional stream parameter to launch, assign, copy, together with stream wrapper class that deals with platform differences and has an implicit ctor from the platform stream object. This means that to use non-default streams, one must use the gt::assign interface instead of the nicer looking = syntax. This is the solution I prefer.

@carlosenlopez do you think this would work for you, at least initially to get started?

carlosenlopez commented 3 years ago

@bd4 I talked to Tilman about this and we think this would work.

bd4 commented 2 years ago

Initial implementation now in main branch, see test_stream.cxx. Not that for the interoperability use case, you would generally use gt::stream_view directly and not use gt::stream at all.

wdmapp / gtensor

Using streams with gtensor #138