Merge `cudaSetDevice` and calling kernel function into a function

In the old code mobula/func.py,

class CFuncDef:
    [...]
    def __call__(self, arg_datas, arg_types, dev_id):
        if dev_id is None:
            ctx = 'cpu'
        else:
            set_device(dev_id)
            ctx = gpu_ctx_name
        # function loader
        func = self.loader(self, arg_types, ctx, **self.loader_kwargs)
        return func(*arg_datas)

The Python interpreter calls set_device function to call cudaSetDevice in C, then calls the C function. However, the Deep Learning Framework may set the device between calling the two functions, although cudaSetDevice is thread-safe.

So I merge cudaSetDevice and calling kernel function into a function by adding the two macro, namely KERNEL_RUN_BEGIN and KERNEL_RUN_END.

The argument list of kernel wrapper function becomes MOBULA_DLL void function_name(const int device_id, xxx)

wkcn / MobulaOP

Merge `cudaSetDevice` and calling kernel function into a function #15