ysh329 / OpenCL-101

Learn OpenCL step by step.
123 stars 31 forks source link

【竞品调研】TensorFlow Lite GPU OpenCL WorkGroup TuningType策略浅析 #33

Open ysh329 opened 3 years ago

ysh329 commented 3 years ago

最近发现TensorFlow Lite在GPU方面的性能有不小提升,先前了解到起初是支持的OpenGL来完成计算,猜想可能是考虑到GL的更广阔的的兼容性(不同的GPU版本,兼容的新老的库版本),但后续这次对GPU以OpenCL进行支持,应该考虑的更多是计算性能,也是与TFLite的相关竞品,如MACE / Paddle-Mobile / MNN / TNN在OpenCL上的支持和性能确实不容忽视。

说到OpenCL,深入一些都会谈及GPU的Kernel调优的手段和策略。根据阅读TensorFlow Lite在GPU方面的代码,发现其GPU/CL部分有tuning_parameters.h这一文件:

// tensorflow/tensorflow/lite/delegates/gpu/cl/kernels/tuning_parameters.h
// https://github.com/tensorflow/tensorflow/blob/465aeca04268f6e19d5f845610cc7ccaf03f5b8d/tensorflow/lite/delegates/gpu/cl/kernels/tuning_parameters.h
enum class TuningType { EXHAUSTIVE, FAST };

struct TuningParameters {
  ProfilingCommandQueue* queue;
  const DeviceInfo* info;
  TuningType tuning_type = TuningType::EXHAUSTIVE;
};

ProfiliingCommandQueue是class ProfilingCommandQueue : public CLCommandQueue,在原有父类CLCommandQueue基础上,增加了opencl kernel计时、找最佳work group(GetBestWorkGroupIndex)等方法。

此外,又发现可能和性能调优(tuning/tune)相关的目录或者文件:

  1. device_info.cc:定义了不同型号的GPU的相关信息,针对不同硬件有细致的区分,为后续调优做指导:
    1. 高通Adreno:区分了型号代数以类似4xx、6xx的方法,此外,在MaxWaveCounts、ComputeUnit的RegisterMemSize、WaveSize有比较区分。MaxWaveCounts:针对Adreno640固定设置是30个,其它6xx系列为16个;ComputeUnit的RegisterMemSize:针对Adreno640是128*144*16,其它6xx系列为128*96*16;WaveSize:对于<400系列不支持、<600系列会判断是否full_wave进而选择64或32、其它型号也会判断是否full_wave使用进而判断是128还是64;
    2. ARM mali:对型号区分:T6xx、T7xx、T8xx系列,对架构区分:Midgad架构(T6xx、T7xx、T8xx)、Vallhall架构(G57、G77)、Bifrost Gen1架构(G31、G51、G71),Bifrost Gen2架构(G52、G72)、Bifrost Gen3架构(G76);
    3. 其它:当然还有PowerVR、NVIDIA等GPU型号,但相比上面没有过分细致的区分。
  2. work_group_picking.ccwork_group_picking.h:这个后续会详细说明,主要定义了上述两种tunning_type的策略实现,以及相关的辅助函数;
  3. tuning_parameters.htuning_parameters.cc
  4. inference_context.ccInitFromGraphWithTransforms有定义模型从原始Graph转换为适合GPU执行的Graph的流程。大体主要有3个步骤:
    1. 设备相关:拿到context/device/queue/program cache、判断GPU类型(mali/powervr)、设置是否需要Flush以及Flush周期间隔(mali需要手动设置Flush间隔、PowerVR不需要,补充:clFlush用于分发所有设备中已经进入命令队列的命令给设备并不保证执行完成);
    2. 模型相关:转换GPU的Graph、分配GPU资源如内存上传模型权重到GPU上、逐模型的节点释放CPU模型表达;
    3. 优化相关:TuningParameter的初始化(含profiling_queue、设备信息、TuningType)、逐模型的节点开始GPUOperation::Tune设置WorkGroup。
  5. gpu_operation.hgpu_operation.cc:实现了GPUOperation::Tune方法,包含整个Tune的过程,目前看来每次只会选择一种,并没有在多种WorkGroup下做选择,即目前还没支持针对某个确定Operation做大规模和批量WorkGroup的性能Tune。

那就先来看看Tune这一方法,该方法进入时,会先获取KernelWorkGroups,注意是一个候选表std::vector<int3> possible_work_groups,然后在最后if-else分支的else情况中,选择这些里最好的best_work_group_index,作为当前Operation最终的work_group。

// delegates/gpu/cl/kernels/gpu_operation.cc
// https://github.com/tensorflow/tensorflow/blob/b14150088dac1924cf0482f6e456332b3e6211ff/tensorflow/lite/delegates/gpu/cl/kernels/gpu_operation.cc

absl::Status GPUOperation::Tune(const TuningParameters& params) {
  std::vector<int3> possible_work_groups;
  GetPossibleKernelWorkGroups(params.tuning_type, *params.info, kernel_.info_,
                              &possible_work_groups);
  if (possible_work_groups.empty()) {
    return absl::NotFoundError(
        "Can not found work_group size to launch kernel");
  }
  if (possible_work_groups.size() == 1) {
    work_group_size_ = possible_work_groups[0];
    return absl::OkStatus();
  } else {
    RETURN_IF_ERROR(args_.Bind(kernel_.kernel()));
    int best_work_group_index;
    RETURN_IF_ERROR(params.queue->GetBestWorkGroupIndex(
        kernel_, *params.info, grid_size_, possible_work_groups,
        &best_work_group_index));
    work_group_size_ = possible_work_groups[best_work_group_index];
    return absl::OkStatus();
  }
}

void GPUOperation::GetPossibleKernelWorkGroups(
    TuningType tuning_type, const DeviceInfo& device_info,
    const KernelInfo& kernel_info, std::vector<int3>* work_groups) const {
  GetPossibleWorkGroups(tuning_type, device_info, kernel_info, grid_size_,
                        work_groups);
}

Tune的流程是不可避免的,每个GPU Operation node都会有Tune操作,而不是可选项,即GetPossibleKernelWorkGroups方法是必然要进入的,并拿到可能的一个或者多个work group,即std::vector<int3> possible_work_groups。当有多个work groups时,会选择最佳workGroup,即会执行后续if-else的else分支的GetBestWorkGroupIndex方法,而当只有一个work groups时,直接返回,一个work groups也没有则直接报错提示absl::NotFoundError("Can not found work_group size to launch kernel");

1. 找候选work group

我们知道,当opencl的cl kernel在给定不能改动的情况下,性能和work group的设置策略有极大关系,GPUOperation::GetPossibleKernelWorkGroups做了细致的设置策略,该方法会调用GetPossibleWorkGroups,暂且称之为通用策略,作为各个GPU Operation继承的默认策略。此外除了继承,各个GPU Operation node子类也可能做一些自定义的策略,目前发现有名为GetPossibleWorkGroupsConv的方法,发现对该带Conv后缀的设置策略的调用有如下Conv方法:

  1. conv_buffer_1x1.cc
  2. convolution_transposed_3x3.cc
  3. convolution_transposed.cc
  4. conv_powervr.cc

可以看到目前调用带Conv后缀的方法的,主要是conv_buffer_1x1和conv_transpose,以及针对powervr实现的conv_powervr(但实际上,这个名字改为conv_general可能更合适,因为其它架构的GPU如AMD/Intel及Adreno等在某些情况下,也有用到)。此外,根据目录下包含Conv关键字的文件,可以其中可以看到部分Conv并没有调用带Conv的后缀的work group设置策略,其中包括depthwise_conv、conv_texture、conv_constant、conv_3d等。换言之,这种特殊的GetPossibleWorkGroupsConv是针对性增加的。

下面我们深入GetPossibleWorkGroupsConvGetPossibleWorkGroups看看他们的执行策略,还能怎么划分。

// tensorflow/tensorflow/lite/delegates/gpu/cl/kernels/work_group_picking.cc
// https://github.com/tensorflow/tensorflow/blob/ee2c2d17814c015477041dcafed0c9c7f1f00162/tensorflow/lite/delegates/gpu/cl/kernels/work_group_picking.cc#L272

void GetPossibleWorkGroups(TuningType tuning_type,
                           const DeviceInfo& device_info,
                           const KernelInfo& kernel_info, const int3& grid,
                           std::vector<int3>* work_groups) {
  switch (tuning_type) {
    case TuningType::FAST:
      work_groups->push_back(
          GetWorkGroup(grid, kernel_info.max_work_group_size));
      return;
    case TuningType::EXHAUSTIVE: {
      GetWorkGroupsAlignedToGrid(device_info, kernel_info, grid, work_groups);
      return;
    }
    default:
      work_groups->push_back({8, 4, 1});
      return;
  }
}

void GetPossibleWorkGroupsConv(TuningType tuning_type,
                               const DeviceInfo& device_info,
                               const KernelInfo& kernel_info, const int3& grid,
                               std::vector<int3>* work_groups) {
  switch (tuning_type) {
    case TuningType::FAST: {
      int max_z_size = 16;
      if (device_info.IsAdreno()) {
        max_z_size = device_info.IsAdreno3xx() ? 16 : 64;
      }
      max_z_size = std::min(max_z_size, device_info.max_work_group_size_z);
      work_groups->push_back(
          GetWorkGroupConv(grid, kernel_info.max_work_group_size, max_z_size));
      return;
    }
    case TuningType::EXHAUSTIVE: {
      GetWorkGroupsAlignedToGrid(device_info, kernel_info, grid, work_groups);
      return;
    }
    default:
      work_groups->push_back({8, 4, 1});
      return;
  }
}

我们深入GetPossibleWorkGroupsConvGetPossibleWorkGroups后(见如上代码),可以把找候选work group的情况和策略分为如下几种:

  1. 情况1:通用work group设置策略(GetPossibleWorkGroups)
    1. TuningType::FAST->GetWorkGroup
    2. TuningType::EXHAUSTIVE->GetWorkGroupsAlignedToGrid
    3. default: <8,4,1>
  2. 情况2:部分Conv的work group设置策略(GetPossibleWorkGroupsConv)
    1. TuningType::FAST->GetWorkGroupConv(情况1通用和情况2,仅这里不同)
    2. TuningType::EXHAUSTIVE->GetWorkGroupsAlignedToGrid
    3. default: <8,4,1>

综上来说,排除掉default的情况,Tuning策略有通用FAST、通用EXHAUSTIVE和非通用FAST三种,下面我们将会逐个分析实现。

ysh329 commented 3 years ago

2.1 策略一: TuningType::FAST (通用)

通用FAST策略的代码比较简单。可以看出按照z、x、y的顺序依次设置work group:

  1. 首先,找z方向的local work size,找寻过程是来找grid.z(即z方向的global work size)最大能被max_divisor(默认为8)整除的数。从8往下找,若grid.z能被8整除则8就是z方向的local work size,否则再尝试4、2,如果都不满足则以传入的max_divisor作为上界往下找;
  2. 当z方向确定了,则就有x和y方向的local work size的乘积上限也确定了。即使用kernel_info.max_work_group_size,,每个CLKernel对象都有一个struct KernelInfo包含了private_memory_sizemax_work_group_size,当CLKernel::CreateFromProgram执行时,这二者均是通过查询OpenCL内置的CL_KERNEL_PRIVATE_MEM_SIZECL_KERNEL_WORK_GROUP_SIZE信息分别获取到每个work item的private mem,以及GPU设备最大支持的work group,来除去刚得到的z方向的local work size;
  3. 确定x方向的local work size。用grid.x除以2并向上取整,与wg_xy_size比较大小,取小的作为x方向的local work size;
  4. 确定y方向的local work size。wg_xy_size除以wg_x,与grid.y取较小的,作为y方向的local work size。
int3 GetWorkGroup(const int3& grid, int max_size) {
  int wg_z = GetBiggestDividerWithPriority(grid.z, 8);
  int wg_xy_size = max_size / wg_z;
  int wg_x = std::min(DivideRoundUp(grid.x, 2), wg_xy_size);
  int wg_y = std::min(wg_xy_size / wg_x, grid.y);
  return int3(wg_x, wg_y, wg_z);
}

int GetBiggestDividerWithPriority(int number, int max_divider) {
  if (number % 8 == 0 && 8 <= max_divider) {
    return 8;
  }
  if (number % 4 == 0 && 4 <= max_divider) {
    return 4;
  }
  if (number % 2 == 0 && 2 <= max_divider) {
    return 2;
  }
  for (int i = max_divider; i != 0; i--) {
    if (number % i == 0) {
      return i;
    }
  }
  return 1;
}

// @param n must be non negative
// @param divisor must be greater than zero
template <typename T, typename N>
T DivideRoundUp(T n, N divisor) {
  const T div = static_cast<T>(divisor);
  const T q = n / div;
  return n % div == 0 ? q : q + 1;
}
ysh329 commented 3 years ago

2.2 策略二: TuningType::EXHAUSTIVE

这个TuningType没有Conv的特殊版本,这个方法EXHAUSTIVE,英文即为精疲力竭地,也就是穷举搜索所有可能找出最佳,相比Fast复杂了很多。

void GetWorkGroupsAlignedToGrid(const DeviceInfo& device_info,
                                const KernelInfo& kernel_info, const int3& grid,
                                std::vector<int3>* work_groups) {
  int3 max_wg_size;
  max_wg_size.x = device_info.max_work_group_size_x;
  max_wg_size.y = device_info.max_work_group_size_y;
  max_wg_size.z = device_info.max_work_group_size_z;
  GenerateWorkGroupSizesAlignedToGrid(
      grid, max_wg_size, kernel_info.max_work_group_size, work_groups);
}

首先,将当前设备device_info三个方向支持的最大work_group,赋值给max_wg_size的三个分量(x,y,z),在后续生成work group的GenerateWorkGroupSizesAlignedToGrid方法中会有使用。需要注意的是,kernel info的max_work_group_size是最大支持的维度,是一个数,而device info的max_work_group_size是每个维度支持的最大分量,一般是3个数。

KernelInfo和DeviceInfo

KernelInfo

顾名思义,这个KernelInfo和CL的逻辑Kernel有关,其定义的每个class CLKernel实例,都持有一个public的struct KernelInfo,其定义如下:

// tensorflow/tensorflow/lite/delegates/gpu/cl/cl_kernel.h
struct KernelInfo {
  int private_memory_size;
  int max_work_group_size;
};

/*
KernelInfo的里有两个成员,
分别是每个work item私有内存的private_memory_size和最大的max_work_group_size。
这两个成员在后续选择最佳work group等细粒度(针对性某款GPU)调优过程,会有用到。

这两个成员的初始化过程,
位于`cl_kernel.cc`内的`CLKernel::CreateFromProgram`方法中,
该过程会基于先前创建好的CL Program对象来创建CL Kernel对象,
创建完成后,便会获取`KernelInfo`的`private_memory_size`和`max_work_group_size`。

二者都是通过clGetKernelWorkGroupInfo这个OPENCL API获取到,例如后者是通过传入CL_KERNEL_WORK_GROUP_SIZE这个宏,来得到3(即3个方向)。

*/

// tensorflow/tensorflow/lite/delegates/gpu/cl/cl_kernel.cc
// https://github.com/tensorflow/tensorflow/blob/0a7a6220981cedb1cdaf858a563e73aeae90543b/tensorflow/lite/delegates/gpu/cl/cl_kernel.cc#L104-L124
absl::Status CLKernel::CreateFromProgram(const CLProgram& program,
                                         const std::string& function_name) {
  int error_code;
  function_name_ = function_name;
  kernel_ =
      clCreateKernel(program.program(), function_name.c_str(), &error_code);
  if (!kernel_ || error_code != CL_SUCCESS) {
    kernel_ = nullptr;
    return absl::UnknownError(absl::StrCat("Failed to create ", function_name,
                                           CLErrorCodeToString(error_code)));
  }

  program_ = program.program();
  clRetainProgram(program_);

  RETURN_IF_ERROR(GetKernelPrivateMemorySize(kernel_, program.GetDeviceId(),
                                             &info_.private_memory_size));
  RETURN_IF_ERROR(GetKernelMaxWorkGroupSize(kernel_, program.GetDeviceId(),
                                            &info_.max_work_group_size));
  return absl::OkStatus();
}

absl::Status GetKernelMaxWorkGroupSize(cl_kernel kernel, cl_device_id device_id,
                                       int* result) {
  size_t max_work_group_size;
  cl_int error_code =
      clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE,
                               sizeof(size_t), &max_work_group_size, nullptr);
  if (error_code != CL_SUCCESS) {
    return absl::UnknownError(
        absl::StrCat("Failed to get info CL_KERNEL_WORK_GROUP_SIZE ",
                     CLErrorCodeToString(error_code)));
  }
  *result = static_cast<int>(max_work_group_size);
  return absl::OkStatus();
}

DeviceInfo

位于cl_device.cc中有一个DeviceInfo DeviceInfoFromDeviceID(cl_device_id id)方法,

// tensorflow/tensorflow/lite/delegates/gpu/cl/cl_device.cc
// https://github.com/tensorflow/tensorflow/blob/b14150088dac1924cf0482f6e456332b3e6211ff/tensorflow/lite/delegates/gpu/cl/cl_device.cc#L242-L246
DeviceInfo DeviceInfoFromDeviceID(cl_device_id id) {
  DeviceInfo info;
  // ignored ...
  int3 max_work_group_sizes;
  GetDeviceWorkDimsSizes(id, &max_work_group_sizes);
  info.max_work_group_size_x = max_work_group_sizes.x;
  info.max_work_group_size_y = max_work_group_sizes.y;
  info.max_work_group_size_z = max_work_group_sizes.z;
  // ignored ...
  return info;
}

void GetDeviceWorkDimsSizes(cl_device_id id, int3* result) {
  int dims_count =
      GetDeviceInfo<cl_uint>(id, CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS);
  if (dims_count < 3) {
    return;
  }
  std::vector<size_t> limits(dims_count);
  cl_int error =
      clGetDeviceInfo(id, CL_DEVICE_MAX_WORK_ITEM_SIZES,
                      sizeof(size_t) * dims_count, limits.data(), nullptr);
  if (error != CL_SUCCESS) {
    return;
  }
  // dims_count must be at least 3 according to spec
  result->x = limits[0];
  result->y = limits[1];
  result->z = limits[2];
}
ysh329 commented 3 years ago

该方法进入时,会先获取KernelWorkGroups,注意是一个候选表std::vector<int3> possible_work_groups,然后在最后if-else分支的else情况中,选择这些里最好的best_work_group_index,作为当前Operation最终的work_group。

// delegates/gpu/cl/kernels/gpu_operation.cc
// https://github.com/tensorflow/tensorflow/blob/b14150088dac1924cf0482f6e456332b3e6211ff/tensorflow/lite/delegates/gpu/cl/kernels/gpu_operation.cc

absl::Status GPUOperation::Tune(const TuningParameters& params) {
  std::vector<int3> possible_work_groups;
  GetPossibleKernelWorkGroups(params.tuning_type, *params.info, kernel_.info_,
                              &possible_work_groups);
  if (possible_work_groups.empty()) {
    return absl::NotFoundError(
        "Can not found work_group size to launch kernel");
  }
  if (possible_work_groups.size() == 1) {
    work_group_size_ = possible_work_groups[0];
    return absl::OkStatus();
  } else {
    RETURN_IF_ERROR(args_.Bind(kernel_.kernel()));
    int best_work_group_index;
    RETURN_IF_ERROR(params.queue->GetBestWorkGroupIndex(
        kernel_, *params.info, grid_size_, possible_work_groups,
        &best_work_group_index));
    work_group_size_ = possible_work_groups[best_work_group_index];
    return absl::OkStatus();
  }
}

void GPUOperation::GetPossibleKernelWorkGroups(
    TuningType tuning_type, const DeviceInfo& device_info,
    const KernelInfo& kernel_info, std::vector<int3>* work_groups) const {
  GetPossibleWorkGroups(tuning_type, device_info, kernel_info, grid_size_,
                        work_groups);
}

Tune的流程是不可避免的,每个GPU Operation node都会有Tune操作,而不是可选项,即GetPossibleKernelWorkGroups方法是必然要进入的,并拿到可能的一个或者多个work group,即std::vector<int3> possible_work_groups。当有多个work groups时,会选择最佳workGroup,即会执行后续if-else的else分支的GetBestWorkGroupIndex方法,而当只有一个work groups时,直接返回,一个work groups也没有则直接报错提示absl::NotFoundError("Can not found work_group size to launch kernel");

我们知道,当opencl的cl kernel在给定不能改动的情况下,性能和work group的设置策略有极大关系,GPUOperation::GetPossibleKernelWorkGroups做了细致的设置策略,该方法会调用GetPossibleWorkGroups,暂且称之为通用策略,作为各个GPU Operation继承的默认策略。此外除了继承,各个GPU Operation node子类也可能做一些自定义的策略,目前发现有名为GetPossibleWorkGroupsConv的方法,发现对该带Conv后缀的设置策略的调用有如下Conv方法:

  1. conv_buffer_1x1.cc;
  2. convolution_transposed_3x3.cc;
  3. convolution_transposed.cc;
  4. conv_powervr.cc。

可以看到目前调用带Conv后缀的方法的,主要是conv_buffer_1x1和conv_transpose,以及针对powervr实现的conv。此外,根据目录下包含Conv关键字的文件,可以其中可以看到部分Conv并没有调用带Conv的后缀的work group设置策略,其中包括depthwise_conv、conv_texture、conv_constant、conv_3d等。换言之,这种特殊的GetPossibleWorkGroupsConv是针对性增加的。

下面我们深入GetPossibleWorkGroupsConvGetPossibleWorkGroups看看他们的执行策略,还能怎么划分。

// tensorflow/tensorflow/lite/delegates/gpu/cl/kernels/work_group_picking.cc
// https://github.com/tensorflow/tensorflow/blob/ee2c2d17814c015477041dcafed0c9c7f1f00162/tensorflow/lite/delegates/gpu/cl/kernels/work_group_picking.cc#L272

void GetPossibleWorkGroups(TuningType tuning_type,
                           const DeviceInfo& device_info,
                           const KernelInfo& kernel_info, const int3& grid,
                           std::vector<int3>* work_groups) {
  switch (tuning_type) {
    case TuningType::FAST:
      work_groups->push_back(
          GetWorkGroup(grid, kernel_info.max_work_group_size));
      return;
    case TuningType::EXHAUSTIVE: {
      GetWorkGroupsAlignedToGrid(device_info, kernel_info, grid, work_groups);
      return;
    }
    default:
      work_groups->push_back({8, 4, 1});
      return;
  }
}

void GetPossibleWorkGroupsConv(TuningType tuning_type,
                               const DeviceInfo& device_info,
                               const KernelInfo& kernel_info, const int3& grid,
                               std::vector<int3>* work_groups) {
  switch (tuning_type) {
    case TuningType::FAST: {
      int max_z_size = 16;
      if (device_info.IsAdreno()) {
        max_z_size = device_info.IsAdreno3xx() ? 16 : 64;
      }
      max_z_size = std::min(max_z_size, device_info.max_work_group_size_z);
      work_groups->push_back(
          GetWorkGroupConv(grid, kernel_info.max_work_group_size, max_z_size));
      return;
    }
    case TuningType::EXHAUSTIVE: {
      GetWorkGroupsAlignedToGrid(device_info, kernel_info, grid, work_groups);
      return;
    }
    default:
      work_groups->push_back({8, 4, 1});
      return;
  }
}

不过,我们深入GetPossibleWorkGroupsConvGetPossibleWorkGroups后(见如上代码),可以把GetPossibleKernelWorkGroups的情况和策略分为如下几种:

  1. 通用
    1. fast->GetWorkGroup
    2. exhaustive->GetWorkGroupsAlignedToGrid
    3. default: <8,4,1>
  2. Conv(就是上面4种情况,见对应.cc文件)
    1. fast->GetWorkGroupConv(【Conv和通用仅这里不同】)
    2. exhaustive->GetWorkGroupsAlignedToGrid
    3. default: <8,4,1>

综上来说,排除掉default的情况,Tuning策略有通用fast、通用exhaustive和非通用Fast(即Conv fast)三种。

// lite/delegates/gpu/cl/kernels/work_group_picking.cc
// https://github.com/tensorflow/tensorflow/blob/ee2c2d17814c015477041dcafed0c9c7f1f00162/tensorflow/lite/delegates/gpu/cl/kernels/work_group_picking.cc#L272
void GetPossibleWorkGroups(TuningType tuning_type,
                           const DeviceInfo& device_info,
                           const KernelInfo& kernel_info, const int3& grid,
                           std::vector<int3>* work_groups) {
  switch (tuning_type) {
    case TuningType::FAST:
      work_groups->push_back(
          GetWorkGroup(grid, kernel_info.max_work_group_size));
      return;
    case TuningType::EXHAUSTIVE: {
      GetWorkGroupsAlignedToGrid(device_info, kernel_info, grid, work_groups);
      return;
    }
    default:
      work_groups->push_back({8, 4, 1});
      return;
  }
}

// lite/delegates/gpu/cl/kernels/work_group_picking.cc
// https://github.com/tensorflow/tensorflow/blob/ee2c2d17814c015477041dcafed0c9c7f1f00162/tensorflow/lite/delegates/gpu/cl/kernels/work_group_picking.cc#L272
void GetPossibleWorkGroupsConv(TuningType tuning_type,
                               const DeviceInfo& device_info,
                               const KernelInfo& kernel_info, const int3& grid,
                               std::vector<int3>* work_groups) {
  switch (tuning_type) {
    case TuningType::FAST: {
      int max_z_size = 16;
      if (device_info.IsAdreno()) {
        max_z_size = device_info.IsAdreno3xx() ? 16 : 64;
      }
      max_z_size = std::min(max_z_size, device_info.max_work_group_size_z);
      work_groups->push_back(
          GetWorkGroupConv(grid, kernel_info.max_work_group_size, max_z_size));
      return;
    }
    case TuningType::EXHAUSTIVE: {
      GetWorkGroupsAlignedToGrid(device_info, kernel_info, grid, work_groups);
      return;
    }
    default:
      work_groups->push_back({8, 4, 1});
      return;
  }
}
ysh329 commented 3 years ago

2.2 继续主线

//当没有检查内核边界时,我们需要精确 //如果有检查,精确或无需校准都可以。

// tensorflow/tensorflow/lite/delegates/gpu/common/workgroup_selection.h
// https://github.com/tensorflow/tensorflow/blob/b5d2374f5e21ff0aa44ac26b039336d7443d08e3/tensorflow/lite/delegates/gpu/common/workgroup_selection.h#L28

// PRECISE assume that WorkGroupSize * k = GridSize;
// NO_ALIGNMENT no restrictions;
// We need PRECISE when we don't have check in kernel for boundaries
// If we have the check, we can use PRECISE or NO_ALIGNMENT as well.
enum class WorkGroupSizeAlignment { PRECISE, NO_ALIGNMENT };
// tensorflow/tensorflow/lite/delegates/gpu/common/workgroup_selection.cc
template <typename T>
void GenerateWorkGroupSizesAlignedToGrid(const T& grid,
                                         const T& max_work_group_size,
                                         const int max_work_group_invocations,
                                         std::vector<T>* work_groups) {
  auto alignment = WorkGroupSizeAlignment::PRECISE;
  *work_groups = GenerateWorkGroupSizes<T>(
      grid, /*min_work_group_total_size = */ 32, max_work_group_invocations,
      max_work_group_size, alignment, alignment, alignment);
  // If the grid parameter too small, method below cannot generate workgroups.
  if (work_groups->empty()) {
    AddCornerCases(grid, max_work_group_invocations, max_work_group_size,
                   alignment, alignment, alignment, work_groups);
  }
}

template <typename T>
std::vector<T> GenerateWorkGroupSizes(
    const T& grid, int min_work_group_total_size, int max_work_group_total_size,
    const T& max_work_group_sizes, WorkGroupSizeAlignment x_alignment,
    WorkGroupSizeAlignment y_alignment, WorkGroupSizeAlignment z_alignment) {
  std::vector<T> work_groups;
  work_groups.reserve(64);

  std::vector<int> sizes_x = GetPossibleSizes(grid.x, x_alignment);
  std::vector<int> sizes_y = GetPossibleSizes(grid.y, y_alignment);
  std::vector<int> sizes_z = GetPossibleSizes(grid.z, z_alignment);

  for (auto x : sizes_x) {
    if (x > max_work_group_sizes.x) continue;
    for (auto y : sizes_y) {
      if (y > max_work_group_sizes.y) continue;
      for (auto z : sizes_z) {
        if (z > max_work_group_sizes.z) continue;
        const int work_group_size = x * y * z;
        if (work_group_size < min_work_group_total_size ||
            work_group_size > max_work_group_total_size)
          continue;
        work_groups.push_back({x, y, z});
      }
    }
  }

  return work_groups;
}

std::vector<int> GetPossibleSizes(int number,
                                  WorkGroupSizeAlignment z_alignment) {
  if (z_alignment == WorkGroupSizeAlignment::PRECISE) {
    // we will use for potential sizes, sizes that cover grid precisely
    // work group size * k (k is integer) == grid_size
    return GetDivisors(number);
  } else {
    // when we chose work group size we can use work group size that
    //   work group size * k (k is integer) != grid_size (slightly bigger)
    // so in this heuristic we trying to find potential size, that satisfies
    //   to this : work group size * k (k is integer) <= grid_size + 5
    //   and this : work group size * k (k is integer) >= grid_size
    return GetDivisorsForRange(number, 5);
  }
}

std::vector<int> GetDivisors(int number) {
  const int max_divisor = static_cast<int>(std::sqrt(number));
  std::vector<int> divisors;
  // we don't know the number of dividers, so it is just heuristic.
  divisors.reserve(max_divisor / 3 + 1);
  for (int i = 1; i <= max_divisor; ++i) {
    const int d = number / i;
    if (i * d == number) {
      divisors.push_back(i);
      if (d != i) {
        divisors.push_back(d);
      }
    }
  }
  return divisors;
}

std::vector<int> GetDivisorsForRange(int number, int range) {
  const int last_number = number + range;
  const int max_divisor = static_cast<int>(std::sqrt(last_number));
  std::set<int> divisors;
  for (int i = 1; i <= max_divisor; ++i) {
    const int reminder = number % i;
    // iterate through numbers that divisible by i in our range;
    const int first_number = number + (i - reminder) % i;
    if (first_number <= last_number) {
      divisors.insert(i);
    }
    for (int j = first_number; j <= last_number; j += i) {
      const int d = j / i;
      if (d != i) {
        divisors.insert(d);
      }
    }
  }
  return std::vector<int>(divisors.begin(), divisors.end());
}
ysh329 commented 3 years ago
// tensorflow/tensorflow/lite/delegates/gpu/common/workgroup_selection.cc
// https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/delegates/gpu/common/workgroup_selection.cc#L30-L87
template <typename T>
void AddCornerCases(const T& grid, int max_work_group_total_size,
                    const T& max_work_group_sizes,
                    WorkGroupSizeAlignment x_alignment,
                    WorkGroupSizeAlignment y_alignment,
                    WorkGroupSizeAlignment z_alignment,
                    std::vector<T>* work_groups) {
  for (int x = 1; x <= 4; ++x) {
    for (int y = 1; y <= 4; ++y) {
      for (int z = 1; z <= 4; ++z) {
        int wg_x = DivideRoundUp(grid.x, x);
        int wg_y = DivideRoundUp(grid.y, y);
        int wg_z = DivideRoundUp(grid.z, z);
        if (wg_x > max_work_group_sizes.x || wg_y > max_work_group_sizes.y ||
            wg_z > max_work_group_sizes.z ||
            wg_x * wg_y * wg_z > max_work_group_total_size) {
          continue;
        }
        if (x_alignment == WorkGroupSizeAlignment::PRECISE &&
            grid.x % wg_x != 0) {
          continue;
        }
        if (y_alignment == WorkGroupSizeAlignment::PRECISE &&
            grid.y % wg_y != 0) {
          continue;
        }
        if (z_alignment == WorkGroupSizeAlignment::PRECISE &&
            grid.z % wg_z != 0) {
          continue;
        }
        work_groups->push_back({wg_x, wg_y, wg_z});
      }
    }
  }

  // this will add at least {1, 1, 1} always.
  for (int x = 1; x <= 4; ++x) {
    for (int y = 1; y <= 4; ++y) {
      for (int z = 1; z <= 4; ++z) {
        if (x > max_work_group_sizes.x || y > max_work_group_sizes.y ||
            z > max_work_group_sizes.z ||
            x * y * z > max_work_group_total_size) {
          continue;
        }
        if (x_alignment == WorkGroupSizeAlignment::PRECISE && grid.x % x != 0) {
          continue;
        }
        if (y_alignment == WorkGroupSizeAlignment::PRECISE && grid.y % y != 0) {
          continue;
        }
        if (z_alignment == WorkGroupSizeAlignment::PRECISE && grid.z % z != 0) {
          continue;
        }
        work_groups->push_back({x, y, z});
      }
    }
  }
}
ysh329 commented 3 years ago

2.3 策略TuningType::FAST->GetWorkGroupConv

void GetPossibleWorkGroupsConv(TuningType tuning_type,
                               const DeviceInfo& device_info,
                               const KernelInfo& kernel_info, const int3& grid,
                               std::vector<int3>* work_groups) {
  switch (tuning_type) {
    case TuningType::FAST: {
      int max_z_size = 16;
      if (device_info.IsAdreno()) {
        max_z_size = device_info.IsAdreno3xx() ? 16 : 64;
      }
      max_z_size = std::min(max_z_size, device_info.max_work_group_size_z);
      work_groups->push_back(
          GetWorkGroupConv(grid, kernel_info.max_work_group_size, max_z_size));
      return;
    }
  case TunningType::EXAUSTIVE: {
    // ignored
    }
  }
}
// tensorflow/tensorflow/lite/delegates/gpu/cl/kernels/work_group_picking.cc
// https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/delegates/gpu/cl/kernels/work_group_picking.cc#L206-L215
int3 GetWorkGroupConv(const int3& grid, int max_size, int max_z_size) {
  int wg_z = GetBiggestDivider(grid.z, max_z_size);
  int wg_xy_size = std::min(256, max_size) / wg_z;
  int wg_x = std::min(grid.x, wg_xy_size);
  int wg_y = std::min(wg_xy_size / wg_x, grid.y);
  if (wg_y == grid.y && grid.y % 2 == 0) {
    wg_y = grid.y / 2;
  }
  return int3(wg_x, wg_y, wg_z);
}

int GetBiggestDivider(int number, int max_divider) {
  for (int i = max_divider; i != 0; i--) {
    if (number % i == 0) {
      return i;
    }
  }
  return 1;
}
ysh329 commented 3 years ago

3. 选择最优候选work group

// delegates/gpu/cl/kernels/gpu_operation.cc
// https://github.com/tensorflow/tensorflow/blob/b14150088dac1924cf0482f6e456332b3e6211ff/tensorflow/lite/delegates/gpu/cl/kernels/gpu_operation.cc

absl::Status GPUOperation::Tune(const TuningParameters& params) {
  std::vector<int3> possible_work_groups;
  GetPossibleKernelWorkGroups(params.tuning_type, *params.info, kernel_.info_,
                              &possible_work_groups);
  if (possible_work_groups.empty()) {
    return absl::NotFoundError(
        "Can not found work_group size to launch kernel");
  }
  if (possible_work_groups.size() == 1) {
    work_group_size_ = possible_work_groups[0];
    return absl::OkStatus();
  } else {
    RETURN_IF_ERROR(args_.Bind(kernel_.kernel()));
    int best_work_group_index;
    RETURN_IF_ERROR(params.queue->GetBestWorkGroupIndex(
        kernel_, *params.info, grid_size_, possible_work_groups,
        &best_work_group_index));
    work_group_size_ = possible_work_groups[best_work_group_index];
    return absl::OkStatus();
  }
}
// tensorflow/tensorflow/lite/delegates/gpu/cl/cl_command_queue.cc
// https://github.com/tensorflow/tensorflow/blob/1da2ac286f24bc04cef9a24889c24019924691af/tensorflow/lite/delegates/gpu/cl/cl_command_queue.cc#L218-L278
absl::Status ProfilingCommandQueue::GetBestWorkGroupIndex(
    const CLKernel& kernel, const DeviceInfo& device_info,
    const std::vector<int3>& work_groups_count,
    const std::vector<int3>& work_group_sizes, int* index) {
  // Some Adreno 3xx can have wrong numbers for some events
  const bool possible_bug_with_events = device_info.IsAdreno3xx();
  events_.resize(work_group_sizes.size());
  for (int i = 0; i < work_group_sizes.size(); ++i) {
    RETURN_IF_ERROR(CLCommandQueue::Dispatch(kernel, work_groups_count[i],
                                             work_group_sizes[i], &events_[i]));

    // reducing the speed of memory leak on Mali for some kernels
    if (device_info.IsMali() && i % 8 == 7) {
      events_[i - 7].Wait();
    }
    if (possible_bug_with_events) {
      // We are trying to increase probability for correct result.
      RETURN_IF_ERROR(WaitForCompletion());
    }
  }

  RETURN_IF_ERROR(WaitForCompletion());

  // To release memory of some kernel pool on Mali.
  if (device_info.IsMali()) {
    RETURN_IF_ERROR(kernel.ReInit());
  }

  int minimum_index = 0;
  double minimum_time = std::numeric_limits<double>::max();
  if (possible_bug_with_events) {  // we will try to cut out suspicious results
    double average_time = 0.0;
    int average_samples_count = 0;
    for (int i = 0; i < work_group_sizes.size(); ++i) {
      if (events_[i].GetEventTimeMs() < 100 * 1000) {  // 100 sec
        average_time += events_[i].GetEventTimeMs();
        average_samples_count++;
      }
    }
    average_time /= average_samples_count;
    for (int i = 0; i < work_group_sizes.size(); ++i) {
      double time = events_[i].GetEventTimeMs();
      if (time < minimum_time && time >= 0.1 * average_time) {
        minimum_index = i;
        minimum_time = time;
      }
    }
  } else {
    for (int i = 0; i < work_group_sizes.size(); ++i) {
      double time = events_[i].GetEventTimeMs();
      if (time < minimum_time) {
        minimum_index = i;
        minimum_time = time;
      }
    }
  }

  *index = minimum_index;

  return absl::OkStatus();
}