Open ysh329 opened 4 years ago
通用FAST策略的代码比较简单。可以看出按照z、x、y的顺序依次设置work group:
max_divisor
(默认为8)整除的数。从8往下找,若grid.z能被8整除则8就是z方向的local work size,否则再尝试4、2,如果都不满足则以传入的max_divisor作为上界往下找;kernel_info.max_work_group_size
,,每个CLKernel对象都有一个struct KernelInfo
包含了private_memory_size
和max_work_group_size
,当CLKernel::CreateFromProgram
执行时,这二者均是通过查询OpenCL内置的CL_KERNEL_PRIVATE_MEM_SIZE
和CL_KERNEL_WORK_GROUP_SIZE
信息分别获取到每个work item的private mem,以及GPU设备最大支持的work group,来除去刚得到的z方向的local work size;int3 GetWorkGroup(const int3& grid, int max_size) {
int wg_z = GetBiggestDividerWithPriority(grid.z, 8);
int wg_xy_size = max_size / wg_z;
int wg_x = std::min(DivideRoundUp(grid.x, 2), wg_xy_size);
int wg_y = std::min(wg_xy_size / wg_x, grid.y);
return int3(wg_x, wg_y, wg_z);
}
int GetBiggestDividerWithPriority(int number, int max_divider) {
if (number % 8 == 0 && 8 <= max_divider) {
return 8;
}
if (number % 4 == 0 && 4 <= max_divider) {
return 4;
}
if (number % 2 == 0 && 2 <= max_divider) {
return 2;
}
for (int i = max_divider; i != 0; i--) {
if (number % i == 0) {
return i;
}
}
return 1;
}
// @param n must be non negative
// @param divisor must be greater than zero
template <typename T, typename N>
T DivideRoundUp(T n, N divisor) {
const T div = static_cast<T>(divisor);
const T q = n / div;
return n % div == 0 ? q : q + 1;
}
这个TuningType没有Conv的特殊版本,这个方法EXHAUSTIVE,英文即为精疲力竭地,也就是穷举搜索所有可能找出最佳,相比Fast复杂了很多。
void GetWorkGroupsAlignedToGrid(const DeviceInfo& device_info,
const KernelInfo& kernel_info, const int3& grid,
std::vector<int3>* work_groups) {
int3 max_wg_size;
max_wg_size.x = device_info.max_work_group_size_x;
max_wg_size.y = device_info.max_work_group_size_y;
max_wg_size.z = device_info.max_work_group_size_z;
GenerateWorkGroupSizesAlignedToGrid(
grid, max_wg_size, kernel_info.max_work_group_size, work_groups);
}
首先,将当前设备device_info
三个方向支持的最大work_group
,赋值给max_wg_size
的三个分量(x,y,z),在后续生成work group的GenerateWorkGroupSizesAlignedToGrid
方法中会有使用。需要注意的是,kernel info的max_work_group_size是最大支持的维度,是一个数,而device info的max_work_group_size是每个维度支持的最大分量,一般是3个数。
顾名思义,这个KernelInfo和CL的逻辑Kernel有关,其定义的每个class CLKernel
实例,都持有一个public的struct KernelInfo
,其定义如下:
// tensorflow/tensorflow/lite/delegates/gpu/cl/cl_kernel.h
struct KernelInfo {
int private_memory_size;
int max_work_group_size;
};
/*
KernelInfo的里有两个成员,
分别是每个work item私有内存的private_memory_size和最大的max_work_group_size。
这两个成员在后续选择最佳work group等细粒度(针对性某款GPU)调优过程,会有用到。
这两个成员的初始化过程,
位于`cl_kernel.cc`内的`CLKernel::CreateFromProgram`方法中,
该过程会基于先前创建好的CL Program对象来创建CL Kernel对象,
创建完成后,便会获取`KernelInfo`的`private_memory_size`和`max_work_group_size`。
二者都是通过clGetKernelWorkGroupInfo这个OPENCL API获取到,例如后者是通过传入CL_KERNEL_WORK_GROUP_SIZE这个宏,来得到3(即3个方向)。
*/
// tensorflow/tensorflow/lite/delegates/gpu/cl/cl_kernel.cc
// https://github.com/tensorflow/tensorflow/blob/0a7a6220981cedb1cdaf858a563e73aeae90543b/tensorflow/lite/delegates/gpu/cl/cl_kernel.cc#L104-L124
absl::Status CLKernel::CreateFromProgram(const CLProgram& program,
const std::string& function_name) {
int error_code;
function_name_ = function_name;
kernel_ =
clCreateKernel(program.program(), function_name.c_str(), &error_code);
if (!kernel_ || error_code != CL_SUCCESS) {
kernel_ = nullptr;
return absl::UnknownError(absl::StrCat("Failed to create ", function_name,
CLErrorCodeToString(error_code)));
}
program_ = program.program();
clRetainProgram(program_);
RETURN_IF_ERROR(GetKernelPrivateMemorySize(kernel_, program.GetDeviceId(),
&info_.private_memory_size));
RETURN_IF_ERROR(GetKernelMaxWorkGroupSize(kernel_, program.GetDeviceId(),
&info_.max_work_group_size));
return absl::OkStatus();
}
absl::Status GetKernelMaxWorkGroupSize(cl_kernel kernel, cl_device_id device_id,
int* result) {
size_t max_work_group_size;
cl_int error_code =
clGetKernelWorkGroupInfo(kernel, device_id, CL_KERNEL_WORK_GROUP_SIZE,
sizeof(size_t), &max_work_group_size, nullptr);
if (error_code != CL_SUCCESS) {
return absl::UnknownError(
absl::StrCat("Failed to get info CL_KERNEL_WORK_GROUP_SIZE ",
CLErrorCodeToString(error_code)));
}
*result = static_cast<int>(max_work_group_size);
return absl::OkStatus();
}
位于cl_device.cc
中有一个DeviceInfo DeviceInfoFromDeviceID(cl_device_id id)
方法,
// tensorflow/tensorflow/lite/delegates/gpu/cl/cl_device.cc
// https://github.com/tensorflow/tensorflow/blob/b14150088dac1924cf0482f6e456332b3e6211ff/tensorflow/lite/delegates/gpu/cl/cl_device.cc#L242-L246
DeviceInfo DeviceInfoFromDeviceID(cl_device_id id) {
DeviceInfo info;
// ignored ...
int3 max_work_group_sizes;
GetDeviceWorkDimsSizes(id, &max_work_group_sizes);
info.max_work_group_size_x = max_work_group_sizes.x;
info.max_work_group_size_y = max_work_group_sizes.y;
info.max_work_group_size_z = max_work_group_sizes.z;
// ignored ...
return info;
}
void GetDeviceWorkDimsSizes(cl_device_id id, int3* result) {
int dims_count =
GetDeviceInfo<cl_uint>(id, CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS);
if (dims_count < 3) {
return;
}
std::vector<size_t> limits(dims_count);
cl_int error =
clGetDeviceInfo(id, CL_DEVICE_MAX_WORK_ITEM_SIZES,
sizeof(size_t) * dims_count, limits.data(), nullptr);
if (error != CL_SUCCESS) {
return;
}
// dims_count must be at least 3 according to spec
result->x = limits[0];
result->y = limits[1];
result->z = limits[2];
}
该方法进入时,会先获取KernelWorkGroups,注意是一个候选表std::vector<int3> possible_work_groups
,然后在最后if-else分支的else情况中,选择这些里最好的best_work_group_index,作为当前Operation最终的work_group。
// delegates/gpu/cl/kernels/gpu_operation.cc
// https://github.com/tensorflow/tensorflow/blob/b14150088dac1924cf0482f6e456332b3e6211ff/tensorflow/lite/delegates/gpu/cl/kernels/gpu_operation.cc
absl::Status GPUOperation::Tune(const TuningParameters& params) {
std::vector<int3> possible_work_groups;
GetPossibleKernelWorkGroups(params.tuning_type, *params.info, kernel_.info_,
&possible_work_groups);
if (possible_work_groups.empty()) {
return absl::NotFoundError(
"Can not found work_group size to launch kernel");
}
if (possible_work_groups.size() == 1) {
work_group_size_ = possible_work_groups[0];
return absl::OkStatus();
} else {
RETURN_IF_ERROR(args_.Bind(kernel_.kernel()));
int best_work_group_index;
RETURN_IF_ERROR(params.queue->GetBestWorkGroupIndex(
kernel_, *params.info, grid_size_, possible_work_groups,
&best_work_group_index));
work_group_size_ = possible_work_groups[best_work_group_index];
return absl::OkStatus();
}
}
void GPUOperation::GetPossibleKernelWorkGroups(
TuningType tuning_type, const DeviceInfo& device_info,
const KernelInfo& kernel_info, std::vector<int3>* work_groups) const {
GetPossibleWorkGroups(tuning_type, device_info, kernel_info, grid_size_,
work_groups);
}
Tune的流程是不可避免的,每个GPU Operation node都会有Tune操作,而不是可选项,即GetPossibleKernelWorkGroups
方法是必然要进入的,并拿到可能的一个或者多个work group,即std::vector<int3> possible_work_groups
。当有多个work groups时,会选择最佳workGroup,即会执行后续if-else的else分支的GetBestWorkGroupIndex
方法,而当只有一个work groups时,直接返回,一个work groups也没有则直接报错提示absl::NotFoundError("Can not found work_group size to launch kernel");
。
我们知道,当opencl的cl kernel在给定不能改动的情况下,性能和work group的设置策略有极大关系,GPUOperation::GetPossibleKernelWorkGroups
做了细致的设置策略,该方法会调用GetPossibleWorkGroups
,暂且称之为通用策略,作为各个GPU Operation继承的默认策略。此外除了继承,各个GPU Operation node子类也可能做一些自定义的策略,目前发现有名为GetPossibleWorkGroupsConv
的方法,发现对该带Conv
后缀的设置策略的调用有如下Conv方法:
可以看到目前调用带Conv后缀的方法的,主要是conv_buffer_1x1和conv_transpose,以及针对powervr实现的conv。此外,根据目录下包含Conv关键字的文件,可以其中可以看到部分Conv并没有调用带Conv
的后缀的work group设置策略,其中包括depthwise_conv、conv_texture、conv_constant、conv_3d等。换言之,这种特殊的GetPossibleWorkGroupsConv
是针对性增加的。
下面我们深入GetPossibleWorkGroupsConv
和GetPossibleWorkGroups
看看他们的执行策略,还能怎么划分。
// tensorflow/tensorflow/lite/delegates/gpu/cl/kernels/work_group_picking.cc
// https://github.com/tensorflow/tensorflow/blob/ee2c2d17814c015477041dcafed0c9c7f1f00162/tensorflow/lite/delegates/gpu/cl/kernels/work_group_picking.cc#L272
void GetPossibleWorkGroups(TuningType tuning_type,
const DeviceInfo& device_info,
const KernelInfo& kernel_info, const int3& grid,
std::vector<int3>* work_groups) {
switch (tuning_type) {
case TuningType::FAST:
work_groups->push_back(
GetWorkGroup(grid, kernel_info.max_work_group_size));
return;
case TuningType::EXHAUSTIVE: {
GetWorkGroupsAlignedToGrid(device_info, kernel_info, grid, work_groups);
return;
}
default:
work_groups->push_back({8, 4, 1});
return;
}
}
void GetPossibleWorkGroupsConv(TuningType tuning_type,
const DeviceInfo& device_info,
const KernelInfo& kernel_info, const int3& grid,
std::vector<int3>* work_groups) {
switch (tuning_type) {
case TuningType::FAST: {
int max_z_size = 16;
if (device_info.IsAdreno()) {
max_z_size = device_info.IsAdreno3xx() ? 16 : 64;
}
max_z_size = std::min(max_z_size, device_info.max_work_group_size_z);
work_groups->push_back(
GetWorkGroupConv(grid, kernel_info.max_work_group_size, max_z_size));
return;
}
case TuningType::EXHAUSTIVE: {
GetWorkGroupsAlignedToGrid(device_info, kernel_info, grid, work_groups);
return;
}
default:
work_groups->push_back({8, 4, 1});
return;
}
}
不过,我们深入GetPossibleWorkGroupsConv
和GetPossibleWorkGroups
后(见如上代码),可以把GetPossibleKernelWorkGroups
的情况和策略分为如下几种:
综上来说,排除掉default的情况,Tuning策略有通用fast、通用exhaustive和非通用Fast(即Conv fast)三种。
// lite/delegates/gpu/cl/kernels/work_group_picking.cc
// https://github.com/tensorflow/tensorflow/blob/ee2c2d17814c015477041dcafed0c9c7f1f00162/tensorflow/lite/delegates/gpu/cl/kernels/work_group_picking.cc#L272
void GetPossibleWorkGroups(TuningType tuning_type,
const DeviceInfo& device_info,
const KernelInfo& kernel_info, const int3& grid,
std::vector<int3>* work_groups) {
switch (tuning_type) {
case TuningType::FAST:
work_groups->push_back(
GetWorkGroup(grid, kernel_info.max_work_group_size));
return;
case TuningType::EXHAUSTIVE: {
GetWorkGroupsAlignedToGrid(device_info, kernel_info, grid, work_groups);
return;
}
default:
work_groups->push_back({8, 4, 1});
return;
}
}
// lite/delegates/gpu/cl/kernels/work_group_picking.cc
// https://github.com/tensorflow/tensorflow/blob/ee2c2d17814c015477041dcafed0c9c7f1f00162/tensorflow/lite/delegates/gpu/cl/kernels/work_group_picking.cc#L272
void GetPossibleWorkGroupsConv(TuningType tuning_type,
const DeviceInfo& device_info,
const KernelInfo& kernel_info, const int3& grid,
std::vector<int3>* work_groups) {
switch (tuning_type) {
case TuningType::FAST: {
int max_z_size = 16;
if (device_info.IsAdreno()) {
max_z_size = device_info.IsAdreno3xx() ? 16 : 64;
}
max_z_size = std::min(max_z_size, device_info.max_work_group_size_z);
work_groups->push_back(
GetWorkGroupConv(grid, kernel_info.max_work_group_size, max_z_size));
return;
}
case TuningType::EXHAUSTIVE: {
GetWorkGroupsAlignedToGrid(device_info, kernel_info, grid, work_groups);
return;
}
default:
work_groups->push_back({8, 4, 1});
return;
}
}
//当没有检查内核边界时,我们需要精确 //如果有检查,精确或无需校准都可以。
// tensorflow/tensorflow/lite/delegates/gpu/common/workgroup_selection.h
// https://github.com/tensorflow/tensorflow/blob/b5d2374f5e21ff0aa44ac26b039336d7443d08e3/tensorflow/lite/delegates/gpu/common/workgroup_selection.h#L28
// PRECISE assume that WorkGroupSize * k = GridSize;
// NO_ALIGNMENT no restrictions;
// We need PRECISE when we don't have check in kernel for boundaries
// If we have the check, we can use PRECISE or NO_ALIGNMENT as well.
enum class WorkGroupSizeAlignment { PRECISE, NO_ALIGNMENT };
// tensorflow/tensorflow/lite/delegates/gpu/common/workgroup_selection.cc
template <typename T>
void GenerateWorkGroupSizesAlignedToGrid(const T& grid,
const T& max_work_group_size,
const int max_work_group_invocations,
std::vector<T>* work_groups) {
auto alignment = WorkGroupSizeAlignment::PRECISE;
*work_groups = GenerateWorkGroupSizes<T>(
grid, /*min_work_group_total_size = */ 32, max_work_group_invocations,
max_work_group_size, alignment, alignment, alignment);
// If the grid parameter too small, method below cannot generate workgroups.
if (work_groups->empty()) {
AddCornerCases(grid, max_work_group_invocations, max_work_group_size,
alignment, alignment, alignment, work_groups);
}
}
template <typename T>
std::vector<T> GenerateWorkGroupSizes(
const T& grid, int min_work_group_total_size, int max_work_group_total_size,
const T& max_work_group_sizes, WorkGroupSizeAlignment x_alignment,
WorkGroupSizeAlignment y_alignment, WorkGroupSizeAlignment z_alignment) {
std::vector<T> work_groups;
work_groups.reserve(64);
std::vector<int> sizes_x = GetPossibleSizes(grid.x, x_alignment);
std::vector<int> sizes_y = GetPossibleSizes(grid.y, y_alignment);
std::vector<int> sizes_z = GetPossibleSizes(grid.z, z_alignment);
for (auto x : sizes_x) {
if (x > max_work_group_sizes.x) continue;
for (auto y : sizes_y) {
if (y > max_work_group_sizes.y) continue;
for (auto z : sizes_z) {
if (z > max_work_group_sizes.z) continue;
const int work_group_size = x * y * z;
if (work_group_size < min_work_group_total_size ||
work_group_size > max_work_group_total_size)
continue;
work_groups.push_back({x, y, z});
}
}
}
return work_groups;
}
std::vector<int> GetPossibleSizes(int number,
WorkGroupSizeAlignment z_alignment) {
if (z_alignment == WorkGroupSizeAlignment::PRECISE) {
// we will use for potential sizes, sizes that cover grid precisely
// work group size * k (k is integer) == grid_size
return GetDivisors(number);
} else {
// when we chose work group size we can use work group size that
// work group size * k (k is integer) != grid_size (slightly bigger)
// so in this heuristic we trying to find potential size, that satisfies
// to this : work group size * k (k is integer) <= grid_size + 5
// and this : work group size * k (k is integer) >= grid_size
return GetDivisorsForRange(number, 5);
}
}
std::vector<int> GetDivisors(int number) {
const int max_divisor = static_cast<int>(std::sqrt(number));
std::vector<int> divisors;
// we don't know the number of dividers, so it is just heuristic.
divisors.reserve(max_divisor / 3 + 1);
for (int i = 1; i <= max_divisor; ++i) {
const int d = number / i;
if (i * d == number) {
divisors.push_back(i);
if (d != i) {
divisors.push_back(d);
}
}
}
return divisors;
}
std::vector<int> GetDivisorsForRange(int number, int range) {
const int last_number = number + range;
const int max_divisor = static_cast<int>(std::sqrt(last_number));
std::set<int> divisors;
for (int i = 1; i <= max_divisor; ++i) {
const int reminder = number % i;
// iterate through numbers that divisible by i in our range;
const int first_number = number + (i - reminder) % i;
if (first_number <= last_number) {
divisors.insert(i);
}
for (int j = first_number; j <= last_number; j += i) {
const int d = j / i;
if (d != i) {
divisors.insert(d);
}
}
}
return std::vector<int>(divisors.begin(), divisors.end());
}
// tensorflow/tensorflow/lite/delegates/gpu/common/workgroup_selection.cc
// https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/delegates/gpu/common/workgroup_selection.cc#L30-L87
template <typename T>
void AddCornerCases(const T& grid, int max_work_group_total_size,
const T& max_work_group_sizes,
WorkGroupSizeAlignment x_alignment,
WorkGroupSizeAlignment y_alignment,
WorkGroupSizeAlignment z_alignment,
std::vector<T>* work_groups) {
for (int x = 1; x <= 4; ++x) {
for (int y = 1; y <= 4; ++y) {
for (int z = 1; z <= 4; ++z) {
int wg_x = DivideRoundUp(grid.x, x);
int wg_y = DivideRoundUp(grid.y, y);
int wg_z = DivideRoundUp(grid.z, z);
if (wg_x > max_work_group_sizes.x || wg_y > max_work_group_sizes.y ||
wg_z > max_work_group_sizes.z ||
wg_x * wg_y * wg_z > max_work_group_total_size) {
continue;
}
if (x_alignment == WorkGroupSizeAlignment::PRECISE &&
grid.x % wg_x != 0) {
continue;
}
if (y_alignment == WorkGroupSizeAlignment::PRECISE &&
grid.y % wg_y != 0) {
continue;
}
if (z_alignment == WorkGroupSizeAlignment::PRECISE &&
grid.z % wg_z != 0) {
continue;
}
work_groups->push_back({wg_x, wg_y, wg_z});
}
}
}
// this will add at least {1, 1, 1} always.
for (int x = 1; x <= 4; ++x) {
for (int y = 1; y <= 4; ++y) {
for (int z = 1; z <= 4; ++z) {
if (x > max_work_group_sizes.x || y > max_work_group_sizes.y ||
z > max_work_group_sizes.z ||
x * y * z > max_work_group_total_size) {
continue;
}
if (x_alignment == WorkGroupSizeAlignment::PRECISE && grid.x % x != 0) {
continue;
}
if (y_alignment == WorkGroupSizeAlignment::PRECISE && grid.y % y != 0) {
continue;
}
if (z_alignment == WorkGroupSizeAlignment::PRECISE && grid.z % z != 0) {
continue;
}
work_groups->push_back({x, y, z});
}
}
}
}
void GetPossibleWorkGroupsConv(TuningType tuning_type,
const DeviceInfo& device_info,
const KernelInfo& kernel_info, const int3& grid,
std::vector<int3>* work_groups) {
switch (tuning_type) {
case TuningType::FAST: {
int max_z_size = 16;
if (device_info.IsAdreno()) {
max_z_size = device_info.IsAdreno3xx() ? 16 : 64;
}
max_z_size = std::min(max_z_size, device_info.max_work_group_size_z);
work_groups->push_back(
GetWorkGroupConv(grid, kernel_info.max_work_group_size, max_z_size));
return;
}
case TunningType::EXAUSTIVE: {
// ignored
}
}
}
// tensorflow/tensorflow/lite/delegates/gpu/cl/kernels/work_group_picking.cc
// https://github.com/tensorflow/tensorflow/blob/master/tensorflow/lite/delegates/gpu/cl/kernels/work_group_picking.cc#L206-L215
int3 GetWorkGroupConv(const int3& grid, int max_size, int max_z_size) {
int wg_z = GetBiggestDivider(grid.z, max_z_size);
int wg_xy_size = std::min(256, max_size) / wg_z;
int wg_x = std::min(grid.x, wg_xy_size);
int wg_y = std::min(wg_xy_size / wg_x, grid.y);
if (wg_y == grid.y && grid.y % 2 == 0) {
wg_y = grid.y / 2;
}
return int3(wg_x, wg_y, wg_z);
}
int GetBiggestDivider(int number, int max_divider) {
for (int i = max_divider; i != 0; i--) {
if (number % i == 0) {
return i;
}
}
return 1;
}
// delegates/gpu/cl/kernels/gpu_operation.cc
// https://github.com/tensorflow/tensorflow/blob/b14150088dac1924cf0482f6e456332b3e6211ff/tensorflow/lite/delegates/gpu/cl/kernels/gpu_operation.cc
absl::Status GPUOperation::Tune(const TuningParameters& params) {
std::vector<int3> possible_work_groups;
GetPossibleKernelWorkGroups(params.tuning_type, *params.info, kernel_.info_,
&possible_work_groups);
if (possible_work_groups.empty()) {
return absl::NotFoundError(
"Can not found work_group size to launch kernel");
}
if (possible_work_groups.size() == 1) {
work_group_size_ = possible_work_groups[0];
return absl::OkStatus();
} else {
RETURN_IF_ERROR(args_.Bind(kernel_.kernel()));
int best_work_group_index;
RETURN_IF_ERROR(params.queue->GetBestWorkGroupIndex(
kernel_, *params.info, grid_size_, possible_work_groups,
&best_work_group_index));
work_group_size_ = possible_work_groups[best_work_group_index];
return absl::OkStatus();
}
}
// tensorflow/tensorflow/lite/delegates/gpu/cl/cl_command_queue.cc
// https://github.com/tensorflow/tensorflow/blob/1da2ac286f24bc04cef9a24889c24019924691af/tensorflow/lite/delegates/gpu/cl/cl_command_queue.cc#L218-L278
absl::Status ProfilingCommandQueue::GetBestWorkGroupIndex(
const CLKernel& kernel, const DeviceInfo& device_info,
const std::vector<int3>& work_groups_count,
const std::vector<int3>& work_group_sizes, int* index) {
// Some Adreno 3xx can have wrong numbers for some events
const bool possible_bug_with_events = device_info.IsAdreno3xx();
events_.resize(work_group_sizes.size());
for (int i = 0; i < work_group_sizes.size(); ++i) {
RETURN_IF_ERROR(CLCommandQueue::Dispatch(kernel, work_groups_count[i],
work_group_sizes[i], &events_[i]));
// reducing the speed of memory leak on Mali for some kernels
if (device_info.IsMali() && i % 8 == 7) {
events_[i - 7].Wait();
}
if (possible_bug_with_events) {
// We are trying to increase probability for correct result.
RETURN_IF_ERROR(WaitForCompletion());
}
}
RETURN_IF_ERROR(WaitForCompletion());
// To release memory of some kernel pool on Mali.
if (device_info.IsMali()) {
RETURN_IF_ERROR(kernel.ReInit());
}
int minimum_index = 0;
double minimum_time = std::numeric_limits<double>::max();
if (possible_bug_with_events) { // we will try to cut out suspicious results
double average_time = 0.0;
int average_samples_count = 0;
for (int i = 0; i < work_group_sizes.size(); ++i) {
if (events_[i].GetEventTimeMs() < 100 * 1000) { // 100 sec
average_time += events_[i].GetEventTimeMs();
average_samples_count++;
}
}
average_time /= average_samples_count;
for (int i = 0; i < work_group_sizes.size(); ++i) {
double time = events_[i].GetEventTimeMs();
if (time < minimum_time && time >= 0.1 * average_time) {
minimum_index = i;
minimum_time = time;
}
}
} else {
for (int i = 0; i < work_group_sizes.size(); ++i) {
double time = events_[i].GetEventTimeMs();
if (time < minimum_time) {
minimum_index = i;
minimum_time = time;
}
}
}
*index = minimum_index;
return absl::OkStatus();
}
最近发现TensorFlow Lite在GPU方面的性能有不小提升,先前了解到起初是支持的OpenGL来完成计算,猜想可能是考虑到GL的更广阔的的兼容性(不同的GPU版本,兼容的新老的库版本),但后续这次对GPU以OpenCL进行支持,应该考虑的更多是计算性能,也是与TFLite的相关竞品,如MACE / Paddle-Mobile / MNN / TNN在OpenCL上的支持和性能确实不容忽视。
说到OpenCL,深入一些都会谈及GPU的Kernel调优的手段和策略。根据阅读TensorFlow Lite在GPU方面的代码,发现其GPU/CL部分有
tuning_parameters.h
这一文件:ProfiliingCommandQueue是
class ProfilingCommandQueue : public CLCommandQueue
,在原有父类CLCommandQueue
基础上,增加了opencl kernel计时、找最佳work group(GetBestWorkGroupIndex
)等方法。此外,又发现可能和性能调优(tuning/tune)相关的目录或者文件:
device_info.cc
:定义了不同型号的GPU的相关信息,针对不同硬件有细致的区分,为后续调优做指导:128*144*16
,其它6xx系列为128*96*16
;WaveSize:对于<400系列不支持、<600系列会判断是否full_wave进而选择64或32、其它型号也会判断是否full_wave使用进而判断是128还是64;work_group_picking.cc
、work_group_picking.h
:这个后续会详细说明,主要定义了上述两种tunning_type
的策略实现,以及相关的辅助函数;tuning_parameters.h
、tuning_parameters.cc
;inference_context.cc
:InitFromGraphWithTransforms
有定义模型从原始Graph转换为适合GPU执行的Graph的流程。大体主要有3个步骤:GPUOperation::Tune
设置WorkGroup。gpu_operation.h
、gpu_operation.cc
:实现了GPUOperation::Tune
方法,包含整个Tune的过程,目前看来每次只会选择一种,并没有在多种WorkGroup下做选择
,即目前还没支持针对某个确定Operation做大规模和批量WorkGroup的性能Tune。那就先来看看Tune这一方法,该方法进入时,会先获取KernelWorkGroups,注意是一个候选表
std::vector<int3> possible_work_groups
,然后在最后if-else分支的else情况中,选择这些里最好的best_work_group_index,作为当前Operation最终的work_group。Tune的流程是不可避免的,每个GPU Operation node都会有Tune操作,而不是可选项,即
GetPossibleKernelWorkGroups
方法是必然要进入的,并拿到可能的一个或者多个work group,即std::vector<int3> possible_work_groups
。当有多个work groups时,会选择最佳workGroup,即会执行后续if-else的else分支的GetBestWorkGroupIndex
方法,而当只有一个work groups时,直接返回,一个work groups也没有则直接报错提示absl::NotFoundError("Can not found work_group size to launch kernel");
。1. 找候选work group
我们知道,当opencl的cl kernel在给定不能改动的情况下,性能和work group的设置策略有极大关系,
GPUOperation::GetPossibleKernelWorkGroups
做了细致的设置策略,该方法会调用GetPossibleWorkGroups
,暂且称之为通用策略,作为各个GPU Operation继承的默认策略。此外除了继承,各个GPU Operation node子类也可能做一些自定义的策略,目前发现有名为GetPossibleWorkGroupsConv
的方法,发现对该带Conv
后缀的设置策略的调用有如下Conv方法:可以看到目前调用带Conv后缀的方法的,主要是conv_buffer_1x1和conv_transpose,以及针对powervr实现的conv_powervr(但实际上,这个名字改为
conv_general
可能更合适,因为其它架构的GPU如AMD/Intel及Adreno等在某些情况下,也有用到)。此外,根据目录下包含Conv关键字的文件,可以其中可以看到部分Conv并没有调用带Conv
的后缀的work group设置策略,其中包括depthwise_conv、conv_texture、conv_constant、conv_3d等。换言之,这种特殊的GetPossibleWorkGroupsConv
是针对性增加的。下面我们深入
GetPossibleWorkGroupsConv
和GetPossibleWorkGroups
看看他们的执行策略,还能怎么划分。我们深入
GetPossibleWorkGroupsConv
和GetPossibleWorkGroups
后(见如上代码),可以把找候选work group的情况和策略分为如下几种:综上来说,排除掉default的情况,Tuning策略有通用FAST、通用EXHAUSTIVE和非通用FAST三种,下面我们将会逐个分析实现。