软件发布前的优化与裁剪：OpenCV的裁剪建议

ysh329 commented 5 years ago

上两节中，我们讲的比较浅：

软件发布前的库优化与裁剪：初识：主要介绍了strip命令；
软件发布前的优化与裁剪：gflags和glog ：主要介绍了实际裁剪中的思考，如gflags和glog的其它库的做法，protobuf库的替代方法（如TensorFlow Lite用FlatBuffers代替代替，tiny-dnn用cereal代替，cereal是一个header-only的序列化库）等等。

这次将会以Compact build advice · opencv/opencv Wiki为主要大纲，对其翻译的同时，再补充些内容。其实是我在搜相关内容的时候，刚好OpenCV的这个Wiki页面提到的，而且说得很具体，以结果为导向先来看一个对OpenCV库裁剪优化的表格：

Experi-ment	Hide symbols	Function sections	GC sections	LTO, size	Size (MiB)	Relative (%)
1	:x:	:x:	:x:		413	100
2	:x:	:white_check_mark:	:white_check_mark:		405	98
3	:white_check_mark:	:x:	:x:		413	100
4	:white_check_mark:	:white_check_mark:	:x:		413	100
5	:white_check_mark:	:x:	:white_check_mark:		412	100
6	:white_check_mark:	:white_check_mark:	:white_check_mark:		243	59
7	:x:	:x:	:x:	LTO	386	93
8	:white_check_mark:	:white_check_mark:	:white_check_mark:	LTO	192	46
9	:x:	:x:	:x:	size	272	66
10	:white_check_mark:	:white_check_mark:	:white_check_mark:	size	163	39
11	:white_check_mark:	:white_check_mark:	:white_check_mark:	both	130	31

注：二进制文件大小取决于库配置（library configuration），而库配置又取决于构建环境的平台，以及该平台中安装的库。这里的比较没有CUDA，OpenCL和IPP（Intel Integrated Performance Primitives (Intel IPP)，是Intel的多媒体和数学计算库，基于SSE、AVX等指令进行优化）的x86_64版本。

从上面表中可以看出，以下重要的裁剪结论：

基本裁剪（下文也会讲到）即表中前三项（Hide symbols、Function sections、GC sections），可以实现尺寸减小（#6与#1到#5对比），前三项中单独某一个或两个裁剪效果微弱，当三个裁剪时效果倍增；
LTO允许以构建时间为代价进一步减小尺寸（#8）；
如果执行速度不重要，可以使用-Os编译器标志而不是-O3（#10和#11）将二进制文件压缩到基本大小的三分之一。

下面就会以上表中的优化裁剪手段，展开。不过文中会穿插很多基础内容，可选择跳过。

咱们在编写一个C语言程序的时候，经常会遇到好多重复或常用的部分，如果每次都重新编写固然是可以的，不过那样会大大降低工作效率，并且影响代码的可读性，更不利于后期的代码维护。我们可以把他们制作成相应的功能函数，使用时直接调用就会很方便，还可以进行后期的功能升级。

在Linux上基于轻量级OpenCV构建应用

OpenCV库，可以两种方式构建：动态dynamic（共享库，shared libraries）和静态static（归档，archives）。

大多数平台上的默认模式是动态的（dynamic），要切换到另一种模式，可以在OpenCV的CMake中关闭BUILD_SHARED_LIBS的cmake选项。虽然CMake用的是add_library来区别shared/static/module，但其本质，是调用CMake中指定的编译器如gcc。下面补充关于用gcc编译链接静态/动态库的内容。

用gcc编译链接静态/动态库

>> ## [用gcc编译链接静态/动态库](https://renenyffenegger.ch/notes/development/languages/C-C-plus-plus/GCC/create-libraries/index) > 构建并链接静态库： > ```shell > # 1. 构建object文件 > # main.c调用了add.c和answer.c > $ gcc -c src/main.c -o bin/main.o > > # 为静态库创建object文件 (without -fPIC) > $ gcc -c src/tq84/add.c -o bin/static/add.o > $ gcc -c src/tq84/answer.c -o bin/static/answer.o > > # 用于动态库的object文件需要编译成位置无关代码（position independent）code (-fPIC) > # 因为动态库被用于地址空间的任何位置（mapped to any position in the address space） > $ gcc -c -fPIC src/tq84/add.c -o bin/shared/add.o > $ gcc -c -fPIC src/tq84/answer.c -o bin/shared/answer.o > > # 2.1 构建静态库 > # 为静态库创建object文件(without -fPIC)，位置有关代码 > $ gcc -c src/tq84/answer.c -o bin/static/answer.o > > # 静态库是一系列object文件的集合拷贝到一个单独的文件中，并以后缀.a结尾 > # 静态文件通过archiver（归档）命令（ar）生成 > # 下面将`add.o`和`answer.o`生成静态库`libtq84.a` > $ ar rcs bin/static/libtq84.a bin/static/add.o bin/static/answer.o > > # 2.2 静态链接（Link statically） > # 用静态库链接main.o > # -L：表示要链接的库在哪里可以被找到（这需要手动指定，不是通用的方式） > # -l：表示要链接的库的名字，假定这个库的名称以`lib`作为前缀起始，以`.o`作为后缀结束 > $ gcc bin/main.o -Lbin/static -ltq84 -o bin/statically-linked > > # 创建好的可执行`bin/statically-linked`不依赖于任何其他对象文件或库。可在没有`.a`或`.o`文件的情况下分发。也可在shell上执行，如下所示： > $ ./bin/statically-linked > > # 3.1 创建动态库（即共享库） > # 我们创建一个没有动态库名`SONAME`的动态库。使用GCC的`-shared`标志创建共享库，并使用后缀`.so`而不是`.a`命名最终的文件。 > $ gcc -shared bin/shared/add.o bin/shared/answer.o -o bin/shared/libtq84.so > > # 为了创建共享库，必须生成与位置无关的代码，即使用`-fPIC`标志来编译c文件（注意在前面生成main.o时，使用了`-fPIC`标志） > # 如果在没有-fPIC的情况下创建目标文件（例如在生成静态目标文件时），那么会有类似的如下报错： > # /usr/bin/ld: bin/tq84.o: relocation R_X86_64_PC32 against symbol `gSummand' can not be used when making a shared object; recompile with -fPIC > > # 3.2 动态方式连接动态库（Link dynamically with the shared library） > # 请注意与在2.2中静态库链接时的相似性：静态链接时是-Lbin/static，现在动态链接则是-Lbin/shared > # 注意顺序： > # -ltq84-shared要在main.c后面 > $ gcc bin/main.o -Lbin/shared -ltq84 -o bin/use-shared-library > ``` > 参考： > - [Creating a shared and static library with the gnu compiler (gcc)](https://renenyffenegger.ch/notes/development/languages/C-C-plus-plus/GCC/create-libraries/index) > - [Shared libraries with GCC on Linux - Cprogramming.com](https://www.cprogramming.com/tutorial/shared-libraries-linux-gcc.html)

然而，动静态库有各自的使用场景、优缺点。

动态库（Shared libraries，即`-DBUILD_SHARED_LIBS=ON`时）:

链接到库的应用，仅包含对函数和数据的引用，因此尺寸小
多个应用可以共用该库（见上一点）
应用程序应该能够找到、加载库和所有依赖项（例如libpng）
所有依赖项都应该与库兼容

静态库（Static libraries，即`-DBUILD_SHARED_LIBS=OFF`）:

应用程序二进制文件将包含它使用的所有函数（甚至更多），因此尺寸大
几个应用程序无法共享该库，每个应用程序都有对该库的自己的副本
应用程序将具有较少的运行时依赖性
一些嵌入式平台仅支持这种变体
更多详情阅读wikipedia上的Static_library

用CMake同时编译动静态库

只需要用两个add_library：
add_library(MyLib SHARED source1.c source2.c)
add_library(MyLibStatic STATIC source1.c source2.c)
如果是多个源码文件（source files），可以将他们定义为一个cmake变量给add_library。在Windows上，因为共享和静态都有一个“.lib”文件，每个库要指定一个不同的名称。但在Linux和Mac上，可以为两个库提供相同的名称（例如libMyLib.a和libMyLib.so）：
set_target_properties(MyLibStatic PROPERTIES OUTPUT_NAME MyLib)
但不建议同时给出库的静态和动态版本。使用不同的名称，可以更容易地在编译行上为链接到库的工具选择静态与动态链接。如Linux上，libMyLib.so（共享）和libMyLib_static.a（静态）之类的名称。

下面将会根据GCC和Clang的编译选项，在静态构建时对OpenCV库链接的应用程序二进制文件的影响。

基本裁剪

编译选项：-fvisibility=hidden, -fvisibility-inlines=hidden
函数属性：__attribute__ ((visibility ("hidden")))

visibility用于设置动态链接库中函数的可见性，将变量或函数设置为hidden，则该符号仅在本DSO中可见，对外不可见，其他库无法找到此函数实现，即可见性隐藏，与此同时也减小了体积。GNU的GCC WIKI详细说明了C++ Visibility的优点，这里我提炼以下四点：

从本质上提升动态库的加载速度，如有大量使用C++模板的库；
让编译器优化出更好的代码。PLT间接（调用函数或访问某个变量时，必须通过全局偏移表实现，如PIC代码）可以完全避免，从而基本上避免现代处理器上的流水线停顿，从而更快地编码。此外，当大多数符号在本地绑定（bound locally）时，可以通过整个DSO完全安全地删除它们。这为内联器提供了更大的自由度，内联器不再需要保持“以防万一”的入口点（This gives greater latitude especially to the inliner which no longer needs to keep an entry point around "just in case"）；
将DSO体积减少5-20％。ELF的导出符号表格式非常耗费空间，当大量使用模板时，完整的错位符号名称占的空间巨多，平均约1000字节。C ++模板会产生大量符号，有的C++库甚至可轻松超过30000个符号，即大约5-6MB；
符号碰撞的几率要低得多。比方内部为不同的两个目的，使用相同的两个库的情况。哈利路亚（Hallelujah）！

更详细关于动态库介绍，可以看这篇文章：How To Write Shared Libraries | Ulrich Drepper, Dec 10, 2011，也是GNU WIKI推荐的。但是怎么使用这个函数可见性的特性呢？

在需要暴露API或者公开接口所对应的头文件中，对要暴露的的结构、类和函数声明前加上__attribute __((visibility(“default”)))，下面会给出一段代码，方便你直接复制粘贴，嘻嘻嘻。然后，在GCC每次编译源码的过程中加入-fvisibility = hidden。在输出的DSO上使用nm -C -D命令，比较函数隐藏与否产生的差异，是否符合预期。

注：如果要跨共享对象边界抛出异常（throwing exceptions across shared object boundaries），请参阅Visibility - GCC Wiki的“C ++异常问题”部分。

#if defined _WIN32 || defined __CYGWIN__
  #ifdef BUILDING_DLL
    #ifdef __GNUC__
      #define DLL_PUBLIC __attribute__ ((dllexport))
    #else
      #define DLL_PUBLIC __declspec(dllexport) // Note: actually gcc seems to also supports this syntax.
    #endif
  #else
    #ifdef __GNUC__
      #define DLL_PUBLIC __attribute__ ((dllimport))
    #else
      #define DLL_PUBLIC __declspec(dllimport) // Note: actually gcc seems to also supports this syntax.
    #endif
  #endif
  #define DLL_LOCAL
#else
  #if __GNUC__ >= 4
    #define DLL_PUBLIC __attribute__ ((visibility ("default")))
    #define DLL_LOCAL  __attribute__ ((visibility ("hidden")))
  #else
    #define DLL_PUBLIC
    #define DLL_LOCAL
  #endif
#endif

extern "C" DLL_PUBLIC void function(int a);
class DLL_PUBLIC SomeClass
{
   int c;
   DLL_LOCAL void privateMethod();  // Only for use within this DSO
public:
   Person(int _c) : c(_c) { }
   static void foo(int a);
};

参考：

ysh329 commented 5 years ago

体积的裁剪的两项重点：RTTI的作用消除和Protobuf依赖的移除

下面是各个优化步骤详情：

step1： visibility hidden + data & function section gather and gc -section + strip

当我们依赖静态库，编译和链接选项加如上所示后，应用程序变小，但是查看符号表等，发现一些库中的函数依然存在，于是经过一些时间的排查，当添加上-fvisibility=hidden

库大小： 3.4 MB

参考：https://blog.csdn.net/Swallow_he/article/details/87373345

step2： remove glog and shutdown local log system

库大小：3.3 MB

step3：改造 op register（屏蔽无关代码）

库大小：3.3 MB （效果很小）

step4：摒弃之前静态库连接方式，改为 obj 连接 DSO

库大小： 3.1 MB

step5：删除替换所有异常 throw 操作，添加 -fno-exceptions

库大小： 2.8 MB

step6：去掉默认rtti系统，替换为自实现 rtti 结构（新增 rtti.h / cc）

无法编译，因为目前版本依赖源码编译 protobuf，protobuf 本身调用typeid ，需要整体替换protobuf内部rtti 工作量大，意义不大因为后续pr会删除protobuf依赖，目前protobuf 替换版本pr 还没有合入

ysh329 commented 5 years ago

C/C++ visibility - zhu4674548的专栏 - CSDN博客 https://blog.csdn.net/zhu4674548/article/details/83904604

Visibility - GCC Wiki https://gcc.gnu.org/wiki/Visibility

ysh329 commented 5 years ago

pts.blog: How to make smaller C and C++ binaries http://ptspts.blogspot.com/2013/12/how-to-make-smaller-c-and-c-binaries.html

Android NDK: How to Reduce Binaries Size - The Algolia Blog - Algolia Blog https://blog.algolia.com/android-ndk-how-to-reduce-libs-size/

c++ - How to optimize size of shared library? - Stack Overflow https://stackoverflow.com/questions/8021470/how-to-optimize-size-of-shared-library

Code optimization for size in C++ - CodeProject https://www.codeproject.com/Questions/1231114/Code-optimization-for-size-in-Cplusplus

ysh329 commented 5 years ago

At work we have a custom tool that parses out the .map files that visual studio generates. This lets us see code and data sizes from the executable.

Like KulSeran says: The best way that I know of to examine sizes is to generate and examine a 'map file' - this is basically a way of determining where functions, static data, and resources live in your executable file, and/or where they will be loaded at runtime.

If you want to further analyze code, you need to examine the assembly instructions (an assembly listing file can be generated by many compilers at build time).

You can also just open the EXE in a hex editor and see if there's any obviously wasted space (usually the only time this can be seen at a glance is for large static arrays which are initialized to zero).

Some things that generally reduce code size for x86 programs are:

Set optimization settings as aggressive as possible, EXCEPT for loop unrolling, if you can adjust that separately.
Experimenting with function inlining settings. Counterintuitively, sometimes turning it on will SAVE space.
Make sure dead code stripping is enabled.
Remove all debugging symbols.
Disable exceptions and RTTI.
If you use lots of templated containers, try to use the fewest amount of unique data types as possible in those containers.
Try the 'omit frame pointer' option - this skips the 'push ebp; mov ebp, esp' at the start of every function, saving you 3 bytes, and between 1 and 3 bytes at the end depending on whether the compiler uses the 'LEAVE' instruction or manually MOVs and pops EBP. It also frees up the EBP register for general-purpose use as a 7th general purpose register, which can decrease the amount of memory or instructions needed when shuffling data around.
Compile in 32-bit rather than 64-bit mode unless you plan to perform LOTS of 64-bit integer math, or use more than 3GB of RAM. 64-bit instructions that use the REX prefix tend to be wasteful.
Avoid large static arrays if there is an alternative.
Change alignment and struct packing to eliminate padding wherever possible (this can result in speed issues or alignment exceptions depending on what you're doing).

For RISC:

I've found that instruction sets with fixed size instructions (RISC 32-bit) tend to be EXTREMELY wasteful (in space) compared to x86's variable-width instructions. You might be able to shrink your codebase by switching the majority of your code to use interpreted bytecode (WITHOUT JITting it). This is a pretty extreme thing to do though, since it will destroy your performance.

Advanced Search - GameDev.net https://www.gamedev.net/search/

ysh329 commented 5 years ago

Compact build advice · opencv/opencv Wiki https://github.com/opencv/opencv/wiki/Compact-build-advice

ysh329 commented 5 years ago

Using the GNU Compiler Collection (GCC): Optimize Options https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

ysh329 commented 5 years ago

Reducing Executable Size - WxWiki https://wiki.wxwidgets.org/Reducing_Executable_Size

ysh329 commented 5 years ago

Tutorial: 4k Intros in Linux https://int21.de/linux4k/

c - How to create 4KB Linux binaries that render a 3D scene? - Stack Overflow https://stackoverflow.com/questions/10551665/how-to-create-4kb-linux-binaries-that-render-a-3d-scene/10552160#10552160

yuenshome / yuenshome.github.io