ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code

9 関連研究並列アーキテクチャにコードを適応させるための多数のプログラミングモデルと最先端の技術、特にGPUに対するものが紹介されています[13, 33, 37, 38, 42]。ディレクティブベースのプログラミングモデル[3, 35]は順次言語を拡張し、複雑な科学的アプリケーションがその構造を維持しながらループの反復をアクセラレータにオフロードすることを可能にします。しかし、そのようなコンパイラはしばしば順次コード生成の基礎に依存し、最適化の機会を一般的な計算に限定してしまいます[4, 28]。

いくつかのプロジェクトでは、ディレクティブベースのコードに対するドメイン固有やアーキテクチャ固有のアプローチを探求しています。CLAW DSL[10]はグリッドベースのアルゴリズムのためのディレクティブを提供し、OpenACC/OpenMPコード生成をサポートしながら、空間ブロッキングのようなターゲット固有の最適化を可能にします。JACC[28]は動的定数拡張を伴うジャストインタイムカーネルコンパイレーションを提供するOpenACCランタイムフレームワークです。OptACC[30]はOpenACCの並列性を最適化するために、ランタイムパラメータの探索を実行します。CCAMP[27]はOpenACCとOpenMPを交換し、モデルとアーキテクチャの各組み合わせに対する並列化を最適化します。Baruaら[4]は、ループ展開を通じてILPを最大化するための自動OpenACCカーネルオプティマイザを開発します。SAFARA[46]は、OpenACCカーネルで配列参照の再利用を容易にするために、レジスタリソースを完全に活用します。

私たちのACC Saturatorは3つの点で異なります：

(1)書き換えルールとコストモデルに基づいた最適なコード選択を通じた最適化の自動化、
(2)GPUメモリスループットの大幅な改善を実現するバルクロード最適化技術の統合、
3)ドメイン固有の情報を必要とせずにOpenACCとOpenMPの両方に適用可能である一方で、元のコード構造を維持します。
等価飽和ライブラリのegg[53]の導入以来、特にGPUコンピューティングの文脈で、深層学習アプリケーションの加速のためにそれを活用した多数の研究が行われています[16, 40, 41, 52, 57]。これらの研究は、算術式、抽象演算、テンソルグラフの書き換えルールを用いて、畳み込み、疎テンソル代数、全テンソル操作を最適化します。
Diospyros[49]は、等価飽和を用いてCコードから効率的なDSP操作を合成し、一方、Gowdaら[19]はeggを用いて自動並列性割り当てのための記号代数システムを実装します。
多くの最適化技術がGPU計算[22]のために開発されています。Rawatら[39]は、レジスタプレッシャーを軽減し、SM占有率を増加させるために、ヒューリスティックアルゴリズムを用いてステンシル計算を並べ替えます。
ソフトウェアシスティックアレイ[9]は、グローバルデータをGPUスレッドに割り当て、重複ブロックを通じて部分結果を伝播し、累積することで結果を計算し、共有メモリの使用なしにデータの局在性を向上させます。
Hongら[23]は、軽量カーネルエミュレーションを利用して、自動チューナーにパフォーマンスボトルネックのフィードバックを提供します。
対照的に、ACC Saturatorは直接的な書き換えメカニズムを採用し、複雑なプログラムにその応用を可能にします。さらに、私たちの方法は他のコード最適化に補完的です。

References [4] Prithayan Barua, Jun Shirako, and Vivek Sarkar. 2018. Cost-driven thread coarsening for GPU kernels. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT). Association for Computing Machinery, New York, NY, USA, Article 32, 14 pages. https://doi.org/10.1145/3243176.3243196

[9] Peng Chen, Mohamed Wahib, Shinichiro Takizawa, Ryousei Takano, and Satoshi Matsuoka. 2019. A versatile software systolic execution model for GPU memory-bound kernels. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ’19). Association for Computing Machinery, New York, NY, USA, Article 53, 81 pages. https://doi.org/10.1145/ 3295500.3356162

[10] Valentin Clement, Sylvaine Ferrachat, Oliver Fuhrer, Xavier Lapillonne, Carlos E. Osuna, Robert Pincus, Jon Rood, and William Sawyer. 2018. The CLAW DSL: Abstractions for performance portable weather and climate models. In Proceedings of the Platform for Advanced Scientific Computing Conference (PASC ’18). Association for Computing Machinery, New York, NY, USA, Article 2, 10 pages. https://doi.org/ 10.1145/3218176.3218226

[13] H. Carter Edwards and Daniel Sunderland. 2012. Kokkos array performance-portable manycore programming model. In Proceedings of the 2012 International Workshop on Programming Models and Applications for Multicores and Manycores (PMAM ’12). Association for Computing Machinery, New York, NY, USA, 1–10. https://doi.org/10. 1145/2141702.2141703 [14] Matthias Felleisen, Robert Bruce Findler, Matthew Flatt, Shriram Krishnamurthi, Eli Barzilay, Jay McCarthy, and Sam Tobin-Hochstadt.

The racket manifesto. In 1st Summit on Advances in Programming Languages (SNAPL 2015). Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik. [22] Pieter Hijma, Stijn Heldens, Alessio Sclocco, Ben van Werkhoven, and Henri E. Bal. 2023. Optimization techniques for GPU programming. ACM Comput. Surv. 55, 11, Article 239 (March 2023), 81 pages. https: //doi.org/10.1145/3570638 [28] Kazuaki Matsumura, Simon Garcia De Gonzalo, and Antonio J. Peña.
JACC: An OpenACC runtime framework with kernel-level and multi-GPU parallelization. In 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC). IEEE Computer Society, Los Alamitos, CA, USA, 182–191. https://doi.org/ 10.1109/HiPC53243.2021.00032 [33] NVIDIA Corporation. 2022. Programming Guide :: CUDA Toolkit Documentation. https://docs.nvidia.com/cuda/cuda-c-programmingguide/index.html [34] NVIDIA Corporation. 2023. High Performance Computing (HPC) SDK | NVIDIA. https://developer.nvidia.com/hpc-sdk [35] The OpenACC Organization. 2011. OpenACC. https://www.openacc. org/ [36] Sunghyun Park, Salar Latifi, Yongjun Park, Armand Behroozi, Byungsoo Jeon, and Scott Mahlke. 2022. SRTuner: Effective compiler optimization customization by exposing synergistic relations. In Proceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization (CGO ’22). IEEE Press, 118–130. https: //doi.org/10.1109/CGO53902.2022.9741263 [37] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An imperative style, high-performance deep learning library. Curran Associates Inc., Red Hook, NY, USA. [38] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain Paris, Frédo Durand, and Saman Amarasinghe. 2013. Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI ’13). Association for Computing Machinery, New York, NY, USA, 519–530. https://doi.org/10.1145/2491956.2462176 [39] Prashant Singh Rawat, Fabrice Rastello, Aravind Sukumaran-Rajam, Louis-Noël Pouchet, Atanas Rountev, and P. Sadayappan. 2018. Register optimizations for stencils on GPUs. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP ’18). Association for Computing Machinery, New York, NY, USA, 168–182. https://doi.org/10.1145/3178487.3178500 [40] Amir Shaikhha, Mathieu Huot, and Shideh Hashemian. 2023. ∇SD: Differentiable programming for sparse tensors. arXiv:cs.PL/2303.07030 [41] Gus Henry Smith, Andrew Liu, Steven Lyubomirsky, Scott Davidson, Joseph McMahan, Michael Taylor, Luis Ceze, and Zachary Tatlock.
Pure tensor program rewriting via access patterns (representation pearl). In Proceedings of the 5th ACM SIGPLAN International Symposium on Machine Programming (MAPS 2021). Association for Computing Machinery, New York, NY, USA, 21–31. https://doi.org/10.1145/3460945.3464953 [42] Miko M. Stulajter, Ronald M. Caplan, and Jon A. Linker. 2022. Can Fortran’s ‘do concurrent’ replace directives for accelerated computing?. In Accelerator Programming Using Directives, Sridutt Bhalachandra, Christopher Daley, and Verónica Melesse Vergara (Eds.). Springer International Publishing, Cham, 3–21. [43] Ross Tate, Michael Stepp, Zachary Tatlock, and Sorin Lerner. 2009. Equality Saturation: A new approach to optimization. In Proceedings of the 36th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’09). Association for Computing Machinery, New York, NY, USA, 264–276. https://doi.org/10.1145/ 1480881.1480915 [44] The Khronos Group Inc. 2023. OpenCL Overview - The Khronos Group Inc. https://www.khronos.org/api/opencl [45] The LLVM Project. 2023. Clang C Language Family Frontend for LLVM. https://clang.llvm.org/ [46] X. Tian, D. Khaldi, D. Eachempati, R. Xu, and B. Chapman. 2016. Optimizing GPU register usage: Extensions to OpenACC and compiler optimizations. In 2016 45th International Conference on Parallel Processing (ICPP). 572–581. https://doi.org/10.1109/ICPP.2016.72 [47] TOP500.org. 2022. November 2022 | TOP500. https://www.top500. org/lists/top500/2022/11/ [48] Mircea Trofin, Yundi Qian, Eugene Brevdo, Zinan Lin, Krzysztof Choromanski, and David Li. 2021. MLGO: A machine learning guided compiler optimizations framework. arXiv:cs.PL/2101.04808 [49] Alexa VanHattum, Rachit Nigam, Vincent T. Lee, James Bornholt, and Adrian Sampson. 2021. Vectorization for Digital Signal Processors via equality saturation. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code Conference’17, July 2017, Washington, DC, USA and Operating Systems (ASPLOS 2021). Association for Computing Machinery, New York, NY, USA, 874–886. https://doi.org/10.1145/ 3445814.3446707 [50] Markus Velten, Robert Schöne, Thomas Ilsche, and Daniel Hackenberg.
Memory performance of AMD EPYC Rome and Intel Cascade Lake SP server processors. In Proceedings of the 2022 ACM/SPEC on International Conference on Performance Engineering (ICPE ’22). Association for Computing Machinery, New York, NY, USA, 165–175. https://doi.org/10.1145/3489525.3511689 [51] Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, and Onur Mutlu. 2015. A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA ’15). Association for Computing Machinery, New York, NY, USA, 41–53. https://doi.org/ 10.1145/2749469.2750399 [52] Yisu Remy Wang, Shana Hutchison, Jonathan Leang, Bill Howe, and Dan Suciu. 2020. SPORES: Sum-product optimization via relational equality saturation for large scale linear algebra. Proc. VLDB Endow. 13, 12 (July 2020), 1919–1932. https://doi.org/10.14778/3407790.3407799 [53] Max Willsey, Chandrakana Nandi, Yisu Remy Wang, Oliver Flatt, Zachary Tatlock, and Pavel Panchekha. 2021. Egg: Fast and extensible equality saturation. Proc. ACM Program. Lang. 5, POPL, Article 23 (Jan. 2021), 29 pages. https://doi.org/10.1145/3434304 [54] Michael Joseph Wolfe, Carter Shanklin, and Leda Ortega. 1995. High performance compilers for parallel computing. Addison-Wesley Longman Publishing Co., Inc., USA. [55] Gene Wu, Joseph L. Greathouse, Alexander Lyashevsky, Nuwan Jayasena, and Derek Chiou. 2015. GPGPU performance and power estimation using machine learning. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 564–
https://doi.org/10.1109/HPCA.2015.7056063 [56] Rengan Xu, Xiaonan Tian, Sunita Chandrasekaran, Yonghong Yan, and Barbara Chapman. 2014. Nas parallel benchmarks for GPGPUs using a directive-based programming model. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 67–81. https://doi.org/10.1007/978-3-319-17473-0_5 [57] Yichen Yang, Phitchaya Phothilimthana, Yisu Wang, Max Willsey, Sudip Roy, and Jacques Pienaar. 2021. Equality saturation for tensor graph superoptimization. In Proceedings of Machine Learning and Systems, A. Smola, A. Dimakis, and I. Stoica (Eds.), Vol. 3. 255–268. https://proceedings.mlsys.org/paper_files/paper/2021/file/ 65ded5353c5ee48d0b7d48c591b8f430-Paper.pdf [58] Hamid Reza Zohouri and Satoshi Matsuoka. 2019. The memory controller wall: Benchmarking the Intel FPGA SDK for OpenCL memory interface. In 2019 IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing (H2RC). 11–18. https://doi.org/10.1109/H2RC49586.2019.00007

yukarinoki / reseach

ACC Saturator: Automatic Kernel Optimization for Directive-Based GPU Code #19