於MLIR框架實作迴圈分塊大小選擇方法與效能分析

在過去幾十年，受限於個人電腦的中央處理器速度有限，機器學習等需要龐大運算資源的運算經常透過雲端運算的方式，交由具有較強大處理器的雲端伺服器代為運算。隨著近幾年半導體製程的蓬勃發展，個人電腦甚至是手機晶片的CPU計算速度已經能夠將機器學習演算法移至邊緣運算。在這個智慧型手機盛行的年代，越來越多在手機上做即時影像處理的需求，例如相機濾鏡需要即時萃取出影像特徵相關的程式應用，邊緣運算的趨勢日益成長。當CPU效能的大幅提升，影響邊緣運算效能的主要瓶頸不再是運算速度，而是記憶體存取與CPU速度間的差異所導致的延遲。在編譯器理論中，迴圈分塊的轉換被視為對於資料重用度優化很重要的方式之一，藉由改變迴圈執行順序減少一次涉及到資料存取的大小以增加資料重用度，進而減少程式執行時快取與記憶體之間存取的比例。然而，過去相關的研究指出，效能對於分塊的大小影響非常大，稍微不一樣的尺寸可能導致效能差異很大的結果。過去的研究為了找出適當的分塊大小，不同的方法需要在編譯時間與選擇出較佳解中取得權衡。本篇論文在MLIR 框架上實作新的的選擇迴圈分塊大小選擇策略，其性質與原先在MLIR的Affine Dialect中的分塊大小選擇一樣，以不會造成過多編譯成本情況下找到較佳的解。我們將以兩個影像處理及機器學習中常見的運算：矩陣乘法和卷積運算之迴圈作分塊後，編譯至x86與ARM平台上進行效能的評估。實驗結果顯示，相較於原先的方法，前述程式運算效率在硬體平台有30%與18%的改進，且編譯成本不會比原先的方法來得高。

關鍵字

資料區域性； MLIR編譯框架；迴圈分塊；編譯優化

並列摘要

Because of the limited speed of personal computers' central processing units, operations that require massive computing resources, such as machine learning, have long been performed by servers with more powerful processors. With the rapid development of semiconductor manufacturing processes in recent years, the computing speed of CPUs in personal computers and even mobile phone chips has allowed machine learning algorithms to be moved to edge computing. With the proliferation of smartphones, there is an increasing demand for real-time image processing on mobile phones. Camera filters, for example, must extract image feature-related program applications in real time, and the trend of edge computing is growing by the day. When CPU performance improves significantly, the main bottleneck affecting edge computing performance is no longer computing speed, but memory access overhead. The transformation of loop tiling is regarded as one of the most important ways to optimize data reuse in compiler theory. The size of a data access involved is reduced by changing the execution order of loops, thereby increasing data reuse. Reduce the ratio of cache to memory access during program execution. However, previous research has shown that performance is sensitive to block size, and that slightly different sizes can result in very different performance results. In previous research, different methods had to trade off compilation time and the selection of the best solution in order to find the appropriate tile sizes. This paper implements a new tile size selection method in the MLIR framework's affine dialect. This new method, like the original, uses a simple analytic way to determine tile sizes in order to reduce transformation overhead during the compilation stage. We will perform loop tiling on matrix multiplication and convolution operations which are important operations in image processing and machine learning, and compile them to x86 and ARM platforms for performance evaluation. The results show that our method improves these two programs by an average of 30% and 18%, respectively.

並列關鍵字

Data locality ； MLIR compiler framework ； Loop tiling ； Optimization

參考文獻

REFERENCE

Google Scholar

1. Wolfe, M. More iteration space tiling. in Proceedings of the 1989 ACM/IEEE conference on Supercomputing. 1989.

Google Scholar

2. Bondhugula, U., et al. Pluto: A practical and fully automatic polyhedral program optimization system. in Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation (PLDI 08), Tucson, AZ (June 2008). 2008. Citeseer.

Google Scholar

3. Ragan-Kelley, J., et al., Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. Acm Sigplan Notices, 2013. 48(6): p. 519-530.

Google Scholar

4. Chen, T., et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning. in 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 2018.

Google Scholar

主題瀏覽