張量本位之平行運算記憶體搬運優化方法論

近年來,深度學習技術在電腦視覺、自然語言處理、人工智慧等領域取得了極大的成功。在這些應用中大量使用了平行處理技術,提高運算性能。而在深度學習平行處理架構中,最大的一個挑戰是把資料從晶片外部移動到處理單元中。這是因為電晶體密度提昇的速度,遠遠快於記憶頻寬加大的速度。本論文中,我們提出一個數學方法,可以有效地將優化各種應用的技術套用於不同平行處理架構。我們發現在平行處理中,可以將資料搬移視為一種在記憶體層級中張量的轉換,因此就可以用數學描述多種記憶體優化技術。我們稱前述的張量轉換為「MERIT 轉換」,其不只可以適用於深度學習中,也適用對許多傳統的機器學習以及電腦視覺運算。此外「MERIT 轉換」可以對應到既存的向量處理架構上,透過這個轉換,我們能將許多常見的應用轉換為 GPU 上的 MERIT 表示法,可以用更少的程式碼提昇高達 20 倍的執行速度。我們也用這個轉換的原理設計了專用硬體架構 VectorMesh 來執行這個轉換。在這個架構中,處理單元被組成一個向量單元,透過佇列進行向量對向量的直接交換。除了常見的卷積網路、矩陣乘法外,VectorMesh 也支援多種深度學習技術例如次像素卷積或是相關性層,並且跟其他更專用的處理器有同等的能源以及面積效率。

關鍵字

平行處理；通用圖形處理器；深度學習加速器；張量轉換；類神經網路

並列摘要

Deep learning has achieved great success in fields such as computer vision, natural language processing, and artificial intelligence, and many of these applications utilize parallel processing to achieve high performance. One of the most significant challenges for optimizing deep learning applications on a parallel processing architecture is the data movement from the off-chip storage to processing elements (PEs) since the density of logic gates always grows much faster than memory bandwidth. In this dissertation, we propose a mathematical formulation that is useful for transferring the application optimization knowledge across computing platforms. We discover that in parallel processing, the data movement can be viewed as tensor transforms across memory hierarchies, making it possible to describe many memory optimization techniques mathematically. Such transform, which we call Memory Efficient Ranged Inner-product Tensor (MERIT) transform, can be applied to not only DNN tasks but also many traditional machine learning and computer vision computations. Moreover, the tensor transform can be readily mapped to existing vector processor architectures. With such transform, we can convert many popular applications into a succinct MERIT notation on CUDA GPUs, speeding up GPU kernels up to 20 times while using only half as many code tokens. We also use the principle of the proposed transform to design an ASIC architecture called VectorMesh. Its PEs are grouped as vectors, with FIFOs between the vectors to facilitate data exchange. VectorMesh supports various DNN tasks such as subpixel CNN and correlation layer, as well as other computer vision tasks while providing comparable area and power efficiency to dedicated DNN ASICs.

並列關鍵字

parallel processing ； general-purpose graphics processing unit (GPGPU) ； deep learning accelerator (DLA) ； tensor transform ； neural network

參考文獻

NVIDIA. Geforce products. [Online]. Available: https://www.nvidia.com/en-us/geforce/products/

Google Scholar

M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” vol. 57, 02 2014, pp. 10–14.

Google Scholar

Micron. Micron DDR4 SDRAM datasheet. [Online]. Available: https: //www.micron.com/products/dram/ddr4-sdram

Google Scholar

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient convolutional neural networks for mobile vision applications,” CoRR, vol. abs/1704.04861, 2017. [Online]. Available: http://arxiv.org/abs/1704.04861

Google Scholar

A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazırbaş, V. Golkov, P. v.d. Smagt, D. Cremers, and T. Brox, “FlowNet: Learning optical flow with convolutional networks,” in IEEE International Conference on Computer Vision (ICCV), 2015. [Online]. Available: http: //lmb.informatik.uni-freiburg.de/Publications/2015/DFIB15 2, 10, 18, 19

Google Scholar

國際替代計量

張量本位之平行運算記憶體搬運優化方法論

全文下載

主題瀏覽