通過運算符之間的排程加速卷機神經網路

卷積神經網路在許多機器學習任務中至關重要。目前的深度學習框架和編譯器通常將神經網路視為張量操作的有向無環圖，並按照拓撲順序一次執行。這種一般方法有兩個問題。首先，新的卷積神經網路有分支結構，它們形成了複雜的有向無環圖。這些有向無環圖使得很難找到一個好的拓撲排序順序來安排 GPU 內的運算器。其次，現代硬件具有很高的計算能力，這使得在現代硬件上按順序運行運算符的資源利用不足。這兩個問題為利用運算器間的並行性，即有向無環圖中獨立運算器間的並行性，以更有效地利用硬件資源提供了可能。在這項工作中，我們正式定義了解決資源爭奪的有向無環圖調度問題，並提出了一種具有兩個經驗規則的最早時間優先的算法，以利用獨立運算符之間的並行性。實驗結果表明，與順序執行相比，我們的方法在 RTX 3090 上的性能提高了 3.76 倍。

關鍵字

運算符之間的排程；圖像處理單元；卷機神經網路；資源受限的項目排程；深度學習框架

並列摘要

Convolution neural networks (CNNs) are essential in many machine learning tasks. Current deep learning frameworks and compilers usually treat the neutral network as a DAG (directed acyclic graph) of tensor operations and execute them one at a time according to a topological order, which respects the dependency in the DAG. There are two issues with this general approach. First, new CNNs have branch structures, and they form complex DAGs. These DAGs make it hard to find a good topology sort order that schedules operators within a GPU. Second, modern hardware has high computational power, which makes running operators sequentially on modern hardware under-utilizes resources. These two issues open the possibility of exploiting inter-operator parallelism, i.e., parallelism among independent operators in the DAG, to utilize the hardware resources more efficiently. In this work, we formally define the DAG scheduling problem that addresses the resource contention and propose an early-start-time-first algorithm with two heuristic rules for exploiting parallelism between independent operators. Experimental results show that our method improves the performance by up to 3.76x on RTX 3090 compared to the sequential execution.

並列關鍵字

Inter-operator scheduling ； Graphic processing unit ； Convolutional neural network ； Resource-constrained project scheduling ； deep learning framework

參考文獻

[1] F. Ballestin and R. Leus. Resource-constrained project scheduling for timely project completion with stochastic activity durations. SSRN Electronic Journal, 2007.

Google Scholar

[2] J. Blazewicz, J. Lenstra, and A. Kan. Scheduling subject to resource constraints: classification and complexity. Discrete Applied Mathematics, 1983.

Google Scholar

[3] T. e. a. Chen. Learning to optimize tensor programs. Curran Associates Inc., 2018.

Google Scholar

[4] Y. Ding, L. Zhu, Z. Jia, G. Pekhimenko, and S. Han. Ios: Inter-operator scheduler for cnn acceleration. In Proceedings of Machine Learning and Systems, 2021.

Google Scholar

[5] L. M. et al. Rammer: Enabling holistic deep learning compiler optimizations with rTasks. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), 2020.

Google Scholar

主題瀏覽