透過您的圖書館登入
IP:18.226.181.89
  • 學位論文

分析與優化 Tensorflow 深度學習系統輸入管線與 XLA 之結構

Analysis and optimization of the input pipeline and the use of XLA for Tensorflow deep learning systems

指導教授 : 洪士灝

摘要


深度學習(deep learning)是機器學習的分支,可以利用深度學習執行影像辨識、語音分析、文字翻譯等。對於強調效能與效率的深度學習應用而言,開發者不應只重視類神經網路演算法的設計,也必須考慮在真實情況中做一次完整的推論時所包含的資料前處理管線(input pipeline)。在某些情況下,資料前處理會嚴重影響深度學習應用效能,因此為了達到最佳的效能,使用者需要分析工具去找出效能瓶頸。然而現存的深度學習分析工具,例NVprof和TFprof,並不能提供足以深度剖析Tensorflow資料前處理管線效能所需的細節資料,因此我們發展一套分析工具(SOFA)來幫助解決這個問題。 為了驗證這套分析工具的效果,在這份研究論文中,我們提出四種可實作資料前處理管線的方法,透過SOFA去探討這四種不同前處理方法對於效能的影響。在五種不同的類神經網路模型所建構的實驗情境中,使用者可以清楚的從SOFA的分析結果中瞭解效能瓶頸的所在以及原因。從實驗結果也可發現,當資料前處理管線經過優化後,有可能大幅提升深度學習應用的效能,例如Alexnet獲得了19.8倍的提升,Googlenet則提升了12.3倍。當效能瓶頸不在資料前處理管線,而是資料推論時,我們可以進一步使用Tensorflow所提供的加速線性代數(XLA)機制來加速,例如將VGG11從7.8倍於原始版本的效能提升到8.4倍。

並列摘要


Deep Learning is a subset of machine learning and deep learning applications include image detection and voice recognition. For deep learning applications, most developers should not only focus on the design and accuracy of neural network, but also take the input pipeline in an inference step in real world as consideration. Data preprocessing will be a serious performance issue in some cases. In purpose of getting a better performance, developers need a profiling tool to analyze deep learning applications. However, profiling tools, Nvprof and TFprof, nowadays could not acquire the entire details of TensorFlow data preprocessing. In this study, a deep learning profiling tool, SOFA(Swarms of Functions Analysis), is developed for solving the problem. In the purpose of evaluation SOFA, there are four data preprocessing methods implemented by five neural network models and analyzed by SOFA separately in this study. SOFA allows the developers to discover the performance bottleneck and the root cause of it. After data preprocessing pipeline is optimized, great improvement of deep learning application performance is possible in these case studies. In the case of using Alexnet, a 19.8x speedup is achieved, and 12.3x in the case of using Googlenet. When CPU is no longer the performance bottleneck, an additional speedup is achievable with XLA, such as an increment in growth from 7.8x speedup in original version to 8.4x speedup in XLA version when using VGG11.

參考文獻


[1] Nvidia hgx-1 hyperscale gpu accelerator. https://www.nvidia.com/en-us/ data-center/hgx/. Accessed: 2018-07-27.
[2] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: a system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016.
[3] T.Bradley.Gpuperformanceanalysisandoptimisation.NVIDIACorporation,2012.
[4] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
[5] L. Cheng-Yueh. Sofa. https://github.com/cyliustack/sofa.git, 2018.

延伸閱讀