對分散式深度學習的計算與傳輸之排程優化

Deep learning is a technique that can solve complex problem. Due to the growth of data and model complexity, large-scale deep learning has became an important issue. Distributed deep learning is a efficient way to train a large model. Under distributed environment, network bandwidth is a performance bottleneck. This paper focus on how to schedule network events to reduce training time. We propose some schedulers and get at most 25% speedup.

並列關鍵字

Machine learning ； Deep learning ； Parameter server ； Network limitation

參考文獻

[1] Spark mllib. http://spark.apache.org/mllib/. Accessed: 2018.

Google Scholar

[2] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.

Google Scholar

[3] A. F. Aji and K. Heafield. Sparse communication for distributed gradient descent. arXiv preprintarXiv:1704.05021, 2017.

Google Scholar

[4] D. Alistarh, J. Li, R. Tomioka, and M. Vojnovic. Qsgd: Randomized quantization for communication-optimal stochastic gradient descent. arXiv preprintarXiv:1610.02132, 2016.

Google Scholar

[5] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprintarXiv:1607.06450, 2016.

Google Scholar

國際替代計量

對分散式深度學習的計算與傳輸之排程優化

未授權

主題瀏覽