Elastic TensorFlow: 可擴展之深度學習計算框架的設計與實作

TensorFlow 是一個熱門的的深度學習框架，除了提供單機版本外，也支援多機多卡的分散式版本，擴展度很好。因為深度學習的計算量十分龐大，常常花費數小時至數天的時間。透過分散式計算，使用者可以利用更多有效的計算資源加入計算，加速訓練過程。然而大部分的深度學習都需要在已知機器叢集的前提下才能進行訓練，TensorFlow 也不例外。這樣的限制，造成更多的機器要加速訓練時，必須先中止訓練工作、更新機器叢集並重新啟動工作。也因為這樣的設計，系統的資源運用會較不彈性，也無法完全發揮雲端計算的好處。所以我們基於 TensorFlow 改良一個可以動態擴展的框架， ElasticTF，可以在不中斷訓練過程的前提下彈性的增減計算節點。有了這個功能，在有限資源下，我們可以動態增減訓練所需資源，維持整體系統的高度此使用率。在雲端計算下，我們可以利用一些訓練評估，使用更多的機器來加速訓練或是節省花費。在我們的實驗中，相對於傳統重新啟動的方法， ElasticTF 可以在有限時間達到相同計算量的前提下，節省將近 18.6% 的花費，而且花費與理想的靜態設定訓練相近。

關鍵字

分散式深度學習；深度學習；彈性運算；分散式計算；平行系統

並列摘要

TensorFlow is one of the most popular deep learning frameworks. With good scalability, it only provides single node training but also distributed training with multiple devices and multiple nodes. Due to the computational complexity of deep learning, it often takes hours to days to finish training. Through distributed computing, users can exploit more resources to join training to speed up training process. However, most of these deep learning frameworks follow a constraint of static cluster, including TensorFlow. With this constraint, to scale up workers, we need to shutdown the training job, update cluster specification, and restart training job by checkpoint files. Thus, the system resources usage is not that flexible, and also limit the advantages of cloud computing. So, we introduce ElasticTF, a framework based on TensorFlow, that is capable of scaling workers in runtime dynamically without suspending execution. With elasticity, we can remain high utilization in limited resources system. Furthermore, in public cloud, we can predict the scaling strategies by historical performance logs to speed up training and/or reduce cost of training. Compared to checkpoint-restart with same computation before deadline, ElasticTF can reduce up to 18.6% of cost. Besides, the cost is comparable to ideal static train- ing setting.

並列關鍵字

Distributed Deep Learning ； Deep Learning ； Elastic Computing ； Distributed Computing ； Parallel Systems

參考文獻

[1] Amazon Spot Instance. ”https://aws.amazon.com/ec2/spot/”.

Google Scholar

[2] Google Preemptive VMs. ”https://cloud.google.com/preemptible-vms/”.

Google Scholar

[3] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

Google Scholar

[4] Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018.

Google Scholar

[5] Shijian Li, Robert J Walls, Lijie Xu, and Tian Guo. Speeding up deep learning with transient servers. arXiv preprint arXiv:1903.00045, 2019.

Google Scholar

國際替代計量

Elastic TensorFlow: 可擴展之深度學習計算框架的設計與實作

全文下載

主題瀏覽