透過您的圖書館登入
IP:3.136.85.225
  • 學位論文

Elastic TensorFlow: 可擴展之深度學習計算框架的設計與實作

Elastic TensorFlow: A Novel Network Overlay Design and Implementation to Support Elastic Deep Learning Computing

指導教授 : 周志遠

摘要


TensorFlow 是一個熱門的的深度學習框架,除了提供單機版本外,也支 援多機多卡的分散式版本,擴展度很好。因為深度學習的計算量十分龐大, 常常花費數小時至數天的時間。透過分散式計算,使用者可以利用更多有效 的計算資源加入計算,加速訓練過程。然而大部分的深度學習都需要在已知機器叢集的前提下才能進行訓練,TensorFlow 也不例外。這樣的限制,造成更多的機器要加速訓練時,必須先中止訓練工作、更新機器叢集並重新啟動工作。也因為這樣的設計,系統的資源運用會較不彈性,也無法完全發揮 雲端計算的好處。所以我們基於 TensorFlow 改良一個可以動態擴展的框架, ElasticTF,可以在不中斷訓練過程的前提下彈性的增減計算節點。有了這個功能,在有限資源下,我們可以動態增減訓練所需資源,維持整體系統的高度此使用率。在雲端計算下,我們可以利用一些訓練評估,使用更多的機器來加速訓練或是節省花費。在我們的實驗中,相對於傳統重新啟動的方法, ElasticTF 可以在有限時間達到相同計算量的前提下,節省將近 18.6% 的花費,而且花費與理想的靜態設定訓練相近。

並列摘要


TensorFlow is one of the most popular deep learning frameworks. With good scalability, it only provides single node training but also distributed training with multiple devices and multiple nodes. Due to the computational complexity of deep learning, it often takes hours to days to finish training. Through distributed computing, users can exploit more resources to join training to speed up training process. However, most of these deep learning frameworks follow a constraint of static cluster, including TensorFlow. With this constraint, to scale up workers, we need to shutdown the training job, update cluster specification, and restart training job by checkpoint files. Thus, the system resources usage is not that flexible, and also limit the advantages of cloud computing. So, we introduce ElasticTF, a framework based on TensorFlow, that is capable of scaling workers in runtime dynamically without suspending execution. With elasticity, we can remain high utilization in limited resources system. Furthermore, in public cloud, we can predict the scaling strategies by historical performance logs to speed up training and/or reduce cost of training. Compared to checkpoint-restart with same computation before deadline, ElasticTF can reduce up to 18.6% of cost. Besides, the cost is comparable to ideal static train- ing setting.

參考文獻


[1] Amazon Spot Instance. ”https://aws.amazon.com/ec2/spot/”.
[2] Google Preemptive VMs. ”https://cloud.google.com/preemptible-vms/”.
[3] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
[4] Alexander Sergeev and Mike Del Balso. Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799, 2018.
[5] Shijian Li, Robert J Walls, Lijie Xu, and Tian Guo. Speeding up deep learning with transient servers. arXiv preprint arXiv:1903.00045, 2019.

延伸閱讀