基於參數伺服器之雙批量尺寸學習

分散式機器學習對於應用具有許多數據和參數的深度學習模型至關重要。當前對分散式機器學習的研究集中在使用更多硬體設備與強大的計算單元進行快速的訓練。對此，模型訓練傾向於使用更大的批量尺寸來加快訓練速度。然而，由於泛化能力差，大批量訓練往往會出現準確率低的問題。對於大批量，研究人員已經提出了許多複雜的方法來解決準確性的問題。這些方法通常具有複雜的機制，因此使訓練更加困難。此外，用於大批量的強大訓練硬體價格昂貴，並非所有研究人員都能負擔得起。我們提出了雙批量尺寸學習方案來解決批量大小的問題。我們使用硬體的最大批量尺寸來實現我們可以負擔的最大訓練效率。此外，我們在訓練過程中引入了更小的批量尺寸，以提高模型的泛化能力。此方法在同一訓練中同時使用兩個不同的批量尺寸，以減少測試損失並獲得良好的泛化能力，且訓練時間只會略有增加。我們實作我們的雙批量尺寸學習方案並進行實驗。通過增加 5% 的訓練時間，我們可以在某些情況下將損失從 1.429 減少到 1.246。此外，通過適當調整大批量和小批量的百分比，我們可以在某些情況下將準確率提高 2.8%。而在訓練時間額外增加 10% 的情況下，我們可以將損失從 1.429 減少到 1.193。並且在適度調整大批量和小批量的數量後，準確率可以提升 2.9%。在同一訓練中使用兩種不同的批量尺寸會帶來兩個複雜性。首先，兩種不同批量尺寸的數據處理速度不同，所以我們必須按比例分配數據，以最大化整體處理速度。此外，基於整體處理速度的考慮，較小的批量將看到更少的數據，我們按比例調整它們對參數服務器中全局權重更新的貢獻。我們使用小批量和大批量之間的數據比例來調整貢獻。實驗結果表明，此貢獻調整將最終準確率提高 0.9%。

關鍵字

機器學習；深度神經網路；批量尺寸；分散式學習；參數伺服器

並列摘要

Distributed machine learning is essential for applying deep learning models with many data and parameters. Current researches on distributed machine learning focus on using more hardware devices powerful computing units for fast training. Consequently, the model training prefers a larger batch size to accelerate the training speed. However, the large batch training often suffers from poor accuracy due to poor generalization ability. Researchers have come up with many sophisticated methods to address this accuracy issue due to large batch sizes. These methods usually have complex mechanisms, thus making training more difficult. In addition, powerful training hardware for large batch sizes is expensive, and not all researchers can afford it. We propose a dual batch size learning scheme to address the batch size issue. We use the maximum batch size of our hardware for maximum training efficiency we can afford. In addition, we introduce a smaller batch size during the training to improve the model generalization ability. Using two different batch sizes in the same training simultaneously will reduce the testing loss and obtain a good generalization ability, with only a slight increase in the training time. We implement our dual batch size learning scheme and conduct experiments. By increasing 5% of the training time, we can reduce the loss from 1.429 to 1.246 in some cases. In addition, by appropriately adjusting the percentage of large and small batch sizes, we can increase the accuracy by 2.8% in some cases. With the additional 10% increase in training time, we can reduce the loss from 1.429 to 1.193. And after moderately adjusting the number of large batches and small batches, the accuracy can increase by 2.9%. Using two different batch sizes in the same training introduces two complications. First, the data processing speeds for two different batch sizes are different, so we must assign the data proportionally to maximize the overall processing speed. In addition, since the smaller batches will see fewer data due to the overall processing speed consideration, we proportionally adjust their contribution towards the global weight update in the parameter server. We use the ratio of data between the small and large batches to adjust the contribution. Experimental results indicate that this contribution adjustment increases the final accuracy by another 0.9%.

並列關鍵字

machine learning ； deep neural networks ； batch size ； distributed learning ； parameter server

參考文獻

[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensorflow: A system for largescale machine learning. In 12th {USENIX} symposium on operating systems design and implementation ({OSDI} 16), pages 265–283, 2016.

Google Scholar

[2] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, et al. Large scale distributed deep networks. Advances in neural information processing systems, 25:1223–1231, 2012.

Google Scholar

[3] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.

Google Scholar

[4] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.

Google Scholar

[5] Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. Ganger, and E. P. Xing. More effective distributed ml via a stale synchronous parallel parameter server. In Advances in neural information processing systems, pages 1223–1231, 2013.

Google Scholar

國際替代計量

基於參數伺服器之雙批量尺寸學習

全文下載

主題瀏覽