透過您的圖書館登入
IP:3.15.183.119
  • 學位論文

提升基於非揮發性記憶體儲存設備的機器學習系統的隨機洗牌效率

Enhancing Random Shuffling Efficiency for Machine Learning for Systems with Non­Volatile Memory Storage

指導教授 : 楊佳玲
共同指導教授 : 鄭湘筠(Hsiang-Yun Cheng)
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


機器學習演算法,例如 Support Vector Machine (SVM) 和 Deep Neural Network (DNN) 現今被廣泛的使用。訓練機器學習演算法的時候,對訓練資料集進行隨機 洗牌可以提升測試準確度和收斂速度。然而,如果訓練資料集比系統記憶體還大,則對訓練資料集洗牌的方式並不直觀。一些訓練框架提出的系統記憶體外的隨機洗牌方式假設儲存系統為機械式硬碟,由於機械式硬碟的隨機存取緩慢,他們必須犧牲一部份隨機性來完成隨機洗牌。有一個較新的框架叫做 LIRS,靠著有快速隨機存取能力的的固態硬碟,來達到全範圍的隨機洗牌。然而 LIRS 也有一個問題:視樣本大小而定,它可能會對 SSD 進行無用的資料讀取,降低了 LIRS 的性能,也佔用了本不需佔用的儲存系統頻寬。我們提出了 padding 機制來解決這個問題,降低了讀取時間達 70%。我們另外將 LIRS 和 padding 機制與一個 DNN 訓練常使用的檔案格式 TFRecord 進行結合,使 TFRecord 可以使用 LIRS 和 padding機制。

並列摘要


Machine learning algorithms, such as Support Vector Machine (SVM) and Deep Neural Network (DNN), are widely used nowadays. When training a machine learning algorithm, randomly shuffling all the training data can improve the testing accuracy and speed up the convergence rate. However, if the training dataset is too large to fit into the system memory, it's not a simple process to shuffle training data. Some machine learning frameworks implement not-in-memory random shuffling assuming that hard disk drive (HDD) is used as storage, due to the slow random access performance of HDD, they sacrifice the random degree of shuffling to reduce random storage accesses. Recently, a framework called Lightweight Implementation of Random Shuffling (LIRS) exploits solid-state drive (SSD) based storage devices, which has hundreds times faster random accesses than HDD, to achieve a full-range random shuffling without taking up precious CPU memory. However, LIRS has a problem: depends on instance size, it may induce redundant data read from SSD, which hinders its own performance and utilizes more bandwidth of storage system than it really needs. We propose a padding mechanism to solve this problem. Compared to LIRS, our proposed padding method can reduce the data loading time by up to 70%. We also incorporated LIRS with the padding mechanism into a famous data format for DNN training, TFRecord. Our implementation enables programmers to achieve full-range random shuffling with TFRecord data format.

並列關鍵字

SVM DNN Random Shuffling SSD

參考文獻


[1] Geforce GTX 1080 ti. https://www.nvidia.com/en-sg/geforce/products/10series/geforce-gtx-1080-ti/.
[2] Kaggle display advertising challenge dataset. http://labs.criteo.com/2014/02/kaggle-display-advertising-challenge-dataset/.
[3] Kdd cup 2012. https://www.kaggle.com/c/kddcup2012-track1.
[4] Paracrawl. https://paracrawl.eu/.
[5] Tensorflow API r1.4. https://github.com/tensorflow/tensorflow/tree/r1.4.

延伸閱讀