機器學習演算法,例如 Support Vector Machine (SVM) 和 Deep Neural Network (DNN) 現今被廣泛的使用。訓練機器學習演算法的時候,對訓練資料集進行隨機 洗牌可以提升測試準確度和收斂速度。然而,如果訓練資料集比系統記憶體還大,則對訓練資料集洗牌的方式並不直觀。一些訓練框架提出的系統記憶體外的隨機洗牌方式假設儲存系統為機械式硬碟,由於機械式硬碟的隨機存取緩慢,他們必須犧牲一部份隨機性來完成隨機洗牌。有一個較新的框架叫做 LIRS,靠著有快速隨機存取能力的的固態硬碟,來達到全範圍的隨機洗牌。然而 LIRS 也有一個問題:視樣本大小而定,它可能會對 SSD 進行無用的資料讀取,降低了 LIRS 的性能,也佔用了本不需佔用的儲存系統頻寬。我們提出了 padding 機制來解決這個問題,降低了讀取時間達 70%。我們另外將 LIRS 和 padding 機制與一個 DNN 訓練常使用的檔案格式 TFRecord 進行結合,使 TFRecord 可以使用 LIRS 和 padding機制。
Machine learning algorithms, such as Support Vector Machine (SVM) and Deep Neural Network (DNN), are widely used nowadays. When training a machine learning algorithm, randomly shuffling all the training data can improve the testing accuracy and speed up the convergence rate. However, if the training dataset is too large to fit into the system memory, it's not a simple process to shuffle training data. Some machine learning frameworks implement not-in-memory random shuffling assuming that hard disk drive (HDD) is used as storage, due to the slow random access performance of HDD, they sacrifice the random degree of shuffling to reduce random storage accesses. Recently, a framework called Lightweight Implementation of Random Shuffling (LIRS) exploits solid-state drive (SSD) based storage devices, which has hundreds times faster random accesses than HDD, to achieve a full-range random shuffling without taking up precious CPU memory. However, LIRS has a problem: depends on instance size, it may induce redundant data read from SSD, which hinders its own performance and utilizes more bandwidth of storage system than it really needs. We propose a padding mechanism to solve this problem. Compared to LIRS, our proposed padding method can reduce the data loading time by up to 70%. We also incorporated LIRS with the padding mechanism into a famous data format for DNN training, TFRecord. Our implementation enables programmers to achieve full-range random shuffling with TFRecord data format.