應用於類神經網路與機器學習之侷限型波茲曼模型處理器設計

近年來，機器學習已被廣泛地應用於訊號處理系統之中以提供智慧處理能力，例如AdaBoost、K-NN、mean-shift和SVM應用於資料辨識，以及HOG和SIFT應用於影像特徵擷取。而在過去幾十年之中，類神經網路在多項應用之中被視為最佳解決方案之一，有別於傳統方法，類神經網路將資料特徵擷取與資料辨識整合並串聯於架構之中，並透過大量資料的訓練，得到一個強大且能夠準確進行資料辨識的類神經網路模型。然而，雖然更多層與更複雜的類神經網路模型可以獲得更準確的辨識效能，但傳統的順向計算與錯誤回饋之類神經網路模型訓練演算法無法有效地針對多層之類神經網路進行訓練。另一方面，將訓練資料庫中大量資料進行標示所需的代價，以及如何在沒有相關背景知識下進行類神經網路的初始化都將是類神經網路模型訓練中的一大問題。在本論文中，我們將設計並實現一侷限型波茲曼模型(RBM)處理器。在此處理器中，整合了32個本論文所提出之侷限型波茲曼模型(RBM)運算單元進行平行處理，並支援單層最多4096個類神經節點(neuron)和單筆測試資料最多128種類別的資料辨識能力。當處理器操作在類神經網路模型學習模式下，可達到多筆訓練資料(batch-level)之間的平行運算，並同時支援監督式(supervised)與非監督式(unsupervised)之侷限型波茲曼模型(RBM)模型訓練。當操作在類神經網路資料辨識模式下，可達到多筆測試資料(sample-level)之間的平行運算。此外，本論文中亦提出多項技術用於改善運算效能、硬體設計複雜度、外部記憶體頻寬、和功率消耗。在本論文中，我們透過兩種實作方式實現所提出之侷限型波茲曼模型(RBM)處理器。第一種是使用Xilinx Virtex-7系列之現場可程式化閘陣列(FPGA)實現，當處理器操作在125 MHz工作頻率下，本處理器使用了114.0k個查找表(LUT)、107.1k個正反器(flip-flop)和80個記憶體區塊(block memory block)。第二種是使用UMC 65奈米製程實現，本侷限型波茲曼模型(RBM)處理器晶片在8.8 mm2面積內包含了2.2M個邏輯閘(gate)和128kB內存記憶體(SRAM)，並將32個侷限型波茲曼模型(RBM)運算單元整合於2個運算叢集(cluster)之中。當操作在1.2V供電電壓下，本晶片可操作在最高210 MHz工作頻率進行模型學習與資料辨識。依據量測結果，基於現場可程式化閘陣列(FPGA)之系統雛型平台可分別達到4.60G neuron weights/s (NWPS)之模型學習效能和3.87G NWPS之資料辨識效能。而本侷限型波茲曼模型(RBM)處理器晶片操作在210 MHz工作頻率下，可分別達到4.61G NWPS與69.50 pJ/NW之模型學習效能以及3.86G NWPS與81.20 pJ/NW之資料辨識效能。和使用一般處理器(CPU)與多核心處理器(multi-core processor)相比，本論文所提出之侷限型波茲曼模型(RBM)處理器能以更快的運算效能與更高的能源效益進行侷限型波茲曼模型(RBM)之模型訓練和資料辨識。因此，針對物聯網(IoT)與手持裝置這類有能源約束考量的裝置，本論文所提出之侷限型波茲曼模型(RBM)處理器將可提供一高效能與高能源效益解決方案，以提供裝置具有智慧處理能力並達到即時模型訓練和即時資料辨識功能。

關鍵字

侷限型波茲曼模型；類神經網路；機器學習

並列摘要

Recently, machine learning techniques have been widely applied to signal processing systems to support intelligent capabilities, such as AdaBoost, K-NN, mean-shift, and SVM for data classification, and HOG and SIFT for feature extraction in multimedia applications. In the past decades, the neural network (NN) algorithms are considered one of the state-of-the-art solutions in many applications, and both feature extraction and data classification are integrated and cascaded in neural networks. In the big data era, the huge dataset benefits neural network learning algorithms to train a powerful and accurate model for machine learning applications. Since the network structure becomes deeper and deeper to achieve more accurate performance for applications, the traditional neural network learning algorithm with feedforwarding and error backpropagation is inefficient to train multi-layer neural networks. Moreover, the data labeling is very expensive especially for big dataset, and how to initialize a neural network without any domain knowledge is also a crucial issue for model training. In this dissertation, a restricted Boltzmann machine (RBM) processor is designed and implemented. In the proposed RBM processor, 32 proposed RBM cores are integrated for parallel computing with the neural network structure of maximal 4k neurons per layer and 128 candidates per sample for inference. Operated in the learning mode, the batch-level parallelism is achieved for RBM model training with supervised and unsupervised learning. And the sample-level parallelism is achieved for data classification operated in the inference mode. Moreover, several features are proposed and implemented in the proposed RBM processor to save computation time, hardware cost, external memory bandwidth, and power consumption. To realize the proposed RBM processor, two implementations are designed in this dissertation. Implemented in Xilinx Virtex-7 FPGA, the proposed RBM processor is operated at 125 MHz and occupies 114.0k LUTs, 107.1k flip-flops, and 80 block memory blocks. Implemented in UMC 65nm LL RVT CMOS technology, the proposed RBM processor chip costs 2.2M gates and 128kB internal SRAM with 8.8 mm2 area to integrate 32 proposed RBM cores in 2 clusters, and the maximal operating frequency of this chip achieves 210 MHz in both learning and inference modes operated at 1.2V supply voltage. According to the measurement results, the proposed FPGA-based system prototype platform achieves 4.60G neuron weights/s (NWPS) learning performance and 3.87G NWPS inference performance for RBM model training and data classification, respectively. And the proposed RBM processor chip operated at 210MHz to achieve 4.61G NWPS and 3.86G NWPS performance with 69.50 pJ/NW and 81.20 pJ/NW energy efficiency in learning and inference modes, respectively. Compared to the software solution implemented on CPU and powerful multi-core processors, the proposed RBM processor achieves faster processing time and higher energy efficiency in both RBM model learning and data inference, respectively. Since the battery life is a crucial issue in IoT and handheld devices, our proposal achieves an energy-efficient solution to integrate the proposed RBM processor chip into the emerging energy-constrained devices to support intelligent capabilities with learning and inference for in-time model training and real-time decision making.

並列關鍵字

Restricted Boltzmann Machine ； Neural Network ； Machine Learning

參考文獻

[1] Hirotsugu Shikano, Kiyoto Ito, Kazuhide Fujita, and Tadashi Shibata, "A Real-Time Learning Processor Based on K-means Algorithm with Automatic Seeds Generation," International Symposium on System-on-Chip, pp. 1-4, Nov. 2007.

[2] Hanaa Hussain, Khaled Benkrid, Chuan Hong, and Huseyin Seker, "An adaptive FPGA implementation of multi-core K-nearest neighbour ensemble classifier using dynamic partial reconfiguration," International Conference on Field Programmable Logic and Applications, pp. 627-630, Aug. 2012.

[3] Muhammad Awais Bin Altaf and Jerald Yoo, "A 1.52 uJ/classification patient-specific seizure classification processor using Linear SVM," IEEE International Symposium on Circuits and Systems, pp. 849-852, May 2013.

[4] Chang-Hung Tsai, Hui-Hsuan Lee, Wan-Ju Yu, and Chen-Yi Lee, "A 2 GOPS quad-mean shift processor with early termination for machine learning applications," IEEE International Symposium on Circuits and Systems, pp. 157-160, Jun. 2014.

[5] Jung Kuk Kim, Phil Knag, Thomas Chen, and Zhengya Zhang, "A 6.67mW sparse coding ASIC enabling on-chip learning and inference," Symposium on VLSI Circuits, pp. 1-2, Jun. 2014.

國際替代計量

應用於類神經網路與機器學習之侷限型波茲曼模型處理器設計

全文下載

主題瀏覽