現今越來越多針對深度學習 (Deep Learning, DL) 的硬體加速器以提升推理的性能及效能功耗比。DL中卷積佔整體運算的大多數,卷積可視為特殊的矩陣乘法,使用乘積累加運算進行硬體實作,在乘積累加運算中關鍵路徑及部分長路徑鮮少被觸發,又因DL應用的資料分布特性更加明顯,利用這點並配合脈動陣列,我們提出近似脈動陣列處理架構,包含計算工作誤差及可變動延遲設計,並有著類似於隨機丟棄類神經元連接的機制,能夠在不影響準確度的情況下,實現DL加速器的電壓調降。在Google的TPU為基礎的架構下,我們的實驗證明在手寫辨識及圖像分類應用,在準確度損失1%時節省了47%~51%的能耗。
In recent years, more and more hardware accelerators are customized for deep learning (DL) computation to improve the throughput of inference and the performance per watt. In DL applications, convolutions are the major computation. The convolution can be regarded as a special matrix multiplication, which utilizes multiplier-accumulators for hardware implementation. In a multiplier-accumulator, the critical path and some paths with longer path delay seldom activate. In addition, the data distribution of deep learning applications leads those paths to activate rarely. According to this characteristic, we propose an Approximate Systolic Array Processor (ASAP), which combines both approximate computing and variable-latency design. With the technique which is similar to randomly drop partial connections within a deep neural network, we implement voltage under-scaling on our proposed DL accelerator to improve the power consumption of systolic arrays with negligible accuracy loss. In our experimental results of hand-written digit recognition and image classification applications, ASAP can obtain 47%~51% power saving over a baseline systolic array based on the architecture of Google’s TPU with 1% loss in classification accuracy.