快取系統之可靠性、性能與穩定性研究

現今電腦晶片系統當中，快取記憶體已經成為了必要的元件之一。在使用快取記憶體之前設計者必須先行考量快取機制的可靠度、效能、以及穩定度，以此評估並且採用不同快取記憶體之設置。在這個博士論文中，我們將會探討三個快取系統重要的主題，分別為可靠度、效能、以及穩定度。首先我們提出使用7T/14T的快取單元，並且發展一個有效的即時控制機制使得整個系統維持在一個可靠的狀態，實驗的結果顯示，我們的方法可以千倍幅度的降低錯誤發生頻率，並且我們所提出的機制，並不會造成效能或是能源的損失；第二個部分，我們提出針對具有不同存取時間的快取記憶體架構中，存在有些存取模式的效能降低情況，這些存取模式存在於許多重要的應用軟體當中，例如最近熱門的人工智慧應用，我們針對這些重要的存取模式，提出有效的模式偵測機制以及快取控制機制，實驗顯示我們的偵測機制，不只是硬體成本非常的小，並且有效的偵測出目標存取模式，而所提出的快取控制機制則可以緊接在偵測機制之後，有效的避免效能降低，以達到高效能目的；最後我們將會針對快取控制機制進行多方面分析考量，以達到選擇最佳快取機制的目的，過去研究主要針對的是平均命中率進行快取的評估，但是我們注意到穩定度也是一個重的考量，我們提出一個穩定度評估衡量標準，並且配合其他的評估機制進行完整的快取控制機制評估，經過完整並且通盤的分析實驗，實驗分析結果顯示隨機存取機制是最適合給通用型處理器的快取機制。

關鍵字

快取記憶體

並列摘要

A cache system is essential for high-performance computing in microprocessors. In this dissertation, we are going to target three indices of the cache system, i.e. cache reliability, cache performance, and cache stability. The scope covers major components in microprocessor cache systems including cache cell, cache architecture, and cache controlling mechanism. First, a novel cache-utilization-based dynamic voltage-frequency scaling mechanism for reliability enhancements is proposed. We propose a cache architecture using a 7T/14T static random-access memory (SRAM) [1] and a control mechanism for reliability enhancements. Our control mechanism differs from conventional dynamic voltage-frequency scaling (DVFS) methods in that it considers not only the cycles per instruction (CPI) behaviors but also the cache utilization. To measure cache utilization, a novel metric is proposed. The experimental results show that our proposed method achieves one thousand times less bit-error occurrences compared to conventional DVFS methods under the ultra-low voltage operation. Moreover, the results indicate that our proposed method surprisingly not only incurs no performance and energy overheads but also achieves on average a 2.10% performance improvement and a 6.66% energy reduction compared to conventional DVFS methods. Second, a dynamic link-latency aware replacement policy (DLRP) is developed. Multiprocessor system-on-chips (MPSoCs) in modern devices have mostly adopted the non-uniform cache architecture (NUCA) [3], which features varied physical distance from cores to data locations and, as a result, varied access latency. In the past, researchers focused on minimizing the average access latency of the NUCA. We found that dynamic latency is also a critical index of the performance. A cache access pattern with long dynamic latency will result in a significant cache performance degradation without considering dynamic latency. We have also observed that a set of commonly used neural network application kernels, including the neural network fully-connected and convolutional layers, contains substantial accessing patterns with long dynamic latency. This dissertation proposes a hardware-friendly dynamic latency identification mechanism to detect such patterns and a dynamic link-latency aware replacement policy (DLRP) to improve cache performance based on the NUCA. The proposed DLRP, on average, outperforms the least recently used (LRU) policy by 53% with little hardware overhead. Moreover, on average, our method achieves 45% and 24% more performance improvement than the not recently used (NRU) policy and the static re-reference interval prediction (SRRIP) policy normalized to LRU. Finally, stability analyses of cache replacement policies for processor designs have been evaluated. A cache system with an effective cache replacement policy is essential for high-performance (micro)-processor designs. Over the years, many cache replacement algorithms have been proposed based on heuristic observations. Those algorithms are usually very effective targeted to a specific application, e.g., LRU-friendly applications, streaming applications, etc. When developing a (micro)-processor, designers first face the decision to select a best-fit cache replacement algorithm for its implementation. If this (micro)-processor is targeting a specific application, an application-specific instruction processor (ASIP) [4] can be developed by using an application-specific cache replacement algorithm to achieve the best overall area/power/speed performance. On the other hand, if this (micro)-processor is a general-purpose (micro)-processor, designers need to select an effective cache replacement algorithm/method that is capable of handling various applications with different characteristics (mixed-workloads). Traditionally, the average hit rate is the main performance index used to estimate the performance of cache replacement policies, and thus most cache replacement policies are focusing on improving the hit rate. However, we have observed that when handling the mixed-workload applications, the average hit rate does not reect the performance variance among different types of applications. In this dissertation, we propose a new performance variance index that can estimate the stability of cache replacement policies. We also found that the random policy has achieved very competitive and stable results compared to other policies. The experimental results have demonstrated that the random policy has achieved 0.08% cache performance variation while the LRU and SRRIP have obtained up to 0.16% and 0.54% variations on the SPEC CPU2006 [5] and GAP [6] benchmark suites. We have also demonstrated that the random policy achieves the most stable overall performance compared to the previous policies under mixed workloads. Moreover, using the random policy, the hardware cost, as well as the power consumption, are significantly lower compared to the previous policies. Consequently, the random policy is a good choice for general-purpose (micro)-processor designs targeted to mixed-workloads of various applications, and thus widely adopted by many of today's (micro)-processor designs.

並列關鍵字

Cache

參考文獻

[1] Hidehiro Fujiwara, Shunsuke Okumura, Yusuke Iguchi, Hiroki Noguchi, Hiroshi Kawaguchi, and Masahiko Yoshimoto. A dependable sram with 7t/14t memory cells. In ieice transactions on electronics, pages 423{432, 2009.

Google Scholar

[2] A. Jain and C. Lin. Back to the future: Leveraging belady's algorithm for improved cache replacement. In 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pages 78{89, 2016.

Google Scholar

[3] Changkyu Kim, Doug Burger, and Stephen W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Kourosh Gharachorloo, editor, Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-X), San Jose, California, USA, October 5-9, 2002., pages 211{222. ACM Press, 2002.

Google Scholar

[4] D. Liu. Asip (application specific instruction-set processors) design. In 2009 IEEE 8th International Conference on ASIC, pages 16{16, 2009.

Google Scholar

[5] John L. Henning. Spec cpu2006 benchmark descriptions. SIGARCH Comput. Archit. News, 34(4):1{17, September 2006.

Google Scholar

國際替代計量

快取系統之可靠性、性能與穩定性研究

主題瀏覽