利用適應性的多個快取置換策略來提高一個基址暫存器的快取的命中率

載入-使用延遲是影響微處理器效能的主要因素中的其中一個。因為載入指令通常有很長的執行延遲且在所有指令中也佔有很大的比例。提早執行載入指令是一個縮短執行延遲的方法。為了提早執行載入指令，最初的議題是如何提早決定載入指令的有效位置. 為了解決這個議題，很多機制利用一個特別的儲存空間來保留最近被有效位置當作基底部份或索引部份使用的暫存器的值來做有效位置的計算。在這計算上的機制可為推測式或非推測式的。為了有在這特別的儲存空間有低的硬體需求和非推測式的機制不需要給有效位置的計算錯誤回復的方法的好處，我們提出了一個新的機制來提早執行載入指令，此機制是結合多個由之前縮短執行延遲的機制所成。這個新的機制使用一個小的快取來保留最近被有效位置當作基底部份使用的暫存器的值；此快取叫作Base Register Cache (BRC)。在原先的機制中，BRC是利用LRU (Least Recently Used) 置換策略。在大部份的時間，LRU表現得很好，但是有時候當大部份references的recency大於快取的大小時LRU會很沒效率的使用快取空間(這些references叫作far reuses)；一個方法是保留一些暫存器在快取中夠久讓它們可以貢獻快取命中，LFU (Least Frequently Used) 置換策略已被證明在此reference的特性的表現最佳。此篇論文著重在如何讓LRU策略和LFU策略合作以提高BRC的命中率。我們找到一個有效率的方法，用在記憶體快取上，叫作Combined the LRU and LFU Policies (CRFP)，此方法是適應性的選擇LRU和LFU策略，我們嘗試將此方法應用在BRC上。但，我們觀察到在某些benchmarks，CRFP仍然是與LRU有一樣的命中率即使LFU有較高的命中率。所以我們提出了一個分析的方法 (Mis-matched selection analysis) 來找出CRFP的問題並針對這些問題提出解決方法。我們所提的機制的最後一個演進 (SCRFP-SC-TagAA)在baseline 4-entry fully-associate BRC有最高的平均命中率的改善: 對LRU有1.82%；對CRFP有1.1%。對各別的benchmarks，相較於其他我們所提的機制的演進，SCRFP-SC-TagAA也有最高的命中率的改善。特別是在LRU命中率較低的benchmark group，SCRFP-SC-TagAA改善有較高命中率的策略的命中率(LRU或是LFU) 最高可達11.48%，對這benchmarks group的平均命中率可提高5.45% 我們也有估計我們所提的機制的效率，該效率利用平均命中率和硬體的開支的比例來表示。我們所提的機制的第三個演進 (SCRFP)有比CRFP多5.45%的價效比。

關鍵字

快取置換策略；適應性；命中率；基底暫存器快取

並列摘要

Load-to-use latency is one of the key factors to influence microprocessor performance. Because load instruction usually has long execution latency and take large portion of total instructions at run time. Early execute load instructions is a way to reduce the load-to-use latency. To early execute load instruction, the beginning issue is to how determine the effective address of the load instruction early. To resolve this issue, several mechanisms use a special storage to keep the values of the registers which are used recently as base or index components of effective address for the effective address calculation. The mechanisms on the calculation can be speculative or non-speculative. To have the benefits of low hardware requirement for the special storage and the non-speculative mechanisms do not need recovering method for effective address calculation, we proposed a new mechanism for early executing load instructions composed of the previous mechanisms on reducing the load-to-use latency The new mechanism uses a small cache to keep the values of the registers used recently as base component of effective address; the cache is called Base Register Cache (BRC). In original mechanism, BRC is managed by the LRU (Least Recently Used) replacement policies. In most of time, LRU performs well, but sometimes LRU will inefficiently use cache space if recencies of most references are greater than the cache size (the references are called far reuses); A solution is to retain some registers long enough in the cache to contribute cache hits, the LFU (Least Frequently Used) replacement policy was proved that performing optimally in this reference characteristic. This thesis focuses on how to make the cooperation between the LRU and the LFU policies for increasing the hit rate of BRC. We found an efficient technique used on memory cache, called Combined the LRU and LFU Policies (CRFP) which is to adaptively select LRU and LFU policies, and tried to apply it on the BRC. But we observed that CRFP still has the same hit rate as LRU even LFU has a higher hit rate on some benchmarks. So we proposed an analysis method (Mis-matched selection analysis) to find the reasons why the CRFP is not well adapted to the BRC, and also proposed methods to make CRFP be well adapted to BRC. The final evolution of our proposed mechanisms (SCRFP-SC-TagAA) has the highest average hit rate improvement on the baseline 4-entry fully-associate BRC: 1.82% to LRU and 1.1% to CRFP. On individual benchmarks, SCRFP-SC-TagAA also the highest hit rate improvement comparing to each evolution of our proposed mechanisms. Especially in the benchmark group which LRU has low hit rates, SCRFP-SC-TagAA improves the hit rate of the policy which has higher hit rate (LRU or LFU) up to 11.48% and 5.45% on the average hit rate of this benchmark group. We also estimate the effectiveness of our proposed mechanisms in terms of the ratio of the average hit rate and the hardware overhead (cost-performance ratio). The third evolution of our proposed mechanisms (SCRFP) has 5.45% more than CRFP on the cost-performance ratio.

並列關鍵字

cache replacement policy ； adaptive ； hit rate ； base register cache

參考文獻

[3] T. M. Austin and G. S. Sohi, "Zero-cycle loads: microarchitecture support for reducing load latency," in Microarchitecture, 1995., Proceedings of the 28th Annual International Symposium on, 1995, pp. 82-92.

[4] Y.-S. Hwang and J.-J. Li, "On reducing load/store latencies of cache accesses," Journal of Systems Architecture, vol. 56, pp. 1-15, 2010.

[7] Y. Smaragdakis, S. Kaplan, and P. Wilson, "The EELRU adaptive replacement algorithm," Performance Evaluation, vol. 53, pp. 93-123, 2003.

[8] A. Jaleel, K. B. Theobald, J. Simon C. Steely, and J. Emer, "High performance cache replacement using re-reference interval prediction (RRIP)," SIGARCH Comput. Archit. News, vol. 38, pp. 60-71, 2010.

[13] D. Lee, J. Choi, J. H. Kim, S. H. Noh, S. L. Min, Y. Cho, et al., "LRFU: A Spectrum of Policies that Subsumes the Least Recently Used and Least Frequently Used Policies," IEEE Trans. Comput., vol. 50, pp. 1352-1361, 2001.

國際替代計量

利用適應性的多個快取置換策略來提高一個基址暫存器的快取的命中率

全文下載

主題瀏覽