應用於多核心平台之可堆疊記憶體存取效率改進與分析

由於結構簡單且相對便宜，在電腦架構的設計上通常會將動態隨機存取記憶體當作主記憶體使用，然而，就歷史的觀點來看，動態隨機存取記憶體效率的演進相對於晶片核心時脈演進的速度來得慢很多，因此早在1994年W. Wulf和S. McKee就提出"記憶體牆"的概念。然而，為了滿足摩爾定律，單晶片上的核心數越來越多，從原本的單核心到現今的多核心系統。相對於單核心來說，多核心系統以核心平行度來取代核心時脈的增加，但對於記憶體的吞吐量需求未減反增，因此很多科學家致力於記憶體存取效率的改善，如：改善記憶體控制器的排程效率、增加匯流排寬度或是增加記憶體存取速率等等。近年來，堆疊記憶體架構的出現使得記憶體吞吐需求量有些微的獲得滿足，但對於使用晶片網路的多核心系統架構來說，從核心到記憶體控制器的距離會隨著晶片上網路的增大而相對變遠，因此，在本篇論文中，我們使用了一個額外的多對多交換網路去處理核心對控制器的存取，此舉不但能減少因大量存取所造成晶片網路的雍塞，且能使核心能更快的對記憶體控制器做存取。經由SPLASH-2測資的證明，此種架構能使核心到記憶體的存取效率達到1.13到2.57倍之多，並且適用於現今的記憶體堆疊架構。

關鍵字

多核心；多核心單晶片；堆疊記憶體；加寬輸入輸出；動態隨機存取記憶體

並列摘要

Because of DRAM is its structural simplicity, high density per unit area and more inexpensive, it’s very suited to be a role of main-memory in computer architecture. However, from a historical point of view, since the DRAM was flourished, the rate of improvement in processor speed exceeds the rate of improvement in DRAM memory speed, that W. Wulf and S. McKee called the phenomenon “memory wall”. Nevertheless, over the past few decades the amount of on-chip cores comes from one to several, and the up-coming NoC-based (most is mesh) many-core architecture no longer blindly upgrades processor’s performance, but takes advantage of parallelism to achieve the throughput requirement with superior cost-effectiveness. Unfortunately, the demand for memory bandwidth or throughput is still increased. Therefore, many engineer try to do their best to enhance the efficiency between memory controller and DRAM devices by proposing better memory scheduling policy, increasing bandwidth and improving the access speed, etc. Recently, the emergence of 3D-stacked DRAM (wide I/O) slightly reduces the speed gap between processor and memory system. But the architecture which used Network-on-Chip as a bridge to connect processors and memory controllers has a characteristic that some DRAM requests from processors may go through very far distance to access memory controller. Based on the above motivation, in this thesis we present an architecture which improves efficiency of accessing stacked memories on many-core platforms. This architecture uses an extra switch network to transport the packets which come from processor to DRAM sub-system and groups few numbers of processor to specify DRAM-channel. By this method, we can alleviate the traffic contention between DRAM-requests and inter-processor communication. We use traditional method as a contrast, that all of DRAM-requests are routed by NoC. Experimental results of SPLASH2 applications demonstrate significant speed up that ranges from 1.13 times to 2.57 times, with cost-affordable crossbar switch network which also applies to the Wide I/O DRAM interface.

並列關鍵字

Many-Core ； CMP ； Stacked Memories ； Wide I/O ； DRAM

參考文獻

H. Park et al., “A 1.2 v 12.8 gb/s 2gb mobile wide-i/o dram with 4 128 i/os using tsv-based

stacking,” in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2011 IEEE

[2] M. B. Taylor, J. Kim, J. Miller, D.Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson,

J.-W. Lee, W. Lee et al., “The raw microprocessor: A computational fabric for software circuits

R. Barua et al., “Baring it all to software: Raw machines,” Computer, vol. 30, no. 9, pp. 86–93,

被引用紀錄

陳偉華（2009）。臺北地區35歲以上高中職女性教師更年期症狀、態度、知識與生活品質之探討〔碩士論文，臺北醫學大學〕。華藝線上圖書館。https://doi.org/10.6831/TMU.2009.00144

國際替代計量

應用於多核心平台之可堆疊記憶體存取效率改進與分析

全文下載

主題瀏覽