隨著系統晶片(SoC)技術的進步,多核心處理器變得越來越重要。 有效並且穩定的資料傳輸是當前重要的課題,特別是針對多核心處理器的設計。 這篇論文使用暫存器轉換層級(RTL)的設計方式,實作了棋盤狀的多核心架構,其中包含了16個處理單元(Processing Element)。 我們的系統可以提供處理單元之間穩定的傳輸、在平台上執行平行程式的能力、評估系統效能瓶頸的方法、與電子系統層級(Electronic System Level)相互驗證並且收集硬體參數例如功率、能量消耗和工作頻率以回饋給電子系統層級。 每一個處理單元包含一個處理器和傳輸元件(Transmission Unit)。 傳輸元件由我們提出的PE-to-PE Core和一個修改過的DMA組成。 為了維持系統穩定,處理單元之間的資料傳輸是由PE-to-PE Core負責,而不是讓處理單元可以直接存取其他處理單元中的內部記憶體。 DMA是用來搬移處理單元與外部記憶體的資料。 另一方面,傳輸元件(Transmission Unit)是由軟體所驅動,因此設計了Low-Level Communication library (LLC library)來驅動PE-to-PE Core。 LLC library提供控制傳輸元件的能力,並且支援軟體端的協定來避免資料亂序。 根據LLC 的軟體端協定,資料傳遞的函式庫例如iLib library可以在我們的平台上實現,來評估平行程式執行時系統的效能。 評估複雜且實際的平行程式例如JPEG和Odd-Even Sort通常需要數百萬個時鐘週期。 同時程式的特性例如總時鐘周期數、記憶體行為和傳輸所需要的周期數都可以被收集。 透過這些測試資料,我們發現了多核心系統的效能瓶頸,並且可以回饋給電子系統層級例如SystemC平台。 多核心架構的探討可以先從SystemC平台著手, 由SystemC平台得到的架構改變可以提供給暫存器轉換層級(RTL)平台做修改。 利用RTL平台可以得到準確的時鐘週期資訊。 使用TSMC 0.13μm CMOS製程去合成我們提出的多核心平台,其工作頻率可以達到100MHz。 我們所提出的PE-to-PE Core的面積在處理單元中只占用3.28% (19.2k 個邏輯閘)。 另外,這個平台的總和傳輸率高達952.64 Mbps。 我們接下來的工作包括提升LLC library的效能、完成可程式化邏輯閘陣列 (FPGA)的原型設計、改善平台的架構為叢集式處理單元,最後收集硬體參數例如功率、能量消耗和工作頻率以回饋給電子系統層級。 我們可以試著使用DMA來搬移在內部記憶體與PE-to-PE Core之間的資料。 以組合語言來重新設計LLC library。 另外對於叢集式處理單元最有挑戰的部分是解決其間的快取同調系統 (Cache Coherence)。
As improvement of System-on-Chip technology, multi-core processors are becoming more and more important. Efficient and robust data transferring is one of the most critical and complex issues to be considered, especially when designing multi-core systems. This thesis presents an RTL implementation of mesh-based multi-core architecture containing 16 processing elements (PEs). Our platform provides 1) robust data transmission between PEs, 2) ability to execute realistic parallel programs, 3) approaches to profiling system bottlenecks, 4) cross-verification with ESL (Electronic System Level) design, and 5) physical characteristics such as power, energy and frequency as a feedback to ESL design. Each PE has single processor and a Transmission Unit (TU). The TU is composed of a proposed PE-to-PE Core and a modified DMA core. Data transmission between PEs is assisted by the proposed PE-to-PE Core, instead of accessing remote memory directly due to system stability. And the DMA core is used to move data between the local memory in a PE and the external memory controller with an OCP interface. In addition, the Transmission Unit is software driven, so that a Low-Level Communication (LLC) library is designed and proposed. The LLC library provides controls of the Transmission Unit. Furthermore, the LLC library supports a software protocol to avoid unexpected sequence errors for software developers. Based on LLC software protocol, message passing libraries such as the iLib library can be implemented to evaluate the system performance by porting realistic parallel programs. Evaluating system performance usually takes millions of cycles. Complicated and realistic parallel programs such as Odd-Even Sort and JPEG encoding are ported to this platform. And the application features such as total cycle count, memory behaviors and communication cycle count can be collected. Through these test cases, we can find out bottlenecks in multi-core platforms, and the feedback benefits platforms working at high abstraction level, or so called Electronic System Level (ESL) such as SystemC. The architecture exploration can be progressed in the SystemC platform first, and the corresponding adjustments then are provided for the RTL platform. Take advantage of the RTL implementation, our multi-core platform provides the exact cycle count of the system. We adopted TSMC 0.13μm CMOS technology to synthesize the proposed multi-core platform at 100MHz as operating frequency. The area overhead of the proposed PE-to-PE Core is only 3.28% (19.2k gates) in a PE. Furthermore, the aggregate throughput of this platform is 952.64 Mbps. Our future works include 1) optimizing throughput and latency of LLC library, 2) finishing FPGA prototype, 3) improving platform architecture to a cluster-based processing element and 4) extracting more characteristics such as power, energy and memory behaviors to ESL design. We can try to use DMA to move data between the local memory and the PE-to-PE Core. Re-coding LLC library by assembly language may also useful. The most challenging cache coherence issues must be solved in a cluster-based processing element.