透過您的圖書館登入
IP:3.144.230.82
  • 學位論文

於具備多佇列網路卡的多核心平台上對高效能封包處理之研究

High Performance Packet Processing on Multi-queue and Multi-core Platforms

指導教授 : 黃能富
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


隨著半導體技術的進步,多核心處理器被廣為使用於現代人的生活 - 從輕巧的手持裝置如行動電話到大型主機都看得到其蹤影。另一方面,為了克服多個處理器同時競爭網路卡上單一收發佇列所造成的效能瓶頸,支援多佇列的網路卡因應而生。在傳統硬體中斷驅動的封包處理模型之下,當網路卡透過直接記憶體存取 (DMA) 將一封包從網卡佇列送到系統記憶體之後,即會透過中斷通知處理器進行後續的處理。為了最大化利用多核心平台及多佇列網路卡的運算能力,新的中斷處理架構如 PCI-MSI(x) 被提出;其大幅改善中斷通知的效率並使得每個網卡佇列可以有各自獨立的中斷向量,透過不同中斷向量對個別處理器的綁定,可達到最高的系統利用率及整機效能。 雖然多核心處理器提供軟體設計師更強大的運算能力,然而在多核心平台上設計有效率的封包處理程式卻存在許多單核心系統未見的挑戰。其中首要就是如何同步被多個處理器同時存取的資料及其衍生的許多問題如效能的下降及因錯誤的同步方式引起的系統死鎖(deadlock)。本論文首先介紹多核心系統並特別著重於非均勻訪存架構(NUMA) 的特性;接著說明軟體同步技術從經典的鎖(lock)、信號標(semaphore) 及無鎖(lockless) 操作到利用處理器硬體的同步機制如 transactional memory 及鎖省略(lock elision) 等,期待為讀者建立背景知識及相關術語。 連線追蹤為本論文第一個研究主題,其目的為將個別封包關連到其所屬的連線以進行需要連線資訊的應用如跨封包內容檢測 (cross packet deep inspection) 及位址轉換等。此技術的難度在於高速的查找連線追蹤表以更新既存的連線或建立新的連線紀錄。本論文改善傳統使用單一共享追蹤表的做法,將單一表分割為較小的表以減少原來較多處理器同使存取單一表引起的上鎖/解鎖操作負擔。實際效能量測的結果也符合我們的預期: 當愈少處理器競爭相同的表(鎖),整體效能愈高。另一方面,本研究也提出一個動態資源分配的演算法以避免因負載不均造成連線追蹤能力下降的情形。 中斷綁定在多核心平台上扮演著影響封包處理效能關鍵的腳色。在非最佳化的綁定之下,系統會因處理器的中斷處理負載分配不均而引起效能的大幅下降。然而設計一個全體適用最佳的中斷綁定器已被證明為 NP-hard 的問題,因此可能的研究方向乃是有效率及系統化的找出一個接近最佳綁定的方法。本論文首先提出一個綜合系統軟硬體及網卡功能配置資訊的系統化綁定演算法,試驗結果顯示此方法的效能在不同網路應用下均接近最佳的綁定法。為了更進一步將此演算法推廣到多佇列網路平台及考慮其提供中斷綁定建議的新介面,我們提出 qcAffin 作為多佇列 (queue) 到多處理器核心 (core) 的綁定器 (affinitizer)。qcAffin 因其對多佇列網卡的最佳化處理,在使用 1G 及 10G 多佇列網卡系統上的效能大幅領先 Linux系統核心內建的中斷綁定方式且可根據系統負載實現動態中斷綁定。

並列摘要


Advances in semiconductor technology are making way for multi-core and many-core processors that incorporate tens to hundreds cores in a single package. Meanwhile, network interface cards (NICs) featuring multiple hardware reception (Rx) and transmission (Tx) queues are the responses from the networking community to the prevalence of multi-core computing. Multi-queue networking circumvents the performance degradation due to contention of multi-core on a single Rx/Tx queue by distributing the packets across multiple queues. In the meantime, benefiting from evolving interrupt handling techniques, a NIC can now be allocated enough interrupt resource for each of its queues to associate to a dedicated core. Although the number of cores in CPUs continues to climb, many difficulties remain in building systems that are capable of keeping up with the packet volume in a modern middle to large scale network deployment. This is due to several factors, including the ever-increasing rate of network traffic, e.g., the now prevalent 10Gbps, the cutting-edge 40Gbps, and the upcoming 100Gbps NICs, and some fundamental limitations in both software and hardware architectures. Software imposed synchronization overheads for multi-core programming such as atomic operations and locking play a critical role affecting the packet processing performance. On the other spectrum, hardware architectural complication like cache coherency and NUMA effects brings new challenges that demand developers to equip with new skill set to unleash the real computing power. Correspondingly, researches attack these challenges by a hardware and software co-design approach that starts from investigating the underlying hardware, which collects necessary knowledge to facilitate software development and allow optimization. In this dissertation, we focus on two problems: 1) reducing the lock contentions when performing session tracking and 2) affinitizing interrupts from multi-queue NICs to CPU cores with the objective of maximizing packet processing performance. For the first problem, we propose a simple partitioning scheme aiming at striking a balance between excessive locking and lockless manipulations. Meanwhile, a resource balancing mechanism is also given to prevent the problem of underutilization of session tracking resources under circumstances of unbalanced traffic loads. The effectiveness is justified by improved performance as the number of cores that contend for a single lock decreases. On the other end of the spectrum, to address the problem of interrupt affinitization, an algorithmic approach based on numerical cost model is proposed to find the best affinitization. Comprehensive experiences covering 1G and 10G NICs with four networking applications ranging from L2 to L7 are conducted to justify the effectiveness.

參考文獻


[1] Uncore [Online]. Available: http://en.wikipedia.org/wiki/Uncore
[2] Amdahl, Gene M., "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities," AFIPS Conference Proceedings, 1967, pp. 483-485.
[3] Hill, M.D.; Marty, M.R., "Amdahl's Law in the Multicore Era," Computer, vol.41, no.7, pp.33-38, July 2008.
[4] John L. Gustafson., “Reevaluating Amdahl's law,” Commun. ACM, vol. 31, no. 5, pp. 532-533, May 1988.
[6] Alan H. Karp and Horace P. Flatt., “Measuring parallel processor performance,” Commun. ACM , vol. 33, no. 5, 539-543, May 1990.

延伸閱讀