橢圓曲線密碼系統 (Elliptic curve cryptography,簡稱ECC) 是 Koblitz 以及Miller 分別於1985 年提出的公開金鑰密碼技 (Public key cryptography)。和目前被廣泛使用的RSA 加密演算法做比較,在相同的安全強度下,橢圓曲線密碼系統使用的金鑰長度遠低於RSA 加密演算法,因此,橢圓曲線加密系統更適合應用在手持式裝置,例如悠遊卡或是手機等記憶體資源有限的行動裝置。 我們設計出了一套硬體架構,稱為橢圓曲線加密核心 (ECC core), 可以在二元的伽羅瓦體上快速計算出橢圓曲線數學運算中最重要的純 量乘法。一個純量乘法其實是由多個二元體上的運算組合而成。這個加密核心包括多個算數元件 (Arithmetic unit,簡稱AU)、一個平方器 (Squarer)、一個控制器以及儲存元件。多個算術元件可以平行執行純量乘法中包含的運算。將多個運算平行化處理最大的好處是能夠有效地加速運算的速度。關於使用的演算法,我們利用Montgomery ladder 演算法來計算純量乘法。同時,為了提高算數元件的使用率,我們仔細地針對演算法中的所有運算去做排程。此外,我們在不同的運算元之間使用資料前送(data forwarding)技術,它減少了15%的時脈週期數目;我們也將有餘式化簡 (modular reduction) 獨立在一個時脈週期執行而非和乘法在同一個時脈週期執行,並減少了13%的關鍵路徑延遲時間。在論文中,我們也推導出擁有五個項的既約多項式的位元平行有餘式化簡的算式,能夠快速地將次數超過二元體的多項式減低成另一個次數較小的多項式,這兩個多項式在伽羅瓦體是等效的。 實作的合成結果使用台積電0.13μm 的邏輯閘資料庫。在金鑰長度 為163 位元以及233 位元的伽羅瓦體下,我們的硬體架構能分別在7.7μs 以及11.4μs 以內完成一個橢圓曲線上純量乘法的運算。在同樣的技術下,與其他論文相比較,我們的實作結果在時間方面和面積方面加成來看是目前最好的。
In this paper, we present an Elliptic Curve Cryptographic (ECC) core dedicated to performing point scalar multiplications over GF(2m). The core consists of multiple Arithmetic Units (AUs), a squarer, a controller, and storage devices. Multiple AUs can execute in parallel to perform the crucial operations of ECC, i.e., the point scalar multiplication. The parallelism efficiently speeds up the performance of the scalar multiplication. The proposed high-performance architecture is based on the Montgomery ladder method. In order to raise the utilization of AUs and achieve high speed by reducing cycles, we carefully schedule the field operations for a single iteration of the point scalar multiplication to make AUs as busy as possible. Besides, we exploit the data forwarding technique between arithmetic devices to decrease approximately 15% of the number of cycles. To reduce the critical path delay, the multiplication and the modular reduction are implemented separately in different cycles. We reduce about 13% of the critical path delay in our design. We also induce the bit-parallel modular reduction equations for irreducible pentanomials. The developed bit-parallel modular reduction can operate the modular reduction more efficiently than traditional reduction method. Using 0.13μm CMOS technology, our ECC core can implement a point scalar multiplication with coordinate conversion in 7.7μs at 243.9MHz and in 11.4μs at 242.6MHz over GF(2^163) and GF(2^233), respectively, exhibiting that our result reaches the best performance among all the other published literatures with the same technology. We also explore our architecture with different configurations, i.e. the number of AUs, the cycles needed for a binary field multiplication, and the field size, to find the optimized one under given constraints.