本篇論文為Blake3演算法實現在FPGA上之研究,Blake3為Alephium(ALPH)的挖礦演算法,ALPH為區塊鏈幣種,目的為提供交易功能,而挖礦是為了紀錄每筆交易,多筆交易形成一個區塊,多個區塊之間以雜湊函數串聯產生區塊鏈。挖礦的過程是將最新區塊的區塊頭結合一個自行產生的隨機數,接著利用挖礦演算法計算結果,若此結果小於特定值,則此隨機數可被網路上所有節點認可,並形成新的區塊。 ALPH的挖礦演算法即是進行兩次Blake3,且輸入的區塊頭以及隨機數結合之寬度為固定2608位元。本論文將會實現此演算法之架構,並且將硬體去除不必要的部分,使性能最大化。挖礦的性能指標是以算力(hashrate)為標準,也就是每秒可以產生多少hash值。在有限的硬體資源下找出最高算力之硬體架構為本論文目標,並且以Nvidia RTX3070為比較基準。最終設計之結果為算力高達2.2GH/s,超越RTX3070之1.32GH/s。 本論文實現Blake3演算法之硬體為Xulinx Virtex Ultrascale+ VU33P FPGA,前端工作為撰寫RTL並使用Synopsys VCS模擬及nWave除錯,接著使用nLint檢查語法上之錯誤,後端則以Xilinx Vivado為主要開發之EDA工具,進行Synthesis、APR以及bitstream燒錄,走過完整FPGA設計及實現流程。 ALPH之Blake3演算法為compute-hard演算法,並不會使用到HBM,不同於HBM性能瓶頸存在於溝通的介面,compute-hard的瓶頸存在於整體演算法的設計。本論文將會先從解析Blake3演算法開始,接著介紹完整設計硬體架構的過程以及提升性能需要考量的重點,並且比較不同架構之性能,並展示最佳成果。
This thesis focuses on the research of implementing the Blake3 algorithm on an FPGA. Blake3 is the mining algorithm for Alephium (ALPH), a cryptocurrency that aims to facilitate transactions. Mining serves to record each transaction; multiple transactions form a block, and multiple blocks are linked by a hash function to create a blockchain. The mining process involves combining the header of the latest block with a nonce, then calculating the result using the mining algorithm. If this result is less than a specific value, the random number can be accepted by all nodes on the network and form a new block. The mining algorithm of ALPH consists of two Blake3 computations, with a combined input width of 2608 bits from the block header and the nonce. This thesis will implement the structure of this algorithm and optimize hardware performance by removing unnecessary parts. The performance metric for mining is the hashrate, i.e., how many hashes can be produced per second. The objective of this thesis is to find the hardware architecture with the highest hashrate under limited hardware resources, using Nvidia RTX3070 as a benchmark. The final design achieved a hashrate as high as 2.2GH/s, exceeding the RTX3070's 1.32GH/s. The hardware used in this thesis to implement the Blake3 algorithm is the Xulinx Virtex Ultrascale+ VU33P FPGA. Front-end work includes writing RTL and using Synopsys VCS for simulation and nWave for debugging, followed by nLint for syntax error checking. The back-end mainly uses Xilinx Vivado as the EDA tool for Synthesis, APR, and bitstream programming, completing the full FPGA design and implementation process. The Blake3 algorithm of ALPH is a compute-hard algorithm that does not use HBM. Unlike HBM, where the performance bottleneck lies in the communication interface, the bottleneck for compute-hard lies in the overall algorithm design. This thesis will start by analyzing the Blake3 algorithm, then describe the process of designing the complete hardware architecture and key considerations for enhancing performance. It will compare the performance of different architectures and showcase the best results.