影像去噪做為在影像處理領域中一門重要的技術,其目標在於移除影像中的雜訊以獲得更好的影像品質。其中BM3D演算法被視為是目前能產生最佳去噪品質的影像去噪演算法。然而BM3D演算法較高的計算複雜度使其難以被使用於需要即時運算的應用。因此,本篇論文提出一個運算簡化之BM3D演算法,提升其運算速度。由於簡化的運算可能造成去噪品質下降,本篇論文亦提出一個使用導向影像的方法,藉由在BM3D演算法的基礎估計階段引入一張導向影像來提升去噪品質。此外,本篇論文提出一個基於前述運算化簡演算法之硬體加速器,使用硬體運算進一步提升整個去噪演算法的速度。藉由我們硬體設計中平行運算的去噪單元以及管線化區塊搜尋,演算法的執行時間有顯著的改善。 我們將本篇論文提出之硬體加速器設計實作在Intel Stratix V的FPGA開發版上,並使用PCIe實現主機與FPGA之間的資料傳輸。與原始BM3D演算法的軟體實作相比,本篇論文提出之加速器達到61倍之加速倍率。和相關研究提出之使用OpenCL實作並運作於FPGA之加速器相比,本篇論文之加速器亦擁有更好的運算速度。同時,本篇論文提出加速器的影像去噪品質在PSNR與SSIM的比較上皆與原始BM3D演算法得到之去噪品質相近。
Image denoising is an important technique in the field of image processing. It aims at removing noise from images to get better image quality. The block-matching and 3-D filtering (BM3D) algorithm is considered a state-of-the-art image denoising algorithm. Although BM3D has great denoising quality, the high computation complexity makes it hard to be employed in real-time applications. Therefore, we propose a computation-reduced BM3D algorithm to enhance the processing speed. Since the reduction of computations might degrade the quality, we also introduce a guidance image method in the basic-estimation stage of BM3D to enhance the denoising quality. Furthermore, we present a hardware design for the proposed algorithm to further speedup the denoising process by taking the advantage of hardware computing. The processing time is significantly improved by the parallelism of denoising units and block-matching pipelines in our hardware design. We have implemented our accelerator design on an Intel Stratix V FPGA development board, where the data transmission between the host CPU and the FPGA is supported by PCIe. Our accelerator achieves 61$\times$ speedup compared to the original BM3D software. Our accelerator also has better processing speed compared to the prior work of OpenCL high-level synthesis design on FPGA. Meanwhile, the denoising quality of our accelerator is comparable with the original BM3D algorithm on both PSNR and SSIM.