基於學習式殘差編碼的混合式多位元率影像壓縮演算法

基於機器學習技術的影像壓縮系統在近幾年間已經取得重大的研究進展。但至今，大多數的研究目標只放在提升影像品質而沒有考慮到系統運算複雜度。此篇論文中，我們提出了一個混合式的雙層影像壓縮演算法，其中包含了以多用途影片編碼器(Versatile Video Coding)的幀內編碼(intra coding)為基礎層，以基於神經網路的殘差訊號編碼為強化層。後者目標是藉由傳送轉換後的殘差訊號來改進基礎層的影像品質。除此之外，也加入由基礎層引導的區域性注意力模型(local attention module)以利更有效的在決定性的高頻區域做特徵萃取。實驗結果與其他已經存在的影像壓縮標準相比，顯示了優秀的主觀視覺品質。我們並更進一步將研究拓展至多位元率的模型。藉由整合條件卷積與可變的量化步數，使得我們的系統可以應用於產生多位元率而僅使用單一模型。實驗結果顯示，提出的模型在位元率與失真的綜合評估底下，與單層架構的多用途影片編碼器相比有相似的客觀指標得分，而尤其在高位元率的情況下，主觀視覺有更優秀的表現。與傳統JPEG、JPEG 2000、高效率視訊編碼(HEVC)相比皆有更加優異的表現。我們所提出的多位率系統共含有一千八百萬個網路參數並以16位元浮點數的形式儲存。平均來說，在 Intel Xeon Gold 6154上編碼單一照片約花費13.5分鐘，主要由花費在由多用途影片編碼器進行編碼的部分。相對的，在解碼端大部分時間則花費在基於神經網路主導的強化層，平均每張影像約花費31秒。與傳統的影像壓縮標準相比，我們提出的兩項架構在主要視覺上有更為優異的成果並同時實現類似的編碼效率。

關鍵字

殘差編碼；多位元率；影像壓縮

並列摘要

Recent learning-based image compression has made significant progress in the past a few years. Up to now, most approaches intend increase the performance but with little consideration on the computation complexity. In this study, we propose a hybrid two-layer image coding system. It features a VVC intra two-layer codec as the base layer and a learning-based residual codec as the enhancement layer. The latter aims to refine the quality of the base layer via sending a latent residual signal to improve the reconstructed image quality. In particular, a base-layer-guided attention module is employed to identify the critical high-frequency areas. The experimental results demonstrate its superior visual quality comparing to all the existing image coding standards. Then, we extend our approach to the multiple-rate case. We integrate the conditional convolution operation and the variable quantization step size into the system to achieve multiple bitrate coding with only one model. The experimental results show that it can achieve a comparable rate-distortion performance to the single-layer VVC intra on various common objective metrics, but it often produces better subjective quality particularly at very low bit rates. It consistently outperforms HEVC intra, JPEG 2000, and JPEG. The proposed multiple rate system incurs 18M network parameters in 16-bit floating-point format. On average, the encoding of an image on Intel Xeon Gold 6154 takes about 13.5 minutes, in which the VVC base layer dominates the encoding runtime. In contrast, the decoding time is dominated by the residual decoder and the synthesizer, requiring 31 seconds per image. The two proposed systems make a clear improvement on subjective quality and achieve similar inference speed comparing to the baseline standard systems.

並列關鍵字

Residual Coding ； Variable-rate ； Image Compression

參考文獻

[1] J. Ballé, V. Laparra, and E. P. Simoncelli, “End-to-end optimized image compression,” arXiv preprint arXiv:1611.01704, 2016.

Google Scholar

[2] J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Variational image compression with a scale hyperprior,” arXiv preprint arXiv:1802.01436, 2018.

Google Scholar

[3] D. Minnen, J. Ballé, and G. D. Toderici, “Joint autoregressive and hierarchical priors for learned image compression,” in Advances in Neural Information Processing Systems, 2018, pp. 10 771–10 780.

Google Scholar

[4] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convolutional networks for content-weighted image compression,” in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 3214–3223.

Google Scholar

[5] J. Lee, S. Cho, and M. Kim, “An end-to-end joint learning scheme of image compression and quality enhancement with improved entropy minimization,” arXiv, pp. arXiv–1912, 2019.

Google Scholar

國際替代計量

基於學習式殘差編碼的混合式多位元率影像壓縮演算法

全文下載

主題瀏覽