論文提要內容: 影像超解析度是一個困難的非單一解問題,由於一幅高解析度影像的生成可能來自多張不同的低解析度影像,因此現今的許多演算法皆將深度學習用於解決單幅影像超解析度任務。與使用卷積神經網路為基礎的架構相比,使用 Transformer 為基礎的架構在各個領域都表現良好,然而利用自注意力機制從影像中提取特徵訊息需要大量的計算資源,因此大大限制以 Transformer為基礎的架構在邊緣運算平台上的應用。本論文以 Swin Transformer 為基礎設計一個新的網路架構,提出新的間隔密集連接算法對 Swin Transformer Layer 進行連接,使模型內的特徵能夠重複利用,而以此為設計出的新的單幅影像超解析度輕量化網路架構,稱之為 SwinOIR。此網路架構在 DIV2K 數據集和 Flickr2K 數據集上進行訓練並進行消融研究,以證明間隔密集連接對模型性能的正面影響。除此之外,將此模型在各種基準數據集上與其他的 SOTA(State-of-the-Art)模型進行性能比較,皆具有較好的結果。例如,當 SwinOIR 在Urban 數據集進行 4 倍放大時獲得 26.62 dB,比 SOTA 模型 SwinIR 高出0.15 dB。 同時,隨著將深度學習應用於電腦視覺領域的技術日益進步和網路架構的推陳出新,探索這些技術在日常生活中的應用已經成為一個重要的研究課題。因此本論文將影像超解析度技術分別和物件偵測與語義分割兩項高級電腦視覺任務進行結合,通過比較兩者將超解析度結合這些任務前後的差異和影響,證明當影像的解析度提高時,可以協助執行高層次的視覺任務,不僅能提升視覺感知,還能提高其準確率。
Image super-resolution is a challenging non-unique solution, as a high-resolution image can be generated from multiple different low-resolution images. Therefore, many existing algorithms leverage deep learning to address the single-image super-resolution task. Compared to architectures based on Convolutional Neural Networks (CNNs), architectures based on Transformers perform well in various domains. However, extracting feature information from images using self-attention mechanisms requires significant computational resources, which greatly limits the application of Transformer-based architectures on edge computing platforms. This paper introduces a novel lightweight network architecture based on the Swin Transformer and propose a new interval dense connection algorithm to connect the Swin Transformer layers, which improves the model feature reuse. The network architecture for single-image super-resolution is named SwinOIR. The proposed architecture is trained on the DIV2K and Flickr2K datasets, and the ablation study demonstrates the positive impact of interval dense connections on model performance. Additionally, the model is compared with other state-of-the-art (SOTA) models on various benchmark datasets, showing superior results. For example, SwinOIR obtains a PSNR of 26.62 dB for ×4 upscaling image super-resolution on Urban100 dataset, which is 0.15 dB higher than the SOTA model SwinIR. Simultaneously, with the advancement of deep learning and the emergence of new network architectures in the field of computer vision, exploring the application of these technologies in everyday life has become an important research topic. Therefore, in this paper, image super-resolution techniques are combined with two advanced computer vision tasks, namely object detection and semantic segmentation. By comparing the differences and impacts before and after incorporating super-resolution into these tasks, we demonstrate that improving image resolution can assist in performing higher-level visual tasks, leading to improved visual perception and increased accuracy.