基於多階級全域關聯性與區域圖樣資訊進行材質辨識

由於細微的材質特徵以及多種環境資訊所導致視覺上較大的差異，想要從材質影像中提取具有區別性的特徵表示是困難的，使得材質辨識是一個具有挑戰性的任務。過去的文獻專注於使用卷積神經網路從材質物體中提取圖樣資訊。在我們的工作中，我們首次將 Vision Transformer (ViT) 應用至材質辨識，討論自我專注機制所學習的材質影像區域與區域之間的關聯性在材質辨識領域的可行性。接著，為了生成資訊量更大的特徵表示，我們提出了 CTF-Net，透過將 ViT 和卷積神經網路所生成的高階級特徵圖譜在圖譜的每個位置進行結合，實現全域關聯性和區域圖樣特徵兩種資訊的互補。除了如過去文獻只考慮高階級特徵圖譜外，我們提出了 MLCTF-Net 將 ViT 和卷積神經網路在多個神經網路層所生成的特徵圖譜都納入考量，整合了不同階級的材質特徵。最後，除了為了處理材質組間差異所使用的交叉直熵，我們提出了 MLCTF-Net† 進一步採用了中心損失來解決材質組內差異大的問題。透過在 DTD、KTH-TIPS2-b、FMD、GTOS、GTOS-mobile 上全面的實驗展示了我們所提出的模型具有出色的材質分類準確度。

關鍵字

材質辨識；模型結合；視覺變換器；卷積神經網路；深度學習

並列摘要

With the minuscule texture primitives and the large perceptual variations under diverse contexts, it is hard to capture discriminative representations from texture images, which makes texture recognition a challenging problem. Up to now, previous works focused on applying Convolutional Neural Networks (CNN) to extract pattern information from texture objects. In our work, we first investigate the efficacy of vision transformer (ViT) on texture recognition, which models the global semantic relevance of texture image patches by a series of self-attention mechanisms. Next, to generate more informative representations, we propose CNN-ViT Fusion Network (CTF-Net), which fuses high-level feature maps generated by CNN and ViT backbones, complementing the global semantic relevance learned by ViT with pattern characteristics captured by CNN at each spatial position. Besides considering only the high-level feature map as in previous works, we propose Multi-Level CNN-ViT Fusion Network (MLCTF-Net) that fuses feature maps generated by CNN and ViT at multiple layers to incorporate texture features of different abstraction levels. Finally, in addition to cross entropy loss that is used to deal with the inter-class variations between texture categories, we propose MLCTF-Net† that further takes center loss into account to address intra-class variations within each texture category. The ex- tensive experiments on DTD, KTH-TIPS2-b, FMD, GTOS, and GTOS-mobile show that the proposed fusion networks achieve prominent performance on texture classification.

並列關鍵字

Texture Recognition ； Model Fusion ； Vision Transformer ； Convolutional Neural Network ； Deep Learning

參考文獻

[1] S. Albawi, T. A. Mohammed, and S. Al-Zawi. Understanding of a convolutional neural network. In 2017 international conference on engineering and technology (ICET), pages 1–6. Ieee, 2017.

Google Scholar

[2] M.AlkhatibandA.Hafiane.Robustadaptivemedianbinarypatternfornoisytexture classification and retrieval. IEEE Transactions on Image Processing, 28(11):5407–5418, 2019.

Google Scholar

[3] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Empirical evaluation of gated re- current neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.

Google Scholar

[4] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi. Describing textures in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3606–3613, 2014.

Google Scholar

[5] M. Cimpoi, S. Maji, and A. Vedaldi. Deep filter banks for texture recognition and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3828–3836, 2015.

Google Scholar

主題瀏覽