改良YOLOv8模型之遙測影像目標偵測

近幾年在電腦視覺領域當中有許多專家學者提出了高效的物件偵測器。然而，遙測影像相比普通影像更為複雜，且具有較高的相似度。這一特性使得直接將物件偵測器應用於遙感影像時，效果往往不如預期。此外，基於深度學習的演算法應用於物件偵測，雖然可以辨識到物件的存在和對應類別，但目前大部分的演算法只有利用區域提案中的訊息，從而忽略全區域的資訊。而近年較新的物件偵測器，絕大多數追求更高的精度，卻忽略了檢測精度和模型大小的平衡，這將導致物件偵測器無法良好地應用於資源較為受限的環境。為瞭解決上述問題並能夠更好的地實現遙測影像之目標偵測，我們基於YOLOv8快速目標檢測模型，提出三個新模型應用在遙測影像物件偵測。第一個模型為YOLOv8n with Bi-directional Feature Pyramid Network(YOLOv8n-Bi)模型。YOLOv8n-Bi模型將特徵金字塔改為具有權重的雙向特徵金字塔網絡(Bi-directional Feature Pyramid Network, BiFPN)，並且學習不同輸入的重要性。第二個模型進一步在YOLOv8n-Bi模型的基礎上，添加Transformer區塊，使模型能更有效地捕捉影像中的長距離相依和全區域上下文訊息。且透過自注意力機制，模型得以提高對影像中各個位置的關注度，特別是在存在遮蔽或複雜背景的情境下，我們稱第二個模型為YOLOv8n with Transformer and Bi-directional Feature Pyramid Network(YOLOv8n-TFBi)模型。而第三個模型基於 YOLOv8n-TFBi模型，加入座標注意力區塊(Coordinate Attention, CA)，強化模型對特定位置的注意力，進而提高物件偵測的精度，故我們提出的第三個模型為YOLOv8n with Transformer and Bi-directional Feature Pyramid Network and Coordinate Attention(YOLOv8n-TFBiCA)模型。本文將我們所提出的模型與其他物件偵測的先進模型於公開資料集RSOD上進行比較，並透過全類均值精度(Mean Average Precision, mAP)、參數量(Parameters)及推理速度(Inference time)三個指標，驗證我們所提出的模型在參數量及推理速度相差不遠的情況下，精度優於其他先進模型。實驗結果顯示，本文所提出的YOLOv8n-Bi、YOLOv8n-TFBi和YOLOv8n-TFBiCA模型在RSOD公開資料集上相比於YOLOv8n模型，mAP從90.2%分別提升至91.7%、92.5%及94.5%。在與YOLOv5、YOLOv6、CA-YOLO等模型相比亦擁有更好的mAP表現，並同時保有競爭性的參數量及推理速度。

關鍵字

深度學習；卷積神經網路；物件偵測； YOLOv8 ；遙測影像

並列摘要

In recent years, numerous experts and scholars in the field of computer vision have proposed efficient object detectors. However, remote sensing images are more complex than ordinary images, and the similarity is much higher. This feature makes the effect often less effective when applied directly to remote sensing images with object detectors. Furthermore, algorithms based on deep learning are applied to object detection. Although the existence and corresponding category of the object can be identified, most current algorithms only use the information in the region proposals, thereby ignoring the information of the entire area. In addition, many recent state-of-the-art object detectors prioritize higher accuracy but neglect the balance between detection accuracy and model size. This oversight limits their applicability in resource-constrained environments. To address these issues and enhance the detection of targets in remote sensing images, we propose three novel models based on the YOLOv8 fast object detection model. The first model is YOLOv8n with Bi-directional Feature Pyramid Network(YOLOv8n-Bi), which replaces the feature pyramid with a weighted Bi-directional Feature Pyramid Network(BiFPN) and learns the importance of distinguishing features between different inputs. Building upon YOLOv8n-Bi, the second model incorporates Transformer blocks to enable the model to capture long-range dependencies and global contextual information more effectively. This model called YOLOv8n with Transformer and Bi-directional Feature Pyramid Network(YOLOv8n-TFBi), utilizes self-attention mechanisms to enhance focus on various positions in the image, particularly in scenarios with occlusion or complex backgrounds. The third called YOLOv8n with Transformer and Bi-directional Feature Pyramid Network(YOLOv8n-TFBi), builds upon YOLOv8n-TFBi by introducing a Coordinate Attention(CA) block to strengthen the model's attention to specific positions, thereby improving object detection accuracy. This thesis compares our proposed models with other advanced object detection approaches on the publicly available RSOD dataset. Evaluation metrics include Mean Average Precision(mAP) and the number of parameters. We demonstrate that our models outperform other state-of-the-art approaches in terms of accuracy while maintaining a comparable number of parameters. The experimental results show that our proposed YOLOv8n-Bi, YOLOv8n-TFBi, and YOLOv8n-TFBiCA models improved the mAP on the RSOD public dataset from 90.2% with the YOLOv8n model to 91.7%, 92.5%, and 94.5%, respectively. Compared to models such as YOLOv5, YOLOv6, and CA-YOLO, these proposed models also demonstrated better mAP performance while maintaining competitive parameter counts and inference speeds.

並列關鍵字

Deep Learning ； Convolutional Neural Networks ； Object Detection ； YOLOv8 ； Remote Sensing

參考文獻

[1] S. Elfwing, E. Uchibe, and K. Doya, "Sigmoid-weighted linear units for neural network function approximation in reinforcement learning," Neural Netw, vol. 107, pp. 3-11, Nov. 2018.

Google Scholar

[2] L. Zhang, H. Lin, and F. Wang, "Individual tree detection based on high-resolution RGB images for urban forestry applications," IEEE Access, vol. 10, pp. 46589-46598, May 2022.

Google Scholar

[3] H. Wang et al., "Multi-source remote sensing intelligent characterization technique-based disaster regions detection in high-altitude mountain forest areas," IEEE Geoscience and Remote Sensing Letters, vol. 19, pp. 1-5, Jun. 2022.

Google Scholar

[4] M. Zhang and X. Li, "Drone-enabled Internet-of-Things relay for environmental monitoring in remote areas without public networks," IEEE Internet of Things Journal, vol. 7, no. 8, pp. 7648-7662, Apr. 2020.

Google Scholar

[5] J. Lv et al., "Recognition of deformation military targets in the complex scenes via miniSAR submeter images with FASAR-Net," IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1-19, May 2023.

Google Scholar

延伸閱讀

全文下載

主題瀏覽