透過您的圖書館登入
IP:3.145.36.10
  • 學位論文

卷積神經網絡的資料集設計對工地人員姿勢識別之影響

The influence of data set design of convolutional neural network on posture recognition of construction site personnel

指導教授 : 葉怡成

摘要


隨著基礎設施建設的不斷發展,建築行業的安全生產理念逐步得到推廣。近年來,基於深度學習的趨勢,使得影像識別獲得突破性的進展。因此,從監控視頻畫面中自動識別工人的行為,保障工人的安全,變得可行。但以往很少研究探討識別行人姿勢種類。因此,本研究以YOLO V4的深度學習模型識別工地環境中人員之站立(Standing person)、彎腰(Bending over person)、蹲下(Squatting person)三類姿勢。為了提升準確度,除了收集現有的工地影像數據集,為了使數據集更多元化,以及探討不同特性數據集所建構的模型的識別能力,本研究還自行建構兩種影像數據集 (1) 設計影像數據集:找來不同的人,在不同場地背景、拍攝距離、角度,以不同姿勢拍攝。(2) 自然影像數據集:在校園與工地類似的不同場地背景,包括測量實習、材料實驗室,拍攝不同群體在自然工作中的集體影像。上述三種影像數據集共890張影像,以人工標註得到行人樣本:站立2144、彎腰489、蹲下697。此外,還利用設計、自然影像數據集的樣本混合成具有等量樣本的混合數據集,以及具有二倍樣本的二倍混合數據集。以上述五種影像數據集建構Yolo V4識別模型,為了避免過度學習,訓練時,均分成80%為訓練集,20%為驗證集。各自得到最優模型(基於驗證集),再使用工地數影像數據集做為測試集,以評估何種數據集能訓練出對工地行人姿勢有最佳識別效果的模型。最後經由 mAP(平均精度)分析得出五種影像數據集的排序:兩倍混合(70.03%)>自然(65.77%)>工地(63.34%)>混合(60.50%)>設計(29.32%)。可知 (1) 樣本最多的兩倍混合數據集表現最佳,可見樣本數量十分關鍵。(2) 由mAP自然>工地>設計可知,自然數據集表現最佳,可見校園數據集略優於工地數據集,這可能是因為工地數據集中的部分行人樣本太小,解析度太差,或遮蔽過多,因此不利於學習。與工地環境相似但不相同的校園數據集反而因拍攝的解析度較高,利於學習。(3) 自然、工地數據集均遠高於設計數據集。仔細分析發現,設計數據集在訓練集、驗證集的表現極佳,因此設計數據集出現了嚴重的過度學習現象。原因可能是它的影像中通常只有少數一兩人,影像較大,無遮蔽,因此容易學習,故在訓練集、驗證集的表現極佳,但遇到差異大(解析度小、遮蔽嚴重)的工地樣本,表現很差。(4) 混合數據集的mAP介於自然、設計數據集,顯示這兩種數據集不具有互補的綜效。總之,自然數據集是最理想的訓練素材。

並列摘要


With the continuous development of infrastructure construction, the concept of safe production in the construction industry has been gradually promoted. In recent years, the trend of deep learning has led to breakthroughs in image recognition. Therefore, it becomes feasible to automatically identify workers' behavior from the surveillance video screen to ensure their safety. But in the past, few studies have explored the types of pedestrian posture recognition. Therefore, this study used the deep learning model of YOLO V4 to identify three types of person posture, including standing, bending over, and squatting, in the construction site environment. To improve the accuracy, in addition to collecting existing site image datasets, this study also constructed two image datasets to diversify the datasets and to investigate the recognition capability of the models constructed by different characteristic datasets (1) Designed image dataset: let different people too be as models, and take pictures under different backgrounds, shooting distances and angles, for different postures. (2) Nature image dataset: take images on different groups working in nature in different sites similar to construction sites in campus, including surveying practice and materials laboratory. Then a total of 890 images from the above three image datasets were manually annotated to obtain pedestrian samples: 2144 of standing person, 489 of bending over person, and 697 of squatting person. In addition, samples from the designed and nature image datasets are blended into a mixed dataset with equal number of samples, and a double mixed dataset with double samples. The Yolo V4 recognition model was constructed using the above five image datasets. To avoid over-learning, each dataset was divided into 80% for the training set and 20% for the validation set. The best models (based on the validation set) were obtained for each model, and then a construction site image dataset is used to evaluate which dataset can build the best model for image recognition of construction site pedestrian posture. Finally, the ranking of the five image datasets was obtained from the mAP (mean Average Precision) analysis: double mixed (70.03%) > nature (65.77%) > construction site (63.34%) > mixed (60.50%) > designed (29.32%). It can be seen that (1) The double mixed dataset with the largest number of samples performs best, which shows that sample size is critical. (2) According to mAP, Nature dada set performed slightly better than Construction Site dataset. This may be because some of the pedestrian samples in the construction site dataset are too small, too poorly resolved, or too much obscured, and thus not conducive to learning. The campus dataset, which is similar to but not the same as the construction site environment, is, on the contrary, better for learning because it is captured at a higher resolution. (3) The nature and construction site datasets were much better than the designed dataset. A careful analysis reveals that the designed dataset performs extremely well in the training dataset and validation dataset, but poorly in testing dada set, so the designed dataset shows a serious over-learning phenomenon. The reason may be that it usually has only one or two persons in the image, and the person image is large and unobscured, so it is easy to learn, and performs extremely well in the training dataset and validation dataset, but performs poorly when encountering construction site samples with large differences, small resolution and severe obscuration. (4) The mAP of the mixed dataset is intermediate between the nature dataset and designed dataset, showing that these two datasets do not have complementary synthesis effects. In summary, natural dataset is the most ideal training material in this study.

參考文獻


[1]Hu, J., Gao, X., Wu, H., & Gao, S. (2019). Detection of Workers Without the Helments in Videos Based on YOLO V3. 2019 12th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI). doi:10.1109/cisp-bmei48845.2019.8966045
[2]Nath, N. D., & Behzadan, A. H. (2020). Deep Convolutional Networks for Construction Object Detection Under Different Visual Conditions. Frontiers in Built Environment, 6. doi:10.3389/fbuil.2020.00097
[3]Wu, F., Jin, G., Gao, M., HE, Z., & Yang, Y. (2019). Helmet Detection Based On Improved YOLO V3 Deep Model. 2019 IEEE 16th International Conference on Networking, Sensing and Control (ICNSC). doi:10.1109/icnsc.2019.8743246
[4]Zhang, X., Zhang, L., & Li, D. (2019). Transmission Line Abnormal Target Detection Based on Machine Learning YOLO V3. 2019 International Conference on Advanced Mechatronic Systems (ICAMechS). doi:10.1109/icamechs.2019.8861617
[5]Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580-587).

延伸閱讀