使用Faster-RCNN進行手語偵測和辨識

在本論文中，我們提出了一個使用Faster R-CNN模型來偵測和辨識手語的方法，本論文跟其他手勢辨識的論文差異在於我們所著重的在手語方面。本論文分成四個步驟，第一步是我們會從影片截圖和拍照來收集圖片，由於得到的圖片只有470張，所以我們使用圖像增強的方法產生新的圖片，圖像增強的原理是透過改變圖片的比例和畫質來創造新的圖片，第二步是將所有的圖片標記成30種的類別，像是:我你他好…等，第三步是訓練和應用，透過建立一個對照表來收集我們的分類，然後把它跟我們的圖像一起交給Faster R-CNN訓練，在模型中我們使用VGG16來提取特徵圖，然後利用RPN來獲得候選區域，再將得到的結果導入ROI pooling 獲得辨識分類跟候選位置，經過18小時和12394次的訓練後，整體損失降到大約0.2後我們儲存權重，最後一步，展示實驗結果，並跟其他有名的深度學習模型做比較。在本論文中，我們有以下貢獻: 1. 方便性:我們只需使用圖片來辨識和檢測手語。 2. 可擴充性:透過改變模型中的分類層來進行不同的圖像辨識。 3. 本地化:由於台灣沒有現成的資料庫，因此我們收集了30種的台灣手語圖片。

關鍵字

手語；物體偵測；深度學習

並列摘要

In this thesis, we will represent a method to detect and recognize the sign languages by using Faster-RCNN model. This work is different from the previous works of gesture recognition. The focus of our work is on sign language recognition. There are four steps in our thesis. The first step, we capture images data from screenshot in videos and photos from people. Although our origin dataset only have 470 images, we use the method called image augmentation to generate more images. The principle of this method is to change the ratio and quality of images to generate other new images. The second step, we need to annotate our images into thirty types, such as me, you, they, fine, eat, etc.. The third step is training process and implementation. We collect our classification in labelmap and put it with images all together in Faster-RCNN model. In Faster-RCNN model, we use VGG16 to get feature map from our images and use RPN to obtain the region proposals. Then, we put feature map and region proposal into ROI pooling layer in order to achieve two outputs: classification and proposal position. After training about 18 hours in cpu mode with 12394 steps, we get the total loss in 0.2 and save the weights from training process. Then, we implement models and weights on our testing data. In the last step, we will represent some simulation results and compare the results our model with other popular deep learning models. In this thesis, we have some contribution as follows: 1. Convenience: For other researches, some of them use gloves as tool to detect gesture for recognition. However, we only need images to recognize and detect sign language in our experiment and achieve good simulation results. 2. Scalability: In images recognition task, we can change the classification layer in our model to fit other object recognitions missions. 3. Localization: In each country, they have their own sign languages. Because there is no any Taiwan sign language’s dataset, we collect thirty types of Taiwan sign language images in order to create new dataset for academic used.