近年來,語義圖像分割(Semantic Segmentation)通過使用深度卷積神經網路(Deep Convolutional Neural Networks)在各種數據集上實現了前所未有的高準確性,準確的語義分割可以應用在非常多領域例如自動車、監視攝像機、無人機等,但是這些應用通常需要real-time的反應故高幀速是必須的,然而深度卷積神經網路的執行時間是非常長的無法符合real-time的需求。 可動態調整影片語義分割網路(DVSNet)被用來實現快速且正確的影片語義分割。可動態調整影片語義分割網路由兩個卷積神經網路組成:語義分割網路(Segmentation Network)和光流網路(Flow Network)。前者產生準確的語義分割結果,但網路層數較多且耗時;後者比前者快得多,但須經過額外處理來獲得語義分割結果且較不准確,可動態調整影片語義分割網路使用決策網路(Decision Network)根據預期相似度(Expected Confidence Score)動態的將不同的幀區域分配給不同的網路,具有較高預期相似度的幀區域使用光流網路;具有較低預期相似度的幀區域則必須通過語義分割網路。實驗結果證明可動態調整影片語義分割網路能夠在Cityscapes資料集上達到19.8 fps並有70.4% mIoU的正確性,可動態調整影片語義分割網路的高速版本能夠在相同的資料集上提供30.4 fps和63.2 mIoU,另外可動態調整影片語義分割網路至多可以減少高達95%的計算時間。
In this paper, we present a detailed design of dynamic video segmentation network (DVSNet) for fast and efficient video semantic segmentation. DVSNet consists of two convolutional neural networks: a segmentation network and a flow network. The former generates highly accurate semantic segmentations, but is deeper and slower. The latter is much faster than the former, but its output requires further processing to generate less accurate semantic segmentations. We explore the use of a decision network to adaptively assign different frame regions to different networks based on a metric called expected confidence score. Frame regions with a higher expected confidence score traverse the flow network. Frame regions with a lower expected confidence score have to pass through the segmentation network. We have extensively performed experiments on various configurations of DVSNet, and investigated a number of variants for the proposed decision network. The experimental results show that our DVSNet is able to achieve up to 70.4% mIoU at 19.8 fps on the Cityscapes dataset. A high speed version of DVSNet is able to deliver an fps of 30.4 with 63.2% mIoU on the same dataset. DVSNet is also able to reduce up to 95% of the computational workloads.