基於局部模板匹配和自注意機制的像素卷積神經網絡的深度影像預測

基於前面時間點的一系列影像去預測當前的一張影像是一個很有挑戰的課題。雖然大多數前人的方法在簡單的影像數據預測上有很好的表現。但是他們很難成功預測高解析度的影像和複雜的自然影像。實驗結果通常是模糊的，並且會丟失物體的細節信息。這很有可能是因為這些模型的結構和能力不夠強大。為了解決這個問題，我們脫離傳統的黑盒子做法，提出一個基於局部匹配和自注意機制的像素卷積神經網絡的生成模型。我們的工作分為兩個部分：（1）運動搜索（2）改善自回歸預測模型。我們遵循傳統視訊壓縮框架的做法，把影像中的畫面切成沒有重複的方塊。然後用基於注意網絡的模板模型去前一時間點的畫面的所有的方塊中找一個預測信號。這個預測信號作為一個條件輸入使用在像素卷積神經網絡中。改良的像素卷積神經網絡使用一個像素接著一個像素的方式生成目標方塊。並且，這個改良是通過引入自注意模型實現的。這個模型的發展還停留在初級階段，我們展示了在探索這個模型的過程中的一系列發現。

關鍵字

影像預測；深度學習；機器學習

並列摘要

Predicting a future video frame based on few seen past video frames, also known as one-step video prediction/extrapolation, has been a challenging task. Although recent deep learning-based models demonstrate good performance on simple datasets, such as moving MNIST, they fail to generalize to high resolution, complex natural videos. Often the predicted frames are blurry and lack details. The reasons may be attributed to the poor model capacity and the poor network architecture. To address this problem, we deviate from the pure black-box approach to introduce a generative video prediction model based on local template matching and self-attention PixelCNN. Our work divides the task into two parts: (1) motion search and (2) auto-regressive prediction refinement. Following the conventional video compression framework, we first divide a video frame into non-overlapping blocks. We then find a prediction signal for each of these blocks from the previous frame based on an attention-based template matching model. Such a prediction signal is further utilized in PixelCNN as a conditioning signal to synthesize a target block pixel-by-pixel for prediction refinement. In particular, the generation process is improved by a self-attention mechanism. The development of this model is still in its early stage. We present findings encountered along the way.

並列關鍵字

Video Prediction ； Self-attention ； PixelCNN ； Local Template Matching ； machine learning

參考文獻

[1]D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014.

Google Scholar

[2]R. Brunelli.Template matching techniques in computer vision: theory and practice.John Wiley & Sons, 2009.

Google Scholar

[3]B. Chen, W. Wang, and J. Wang. Video imagination from a single image with transformation generation. InProceedings of the on Thematic Workshops of ACM Multimedia 2017, pages 358–366. ACM, 2017.

Google Scholar

[4]S. Chiappa, S. Racaniere, D. Wierstra, and S. Mohamed. Recurrent environment simulators.arXiv preprint arXiv:1704.02254, 2017.

Google Scholar

[5]E. Denton and R. Fergus. Stochastic video generation with a learned prior.arXivpreprint arXiv:1802.07687, 2018.

Google Scholar

國際替代計量

基於局部模板匹配和自注意機制的像素卷積神經網絡的深度影像預測

全文下載

主題瀏覽