基於對抗式生成網路架構僅使用兩張影像合成人類動作影片

影片生成在電腦視覺領域中一直是一個很重要的課題，這門技術主要是希望能利用給定的少量影像或雜訊來生成出一段完整且連續的影片，而這將可以應用在資料集的擴充或是在傳輸影片中降低其每秒顯示張數 (fps) 的議題上。在過往的經驗中，大部分的方法都會在訓練網絡時添加一些其他的資訊，例如: 語意分割 (semantic segmentations) 、影片的光流 (optical flow) 或是深度 (depth) 等，而這些資訊在現實生活中都是比較難收集的，若不添加這些額外的資訊，結果則很有可能隨著生成時間的拉長而導致影片中正在動的物體越來越模糊，最後導致發散不見，因此我們希望能改善此種現象。在本篇論文中，我們發現傳統方法與近代方法各有一些屬於它們的優缺點，所以我們希望能夠取得它們各自的優點來彌補另一方的缺點，所以我們結合了傳統方法內插 (interpolation) 與近幾年比較火紅的對抗式生成網路 (Generative Adversarial Network) 來設計並完成我們網路的架構，此架構可以單純利用影片中的前後兩張影像且不添加其他資訊來進行影片生成，而在我們的網路架構中可以分解成四個小步驟: 1. 人物關節點熱圖 (heat map) 抓取。 2. 利用內插生成部分關節點熱圖。 3. 利用對抗式網路與步驟二的結果生成更完整的熱圖影片。 4. 將影片的前景背景貼在生成出的影片上。經實驗結果證明，本篇架構在不添加任何資訊的影片預測中，與其他相同條件的論文比較能夠產生出更清楚且目標物離它初始起點移動得更遠的結果。

關鍵字

影片生成；影片擴充

並列摘要

Video generation is very import issue in computer vision. The goal of video generation is that the video can be generated by given a few frames or some noises. So that the skill can be applied to data augmentation or enable a decrease in the need of frames per second (fps) of videos when transporting. In the past, methods, that were taken required ground truth annotation of other information (e.g., semantic segmentations, video’s optical flow, or depth) at training time. If annotations were omitted, the objects generated in the videos would disperse, or become blurred over time. Considering the fact that those information are very difficult to obtain in the real world. The purpose of the thesis is to overcome this difficulty. This thesis presents a new architecture which could generate the video from the first and the last frame. In the architecture, traditional method, interpolation, and deep learn method, generate adversarial network (GAN) are combined. In our method, it could be separated into four stages. First, using [1] to extract the joints of a person in frames. Second, we use the interpolation and some joints, not all, to generate the heat map video. Third, generating more complete heat map video by GAN. Last but not least, employing [2] to generate the appearance of the object in the video. The details of each stage would be explained in the following thesis. Experimental results show that the videos generated by our method without giving another information are more realistic and the object in the videos could move farer from original location as well comparing to other experiments under the same conditions done before.

並列關鍵字

Video prediction ； Video generation

參考文獻

[1] M. Mathieu, C, Couprie, and Y. LeCun. Deep multi-scale video prediction beyond mean square error. In arXiv, 2015.

Google Scholar

[2] C. Vondrick. H, Pirsiavash, and A. Torralba. Generating videos with scene dynamics. In Neural Information Processing Systems (NIPS), 2016.

Google Scholar

[3] J. Walker, C. Doersch. A. Gupta, and M. Hebert. An uncertain future: Forecasting from static images using variational autoencoders. In European Conference on Computer Vision (ECCV), 2016.

Google Scholar

[4] Xiaodan Liang, Lisa Lee, Wei Dai, Eric P. Xing. Dual motion GAN for future-flow embedded video prediction. In IEEE International Conference on Computer Vision (ICCV), 2017.

Google Scholar

[5] Katsunori Ohnishi, Shohei Yamamoto, Yoshitaka Ushiku, Tatsuya Harada. Hierarchical Video Generation from Orthogonal Information: Optical Flow and Texture. In Association for the Advancement of Artificial Intelligence (AAAI), 2018.

Google Scholar

國際替代計量

基於對抗式生成網路架構僅使用兩張影像合成人類動作影片

全文下載

主題瀏覽