使用異質攝影機之客製化視訊合成系統

本篇論文探討一個客製化視訊合成系統。該系統可以為有興趣的活動參與者製作出一部專屬的活動紀錄影片。我們發展了一個由一部可移動式的攝影機器人、固定式與可上下左右旋轉及變焦的攝影機、一般人手持及專業攝影師的攝影機所組成的異質性攝影機網路，並在操控時搭配美學規則來記錄一個社交活動。其中主要的美學元素包含避免非故意的切割線、構圖規則及攝影機的移動。此外，我們提出了一個利用穿戴式慣性感測器來追蹤行人位置的方法來追蹤主要人物。基於慣性感測器的訊號，攝影機器人可以知道該主要人物的運動狀態以協助在人群眾多的環境下持續拍攝該人。影片的合成則是在後製的階段完成。首先，我們利用一個鏡頭變化偵測的方法將異質性相機所取得的影片切割成許多鏡頭。專業的攝影師或是活動人員會手動挑選出活動中不可或缺的鏡頭。而某些攝影機會指派成紀錄有關活動主軸畫面的主軸鏡頭。第二，我們利用人臉偵測技術來找出影片中的人臉，並利用人臉辨識技術來將這些人臉作分類。第三，有興趣的參與者會被要求在分類好的人臉中隨意指認他們的照片並訂正錯誤分類的結果。分類好的人臉會被用來取得其對應之影片鏡頭。第四，我們使用一個影片品質評估的方法來對主軸鏡頭及人臉挑選鏡頭作評分。給定一個期望的影片長度後，我們根據影片品質將不可或缺的鏡頭、主軸鏡頭及人臉挑選鏡頭合成出一個客製化影片。該系統實際在社交活動中作測試。實驗結果顯示，我們所提出的客製化視訊合成系統可產生在觀看時相當有吸引力的影片。

關鍵字

異質相機；客製化視訊合成系統；攝影美學；影片品質評估；運鏡；慣性感測器；行人追蹤

並列摘要

This dissertation studies a customizable video composition system which can produce a documentary video of a social activity dedicated to a participant who is interested in obtaining a video of his own. For recording a social activity, a heterogeneous camera network is developed which consists of a mobile camera-robot, fixed and pan-tilt-zoom cameras, and human operated cameras of participants or professional cameramen. The cameras are controlled to fulfill some aesthetics rules of photoing and filming. The main aesthetics factors considered in this work comprises the avoidance of unintentional dissection lines, composition rules and camera movements. Furthermore, a pedestrian position tracking method is proposed to track a key person wearing a foot-mounted {em{inertial measurement unit}} (IMU). Based on IMU signals, the camera-robot is aware of the key person's motion states that helps to keep recording videos of the key person in a crowded environment. The video composition is accomplished in a post-processing stage. First, videos taken by heterogeneous cameras are split into numerous video shots with a shot change detection method. Critical shots of the activity are selected manually either by a professional cameraman or by the activity organizer. Also, video shots from certain cameras which record the main events of the activity are designated as mainline shots. Second, a face detection method is applied to each frame of the videos and the detected faces are clustered with a face recognition method. Third, interested participants are asked to tag their photos from the classified faces and may also correct miss-classified results at will. The classified faces are used to select face-retrieved (FR) video shots. Fourth, a video quality assessment method is used to grade the quality of the mainline shots and the FR shots. Given the desired video length, a customized video is composed using the critical shots, the mainline shots and the FR shots according to video quality values. The proposed system has been tested in real social activities. The experimental results show that the proposed customizable video composition system can produce visually appealing videos.