透過您的圖書館登入
IP:18.116.62.45
  • 學位論文

利用視訊與聲訊雙重處理進行說話者位置偵測

Detection of the location of talkers via video and audio bimodal processing

指導教授 : 劉奕汶

摘要


近年來越來越多的研究從事聲訊與視訊的結合來做聲源定位,可以減低單一使用聲訊在充滿雜訊以及聲音迴響的環境下估計聲源方位所造成的誤差。本論文就是以兩支麥克風與筆記型電腦上的網路攝影機針對說話者做聲源定位。在聲訊方面是利用雙曲線的定義估計聲源的角度。在視訊方面是利用Viola與Jones提出的人臉偵測演算法偵測到人臉之後,再利用Turk與Pentland提出利用主成份分析法(Principal Component Analysis, PCA)找到每個人不同的eigenface來做人臉辨識。 因此本論文的系統架構是先利用視訊偵測到人臉在影像中的大小估計人到網路攝影機的垂直距離,在結合利用雙曲線定義所估計出的聲源角度,求得當聲源是人時,以網路攝影機(即兩支麥克風的中點)為中心的二維平面座標。本論文除了可以偵測說話者的方位以及辨識說話者的身分之外,同時也利用聲源角度的資訊輔助影像針對人臉旋轉偵測。並且在假設聲源之間彼此的訊號不相關(uncorrelated)時,可以利用視訊偵測到的人臉個數與聲訊利用交互相關函數來估計在室內環境下的潛在聲源個數。 本研究方法,實驗量測結果得知:在利用視訊結合聲訊針對人做聲源定位的二維平面座標誤差不超過5cm。並且假設聲源之間彼此的訊號不相關以及筆記型電腦上的網路攝影機視角範圍限制在-25°~25°之下,只要利用兩支麥克風就可以偵測到兩個聲源同時發聲。

並列摘要


Much research has been investigated regarding the source detection by joining audio and video methods recently. The audio-video method performs better in bias reduction for source detection in the noisy and reverberant environment than using the audio method alone. In this thesis, we design a system for talker detection by using two microphones and the web camera. For audio, we use the definition of hyperbolic surface to estimate the direction of sound sources relative to the microphones. For video, we use Viola-Jones algorithm to detect the face. Afterwards, we use Turk-Pentland algorithm to find the eigenface by principal component analysis, and later use the eigenface to recognize the face. The location of a talking person is determined in two steps. First, we estimate the normal distance between the talker and the imaging plane of the camera by the size of the talker’s face in the image. Then, an estimate of two-dimensional location of the talker is obtained by considering the angle of the talker relative to the camera (or the center of two microphones). Because of using video and audio information jointly, the system can identify the talker, and face detection can be made robust against rotations thanks to the availability of audio information. In addition, when there are multiple talkers in the room, the number of sound sources can be estimated under the assumption that the sources are uncorrelated; this can be achieved either by counting the number of faces in video or calculating the cross correlation function between signals obtained by two microphones. Experiments were conducted and results showed that the bias for estimating the location of a single talker is less than 5cm. Experiments for double talker estimation were also conducted, and we demonstrated that, in principle, we can only use two microphones to detect two sources as long as that they are uncorrelated.

並列關鍵字

TDOA source detection face detection face recognition audio video

參考文獻


19. 鄭兆翔, “達更大偵測範圍的改良人臉偵測系統”, 電機工程學系研究所2007, 國立清華大學: 新竹市.
1. Pingali, G., “Integrated audio-visual processing for object localization and tracking,” in Proceedings of the SPIE, vol. 3310, 1997. p. 206-213.
3. Schmidt, R., “Multiple emitter location and signal parameter estimation,” Antennas and Propagation, IEEE Transactions on, 1986. 34(3): p. 276-280.
5. Cekli, S., “Position detection with spherical interpolation least squares based on time difference of arrivals using separated acoustic signals by independent component analysis,” Signal Processing and Communications Applications Conference (SIU), 2012. p.1-4.
8. Strobel, N.; Spors, S.; Rabenstein, R., “Joint Audio-Video Object Localization and Tracking,” Signal Processing Magazine, IEEE, 2001. 18(1): p. 22-31

延伸閱讀