透過您的圖書館登入
IP:13.58.201.235
  • 學位論文

語音數位內容檢索 ─ 相關回饋、圖論及語意

Spoken Content Retrieval - Relevance Feedback, Graphs and Semantics

指導教授 : 李琳山

摘要


一般的語音資訊檢索可以分成兩個階段。語音辨識引擎先將語料庫中的語音資訊轉寫成文字並儲存起來;然後在檢索時,就直接把文字資訊檢索的方法套用在這些辨識結果上。如果語音辨識引擎可以正確的將語音轉寫成文字,上述架構當然可以得到良好的結果,然而在語音辨識系統正確率較差的情況下,這樣的架構勢必會造成語音資訊檢索的效能大幅下降。本論文的核心思想就是要突破上述架構中語音資訊檢索因完全仰賴語音辨識結果所造成的效能限制,這將會是語音資訊檢索這個領域未來非常重要的發展方向。 本論文首先提出了以使用者相關回饋來重估測辨識系統的聲學模型參數的新技術。有別於傳統的聲學模型訓練法,本論文以提升檢索效能做為聲學模型訓練的目標,並將檢索系統以排序結果進行評估的特性在聲學模型訓練的過程中加以考量。另一方面,本論文提出了以聲學特徵參數做為機器學習特徵的想法,這個方法成功的被實作在虛擬回饋的架構下。其次,為了彌補在辨識過程中所漏失的資訊,本論文提出以聲學特徵相似度來改進語音資訊檢索的想法,這個想法可以被用在虛擬回饋以及圖學基礎之重排序上。最後,雖然今日語音檢索的研究仍集中在提升口述語彙偵測的效能,但本論文進一步考慮了語意檢索,目標在找出語意相關的語音文件,而不僅僅是找出包含查詢詞的文件。本文提出了以聲學特徵相似度來提升詞頻估測準確率的方法,這個方法可以進一步提升語意檢索中的語言檢索模型、文件擴展、查詢詞擴展等技術之效能。

並列摘要


Multimedia content over the Internet is very attractive, while the spoken part of such content very often tells the core information. Therefore, spoken content retrieval will be very important in helping users retrieve and browse efficiently across the huge qualities of multimedia content in the future. There are usually two stages in typical spoken content retrieval approaches. In the first stage, the audio content is recognized into text symbols by an Automatic Speech Recognition (ASR) system based on a set of acoustic models and language models. In the second stage, after the user enters a query, the retrieval engine searches through the recognition output and returns to the user a list of relevant spoken documents or segments. If the spoken content can be transcribed into text with very high accuracy, the problem is naturally reduced to text information retrieval. However, the inevitable high recognition error rates for spontaneous speech under a wide variety of acoustic conditions and linguistic context make this never possible. In this thesis, the above standard two-stage architecture is completely broken, and the two stages of recognition and retrieval are mixed up and considered as a whole. A set of approaches beyond retrieving over recognition output has been developed here. This idea is very helpful for spoken content retrieval, and may become one of the main future directions in this area. To consider the two stages of recognition and retrieval as a whole, it is proposed to adjust the acoustic model parameters borrowing the techniques of discriminative training but based on user relevance feedback. The problem of retrieval oriented acoustic model re-estimation is different from the conventional acoustic model training approaches for speech recognition in at least two ways: 1. The model training information includes only whether a spoken segment is relevant to a query or not; it does not include the transcription of any utterance. 2. The goal is to improve retrieval performance rather than recognition accuracy. A set of objective functions for retrieval oriented acoustic model re-estimation is proposed to take the properties of retrieval into consideration. There have been some previous works in spoken content retrieval taking advantage of the discriminative capability of machine learning methods. Different from the previous works that derive information from recognition output as features, acoustic vectors such as MFCC are taken as the features for discriminating relevant and irrelevant segments, and they are successfully applied on the scenario of Pseudo Relevance Feedback (PRF). The recognition process can be considered as ``quantization', in which the acoustic vector sequences are quantized into word symbols. Because different vector sequences may be quantized into the same symbol, much of the information in the spoken content may be lost in the stage of speech recognition. Information directly from the acoustic vector space is considered to compensate for the recognition output in this thesis. This is realized by either PRF or a graph-based re-ranking approach considering the similarity structure among all the segments retrieved. This approach is successfully applied on not only word-based retrieval system but also subword-based system, and these approaches improve the results of Out-of-Vocabulary (OOV) queries as well. The task of Spoken Term Detection (STD) is mainly considered in this thesis, for which the goal is simply returning spoken segments that contain the query terms. Although most works in spoken content retrieval nowadays continue to focus on STD, in this thesis a more general task is also considered: to retrieve the spoken documents semantically related to the queries, no matter the query terms are included in the spoken documents or not. Taking ASR transcriptions as text, the techniques such as latent semantic analysis or query expansion developed for text-based information retrieval can be directly applied for this task. However, the inevitable recognition errors in ASR transcriptions degrade the performance of these techniques. To have more robust semantic retrieval of spoken documents, the expected term frequencies derived from the lattices are enhanced by acoustic similarity with a graph-based approach. The enhanced term frequencies improve the performance of language modelling retrieval approach, document expansion techniques based on latent semantic analysis, and query expansion methods considering both words and latent topic information.

參考文獻


[1] Lin-Shan Lee and Berlin Chen, “Spoken document understanding and organization,” Signal Processing Magazine, IEEE, vol. 22, pp. 42 – 60, 2005.
[6] Lin-Shan Lee and Yi-Cheng Pan, “Voice-based information retrieval: how far are we from the text-based information retrieval ?,” in ASRU, 13 2009-dec. 1 2009, pp.
[11] Ciprian Chelba and Alex Acero, “Position specific posterior lattices for indexing speech,” in Proceedings of the 43rd Annual Meeting on Association for Computational
Linguistics, 2005, pp. 443–450.
[12] S. Parlak and M. Saraclar, “Spoken term detection for Turkish broadcast news,” in ICASSP, 2008.

延伸閱讀