透過您的圖書館登入
IP:44.203.58.132
  • 學位論文

改善語言模型於中文廣播新聞節錄式摘要

Improved Language Modeling Approaches for Mandarin Broadcast News Extractive Summarization

指導教授 : 顏嗣鈞
共同指導教授 : 許聞廉(Wen-Lian Hsu)

摘要


由於網際網路的蓬勃發展與大資料時代的來臨,近幾年來自動摘要(Automatic Summarization)已儼然成為一項熱門的研究議題。節錄式(Extractive)自動摘要是根據事先定義的摘要比例,從文字文件(Text Documents)或語音文件(Spoken Documents)中選取一些能夠代表原始文件主旨或主題的重要語句當作摘要。在相關研究中,使用語言模型(Language Modeling)結合庫爾貝克-萊伯勒離散度(Kullback-Leibler Divergence)的架構來挑選重要語句之方法,已初步地被驗證在文字與語音文件的自動摘要任務上有不錯的成果。基於此架構,本論文提出幾個改善語言模型方法的研究。第一,我們探究語句明確度(Clarity)資訊對於語音文件摘要任務之影響性,並進一步地藉由明確度的輔助來重新詮釋如何能在自動摘要任務中適當地挑選重要且具代表性的語句。第二,本論文亦針對語句模型的調適方法進行研究;在運用關聯性(Relevance)的概念下,嘗試藉由每一語句各自的關聯性資訊,重新估測並建立語句的語言模型,使其得以更精準地代表語句的語意內容。第三,本論文提出使用重疊分群(Overlapped Clustering)的概念來額外考量兩兩語句之間關聯性(Sentence Relatedness)的資訊,除此之外,重疊分群更可用來當作語句的事前機率,使得更有效地選取重要的語句當作摘要。最後,傳統語言模型在建立語句模型時通常只考慮各別每個單一詞彙,並未考量到長距離的其它詞彙,因此本論文提出使用鄰近(Proximity)與位置(Position)的資訊來建立語句模型,以期增進自動摘要之效能。本論文的語音文件摘要實驗語料是採用中文公視廣播新聞(MATBN),所使用的語音辨識工具為中文大詞彙連續語音辨識器(LVCSR);實驗結果顯示,相較於其它現有的非監督式摘要方法,我們所發展的多個新穎式摘要方法能提供明顯的效能改善。

並列摘要


Extractive speech summarization aims to select an indicative set of sentences from a spoken document so as to succinctly cover the most important aspects of the document, which has garnered much research over the years. In this dissertation, we cast extractive speech summarization as an ad-hoc information retrieval (IR) problem and investigate various language modeling (LM) methods for important sentence selection. The main contributions of this dissertation are four-fold. First, we propose a novel clarity measure for use in important sentence selection, which can help quantify the thematic specificity of each individual sentence and is deemed to be a crucial indicator orthogonal to the relevance measure provided by the LM-based methods. Second, we explore a novel sentence modeling paradigm building on top of the notion of relevance, where the relationship between a candidate summary sentence and a spoken document to be summarized is unveiled through different granularities of context for relevance modeling. In addition, not only lexical but also topical cues inherent in the spoken document are exploited for sentence modeling. Third, we explore a novel approach that generates overlapped clusters to extract sentence relatedness information from the document to be summarized, which can be used not only to enhance the estimation of various sentence models but also to facilitate the sentence-level structural relationships for better summarization performance. Fourth, we also explore several effective formulations of proximity cues, and proposing a position-aware language modeling framework using various granularities of position-specific information for sentence modeling. Extensive experiments are conducted on Mandarin broadcast news summarization dataset with Mandarin large vocabulary continuous speech recognition (LVCSR), and the empirical results seem to demonstrate the performance merits of our methods when compared to several existing well-developed and/or state-of-the-art methods.

參考文獻


[42] Y. Bengio, R. Ducharme, P. Vincent and C. Jauvin, “A neural probabilistic language model,” Journal of Machine Learning Research, Vol. 3, pp. 1137–1155, 2003.
[71] H.-M. Wang, B. Chen, J.-W. Kuo and S.-S. Cheng, “MATBN: A Mandarin Chinese broadcast news corpus,” International Journal of Computational Linguistics and Chinese Language Processing, Vol. 10, No. 2, pp. 219–236, 2005.
[4] S. Furui, T. Kikuchi, Y. Shinnaka, and C. Hori, “Speech-to-text and speech-to-speech summarization of spontaneous speech,” IEEE Transactions on Speech and Audio Processing, Vol. 12, No. 4, pp. 401–408, 2004.
[5] K. McKeown, J. Hirschberg, M. Galley, and S. Maskey, “From text to speech summarization,” in Proc. of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 997–1000, 2005.
[6] J. J. Zhang, R. H. Y. Chan, and P. Fung, “Extractive speech summarization using shallow rhetorical structure modeling,” IEEE Transactions on Audio, Speech and Language Processing, Vol. 18, No. 6, pp. 1147–1157, 2010.

延伸閱讀