非文件集基礎的中文文件自動摘要系統之探討

本研究中應用機率潛在語意分析(Probabilistic Latent Semantic Analysis，PLSA)做為單文件自動摘要的方法。PLSA基於Aspect統計模型，可用於分析詞彙與句子的共同出現(co-occurrence)的現象。PLSA在自動索引的領域已被證實比潛在語意分析(Latent Semantic Analysis，LSA)表現更佳，自動摘要則是本研究所提出的一項新應用。在過去的研究中，自動摘要系統大部份是以文件集為基礎(corpus-based)的技術來建立，但是其缺點在於訓練的過程需要人工摘要的輔助，且在新的主題出現時，由於缺乏足夠學習的文件集，無法產生良好的摘要。本研究中的PLSA自動摘要系統是採用非文件集基礎(non-corpus based)的作法，並在實驗過程中與同樣是非文件集基礎的LSA及關聯性衡量(Relevance Measure，RM)自動摘要技術做比較。在實驗過程中，使用新台灣週刊的文章做為摘要的對象，RM摘要器獲得了最佳的效果，PLSA摘要器次之，LSA摘要器則表現最差。

關鍵字

文件自動摘要；非文件集基礎文件自動摘要；機率潛在語義分析；潛在語義分析；關聯性衡量

並列摘要

In our research, we applied Probabilistic Latent Semantic Analysis (PLSA) to single-document summarization. PLSA is based on Aspect model which can be used to analyze co-occurrence of terms and sentences. PLSA had been already proved that it performs better than Latent Semantic Analysis (LSA) in automatic indexing domain. In our research, we attempt to apply PLSA to solve automatic summarization problem. In literature, most of automatic summarizers were built on corpus-based structure. However, a corpus-based automatic summarizer requires a lot of documents and artificial summaries for training. Moreover, it will be hindered by the shortage of training documents on emerging topics. As so, we applied non-corpus based technique for automatic summarizer builder. A modified PLSA is proposed to build a summarizer. The performance of PLSA was compared with that of LSA and Relevance Measure (RM) summarizer. Using New Taiwan Magazine data, the results indicate that RM summarizer performed the best, PLSA summarizer ranked second, and LSA summarizer performed the worst.

並列關鍵字

Automatic Summarization ； Non-Corpus based Automatic Summarization ； Probabilistic Latent Semantic Analysis ； Latent Semantic Analysis ； Relevance Measure

參考文獻

Brants, T., Chen, F., & Tsochantaridis, I. (2002). Topic-Based Document Segmentation with Probabilistic Latent Semantic Analysis. In Proceedings of CIKM’02.

Edmundson H. P. (1969) New Methods in Automatic Extracting. Journal of the ACM, 16(2), pp. 264-285.

Gong Y., & Liu, X. (2001) Generic Text Summarization Using Relevance Measure and Latent Semantic Analysis. In Proceedings of SIGIR’01.

Hahn, U., & Mani, I. (2000) The Challenges of Automatic Summarization. Computer, vol. 33, no. 11, pp. 29-36.

Hofmann, T. (1999a) Probabilistic Latent Semantic Analysis. In Proceedings of 15th Conference on Uncertainty in AI.

國際替代計量

非文件集基礎的中文文件自動摘要系統之探討

主題瀏覽