透過您的圖書館登入
IP:18.223.172.252
  • 學位論文

發展以三元組為基礎的知識圖譜與文章摘要萃取技術

On the Development of Knowledge Graph and Text Summarization Technology Based on Triplet Extraction

指導教授 : 藍俊宏
本文將於2026/09/25開放下載。若您希望在開放下載時收到通知,可將文章加入收藏

摘要


由於資訊科技的進步,資料的蒐集變得非常容易,導致人類消化資訊的速度遠比不上資料生成的速度,如何從海量的資料中,快速且正確的擷取出有用的資訊,無論在哪個領域中都是非常重要。 本論文以「萃取式摘要」與「摘要知識圖譜」為目標,提出萃取式摘要的泛化改良流程。在不引入語系或領域的完整字典下,從文章內容自製暫用辭典,輔以N-gram尋找關鍵詞,藉此產生知識圖譜所需的三元組,即如中文的主詞、動詞、賓語 (SVO) 的概念,最後以關鍵字與三元組出現的頻率為權重標準,挑選關鍵詞與句,再彙整為萃取式摘要。 為驗證提出之方法,本論文以內容農場、27篇學術論文與18篇期刊論文進行測試,以原文章之摘要為標準進行ROUGE-1、2、L的計算,並與TextRank摘取之結果相比。在單篇平均六萬字的27篇論文且同時包含中英文文字下,無論移除或不移除停止字,平均每篇可得到ROUGE-1、2、L平均分為0.44、0.18與0.37,約為TextRank的3倍,可在29秒內處理完畢,速度為TextRank的142秒的5倍速度;在期刊與內容農場之文章也有類似之成果。摘要後以三元組繪製知識圖譜,視覺化呈現單篇文章摘要,可更有效率地理解文章關鍵字之間的關係。

並列摘要


With the advancement of information technology, data collection has become extremely easy. As a result, the speed of information understanding is far less than that of data generation. Regardless of the domains, to quickly and accurately extract useful information from massive amounts of data is very challenging and crucial for modern readers. The objectives of this thesis focus on the "extractive abstract" and "knowledge graph-based summary visualization". A generalized processing framework for extractive abstract is proposed. Without introducing a complete dictionary of language family or domain knowledge, customized dictionary is learned directly from the content of the article. N-gram is then applied to find keywords, which are the input for generating the triplets, i.e., SVOs, required by the knowledge graph. Finally, the frequency of keywords and triplets are treated as the weights to select keywords and key sentences, which are aggregated into the extractive abstract. To verify the proposed method, articles from content farms, 27 academic theses and 18 journal papers are processed in this research. ROUGE-1, 2, L are calculated based on the abstract of the original article, and the results are compared with the results extracted by TextRank. With an average of 60,000 words mixed with Chinese and English texts in the selected 27 theses, regardless of removing the stopping words or not, the averaged ROUGE-1, 2, L scores are 0.44, 0.18 and 0.37, which is around three times better than that of TextRank. Individual abstract can be extracted within 29 seconds, which is five times faster than TextRank. The extracted triplets are used to visualize the abstract by the knowledge graph, which provides a more efficiently way to understand the relationship between article keywords.

參考文獻


[英文文獻]
1. Duan, X., Yu, H., Yin, M., Zhang, M., Luo, W., Zhang, Y. (2019). Contrastive attention mechanism for abstractive sentence summarization. arXiv preprint arXiv:1910.13114.
2. George K. Zipf (1949) Human Behavior and the Principle of Least Effort. Addison-Wesley.
3. Ji, S., Pan, S., Cambria, E., Marttinen, P., Philip, S. Y. (2021). A survey on knowledge graphs: Representation, acquisition, and applications. IEEE Transactions on Neural Networks and Learning Systems.
4. Lin, C. Y. (2004, July). ROUGE: A package for automatic evaluation of summaries. In Text summarization branches out (pp. 74-81).

延伸閱讀