以文件倉儲概念實現動態群聚與多重文件摘要之研究－以中文電子新聞為例

由於電子文件的數量成爆炸性成長，如何有效率地將文件歸納，以方便日後快速瀏覽與查詢，已經是知識管理領域中刻不容緩的課題。傳統上仰賴反轉索引檔(Inverted Index File)為基礎的全文檢索技術，往往搜尋出相當龐大且雜亂的文件資料，所以還需經過進一步的篩選，才能找到真正有用的文件。這樣的應用模式已經無法滿足使用者快速瀏覽與查詢的需求。在本論文中，我們應用文件倉儲的概念將文件予以結構化儲存，西己合多維度查詢的機制，找出具有相關性的文件以進行多重文件摘要與動態群聚之研究。整體概念透過實作DNCSS系統(Dynamic News Clustering and summarization System)來驗證其效果，我們應用資料倉儲處理數值資料的概念來處理文件資料，建立文件倉儲將文件所包含的結構化資訊應用在文件儲存、搜尋與整合上，並提供多維度查詢。更運用動態群聚的概念，幫助使用者組織對文件倉儲作查詢所回傳之查詢結果。最後以多文件摘要系統對每一個文件群聚結果產生一份多文件摘要，方便使用者瀏覽文件集合的精要內容，以更有效率的方式取得有用的資訊。我們以台灣地區各大網路新聞文件為實例來驗證本系統之效果，經人工評估後獲得相當正面之評價，顯示本研究確實能提供使用者快速且有效地獲取符合需求的文件資訊。

關鍵字

資訊檢索；文件倉儲；多文件摘要；文件群聚

並列摘要

As electronic documents proliferate drastically, for contemporary knowledge management, it is indispensable to provide a mechanism for integrating and sorting huge volume of documents for quick browsing and efficient query processing. Traditionally, full-text searching systems were usually based on inverted-index, which is usually huge in volume and unsorted. That makes users suffer from easily determining the information embedded in the collection. Therefore, for document searching over the Internet, such systems are no longer satisfactory for user's need. In this paper, we propose a general framework for document clustering and multi-document summarization based on the concept of document warehousing. Based on our framework, we have implemented a prototype system, named DNCSS (Dynamic News Clustering and Summarization System) to be the test bed of our approach. The system adopts the concept of document warehousing, which models text-oriented documents into multi-dimensional viewpoints. The constructed document warehouse can be regarded as the main repository for our system and it flexibly organizes document structure information for user's searching and querying. Moreover, the retrieved documents from the document warehouse will be further clustered by some clustering techniques to provide a more organized structure. Finally, our system generates a multi-document summary for each cluster to support users finding distilled information more efficiently. We have collected the most famous on-line news in TAIWAN from the Internet as the testing examples to verify the effectiveness of our system. The evaluation result shows that our approach positively alleviates users from reading large amount of related news and elaborating the necessary conclusion effectively.

並列關鍵字

Information Retrieval ； Document Warehouse ； Multi-Document Summarization ； Document Clustering

參考文獻

Bleyberg, M.Z.,Ganesh, K.(2000).Dynamic multi-dimensional models for text warehouses.IEEE International Conference on Systems, Man, and Cybernetics.(IEEE International Conference on Systems, Man, and Cybernetics).

Google Scholar

Bleyberg, M.Z.,Paranjape, P.S.(2001).A content delivery strategy for text warehouses.IEEE International Conference on Systems, Man, and Cybernetics.(IEEE International Conference on Systems, Man, and Cybernetics).

Google Scholar

Carey, M.,Kriwaczek, F.,Ruger, S.(2000).Proc. of Workshop on the New Paradigms in Information Visualization and Manipulation (NPIVM'2000).Washington, D.C.:

Google Scholar

Chen, K.J.,Kiu, S.H.(1992).Word identification for mandarin chinese sentences.The Fifth International Conference on Computational Linguistics.(The Fifth International Conference on Computational Linguistics).

Google Scholar

Edmundson, H.P,Wyllys, R.E.(1961).Automatic Abstracting and indexing-survey and recommendations.Communications of the ACM.4(5),226-234.

Google Scholar

國際替代計量

以文件倉儲概念實現動態群聚與多重文件摘要之研究－以中文電子新聞為例

全文下載

主題瀏覽