透過您的圖書館登入
IP:3.237.46.120
  • 期刊

以模糊理論為基礎之中文文件多重分類方法

Fuzzy-Based Multi-Categorization of Chinese Documents

摘要


近年來,網際網路的普及造就了文件資料迅速而大量地流通。知識工作者也因此面臨了資訊過載的困境,所以對於日積月累的文件資料,如果不加以分類整理,往後在搜尋所需文件時必會耗費相當大的時間。這個問題也使得資訊檢索(Information Retrieval)的議題在各界被廣泛地討論,其目標大多是希望讓使用者的需求資訊與大量的現有資料能夠做完善的比對,使檢索的結果能快速滿足使用者的資訊擷取需求。所以,若能事先將這些文件資訊進行分類,便可以加快檢索速度及提高檢索之正確率。但是,在真實世界裡,人類有許多思維過程是非常「模糊」(Fuzz)的,因此在實際應用中所需檢索的訊息,往往具有一定的模糊性(Fuzziness),而且一份文件的內容可能涉及多個不同的議題,或是各事先定義的類別之間並不完全獨立,使得將每份文件只歸類到單一特定類別的作法,並不見得合理。因此,在本論文中,我們將利用資訊檢索(Information Retrieval)的相關技術,以模糊集合理論(Fuzzy Set Theory)為基礎,透過「模糊資訊檢索分類」(Fuzzy Information Retrieval Categorization)以中文文件做為分析目標,將每份文件進行合理的多重分類。將文件同時歸屬於多類,不僅可以提高文件檢索的效率,後續更可以進一步建立文件倉儲(Document Warehouse),以便對該文件進行文件探勘(Text Mining)做準備。我們以資訊管理研討會的論文集為對象進行多重分類的測試,並透過分類方法之效能評估來驗證執行結果之正確率(Precision Rate)和回現率(Recall Rate)等相關指標。驗證結果證明本研究所提出之方法具有相當準確的正確率與回現率。

並列摘要


Thanks to the proliferation of Internet, documents are rapidly shared over the cyberspace in the past few years. However, it also makes knowledge workers to suffer from the information-overloading problem. For a set of documents without suitable categorization, the searching process will be prone to time-consuming. To overcome the problem, the issues about effective information retrieval have been studied extensively. The objectives usually focus on how to match documents that are conforming to users' requirement efficiently. Nevertheless, in the real world applications, people tend to use some 'fuzzy' terms to express their thinking. Beside, a document may involve various concerns, which makes it should be better categorized into multiple categories. In this paper, we propose a multiple categorization approach based on fuzzy set theory. We employ Fuzzy Information Retrieval Categorization approach to classify a Chinese document into multiple categories. The obtained result is supposed to be more reasonably based on the nature of documents. Furthermore, the result can be further utilized as a base for constructing a document warehouse for text mining. Finally, we have implemented our approach and used some conference papers regarding topics in information management to test the precision rate and recall rate. The investigated result shows that our approach is feasible and effective.

參考文獻


Arabie, R,Hubert, L.J.,Soete, GD.(1996).Clustering and Classification.N.J.:Singapore, River Edge.
Baeza-Yates, R.,Ribeiro-Neto, B.(1999).Modern Information Retrieval.Addison Wesley Longman Limited.
Blosseville, M.J.,Hébrail, G.,Monteil, M.G.,Pénot, N.(1992).Automatic Document Classification: Natural Language Processing, Statistical Analysis, and Expert System Techniques Used Together.Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.51-58.
Buckles, B.P.,Petry, F.E.(1984).Extending the Fuzzy Database with Fuzzy Numbers.Information Sciences.34(2),145-155.
Buell, D.A.(1982).An Analysis of Some Fuzzy Subset Applications to Information Retrieval Systems.Fuzzy Sets and Systems.7(1),35-42.

延伸閱讀