Fuzzy-Based Multi-Categorization of Chinese Documents


近年來,網際網路的普及造就了文件資料迅速而大量地流通。知識工作者也因此面臨了資訊過載的困境,所以對於日積月累的文件資料,如果不加以分類整理,往後在搜尋所需文件時必會耗費相當大的時間。這個問題也使得資訊檢索(Information Retrieval)的議題在各界被廣泛地討論,其目標大多是希望讓使用者的需求資訊與大量的現有資料能夠做完善的比對,使檢索的結果能快速滿足使用者的資訊擷取需求。所以,若能事先將這些文件資訊進行分類,便可以加快檢索速度及提高檢索之正確率。但是,在真實世界裡,人類有許多思維過程是非常「模糊」(Fuzz)的,因此在實際應用中所需檢索的訊息,往往具有一定的模糊性(Fuzziness),而且一份文件的內容可能涉及多個不同的議題,或是各事先定義的類別之間並不完全獨立,使得將每份文件只歸類到單一特定類別的作法,並不見得合理。因此,在本論文中,我們將利用資訊檢索(Information Retrieval)的相關技術,以模糊集合理論(Fuzzy Set Theory)為基礎,透過「模糊資訊檢索分類」(Fuzzy Information Retrieval Categorization)以中文文件做為分析目標,將每份文件進行合理的多重分類。將文件同時歸屬於多類,不僅可以提高文件檢索的效率,後續更可以進一步建立文件倉儲(Document Warehouse),以便對該文件進行文件探勘(Text Mining)做準備。我們以資訊管理研討會的論文集為對象進行多重分類的測試,並透過分類方法之效能評估來驗證執行結果之正確率(Precision Rate)和回現率(Recall Rate)等相關指標。驗證結果證明本研究所提出之方法具有相當準確的正確率與回現率。


Thanks to the proliferation of Internet, documents are rapidly shared over the cyberspace in the past few years. However, it also makes knowledge workers to suffer from the information-overloading problem. For a set of documents without suitable categorization, the searching process will be prone to time-consuming. To overcome the problem, the issues about effective information retrieval have been studied extensively. The objectives usually focus on how to match documents that are conforming to users' requirement efficiently. Nevertheless, in the real world applications, people tend to use some 'fuzzy' terms to express their thinking. Beside, a document may involve various concerns, which makes it should be better categorized into multiple categories. In this paper, we propose a multiple categorization approach based on fuzzy set theory. We employ Fuzzy Information Retrieval Categorization approach to classify a Chinese document into multiple categories. The obtained result is supposed to be more reasonably based on the nature of documents. Furthermore, the result can be further utilized as a base for constructing a document warehouse for text mining. Finally, we have implemented our approach and used some conference papers regarding topics in information management to test the precision rate and recall rate. The investigated result shows that our approach is feasible and effective.


向殿政男,中國生產力中心技術引進服務組,楊英魁(1992).1992, Fuzzy手法進階.全華科技圖書公司.
