文件自動分類技術與成效評估之探討

企業知識庫每天要處理數以萬計之文件資料，無論是外部競爭者網頁、產業分析、客戶需求；或是內部財務報表、技術文件、專利文件，這些皆為企業經營之必要資訊。這樣龐大的資料量無論是在收集、過濾、以及分門別類歸檔皆十分耗費時間與人力資源。企業由此產生對於文件自動分類之需求。如何利用自動化技術，快速有效協助人工分類，以應付大量待分類文件之需求，儼然已成為現今資訊服務與知識管理之重要課題。企業知識庫之類別架構是否合乎企業需求，收錄之訓練文章是否具代表性，文件分類標準是否一致，這些都足以影響文件分類成果。此外，如何選擇關鍵詞彙使得文件分類工作處理更有效率，分類系統要對待分類之文件有什麼程度之瞭解，如何在速度與正確性之間取得平衡點，這些因素都需要納入文件自動分類系統建置考量。本研究以漢語平衡語料庫為實驗對象，實作一個文件自動分類系統。並比較機器學習方法與非機器學習方法之分類成效。另外，評估分類系統對於待處理文件之認識程度不同，會對分類成效產生之影響。同時，並應用評估語料庫相似度之統計方法於文件類別上，作為預先定義類別或是收錄文章是否適當之初步評估。

關鍵字

文件自動分類；機器學習；語料庫相似性；語料庫同質性

並列摘要

Knowledge bases in a corporation have to process thousands of text-based information every day. Those include competitors’ information, industrial analysis reports, and customer requirements outside the corporation; financial statements, technique reports, and patterns inside the corporation, which are considered crucial for business operation. However, the processes of collecting, filtering, and filing are time and labor consuming tasks. Hence, automatic text classification is required to solve the problem. The issue about the employment of automatic techniques to improve manual classification performance and to meet the requirements of considerable quantities of classification tasks has been raised in the area of information services and knowledge management. The appropriateness of hierarchy of the knowledge base in the company, the representiveness of texts in the classes, and the consistency of data collection will all affect the performance of text classification. In addition, the method of selecting key terms, the level of understanding of unknown texts, how to achieve the equilibrium between speed and accuracy should be taken into consideration during the construction of automatic text classification systems. In this research, an automatic text classification system is implemented, and the texts are gathered from the Sinica Corpus. Some machine learning methods and non-machine learning methods will be compared in the thesis. Besides, the effect of varying level of understanding about texts will also be measured. Furthermore, the method of measuring corpus similarity and homogeneity is applied to the classes, in order to measure the appropriateness of predefined classes or texts in those classes.

並列關鍵字

automatic text classification ； machine learning ； corpus similarity ； corpus homogeneity

參考文獻

【7】楊志良和賴憲堂，全民健康保險下疾病分類編碼一致性調查研究，中央健保局，民85

【2】 Chakrabarti, S. (2002), Mining the Web: Discovering Knowledge from Hypertext Data, Morgan Kaufmann Publishers, San Francisco, CA

【3】 Chen, A., He, J. and Xu, L. (1997), Chinese Text Retrieval without Using a Dictionary, Proceedings of the ACM SIGIR 97, p42-49

【5】 Chien, L. F., Huang, T. I., and Chen, M. C., (1997), PAT-Tree-Based Keyword Extraction for Chinese Information Retrieval, Proceedings of 1997 ACM SIGIR Conference, Philadelphia, USA, p50-58

【6】 Church, K., and Hanks, P. (1989), Word Association Norms, Mutual Information and Lexicography, Association for Computational Linguistics, Vancouver, Canada, p76-83

國際替代計量

文件自動分類技術與成效評估之探討

主題瀏覽