  • 學位論文

中文文件分類研究-以 IC 設備業為例

Chinese Document Classification-A Case Study of an IC Equipment Manufacturer

指導教授 : 陸承志


IC 設備製造業,隨著台灣 IC 製造產業的發達,從以前低階的、不注重精度的,到現在要求快速的、高精度的產品,過程中累積了四十餘年的經驗。過去累積的企業知識文件分類雜亂無章,找尋不易。近幾年隨著企業 e化後,文件也陸續進行 e化,但需利用人工方式,慢慢地將 e化文件歸類,累積企業的智慧資本。 人工歸類文件,面臨不同人做不同決定的窘境,而且耗費許多工時成本,並且速度緩慢,跟不上腳步。基於此理由,本研究提供一個文件分類方法,將企業內部大量知識文件予以分類,主要的設計概念是利用向量空間模式,以其具有提供不同權重組合計算方法來增進分類正確性的優點,加上廣度權重,最後再結合文件特有的特徵項目,提升分類正確率。 實驗結果顯示,若是單獨使用次數權重向量空間模型,分類正確率最高可達68.93%,若再加上廣度權重,則提高至76.42%,最後再結合文件特有的特徵項目後,可提升至86.62%。實驗結果證實,結合多種權重方式有助於提升分類效果。


The IC equipment manufacturing industry, along with the development of Taiwan's IC manufacturing industry, has gone through the low-end products to the current high-end, high-precision product stages. During the past 40 years, the IC equipment makers have accumulated a lot of documents which are not well classified and therefore are not easy to do a search. Until recently, the e-business trend has pushed IC equipment makers to digitalize and manually classified these valuable documents. The manual classification process is slow and tedious. Thus this study proposes a vector space model based method to automatically classify enterprise documents. The proposed method combines several weight factors including term frequency, term's uniformity and document special features to boost classification performance. The experimental results showed that using vector space model (VSM) alone can reach 68.93% of accuracy. Then with additional term's uniformity to adjust term's class weight, the accuracy enhances to 76.42%. Finally, with the addition of document unique features, the accuracy promotes to 86.62%. The experimental results confirmed that the combination of several weight factors leads to the improvement of classification performance.


[2] Borko H., Bernick M., Automatic Document Classification. Journal of the ACM, Volume 10, Issue 2, 151-162, 1963
[3] Hamill Karen A. and Zamora Antonio., The Use of Titles for Automatic Document Classification, Journal of the American Society for Information Science, Volume 31, Issue 6, 396-402, 1980
[5] Jacobes Paul S., Using Statistical Mehods to Improve Knowledge-based News Categorization, IEEE Expert, Volume 8, Issue 2, 13-23, 1993
[6] Kwok K.L., The Use of Title and Cited Titles as Document Representation for Automatic Classfication, Information Processing and Management, Volume 11, Issue 8-12, 201-206, 1975
[7] Larson Ray R., Experiments in Automatic Library of Congress Classification, Journal of the American Society for Information Science, Volume 43, Issue 2, 130-148, 1992
