文件探勘是資料探勘加上一些基礎的語言學所構成的。文件探勘運用的技術,幾乎都與詞彙的頻率與出現篇數有關,但這兩項資訊在資料探勘中卻極少用到。目前在文件聚類的研究中,已經發展出許多不同的聚類演算法,不同的聚類方式對於聚類的成效也有所不同,其中較常被使用的是K-means非階層式聚類演算法,但是K-means聚類在K值的選取上卻是隨機的,因此容易受到資料的離群值所影響,導致聚類的成效不佳。 本研究中,吾人提出以階層式聚類的方式,將實驗資料進行聚類,找出合適的群集數與初始值,改善非階層式聚類K-means演算法的缺點,使聚類的成效能夠有所提升,並加速K-means演算法收斂的速度。而本研究也將採用相對比較的方式,過濾不必要的特徵詞彙,及使用階層式聚類法來控制聚類的品質,使得文件聚類的精確度能夠有良好的表現。
Text mining is composed by data mining and a little basic linguistics. Techniques in text mining are also related with term frequency and the number of documents. Both of this information is few to be used in text mining. In the studies of document clustering had already development many different kinds of clustering algorithms. The most often to be used in non hierarchical clustering is k-means, but the k value is selected by random. Therefore, it is easy to make a bad effect by outlier of data. In order to improve disadvantage of k-means algorithm, we proposed using the way of hierarchical clustering. First, we used experiment data to make it clustering. Then, we found fitness for a number of cluster and initial value to enhance effectiveness and the speed of convergence. Besides, we not only used relativity way to compared and filtering unnecessary keywords, but also used hierarchical clustering to control the quality that made it have good performance on the precision.