透過您的圖書館登入
IP:3.143.228.40
  • 學位論文

使用凝聚型階層式分群法對流成行資料分群

Agglomerative Hierarchical clustering with the string data

指導教授 : 曾富祥
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


由於科技的進步,使得資料量快速地成長。而資料採礦(Data mining)是可有效幫助我們組織成千上萬資料的方法,讓管理者可以從資料中得到相關資訊,做出適當的決策。其中群集分析為資料採礦中常使用的方法之一,而分群的依據來自於資料的特徵。在群集分析中較常使用的資料型態為類別型資料(Qualitative data)與數值型資料(Quantitative data),而流程型資料或字串型資料在過去較少被大家所討論,因此在本研究中,我們將針對流程型資料(字串型資料)提出可行的分群方法。 關於相似度的衡量方法,我們採用以下兩種方法,分別為Jaro similarity與Edit distance,其中距離愈大表示相似度愈小,且根據所定義的相似度或距離,我們可列出相似度矩陣,並利用相似度來對資料做分群。而在本研究中,我們採用凝聚型階層式分群方法來做分群,其中包含最短距離法、最長距離法和平均距離法等方法。在凝聚型階層式分群方法中,一開始每筆資料為各自一群,將最相似的群體逐一合併後,最終全部資料將會屬於同一群體。階層式分群方法的優點為可自己決定分群的群數,且透過階層分群圖可清楚明瞭分群的步驟。 本研究所探討的個案資料,資料型態皆為流程型資料(字串型資料),共使用了三個例子,其中兩個例子為標竿資料,廣泛被許多學者使用;另外一個例子來自於發動機在執行翻修工作時,所產生的待維修零件,因為不同的維修零件所經過的維修站不同,所以各自會有不同的維修流程。本研究中主要在解決流程型資料(字串型資料)間的相似度問題,使我們可以針對資料相似度做分群,讓管理者可以根據分群結果安排適當的維修工作或做其它決策。

並列摘要


Due to the progressing of the science and technology, the data is growing rapidly. Data mining help us to organize the thousands of data efficiently and the managers can obviously find out the information that they do not know before and make appropriate decisions. Cluster analysis is one of the methods that are widely used in data mining according to the features of the data. Most of data applied to cluster analysis are qualitative and quantitative and the string data (flow data) is seldom discussed in cluster analysis. Therefore in this research, we try to propose some possible clustering methods to handle the string data. About the similarity measure, we adopt two measurements as follows. One is Jaro similarity and the other is Edit distance. The larger the value of distance is, the smaller the value of similarity will be. According to the similarity or distance that we define, we can obtain the similarity matrix. Hence, clustering the data is based on this matrix. In our study, we consider the agglomerative hierarchical clustering such as single linkage, complete linkage and average linkage to group string data. In the initial of agglomerative clustering, each string data is in its own cluster. It means that every cluster includes exactly one string. Then the most similar strings are grouped. After a series of merge operations, finally lead all strings to the same cluster. The advantage of hierarchical clustering algorithm is that we can decide the number of groups which we want to divide and we can obviously know the clustering steps through the hierarchical tree. We use three examples to present our methodology. The data type in our research is string data. Two benchmark examples and an engine parts dataset. Because different parts are passing different repair workstations, every part has its own repair procedure. Our study is focusing on dealing with the problem about counting similarity between strings. We want to cluster the string data and the clustering result can help the workstations work efficiently.

參考文獻


3. Dunn, J. C., “Well-Separated Clusters and Optimal Fuzzy Partitions”, Journal of Cybernetics, Vol. 4, No. 1, pp.95-104, 1974.
5. Halkidi, M., Vazirgiannis, M., “A density-based cluster validity approach using muti-representatives.”, Pattern Recognition, Vol. 29, No. 6, pp.773-786, 2008.
6. Harhalakis, G., Nagi, R. and Proth, J. M., “An efficient heuristic in manufacturing cell formation for group technology applications,” International Journal of Production Research, Vol. 28, pp.185-198, 1990.
7. Heragu, S., “Group technology and cellular manufacturing”, IEEE Transactions on Systems, Man, and Cybernetics, Vol. 24, No. 2, pp.203-215, 1994.
8. Jain, A. K., Murty, M. N. and Flynn, P. J., “Data Clustering: A Review”, ACM Computing Surveys, Vol. 31, No. 3, pp.264-323, 1999.

延伸閱讀