類別型資料庫之叢集化技術

近年來，由於使用者需要從大量的資料中擷取有用的資訊，資料庫知識發掘(KDD)已受到廣大的矚目。而其中，資料叢集分析(Clustering)是KDD中常用來分析資料群聚現象的方法。給定一群資料點，資料叢集分析根據定義好的相似測定方式，嘗試將相似的資料群聚成一個叢集而把不相似的資料點分在不同的叢集。在本論文中，我們針對類別型的資料庫進行叢集分析時所面臨的一些問題進行研究。在現實資料中，我們常常會遇到類別型的屬性，例如：消費記錄、網頁瀏覽記錄、或網頁文件，都是類別型的資料。過去研究類別型資料庫叢集分析主要皆是針對一固定的資料庫進行資料叢集分析演算法之開發，而比較少有研究是針對大型類別型資料庫之叢集分析執行效率、資料庫動態改變、以及流動概念進行討論與研究。因此，類別型資料庫叢集分析仍然有相當多值得挑戰的研究議題。針對執行效率的問題上，取樣技術是大家所公認可以降低資料量因而有效增進叢集分析效率的方法。然而，當我們採用取樣技術時，未被取樣到的資料點在正常的流程下是無法得知他們應該屬於哪一個叢集之中。因此，我們在本論文提出一個架構，稱為最大相似資料標記法(Maximal Resemblance Data Labeling)，將之前未被取樣的資料點分到適合的叢集之中。此架構主要是根據我們所新定義的類別型叢集表示法：N節點集合重要度表示法(N-Nodeset Importance Representative)，來衡量資料點與群集之間的相似程度，並找出最相似的叢集將資料點分入該叢集中。除了當資料量龐大時進行叢集分析所要面臨的效率問題之外，另一個設計叢集分析演算法會面臨的問題是資料庫中的資料會不停的新增及刪除。當每次有新的資料新增進資料庫時，對整個資料庫重新進行叢集分析是相當沒有效率的方法。因此，叢集分析演算法需要能夠遞增地處理新增進來的資料。在本論文中，我們設計了另一個架構針對類別型資料提供遞增叢集分析方法。本方法利用之前所提出的節點重要度表示法來快速處理新資料新增進資料庫時該分入哪個叢集中。另外，跟傳統的遞增叢集分析不同的是，我們也採用了漸進式資料探勘常用的感興趣區間觀念(period of interest)來追蹤最近的叢集分析結果。最後，我們針對流動概念對叢集分析所造成的問題進行研究。資料背後的概念通常會隨著時間的演進而改變，因此，整個資料叢集分析的結果也應該隨著時間演進不同而跟著改變。我們稱這種概念隨時間演進的現象稱為流動概念(drifting concept)。如果進行叢集分析時不考慮流動概念，不僅叢集分析的品質不好，也不能滿足使用者期望得到最新資料的叢集分析結果。因此，我們提出了一個架構可以偵測流動概念的現象，根據目前的流動概念產生叢集分析的結果，並且嘗試分析不同叢集分析結果之間的關係藉此觀察流動概念的現象。根據這個架構，我們可以根據不同的概念產生較佳的叢集分析結果，並且藉由觀察叢集分析結果變化的情形分析資料變化的趨勢。

關鍵字

資料探勘；類別型資料叢集分析；叢集表示法；資料標記；遞增叢集分析；流動概念

並列摘要

In recent years, Knowledge Discovery in Databases (KDD) has attracted a large amount of attention because of the need for mining the useful information and knowledge from a great deal of data. Data clustering is one of the frequently used data mining techniques in KDD for exploratory data analysis. Given a set of data objects, the problem of clustering is to partition data objects into groups in such a way that objects in the same group are similar to each other while objects in different groups are dissimilar from each other according to the predefined similarity measurement. In this dissertation, we focus on the problem of performing clustering on categorical databases. Categorical attributes prevalently exist in real data. For example, buying records, web logs, and web documents, are all categorical data. Previous works on clustering categorical data focused on doing clustering on the entire data set, and did not fully explore the issues of the execution efficiency, the incremental updates, and the drifting-concepts. Therefore, the problem of clustering categorical databases remains as a challenging issue. On the problem of execution efficiency, sampling has been recognized as an important technique to improve the efficiency of clustering. However, with sampling applied, those points which are not sampled will not have their labels after the normal process. Therefore, a framework named Maximal Resemblance Data Labeling is proposed to allocate each unlabeled data point into the corresponding appropriate cluster based on the novel categorical clustering representative, namely, N-Nodeset Importance Representative, which represents clusters by the importance of the combinations of attribute values. In addition to the difficulty of processing the tremendous data volume, another challenge in the design of modern clustering algorithms is that various dynamic updates of deletions and insertions are applied to the huge database. It is impractical to re-apply data clustering on the entire database whenever there are new data points added into the database. Therefore, clustering algorithms should operate incrementally. We devise another framework which performs incremental clustering on categorical data. In this work, the practical clustering representative Node Importance Representative which is the simplest version of N-Nodeset Importance Representative is utilized for capturing the characteristics of clusters. Moreover, the concept of period of interest utilized in the progressive mining is adopted to continuously trace the latest clustering result. Finally, we focus on the problem of drifting-concepts. The concepts which we try to learn from the data typically drift with time, and the underlying clusters may also change considerably with time. Performing clustering on the entire time-evolving data not only decreases the quality of clusters but also fails to meet the expectations of users which usually require recent clustering results. Therefore, we propose a framework which detects the drifting-concepts at different windows, and generating the clustering result based on the current concept, and also, shows the relationship between clustering results by the visualization. The framework is composed of two algorithms: Drifting Concept Detection algorithm detecting the changes of cluster distributions between the current window and the last clustering result, and Cluster Relationship Analysis algorithm analyzing the relationship between clustering results at different time. Based on this framework, we can obtain the clustering results with better quality and capture the time-evolving trend in the data set by analyzing the evolving clustering results.

並列關鍵字

Data Mining ； Categorical Data Clustering ； Data Labeling ； Cluster Representative ； Incremental Clustering ； Drifting Concepts

參考文獻

[1] Data generator: Perfect data for an imperfect world. http://www.datasetgenerator.com.

[4] C. C. Aggarwal and P. S. Yu. A framework for clustering massive text and categorical data streams. In Proceedings of the SIAM Conference on Data Mining (SDM), 2006.

[5] R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 1993.

[8] P. Berkhin. Survey of clustering data mining techniques. Technical report, Technical report, Accrue Software, 2002.

[11] F. Cao, M. Ester, W. Qian, and A. Zhou. Density-based clustering over an evolving data stream with noise. In Proceedings of the SIAM Conference on Data Mining (SDM), 2006.

國際替代計量

類別型資料庫之叢集化技術

全文下載

主題瀏覽