透過您的圖書館登入
IP:3.144.96.159
  • 學位論文

基於資料格式及語義之使用者引導式資料勘測

User Guided Data Mining Based on Data Syntax and Semantics

指導教授 : 陳銘憲

摘要


電腦計算科技的進步讓許多資料處理應用變為可能,且使得不斷增加的資料量被收集。資料語法和資料語義的使用者導引對於獲得有意義且有用的勘測結果是很重要的。資料語法和語義具有許多形式。在資料串流應用中,時間語法例如基於查詢歷史長度之移動窗戶大小會影響在固定儲存空間預算下可獲得之摘要品質。此外,用使用者偏好排名所表示之屬性語義會影響叢集分析的可用性。更進一步來說,語義單位間的關聯可被探討來增強分類工作的效能。開發有效率之基於如此資料語法和語義之使用者導引式資料勘測技術仍然是一個具有挑戰性的任務。 隨時間演進之資料串流中所內嵌之時間語法可以被模型化為移動窗戶。因為資料串流之動態本質,移動窗戶被用來產生近似回溯範圍內最近資料的摘要以回答使用者查詢或發現資料模式。我們提出一個新的方法,移動二重樹(SDT),來產生適應於回溯範圍內之新增和刪除之動態資料摘要。藉由利用哈爾小波轉換的特性,我們發展數個動作以時間和空間有效率的方式增量地維護連續時間窗戶內的SDT。這些動作直接操作於被轉換之時間-頻率空間上而不需要儲存�重建原始資料。如我們詳細的分析所顯示,SDT大幅減少產生資料摘要所需要的資源,以回溯範圍長度和品質量度來說,可以最大化小波資料摘要的儲存利用率。 為了考慮叢集分析中之屬性語義,傳統叢集演算法藉由結合資料項目之所有屬性以產生資料項目間的相似�相異度矩陣來切割所輸入的資料集合。如何以有彈性的方式明確地基於使用者的觀感引導叢集仍然是一個具有挑戰性的任務。因此,我們提出新的叢集架構稱為漸進式叢集,係容許使用者藉由分派等級(rank)給資料屬性來表達其叢集期望。在每一個等級中,具有較前或相同等級之資料屬性集合形成基底空間而下一個等級之資料屬性集合形成增強空間。於是,叢集藉由漸進式整合每一增強空間中之資訊於基底空間中的叢集而被實現。漸進式叢集的目標是產生在基底空間中是緊密且其對應相異度在增強空間是最小化之叢集。因此,叢集結果會符合使用者的觀感且很容易讓使用者可以詮釋。 多媒體資料之概念偵測最近被提出來處理視訊索引中的語義隔閡(semantic gap)。語義單位間的關聯可以被視為視訊資料庫之概念註記中的隱藏資料語義。我們提出使用觀念關聯和時序分析的一般性後製過濾(post-filtering)架構。我們提出一基於熵函數(entropy function)的方法從所發現之概念間和時序關聯規則中結合相關概念分類器。實驗結果顯示我們的架構可以有效地提升視訊資料的概念偵測準確率。

並列摘要


Technology advances in computing powers have enabled many data processing applications and an ever-increasing amount of data is being collected. User guidance on data syntax and semantics is essential to obtain meaningful and useful mining results. Data syntax and semantics come in many forms. In data stream applications, temporal syntax such as sliding windows size based on user preference on the length of query history affect the attainable quality of the synopses under the constraint of fixed storage space budget. In addition, attribute semantics expressed in user preference ranks affect the usefulness of the clustering analysis. Furthermore, association between semantic units can be explored to boost the performance of classification task. It remains a challenging task to develop effective user-guided data mining techniques based on such data syntax and semantics. Temporal syntax embedded in the time-evolving data streams can be modeled as sliding windows. Due to the dynamic nature of data streams, a sliding window is used to generate synopses that approximate the most recent data within the retrospective horizon to answer queries or discover patterns. We propose a novel approach, Sliding Dual Tree, abbreviated as SDT, to generate dynamic synopses that adapt to the insertions and deletions within the retrospective horizon. By exploiting the properties of Haar wavelet transform, we develop several operations to incrementally maintain SDT over consecutive time windows in a time- and space-efficient manner. These operations directly operate on the transformed time-frequency domain without the need of storing/reconstructing the original data. As shown in our thorough analysis, SDT greatly reduces the required resources for synopses generation and maximizes the storage utilization of wavelet synopses in terms of the length of the retrospective horizon and quality measures. To account for attribute semantics in clustering analysis, conventional clustering algorithms partition the input data set into several clusters by combining all the attributes of a data tuple to produce the (dis)similarity matrix on a tuple-by-tuple basis. How to explicitly guide clustering based on the user perceptions in a flexible way still remains a challenging task. Therefore, we propose a new clustering framework named Progressive Clustering, which allows the user to express their clustering expectations by assigning ranks to the data attributes. On each rank, the set of attributes with higher or the same rank forms the base space while the set of next highest ranked attributes forms the enhancement space. Then the clustering is carried out in a progressive manner by integrating information in each of the enhancement spaces with the clustering in the base space. The goal of progressive clustering is to generate clusters that are compact in the base space and whose corresponding dissimilarities are minimized in the enhancement space. Therefore, the clustering results conform to user perceptions and become readily accessible for user interpretation. Concept detection in multimedia data has been proposed recently to deal with the semantic gap in video indexing. Association between semantic units can be viewed as hidden data semantics in concept annotations of video archive. We propose a general postfiltering framework that uses concept association and temporal analysis. We propose an entropy-function based scheme to combine related concept classifiers from the discovered inter-conceptual and temporal association rules. Our empirical studies have shown that our framework is effective in improving the accuracy of visual concept detection.

參考文獻


[2] C. C. Aggarwal, C. M. Procopiuc, J. L.Wolf, P. S. Yu, and J. S. Park. Fast algorithms for projected clustering. In SIGMOD Conference, pages 61–72, 1999.
[3] C. C. Aggarwal and P. S. Yu. Finding generalized projected clusters in high dimensional spaces. In W. Chen, J. F. Naughton, and P. A. Bernstein, editors, SIGMOD Conference, pages 70–81. ACM, 2000.
[4] C. C. Aggarwal and P. S. Yu. Finding generalized projected clusters in high dimensional spaces. In SIGMOD Conference, pages 70–81, 2000.
[5] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proc. of VLDB, pages 487–499, 1994.
[7] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In PODS, pages 1–16, 2002.

延伸閱讀