多條資料串流之相關程度分群及相似性查詢

在現今許多應用之中，資料是以串流的方式快速的產生與累積；因此，如何有效率地處理資料串流便成為一項重要的議題。處理資料串流最大的挑戰在於其龐大的量以及快速的產生與變動速度；甚者，很多應用必須同時觀測多於一條的資料串流。若能找出資料串流與串流之間的相似性關係，必能成為挖掘多條資料串流知識的一項重要資訊。因此，本篇論文主旨在討論如何尋找資料串流之間的關係，其中包括利用相關係數將資料串流分群，以及資料串流間之相似性查詢。首先我們考慮當不同的資料串流之間可能存在關聯性的情形。我們設計一個以相關係數為相似性，根據串流事件，即時地驅動且反應串流群集變化的系統。隨著時間前進各個資料串流不斷的變動，某些原本相似的資料串流，在經過一段時間可能又不再相似；因此，若能掌握資料串流的群集變化，對於線上決策分析會有相當大的幫助。相對於每固定時刻就做一次串流分群，我們設計的方法利用串流的變化事件來驅動對應的群集分割或合併。每一條串流都以線性片段來近似時，當有新的片段折點產生，代表串流有一定程度的變化，而這樣的變化可能就是串流群集改變的提示。在許多實際應用中，資料串流的取得其實是獨立且分佈在不同的位置。因此，我們探討如何直接在分散式的環境底下，根據使用者給定的參考資料串流，有效率地節省頻寬使用，找出分佈在不同地方但與參考值最相似的前k名資料串流。不同於先前將參考資料串流全部送給所有其他資料串流所在的位置，我們設計一個能有效率節省頻寬的傳送方法來做相似性查詢。基於每條資料串流使用Haar小波轉換係數來做摘要的情形下，根據此種摘要方式多重解析度的特性，將解析度由低到高，送出查詢的起始者每次只傳送同一解析度的係數給候選串流可能存在的位置。在傳送係數的過程中，查詢起始者會不斷的篩選並縮小候選人可能存在的位置範圍。為了保證篩選的過程不會有任何正確答案的遺漏，對於每一個候選資料串流都會維護一個與參考串流距離的最低與最高保證範圍。隨著此保證範圍的縮小，查詢起始者便能快速且正確的找出最後的答案。最後，我們考慮當資料串流的值含有不確定性的情形。與先前所討論的資料不同的是，資料值不再是一個定值而是一個隨機變數。於是，不確定性資料串流可視為一連串的隨機變數。那麼具不確定性資料串流之間的距離也變成了一個隨機變數。我們採用一個比較通用隨機變數模型，不需要知道變數的機率分佈情形，而只有期望值與變異數是已知的。在此變數模型下，利用機率與統計的理論，我們提出以機率方法來解決相似性查詢，並設計能有效率篩選答案的方式。此外，我們還討論如何將先前設計的方法應用在只有部份摘要的情況底下。利用控制一個可調整的機率門檻值，我們設計的方法可以針對不同的應用需求，在誤報與遺漏答案之間提供交換平衡。

關鍵字

資料串流；資料分群；分散式k 最近鄰查詢；相似性查詢；不確定性資料

並列摘要

Processing data streams has become increasingly important as more and more emerging applications are required to handle a large amount of data in the form of rapidly arriving streams. The huge amount and evolving properties make them more challenging to be processed. Moreover, in many cases, more than one data stream needs to be analyzed simultaneously. To discover knowledge from multiple data streams, it is useful if we know the cross relationship among them first. Therefore, in this dissertation, we focus on how to find the relationship between streams, including clustering by correlations and similarity searches over many streams. First, we devise a framework for Clustering Over Multiple Evolving sTreams by CORrelations and Events, which, abbreviated as COMET-CORE, monitors the distribution of clusters over multiple data streams based on their correlations. In the multiple data stream environment, where streams are evolving as time advances, some might act similarly at this moment but dissimilarly at the next moment. The information of evolving clusters is valuable to support corresponding online decisions. Instead of directly clustering the multiple data streams periodically, COMET-CORE applies efficient cluster split and merge processes only when significant cluster evolution happens. Accordingly, we devise an event detection mechanism to signal the cluster adjustments. The coming streams are smoothed as sequences of end points by employing piecewise linear approximation. At the time when end points are generated, weighted correlations between streams are updated. End points are good indicators of significant change in streams, and this is a main cause of cluster evolution event. When an event occurs, through split and merge operations we can report the latest clustering results. In many real cases, streams are collected independently in a decentralized manner. Given a reference stream, searching its most similar streams, which might exist in more than one distributed databases, is helpful for many applications. Therefore, we present LEEWAVE − a bandwidth-efficient approach to searching range-specified k-nearest neighbors among distributed streams by LEvEl-wise distribution of WAVElet coefficients. This work focuses on that when all streams are summarized using wavelet-based synopses. To find the k most similar streams to a range-specified reference one, the relevant wavelet coefficients of the reference stream can be sent to the peer sites to compute the similarities. However, bandwidth can be unnecessarily wasted if the entire relevant coefficients are sent simultaneously. Instead, we present a level-wise approach by leveraging the multi-resolution property of the wavelet coefficients. Starting from the top and moving down one level at a time, the query initiator sends only the single-level coefficients to a progressively shrinking set of candidates. In addition, we derive and maintain a similarity range for each candidate and gradually tighten the bounds of this range as we move from one level to the next. The increasingly tightened similarity ranges enable the query initiator to effectively prune the candidates without causing any false dismissal. Finally, the case when each stream is composed of uncertain values is discussed. We present PROUD - A PRObabilistic approach to processing similarity queries over Uncertain Data streams. In contrast to streams with certainty, an uncertain stream is an ordered sequence of random variables. The distance between two uncertain streams is also a random variable. We use a general uncertain data model, where only the means and the deviations of the random variables in an uncertain stream are available. Under this model, we first derive mathematical conditions for progressively prune them to reduce the computation cost. We then apply PROUD to a streaming environment where only sketches of streams, like wavelet synopses, are vailable. PROUD offers a flexible trade-off between false positives and false negatives by controlling a threshold, while maintaining a similar computation cost. This trade-off is important as in some applications false negatives are more costly, while in others, it is more critical to keep the false positives low.

並列關鍵字

Data Streams ； Clustering ； Distributed kNN Query ； Similarity Search ； Uncertain Data

參考文獻

[2] C. C. Aggarwal. On unifying privacy and uncertain data models. In IEEE ICDE, 2008.

[3] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams. In VLDB, 2003.

[4] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. On high dimensional projected clustering of data streams. Data Mining and Knowledge Discovery, 10(3):251–273, 2005.

[5] C. C. Aggarwal and P. S. Yu. On high dimensional indexing of uncertain data. In IEEE ICDE, 2008.

[6] M. G. Albanesi, M. Ferretti, and A. Giancane. A compact wavelet index for retrieval in image database. In Proc. of IEEE Int. Conf. on Image Analysis and Processing, 1999.

國際替代計量

多條資料串流之相關程度分群及相似性查詢

全文下載

主題瀏覽