透過您的圖書館登入
IP:18.223.21.5
  • 學位論文

高維度資料之處理與量測

On the Processing and Measurement of High Dimensional Data

指導教授 : 陳銘憲

摘要


隨著電腦與網際網路的使用普及後,加上資料不斷被數位儲存,致使高維度並包含大量資料的資料庫的建置與應用技術需求日趨迫切,如何有效處理高維度並包含大量資料的資料庫技術目前仍為資料勘測上極富挑戰之重要研究議題,在本論文中我們將提出幾個關於高維度資料之處理技術與量測方法。 首先我們探討高維度資料空間下距離函數非穩定現象之理論性質,並提出幾個函數非穩定現象之充分與必要條件,明確地說,我們驗證得到:高維度下,(1)資料點的距離分布之變異係數會退化為零;(2)任兩資料點與查詢點間的距離比會機率收斂到1;以及(3)資料點的距離分布之第二階動差係數會收斂到1等現象,皆為函數非穩定現象之充分與必要條件,易言之,造成距離函數非穩定現象之主要原因為資料點的距離分布之變異過小。依據此充要條件,針對面臨非穩定現象困境之距離函數我們進一步提出重設計機制,實驗證實,加大資料點的距離分布變異確實有抑制非穩定現象之效果。 此外,目前常用時間數列相似度之量測函數,往往僅考量兩數列間的走勢圖型之相似情形,數列本身之統計特質和次序關係並未被整合於相似度之量測函數內,因此我們利用時間數列之小波功率譜密度函數來整合數列之統計特質,並設計一個時間數列之相似度量測函數,稱為WPSM,實驗證實,WPSM於時間數列分群方面其精確性確實較其他常用的量測方法高,且具較低之時間複雜度,並有漲縮平移之不變性,同時可應用於不同長度之數列間。 關於多維度數值資料之隱私保護問題,我們應用獨立成分分析技巧提出以模擬為基礎之技術框架,稱PRIMP,模擬為基礎之隱私保護技術,主要應用某個模擬方法產生和原始資料相似統計特質之虛擬資料來達個人隱私保護目的,目前幾個著名的技術,其產生的虛擬資料僅能維持原始資料之某特定統計特質,如應用Cholesky分解技巧來保持變異矩陣,或應用LHS技巧來保持次序變異矩陣等統計量,不過如此將會限縮公告資料的資料可用性,相較於先前以模擬為基礎之研究,PRIMP於實例實驗上能保持更多資料的統計特質,理論上亦具較低的計算複雜度,以及非常低的資料洩漏風險;更進一步,我們另外整合PRIMP與Cholesky分解技巧,來修正PRIMP所產生的虛擬資料之變異矩陣,使其能更精確保持原始資料之變異矩陣。 PageRank是現今網路瀏覽器上非常重要的網頁排序技術,面對網際網路上數以億計之網頁,如何簡化PageRank的計算就顯得非常迫切,不過現存網頁雖多,整體來看網頁間的連結關係實際上相當稀疏,實際上許多網頁並未被其他網頁所連結,另外,有許多網頁並未連結到其他網頁,應用這些網頁間的連結結構,我們設計一個雙邊重排之PageRank演算法,相較於傳統的PageRank計算方式(即所謂Power Method)和Langville等人提出的單邊重排之PageRank演算法,實例驗證雙邊重排之PageRank演算法確可顯著縮短PageRank的計算時間。 最後,我們提出一個關於空間資料之子空間分群法,稱SCI,SCI整合網格技術和資料分布密度特質,改進以往應用網格技術會將某些有意義的群不當切割且所得的群的邊界會平行於座標軸等現象,由於SCI僅需掃描資料一次,其計算複雜度為資料數的線性函數,且有漲縮平移之不變性,與資料輸入次序無關等性質。 綜而言之,於高維度資料空間上,針對資料勘測上的幾個重要問題,我們提出處理技術與量測方法。量測方面,對於一般距離函數的非穩定現象深入瞭解其理論特質,並提出個避免非穩定現象之距離函數的重設計機制,另外,對於時間數列資料,設計一個能整合並比較數列本身統計特質之時間數列相似度量測函數;資料處理部分,關於多維度數值資料之隱私保護問題,我們提出兩個模擬為基礎之技術框架,能保持更多原始資料統計性質並具非常低的個人隱私洩漏風險,此外,我們設計一個相當有效率的雙邊重排之PageRank演算法,最後,我們提出一個有效能的空間資料之子空間分群法。

並列摘要


The importance of discovering knowledge from a huge and high dimensional database is growing at rapid pace in recent years. This is the reason why data mining has attracted a significant amount of research attention. Recently, the curse of dimensionality has been studied extensively in several data mining problems, such as clustering, nearest neighbor search, indexing, privacy preserving, and similarity measure of time series, because it is critical to both the performance and quality of data mining applications. In this dissertation, we explore several techniques to breaking the curse of dimensionality for some data mining problems in high dimensional space. Specifically, we first derive the the sufficient and necessary conditions of a stable (or meaningful) distance function in high dimensional data space. Further, we propose a new efficient and effective similarity measure for time series. Also, we propose two frameworks for simulation-based privacy preservation of multivariate numerical data. In addition, we propose a new reordered PageRank algorithm to speedup the computation of the PageRank vector. Finally, we develop a algorithm SCI (Subspace clustering based on Information) which integrates Shannon information and grid-based and density-based clustering to form a good solution for subspace clustering of high dimensional spatial data with noises. Recent research results have shown that, in high dimensional space, the concept of distance (proximity) may not even be qualitative. Explicitly, Beyer et al. showed that if the Pearson variation of the distance distribution converges to zero with increasing dimensionality, the distance function will become unstable (or meaningless) in high dimensional space even with the commonly used $L_p$ metric on the Euclidean space. With regard to quality, the design of effective distance functions has been deemed a very important and challenging issue. However, the necessary condition for unstability (i.e., the negation of the sufficient condition for the stability) of a distance function, which is required for function design, remains unknown. In this dissertation, we prove that the following conditions are equivalent to unstability: (1) the Pearson variation of the distance distribution for any given query converges to 0 with increasing dimensionality; (2) the ratio of the distances of any two points to the query converges in probability to 1 with increasing dimensionality; and (3) the 2nd moment coefficient converges to 1 with increasing dimensionality. With these results, we have the necessary and the sufficient conditions for unstability, whose negatives imply the sufficient and necessary conditions for stability. Most of well known approaches adopt the Euclidean distance or its extensions to measure similarity between time series. As mentioned above, under a broad set of conditions in high dimensional space, the Euclidean distance function will become unstable (or meaningless). In addition, the intra-structures (or statistical characteristics) of time series, such as seasonal components, nonstationary behavior, long-term memory, and heterogeneous variations, do not be integrated into the Euclidean distance. Consequently, any similarity measurement of time series that only focuses on inter-series shapes always encounters some disadvantages, such as false alarms, normalization is required in advance, the curse of dimensionality, and the lengths of series must be the same. The wavelet power spectrum can utilize the statistical characteristics of a time series. In addition, it is easy to implement and very effective because its run time is linear to the length of series. Therefore, we propose a new similarity measure of time series, called the Wavelet Power Spectrum Measure (WPSM), based on a wavelet power spectrum. This approach has several advantages, including invariance under scale and shift transformation, no normalization during pre-processing, and applicability to series of different length. Furthermore, we can integrate WPSM with the wavelet power spectra of series to build an effective index with low dimensionality. Privacy preservation is a fundamental issue in data mining and knowledge discovery. Because the internal relationships of attributes are always complex and polymorphous, it is difficult to generate a synthetic data set that preserves the statistical characteristics of some given data set without information about the data's distribution. For the problem of simulation-based privacy preservation for multivariate numerical data, we propose two frameworks for generating a synthetic data set with the constraint that the probability distribution is as close as possible to that of the original set. The first framework, called PRIMP (PRivacy preserving by Independent coMPonents), is based on independent component analysis (ICA). In addition, we prove that the synthetic data generated by PRIMP is sufficiently different from the original data; thus, we are able to protect confidential information in the original data. Also, PRIMP is easy to implement and very effective because, by using FastICA, its run time is linear to the size of the input. The second approach proposed is a hybrid method that combines PRIMP and Cholesky's decomposition technique. In theory, the hybrid method preserves the covariance matrix of the original data exactly. The method also resolves the problem of generating good seeds for the Cholesky-based approach. PageRank is one of the most important ranking techniques used in today's search engines. Since billions of pages already existed in Web, computing the PageRank vector is very time-consuming procedure. How to accelerate the computation of the PageRank vector has turned out to be an important issue. In view of this, we propose a new reordered PageRank algorithm, called two-way reordered PageRank algorithm, which exploits both reverse-dangling nodes and dangling nodes to speedup the computation of the PageRank vector. Clustering a large amount of high dimensional spatial data sets with noises is an important challenge in data mining. In this dissertation, we present a new subspace clustering method, called SCI (Subspace Clustering based on Information), to solve this problem. The SCI combines Shannon information with grid-based and density-based clustering techniques. The design of clustering algorithms is equivalent to construct an equivalent relationship among data points. Therefore, we propose an equivalent relationship, named density-connected, to identify the main bodies of clusters. For the purpose of noise detection and cluster boundary discovery, we also use the grid approach to devise a new cohesive mechanism to merge data points of borders into clusters and to filter out the noises. However, the curse of dimensionality is a well-known serious problem of using grid approach on high dimensional data sets because the number of the grid cells grows exponentially in dimensions. To strike a compromise between the randomness and the structure, we propose an automatic method for attribute selection based on the Shannon information. With the merit of only requiring one data scan, algorithm SCI is very efficient with its run time being linear to the size of the input data set. In addition, SCI has several advantages, including being scalable to the number of attributes, robust to noises, invariant under the location-scale transformation, and independent of the order of input.

參考文獻


[1] N. R. Adam and J. C. Worthmann. Security-control methods for statistical databases: a
comparative study. ACM Computing Surveys, 21(4):515{556, 1989.
[2] C. C. Aggarwal. Re-designing distance functions and distance-based applications for high
[3] C. C. Aggarwal. Towards systematic design of distance functions for data mining applications.
Discovery and Data Mining, pages 9{18, New York, NY, USA, 2003. ACM Press.

延伸閱讀