使用線段估計改進長條圖近似

長條圖是靠著的一些小量的函式來儲存，而且通常使用來估計資料的分佈情形。在資料庫中，記錄每一個單一屬性的長條圖可以幫助我們來評佔資料庫運算所需的代價。而為了執行查詢的最佳化，通常需要估量這種運算的代價，來決定一個有效率的查詢計畫。另外，長條圖也廣泛的使用在近似查詢系統以及資料探詢上面。這種在資料庫中事先儲存長條條圖的技術，需要用到大量的記憶體空間。所以，如何能夠把長條圖壓縮在一個固定的空間中，而使的壓縮造成的誤差為最小，已經被科學家們視為重要的問題，並且研究了多年。一般最常用在壓縮長條圖的演算法，是將長條圖切成好幾個格子，然後把每個格子裡的頻率視為相同的數目。於是，這個問題就轉變成，我們要如何去決定最佳格子的邊界而使得壓縮的誤差是最小。對於這個問題，之前的研究已經提供了些不錯的解答。然而，在現實生活中，我們所知道的許多資料分布是非常偏斜的。所以，之前的方法對於這種資料分佈並不是有太好的估計效果，因為它們沒有考慮到資料分布的傾向。在這篇論文中，我們提出一個用線段來估計每一個格子的演算法。在資料分布是非常偏斜或是在現實生活的資料中，它可以更為準確的來壓縮長條圖。我們做了一連串的實驗，並且從實驗的結果中顯示，我們的方法在偏斜的資料分布下是更為精準的。

關鍵字

長條圖；線段；近似

並列摘要

Histograms are popularly used to approximate data distribution by a small number of step functions. Maintaining histograms for every single attribute in databases helps us estimate the cost of database operations. The query optimizers usually require such estimation of cost to decide an efficient access query plan. Histograms are also widely used in approximate query answering systems and data mining. The techniques that store precomputed histograms in the database require some overhead of memory consumption. Therefore, the problem of compressing the histogram in a fixed amount of space with the least error is considered as an important issue and has been investigated by researchers for many years. The most common algorithm to compress the histogram is to divide the histogram into buckets and estimate every bucket by uniform representation. The problem becomes how to choose the bucket boundaries to minimize the estimation error for a given number of buckets. The pervious approach has provided a desirable solution to this problem of bucket boundaries selection. However, many data distributions in real-life are well known to be extremely skewed. The pervious algorithms do not perform well for the real data because they do not consider the tendency of data distribution. In this paper, we propose an algorithm that utilizes a line segment to estimate each bucket in replace of uniform representation. The algorithm can construct the histogram more precisely when the data distribution is skewed and is more suitable for the real-world data. We performed a series of experiments, and the results show that our method has better accuracy when the data distribution is skewed.

並列關鍵字

Histogram ； Line ； Approximate

參考文獻

[4] M. N. Garofalakis and P. B. Gibbons. Wavelet Synopses with Error Guarantees. Proceedings of ACM SIGMOD, 2002.

[5] M. Greenwald and S. Khanna. Space-Efficient Online Computation of Quantile Summaries. Proceedings of ACM SIGMOD, Santa Barbara, May 2001

[6] S. Guha and N. Koudas. Approximating a Data Stream for Querying and Estimation: Algorithms and performance evaluation. Proceeding of ICDE, 2002

[7] S. Guha, N. Koudas, and K. Shim. Data-Streams and Histograms. In Proc. of STOC, 2001

[8] Y. Ioannidis. and V. Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. Proceedings of ACM SIGMOD, pages 233-244, May 1995

被引用紀錄

郭山本（2014）。我國青少年橄欖球選手運動傷害調查研究〔碩士論文，長榮大學〕。華藝線上圖書館。https://doi.org/10.6833/CJCU.2014.00118

國際替代計量

使用線段估計改進長條圖近似

全文下載

主題瀏覽