多維度立方體之資料處理技術

多維度資料立方體(Data Cubes)在現今資料倉儲系統中，已成為重要的元件。而以此發展的線上分析處理技術(OLAP, On-line Analytical Processing)也成為實作決策支援系統(DSS, Decision Support Systems)的重要方式。在此論文中，主要的研究議題有三:一、透過選擇適當的實體化檢視表進行合併，使線上分析處理系統在查詢處理時間及儲存空間上取得平衡；二、提出多維度資料立方體的壓縮、查詢處理及誤差估計技術；三、研發多維度資料串流上的近似處理技術。今日的資料倉儲系統通常面臨了需處理巨量資料及極複雜的使用者查詢，所以一個理想的線上查詢處理系統需要具要具有可接受的查詢處理時間、可控制的維護成本及以最小儲存空間容納大量資料的特性。為了確保系統具有快速的查詢處理時間，將檢視表(views)預先計算並儲存為資料表(tables)的實體化(materialization)技術也隨之發展。從過往的研究中已證實，實體化檢視表(MVs, Materialized Views)對於加快查詢處理速度有極佳的助益，且此項技術已廣泛地應用在許多商用資料庫系統。另一方面，使用者也經常提出需對數十億位元組(gigabytes)資料進行複雜運算的查詢，導致系統計算正確答案的時間代價十分可觀。再者，在線上分析處理系統處理的通常是需讀取大量資料的範圍查詢(range-sum queries)，這也使查詢處理耗時的問題更加嚴重。因此，快速地對於查詢提供近似的解答以供使用者參考，這種似近處理技術就成為實用且可行的技術。同時，對於近似處理所產生的誤差，使用者也期待系統可以進行估計，以評估近似值的可用度。除了線上分析處理技術，近年來有些應用環境，需對巨量的多維度且源源不斷累積的串流資料進行快速地處理。如行動電話公司需能快速搜集並處理各基地台傳回各種形態的系統錯誤訊號。這種類型的資料，或稱為多維度資料串流。此類型態資料通常資料量十分巨大以至於無法全數儲存於永久儲存系統(permanent devices)；並且資料串流累積十分迅速，導至系統僅可對資料進行單次掃描。因此，近似處理技術便成為解決上述問題的合適方法。然而，一般的串流資料處理器不論在工作記憶體或運算能力等系統資源，其限制要比傳統多維度立方體的處理系統要更為嚴苛。故多維度串流資料的近似處理技術需具備「更小的記憶體需求、更有效率的處理演算法」等特性，以應付快處累積且源源不斷進入的多維度資料串流。在此論文中，我們提出了上述問題的解決方法。首先，我們發展出高效率的MAVIS演算法，可進行實體化檢視表的選擇及合併，使線上分析處理系統可在查詢等待時間及系統儲存空間兩方面取得良好的平衡。再者，我們提出了一個架構健全的多維度資料立方體近似處理架構- the DAWN framework。此架構不僅可以對多維度資料立方體進行高效率地壓縮，並且直接從壓縮後的係數直接計算範圍查詢的結果而無需花費額外時間進行解壓縮。並且，此架構亦包含了精確的誤差估計以提高系統可靠度。最後，我們整合了離散餘弦轉換(DCT, Discrete Cosine Transform)及離散小波轉換(DWT, Discrete Wavelet Transform)，發展了DAWA演算法。此演算法可在小量的工作記憶體環境下，快速地進行多維度資料串流的切割及轉換程序，將資料量甚巨的資料串流壓縮成小量係數，並且可從此係數直接計算範圍查詢的答案。以此三項高效率的多維度立方體資料處理技術，不僅對線上分析處理系統的查詢處理及資料儲存助益甚大，同時對於新型態的多維度串流資料，亦提供了可行的處理方法及極佳的處理效率。

關鍵字

多維度資料立方體；資料倉儲；線上分析處理；資訊勘測；串流資料；多維度資料串流

並列摘要

Data cube has become an important component in most data warehouse systems and in decision support systems. Modern data warehouses have a huge amount of data, and OLAP queries submitted by users are becoming more complex. In this dissertation, we first devise the mechanism for striking the balance between response time and the cost of storage space. Then, we propose a framework of approximating query processing for data cubes. Finally, we extend the framework to deal with cube streams. An ideal OLAP system is expected to provide acceptable response time, controllable updating cost, and least storage space. To guarantee a satisfactory query response time, the pre-computed techniques, also known as view materialization, are developed. Materialized Views (MVs) have been found to be very effective in speeding up query as well as update processing, and the methods are being widely supported by commercial database systems. In addition, users usually pose very complex queries to the OLAP system in recent data warehousing systems, which requires complex operations over gigabytes of data and takes a very long time to produce exact answers. Consequently, the issue of approximating OLAP queries becomes critical. Answering range queries is one of the primary tasks of OLAP applications. However, datasets tend to be very large in real data warehousing systems. Thus, answering aggregate queries can be computationally expensive. To address this issue, providing approximate answers to online queries is a viable solution. Also, error bound estimation of the answers to queries is an important functionality for users. That is, both an efficient approximate query processing algorithm and estimation of error bounds are required for DSS applications. For multidimensional data streams, or cube streams, the volume of data is usually too huge to be stored in permanent devices or be scanned thoroughly more than once. Such applications have to process cube streams with limited resources and keep the approximated information in a synopsis memory for further analysis. In addition, the resources for both the processing time and the memory are much more constrained than in off-line cube construction so that cube streams must be processed efficiently with a small working buffer. As a result, an efficient algorithm that can compress cube streams within a small working buffer in one data scan is required to address such a problem. In this dissertation, we proposed the solutions for those issues of modern OLAP systems. First, we devise an efficient mechanism to solve the problem of how to arrange materialization tasks, and propose the algorithm MAVIS, striking the balance between response time and the cost of storage space. Second, the framework DAWN, focusing on answering range-sum queries with error estimation from compressed OLAP cubes, is proposed. Third, we propose the DAWA algorithm, an integrated algorithm of Dct for Data and discrete WAvelet transform, for approximating the cube streams with very restricted resources. With the the framework DAWN, algorithms MAVIS and DAWA, data warehousing systems are able to answer OLAP queries efficiently with limited storage space. In addition, OLAP systems are able to answer range-sum queries from compressed data cubes instead of dealing with huge volume of data cells. Moreover, the multi-dimensional data streams, or cube streams, could be processed and stored very efficiently.

並列關鍵字

Data Cubes ； Data Warehousing ； OLAP ； Data Mining ； Cube Streams ； Multidimensional Cubes ； Data Streams

參考文獻

of PODS, pages 38—48. ACM Press, 2003.

[52] W. G. Teng, M. S. Chen, and P. S. Yu. Wsing wavelet-based resource-aware mining

[1] Transactional Processing Performance Council. TPC Benchmark. http://www.tpc.org.

[2] F. Afrati and R. Chirkova. Selecting and using views to compute aggregate queries. In

performance of aggregate queries. In Proc. of DASFAA, 2005.

國際替代計量

多維度立方體之資料處理技術

全文下載

主題瀏覽