透過您的圖書館登入
IP:3.15.143.181
  • 學位論文

動態資料庫系統之查詢處理技術

Query Processing Techniques in Dynamic Database Systems

指導教授 : 陳銘憲

摘要


科技的日新月異,讓電腦的運算功能不斷的提昇,對一處理大量資料的資料庫系統而言,執行效率以及執行效能也獲得顯著的改善。然而,對動態的環境,如行動資料庫以及資料流而言,在處理使用者查詢時,許多傳統的最佳化技術,都無法直接套用。由於在動態的資料庫系統中,使用者查詢的統計特性,或是儲存的資料內容,都會隨著時間不斷的作改變。因此,如何縮短處理查詢的時間,以及增加查詢結果的準確度,成為許多鑽研資料庫系統的學者,所研究的重要議題。 在行動計算 (Mobile Computing) 的環境中,以廣播為基礎的傳送機制,是一種具可擴充性 (Scalability) 的傳送資料的方式。在資訊廣播系統中,廣播伺服器會根據使用者查詢的機率分佈,將資料分配到不同的廣播頻道中。而分配的原則,則是儘量將使用者的平均等候時間最小化。有鑒於不同的資料,會具有不同的檔案大小,在第二章裡,我們討論的主題是,在具多樣化資料的廣播環境中,資料排程演算法的設計。我們首先推導出在該環境下,使用者的等候時間之數學模型,並定義資料排程的問題。接下來,我們進一步提出了所謂的「二階段排程」 (Two-Phase Allocation) 的演算法架構。在此解法中,資料的排程,分為以下兩個主要階段:粗調及微調。在「粗調」的過程中,資料會以最快的方式,分配到不同的廣播頻段中;在微調時,粗調的結果,則是會被調整成區域最佳解 (Local Optimum)。除此之外,為了要克服區域最佳解的限制,我們又提出「混合式基因演算法」 (Hybrid Genetic Algorithm) 的系統架構,使排程結果達到整體最佳化 (Global Optimization)。 在許多情況下,我們發現到無線網路的使用者,會同時想查詢不同的資料。假設使用者接收不同資料時,他或她會希望這些資料能依據某一特定順序做接收。在伺服器中,為了要讓使用者的等候時間可以最佳化,排程演算法的設計,則必須要根據這些資料在接收時的順序性。 在第三章裡,我們探討的問題,是在所謂的「順序性廣播環境」 (Sequential Data Broadcasting Environments) 中,設計排程演算法。我們所提出的,是一個名為MULS的系統架構。 這個系統架構裡,包含兩個元件:「線上排程演算法」(On-Line Scheduling Algorithm) 以及「最佳化程序」 (Optimization Procedure)。在「線上排程演算法」中,廣播的資料,會根據使用者查詢中的順序性,做初步的排程;而在第二階段裡,當最佳化程序在進行時,我們允許在伺服器端,利用不同的參數,來控制執行時間和執行效能。 在資料流 (Data Stream) 的環境裡,隨著時間,資料會被不斷的接收,由於接收資料流的伺服器,只具備有限的磁碟空間,在儲存資料時,資料流會以「概要」 (Synopsis) 的方式,存在資料流管理系統 (Data Stream Management System) 中,而當我們在處理使用者查詢時,也只能根據「概要」中的內容,來計算這些查詢的答案。在第四章裡,我們探討的是在資料流的環境中,如何處理使用者所送出的「前K名等級查詢」 (Top-K Queries)。在這個章節中,我們針對此一特殊查詢,提出TOMS資料流管理系統的系統架構。在這個架構中,不論是儲存資料流,或是處理使用者查詢,我們皆設計出精巧的演算法來解決這些方面的議題。 而在資料流管理系統中,資料流的概要中的值,通常會被透過「小波轉換」轉 (Wavelet Transform) 的方式,轉變成頻域 (Frequency Domain) 係數而加以儲存,當查詢送出後,使用者所希望得到的結果,卻往往是資料流在時域 (Time Domain) 中的表現。因此,要如何觀察這些頻域的係數,找出兩段資料流彼此在頻域中的相似度,則是我們在第五章中,所要討論的主題。這個章節主要分成兩個部份,第一個部份是資料流的相似度搜尋,而第二部份中,延續第一部份的研究成果,我們進一步探討如何找出「前K條最接近的資料流」 (K Nearest Neighboring Streams)。由於我們設計出來的方法,避免了將頻域係數轉換成時域係數所需消耗的運算負擔,因此,和許多現有的演算法相比,我們的演算法,能以更有效率的方式,計算出相關查詢的答案。 關鍵詞:資料庫系統、使用者查詢、行動計算、資料流、資訊廣播、小波轉換

並列摘要


The technology advances in computing powers have enhanced the capabilities for a database management system to process the user queries effectively and efficiently. However, in many emerging applications such as mobile computing environments and streaming environments, the conventional techniques for query optimization may suffer from the dynamics in either user queries or evolving data entities. For these dynamic database systems, how to process query within real time and provide accurate answers becomes a challenging issue. In the mobile computing environments, broadcast-based dissemination is a scalable way to deliver data items to interested users. According to the probability distribution of user queries, the broadcast server will allocate the items into broadcast channels in such a way the average waiting time of mobile users can be minimized. In view of the fact that various items with different sizes are disseminated in modern information service, we explore in Chapter 2 the issue of scheduling heterogeneous items in the data broadcasting environments. Given the broadcast database and the number of channels, we first derive the analytical model of the heterogeneous data broadcasting to obtain the average waiting time of mobile users, and formulate the problem as a grouping problem. In order to solve such problem, we propose a two-phase architecture to perform channel allocation. In addition to the two-phase architecture, we also propose algorithm GA-CDMS according to the concept of hybrid genetic algorithm for comparison purposes. Moreover, under many circumstances, the mobile users will tend to access multiple items within a specific session. Consider a special case in which the mobile users wish to download these items within a sequential order. To minimize the average access time, the information server should schedule these queries according to the sequential relationship among the items. In Chapter 3, we study the scheduling approach in such a sequential data broadcasting environment. Explicitly, we propose a general framework referred to as MULS for an information system. There are two primary stages in MULS: on-line scheduling and optimization procedure. By cooperating algorithm OLS with procedure SCI, the proposed MULS framework is able to generate broadcast programs with flexibility of providing different service qualities under different requirements of effectiveness and efficiency. As for the streaming environments, the rapid evolving speed and unlimited number of items make these data be scanned at most only once. Due to the resource limitation in the data stream environment, it has been reported that answering user queries according to the wavelet synopsis of a stream is an essential ability of a Data Stream Management System (DSMS). We first study in Chapter 4 the problem of maintaining the wavelet coefficients of multiple streams within collective memory so that the predetermined global error metric is minimized. Moreover, we also examine a promising application in the multistream environment, i.e., the top-k queries. We resolve the problem of efficient top-k query processing with minimized global error by developing a general framework. For the purposes of maintaining the wavelet coefficients and processing top-k queries, several well-designed algorithms are utilized to optimize the performance of each primary component of this general framework. In this dissertation, motivated by the fact that the data cells in streaming environments are usually transformed to coefficients in the frequency domain, we also attempt to address an essential problem "How to obtain the time-domain similarity between two streams from wavelet coefficients in the frequency domain?". In Chapter 5, we investigate two important types of range-constrained queries in time series streaming environments: the distance queries (which aim at obtaining the Euclidean distance between two streams) and the kNN queries (which aim at discovering k nearest neighbors to a reference stream). To achieve high efficiency in processing these two types of queries, we propose procedure RED and algorithm EKS. Compared to the existing methods in the prior research, the advantageous features of our approaches are in two folds. First, our approaches are capable of processing the queries directly from the wavelet synopses retained in the main memory without using IDWT to reconstruct the data cells. Moreover, our approaches enable the users to query the DSMS within their range of interest.

參考文獻


[30] Y. C. Chung, C. C. Chen, and C. Lee. Time Constrained Service on Air. In Proceedings of
[93] W.-C. Peng and M.-S. Chen. Dynamic generation of data broadcasting programs for a
[96] C. S. Perng, H. Wang, S. R. Zhang, and D. S. Parker. Landmarks: A New Model for
[55] C. H. Hsu, G. Lee, and A. L. P. Chen. An Efficient Algorithm for Near Optimal Data
[28] R. Chengy, B. Kaox, S. Prabhakarz, A. Kwanx, and Y. Tuz. Adaptive Stream Filters for

延伸閱讀