尋找資料串流之代表元素

隨著電子化時代的來臨，資訊量呈現爆炸性的增加，資料串流 (data stream) 所扮演的角色愈來愈重要，許多應用程式的資訊都以資料串流的型式傳輸與處理，諸如：無線感知網路 (wireless sensor networks) 的監控、即時熱門電視節目調查、阻斷服務攻擊 (deny of service attack) 的偵測等。資料串流的特性是：(1) 總資料量龐大，遠超過傳統資料庫或記憶裝置 (如：硬碟) 所能紀錄的量；(2) 資料更新頻率高；(3) 使用者對於資料的要求不再是單次的快照檢索 (one-time query)，而是長期的連續檢索 (continuous query)。這些與傳統資料處理的相異點，使資料串流中，資訊統計量的計算變得很困難。　　這篇論文中，我們將提出一種在資料串流中尋找熱門元素的方式，其中，熱門元素指的是在資料串流中，出現頻率超過特定比例的元素。為了能符合資料串流的特性並提高實用性，我們的演算法將達到以下的目標：(1) 給定有限範圍內的任何一段時間，我們的系統將能有效率的回傳這段時間內的熱門元素；(2) 可採用分散式處理。我們的演算法必須能在局部刪整資料，且這個刪整資料的動作不可影響最後熱門元素統計的正確性；(3) 有效率的儲存方式。由於資料串流的資料量龐大，遠超過儲存裝置所能記載的數量，我們必須設計出一個有效率的資料結構來儲存熱門元素；(4) 低計算時間複雜度。資料串流的更新頻率頻繁，因此，此演算法必須能快速的計算出熱門元素的統計資訊。　　我們已經在數學上證明：當熱門元素存在時，此演算法保證正確。實驗也顯示：即使在熱門元素較稀少的環境中，我們的演算法回傳的結果也有很高的比例是出現次數較高的元素。

關鍵字

資料串流；熱門元素；即時收視率調查；阻斷服務攻擊偵測

並列摘要

In many real world applications, such as monitoring in sensor networks, online TV ratings, and DoS (deny of service) attack detection, information delivering is by the form of data streams. Differs from the traditional database systems, data streams have the following characteristics: (1) total amount of data is enormous, much more than traditional database systems or memory device (e.g. hard disk) can afford; (2) highly updating rate; (3) rather than one-time query, users tend to issue sort of continuous query instead. These differences make it difficult to calculate the statistics of data streams. 　　In this thesis, a method to disclose frequent items in masses of data streams is proposed. Frequent items refer to those whose occurrence rates exceed a given frequency threshold, which can be specified by inquirers. In order to meet the characteristics of data stream processing and without losing the practicability, the proposed approach has the following objectives: (1) any query interval: Given a time period within the system-affordable range, the frequent items of this period will be responded efficiently; (2) distributed processing: the approach can separately prune away many occurrences from bottom to top without losing any fidelity; (3) storage saving: due to the enormous amount of occurrences, we have to design a mechanism to conserve the potential frequent items using only limited local counters; (4) computation saving: because of the highly updating rate, the proposed approach should be able to respond the statistical results of frequent items without delay. 　　We proposed a distributed algorithm which fulfills the above requirements in this thesis as well as proved its correctness when frequent items do exist. To demonstrate the practicability in locating the proper frequent items we conduct some simulations, the experimental results show that most of the items generated are of eminent occurrence rates even though the frequent items are eccentric.

並列關鍵字

data stream ； frequent item ； on-line TV rating ； DoS attack detection

參考文獻

[5] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and Issues in Data Stream Systems,” in Proceedings of the ACM Symposium on Principles of

[7] Robert S. Boyer and J Strother Moore, “MJRTY - A Fast Majority Vote Algorithm,” in Technical Report ICSCA-CMP-32, 1982.

Discrete Algorithms, 2002.

[8] G. S. Manku and R. Motwani, “Approximate Frequency Counts over Data Streams,” in Proceedings of Very Large Data Bases (VLDB), 2002.

[9] M. Charikar, K. Chen, and M. Farach-Colton, “Finding Frequent Items in Data Streams,” in Proceedings of International Colloquium on Automata, Language, and Programming (ICALP), 2002

國際替代計量

尋找資料串流之代表元素

主題瀏覽