於XML資料流上探勘最大樣型

資訊科技的演進爲企業帶來革命性的挑戰，隨著網際網路的蓬勃發展，越來越多使用者利用這個平台傳遞資訊或進行交易，許多企業也嘗試在此商業模式上滿足客戶的需求或尋找潛在的客戶，以創造更多商機；有鑑於此，相關的資料探勘技術遂成為研究重點之一。頻繁樣型經常被用來表示使用者最可能的行為或喜好，不僅有助於拉近系統設計與使用者需求的距離，也可藉此制訂迎合客戶需求的行銷決策；以往這方面的研究著重在靜態資料庫上。在許多新興應用中，資料常具有持續快速且大量流入的特性，在系統資源有限的環境下，需要設計更有效率的儲存和更新機制，以概略估計曾經流過系統的重要資訊。本論文針對XML資料串流設計一套探勘方法，主要分為三個階段：首先將每筆XML資料轉換成序列型式，再建立巧妙的樹狀結構來壓縮儲存大量的序列組合與出現頻率，最後提供有效率的探勘程序找出具代表性的最大樣型。實驗結果證明，在密集與稀疏兩種不同型態的資料上，本論文所提方法均可達成不錯的效率；其中，稀疏資料的探勘效率較佳，但在容錯模式下，密集資料的精確率和回收率相對較高。

關鍵字

資料探勘；資料串流；瀏覽行為；最大樣型

並列摘要

The advance of information technology brings enterprises revolutionary challenges. As the booming development of Internet, more and more users exchange messages or do business on the new platform. Many enterprises also try meeting the needs of their customers on the new business model or finding potential customers for more opportunities of making money. As a result, data mining techniques become one of the major research topics. Frequent patterns are often used to represent the potential interest or behavior of users, and can adapt the system design to user needs. Moreover, frequent patterns can also be used to customize the marketing policies. Previous works in this field focus on static databases, while in recent applications data often arrive in a continuous and rapid way. In the resource-limited environment, a more efficient mechanism of data storage and update is needed in order that the important information passing the system can be estimated. This thesis proposes a method for mining XML data streams, which consists of three phases. Each XML document is first transformed into a sequence. After that, a compact tree structure is constructed to compress the huge amount of subsequences and their counts. Finally, an efficient algorithm for mining maximal patterns is designed. The experimental results show that the proposed method performs well in both dense and sparse datasets. Its efficiency on sparse data is better, while its accuracy on dense data is better when a few errors are tolerable.

並列關鍵字

Browsing Behavior ； Data Mining ； Data Streams ； Maximal Patterns

參考文獻

R. Agrawal and R. Srikant. “Fast Algorithms for Mining Association Rules in Large Databases.” VLDB 1994.

N. Agarwal, M. Galan and Y. Chen. “Approximate Structural Matching over XML Documents.” IDEAS 2007.

R.J. Bayardo. “Efficiently Mining Long Patterns from Databases.” SIGMOD 1998.

Y. Chen, L. Yang and Y.G. Wang. “Incremental Mining of Frequent XML Query Patterns.” ICDM 2004.

C.I. Ezeife and M. Monwar. “SSM: A Frequent Sequential Data Stream Patterns Miner.” CIDM 2007.

國際替代計量

於XML資料流上探勘最大樣型

未授權

主題瀏覽