透過您的圖書館登入
IP:18.119.133.96
  • 學位論文

不同條件限制之連續性關係勘測

Mining Sequential Relationship for Different Constraints

指導教授 : 陳銘憲

摘要


近年來,由於資料收集以及儲存裝置技術的快速進步,使得各機構累積了相當大量的資料。然而,由於資料量太大的緣故,傳統的資料分析技術並不能派上用場。資料探勘是一門同時結合了傳統資料分析方法以及精心設計的演算法以便可以快速處理大量資料的熱門學術研究方向。就目前為止已經有很多相關的研究提出高效率的演算法。在此之中,連續性關係勘測的技術已被廣泛的在選擇性市場行銷,決策分析以及企業管理中所應用。然而大部分連續性關係勘測的相關研究都是利用單一信心水準,也就是說完全依靠單一信心水準來決定該模式是有趣與否。在此篇論文中,我們研究了非單一性信心水準之連續性關係勘測的相關技術。首先我們提出了在動態信心水準條件及漸進式網路資料庫環境下的高效率演算法,之後我們研究在連續性法則之中如何找出關鍵的催化性事件。最後,我們解決了特定條件限制下的連續性關係勘測問題。 由於網路上各式各樣的活動越來越頻繁,網路資料勘測不管在學術界或是業界都是相當熱門的研究主題,然而傳統的使用者瀏覽模式勘測都是利用單一信心水準。也就是說並沒有考慮如模式的長度,網頁的位置及重要性等。有鑑於此,我們提出了動態信心水準條件及漸進式網路資料庫環境模型。尤其我們利用了馬可夫模型來輔助決定各網頁文件的信心水準條件,除此之外,我們提出的演算法DTM 亦可以在動態信心水準條件下有效率的找出使用者瀏覽模式。 此外,連續性事件勘測也是相當重要的學術研究方向,然而傳統的連續性事件勘測也面臨到關於信心水準大小決定的兩難問題。簡而言之,太高的信心水準會使得結果產生出來的連續性法則是平凡無奇的,太小的信心水準則會產生大量的連續性法則,使得從中要發現有價值的連續性法則困難重重。在本論文中我們研究在連續性法則之中如何找出關鍵的催化性事件。傳統的信心水準模型下的過濾技術在本課題中已無法利用了,因此我們發現及利用了一些特性來加速發現關鍵的催化性事件的過程。 最後,由於電信系統每天產生出大量的警報紀錄,警報記錄中可能隱藏著特定的系統運作模式可以用以解決系統問題和預測嚴重錯誤的發生,因此我們設計了一套特定條件限制下的連續性警報記錄關係勘測的系統。首先我們針對警報記錄資料的特性,設計了一連串的資料清理的步驟,之後我們亦針對警報記錄的特性,在連續性警報記錄關係勘測的演算法中提出了新的計數方式以及加入了事件發生時間區隔的限制。我們所提出的演算法已實際在電信業者所提供的警報記錄中驗證效能及實用性。

並列摘要


Remarkable advancement in data collection and storage technology have enabled institutions to store up vast amount of data. However, traditional data analysis techniques cannot be used here because of the huge size of data. Data mining is a technology that combine traditional data analysis procedure and sophisticated algorithms for dealing with large amount of data. A significant amount of research effort has been elaborated upon the development of efficient algorithms for data mining. In particular, the discovery of sequential relationship among the data in a huge database has been known to be useful in selective marketing, decision analysis, and business management. However, most work of sequential relationship are based on the model of a uniform support threshold only, where a single support threshold is used to determine whether the patterns are interesting or not. In this dissertation, we explore several techniques to mining sequential relationship for different constraints. First, we propose efficient algorithms for incremental Web log mining with dynamic thresholds. Then, we investigate the model of mining catalytic events of episode rules in event sequences. Finally, we study the problem of constrained based sequential pattern mining. With the fast increase in Web activities, Web data mining has recently become an important research topic and is receiving a significant amount of interest from both academic and industrial environments. While existing methods are efficient for the mining of frequent path traversal patterns from the access information contained in a log file, these approaches are likely to over evaluate associations. Explicitly, most previous studies of mining path traversal patterns are based on the model of a uniform support threshold, where a single support threshold is used to determine frequent traversal patterns without taking into consideration such important factors as the length of a pattern, the positions of Web pages, and the importance of a particular pattern, etc. As a result, a low support threshold will lead to lots of uninteresting patterns derived whereas a high support threshold may cause some interesting patterns with lower supports to be ignored. In view of this, we broadens the horizon of frequent path traversal pattern mining by introducing a flexible model of mining Web traversal patterns with dynamic thresholds. Specifically, we study and apply the Markov chain model to provide the determination of support threshold of Web documents; and further, by properly employing some effective techniques devised for joining reference sequences, the proposed algorithm DTM (standing for Dynamic Threshold Miner) not only possesses the capability of mining with dynamic thresholds, but also improves the execution efficiency as well as contributes to the incremental mining of Web traversal patterns. Performance of algorithm DTM and the extension of existing methods is comparatively analyzed with synthetic and real Web logs. The data mining techniques include association rules mining, classification, clustering, mining time series, and sequential pattern mining, to name a few. Among others, finding frequent episodes has attracted a significant amount of research attention. However, the traditional episodes mining also suffers from a dilemma of frequency threshold determination. To put it briefly, with a high frequency threshold, the discovered rules are often well-known and valueless, and rules of low frequency threshold but high usefulness are usually not discovered. And with a low frequency threshold, the amount of the discovered rules explodes and it is very difficult to find the valuable rules among them. In view of this, we consider the following problem. Given a set of episodes and a sequence database of events, find all catalytic events of episode rules in the event sequence. An event E is a catalytic event of a given episode rule α=>β if event E and the triggering part of episode rule α together trigger the consequential part of episode rule β more easily than the triggering part of episode rule α does. In other words, event E is a catalyst which makes the probability of β happens after α become larger, that is to say, the event E makes β happens after α more often. Obviously, the classic support-based pruning strategy does not work here. Hence, we find some properties to make the process of mining catalytic events of episodes more efficient. A telecommunication system produces daily a large amount of alarm data which contains hidden valuable information about the system behavior. The knowledge discovered from alarm data can be used in finding problems in networks and possibly in predicting severe faults. Hence, we devise a solution procedure for mining sequential alarm patterns from the alarm data of a GSM system. First, by observing the features of the alarm data, we develop operations for data cleaning without compromising the quality of sequential alarm patterns obtained. After the data cleaning procedure, we transform the alarm data into a set of alarm sequences. Note that the consecutive alarm events exist in the alarm sequences, and it is complicated to count the occurrence counts of events and extract patterns. Therefore, we devise a new counting method to determine the occurrence count of the sequential alarm patterns in accordance with the nature of alarms. More importantly, by utilizing time constraints to restrict the time difference between two alarm events, we devise a mining algorithm to discover useful sequential alarm patterns. The proposed mining algorithm is implemented and applied to test against a set of real alarm data provided by a cellular phone company. The quality of knowledge discovered is evaluated. The experimental results show that the proposed operations of data cleaning are able to improve the execution of our mining algorithm and the knowledge obtained from the alarm data is very important from the perspective of network operators for alarm prediction and alarm control.

參考文獻


[1] R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A Tree Projection Algorithm for Generation
of Frequent Itemsets. Jornal of Parallel and Distributed Computing (Special Issue on High
Performance Data Mining), 2000.
[3] C. C. Aggarwal, J. L. Wolf, and P. S. Yu. A new method for similarity indexing of market
high dimensional data for data mining applications. pages 94–105, 1998.

延伸閱讀