一種三向度時間序列資料處理模型

資訊科技與網路發展提升了資料記錄保存與紀錄的方便性，伴隨而來的新問題是：如何從這些資料中找尋有用的資訊？資料探索(data mining)提供了答案。資料探索研究已經發展出許多有用的理論與技術可以用來解決上面所提到的問題，但是在實際使用上仍有某些困難必須加以解決。譬如在處理不同領域資料會需要特定的領域知識，或是針對資料特性的不同應該有個別的分析處理方法。甚至為了更為精準的找出資料中的隱含資訊，同時運用多種資料探索的方法是可能的。在這些資料中有一部分是包含有時間因子的時間序列資料(sequential data)。譬如網頁的瀏覽路徑、消費者在商場的購物紀錄等都是屬於這類型的資料。本論文即針對這類型的資料進行討論與分析，根據資料的特性，從三個向度來分析：1.不含時間因素的項目矩陣(time-ignored item-matrix)：在忽略時間因素的前提下，將序列資料中的項目視為一個屬性，比較序列間屬性的異同。2.狀態轉換(state-transition)：將序列中的項目視為一個狀態，討論連續兩個狀態的變化。3.完整序列的相似度(sequence similarity)：將一個序列視為一個字串，利用動態規劃的方法計算字串間相似度，來找出相類似的序列。本論文同時設計一個三向度的資料處理流程。這個流程可以交互使用三種分析方法，從不同的向度對時間序列資料進行資料的分析處理，並在一群時間序列資料中找出相類似的群體。最後根據這個處理流程，實作了一套系統，實際處理公立部門相關網站的學員學習紀錄，並得到良好的聚類結果。

關鍵字

資料探索；群集分析；時間序列；動態規劃；序列相似度

並列摘要

The development of information technology and World-Wide Web increases the convenience of recording and storing any kind of data. The following question is that how to discover useful information even knowledge in these large data. Data mining provides a solution. Data mining has developed many helpful theory and technology to solve the above-mentioned problem. Even now, there are still many problems to be solved in real-case problem. For example, it is necessary that we must have certain domain knowledge when analyzing data in deferent applications. And when different types of data are processed according to their characteristics, there should be different techniques. Sometimes, when we want to get more hidden information scientifically, it is possible to apply many kinds of data mining techniques at the same time. One kind of these data we want to analyze is sequential data, which contains time factor, such as “traversal pattern in a Web site” and “consumer purchasing record.” And this thesis will discuss this kind of data. Based on its characteristic, this thesis analyzes sequential data in three ways: 1. Time-ignored item-matrix: In the condition of ignoring time factor, we take items of one sequence into certain attributes. And then the dissimilarity of two sequences is computed. 2. State-transition matrix: We take the items of one sequence into certain states, and then discuss the transition of states in two serial times. 3. Sequence dissimilarity: The problem of calculating sequence dissimilarity is similar to the problem of string alignment. By dynamic programming, we find the dissimilarity of two sequences as the minimum score. An experimental system is implemented to prove this idea. The demonstration system takes from a real web site to evaluate and compare the calculation results. Examples show that the proposed methods can be integrated and work successfully.

並列關鍵字

sequence similarity ； time series ； clustering analyzing ； sequential data ； data mining ； dynamic programming

參考文獻

[AC75] Alfred V. Aho and Margaret J. Corasick, “Efficient string matching: An aid to bibliographic search,” Communications of the ACM, vol. 18, no.6, pp.333-340, 1975.

[AS94] Rakesh Agrawal and Ramakrishnan Srikant, “Fast algorithm for mining association rules in large databases,” Proceedings of the International Very Large Databases Conference, pp. 487-499, 1994.

[GK01] Valerie Guralnik and George Karypis, “A scalable algorithm for clustering sequential data,” Proceedings of IEEE International Conference on Data Mining, pp.179-186, 2001

[HPY00] J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation,” Proceeding of ACM International Conference on Management of Data, 2000.

[MP80] W. J. Masek and M. S. Paterson, “A faster algorithm for computing string edit distances,” Journal of Computer and System Sciences,

被引用紀錄

汪倢伃（2013）。基於影像感測資訊量化復健動作之研究〔碩士論文，長榮大學〕。華藝線上圖書館。https://doi.org/10.6833/CJCU.2013.00126

國際替代計量

一種三向度時間序列資料處理模型

未授權

主題瀏覽