相關特徵發現方法實現巨量資料的應用：關聯分析的新思維

ABSTRACT Recently, enterprises and governments invested aggressively in big data analytics because big data is truly representative of popular opinion based on millions of people. Despite bringing new opportunities, big data encounters the challenge of computation such as extremely large number of observations (e.g., millions of transactions), high dimensionality (e.g., thousands of items), and immediate response (e.g., analyzing the massive data and reporting the result of analysis within several minutes). Association analysis, the fundamentals of data mining, achieved notable success in many applications. However, taking big data into consideration, the conventional association analysis is frustrated by the extraction of patterns information. Specifically, the computational complexity of frequent itemsets mining increases exponentially by the number of items, which has been proven to be an NP-Complete problem. Although many studies used a pruning patterns strategy to reduce the complexity, it probably distorts the shape of data and incurs inaccurate result. In this thesis, we provide novel perspectives on association analysis. In addition to the higher frequency of itemsets, it would be a potential application in the exploration of behavior and relationship between observations. Thus, we devise relative patterns discovery (named RPD) to explore the same patterns between each two observations. It is sensible to examine the behavioral characteristic of an observation by comparison with that of other observations. Instead of pruning patterns, RPD can represent a natural panorama of data, which is appropriate for controlled experiments for discovering more decisive factors. We also propose parallel, decomposable and maintainable components to enhance RPD. For a practical purpose, based on the knowledge of relative patterns, we propose two scoring metrics for evaluating anomaly, and further, we design a scalable outlier detection method (named SOD) for big data analytics. The results of empirical investigations, conducted with various real-world datasets on UCI Machine Learning Repository, demonstrate that our proposed scoring metrics generally outperforms that of previous studies not only in accuracy but also in efficiency. Particularly, in the large-scale dataset (i.e., 494,021 observations and 41 dimensions), the execution time of SOD takes around 6 seconds; moreover, SOD achieves good accuracy (i.e., the area under the curve (AUC) is 0.741). These investigations show evidence that the concept of RPD is practicable in big data analytics.

關鍵字

特徵發現；異常偵測；資料探勘；巨量資料分析；平行運算

並列摘要

並列關鍵字

parallel computing ； outlier detection ； data mining ； big data analytics ； pattern discovery

參考文獻

Agarwal, R., & Srikant, R. (1994, September). Fast algorithms for mining association rules. In Proc. of the 20th VLDB Conference (pp. 487-499).

Aggarwal, C. C., & Philip, S. Y. (2005). An effective and efficient algorithm for high-dimensional outlier detection. The VLDB journal, 14(2), 211-221.

Angiulli, F., & Pizzuti, C. (2005). Outlier mining in large high-dimensional data sets. IEEE Transactions on Knowledge and Data Engineering, 17(2), 203-215.

Angiulli, F., Basta, S., & Pizzuti, C. (2006). Distance-based detection and prediction of outliers. IEEE Transactions on Knowledge and Data Engineering, 18(2), 145-160.

Barron, A., Rissanen, J., & Yu, B. (1998). The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6), 2743-2760.

國際替代計量

相關特徵發現方法實現巨量資料的應用：關聯分析的新思維

未授權

主題瀏覽