透過您的圖書館登入
IP:3.17.110.58
  • 學位論文

相關特徵發現方法實現巨量資料的應用:關聯分析的新思維

Relative Patterns Discovery for Permitting Applications of Big Data: Novel Perspectives on Association Analysis

指導教授 : 吳帆
若您是本文的作者,可授權文章由華藝線上圖書館中協助推廣。

摘要


ABSTRACT Recently, enterprises and governments invested aggressively in big data analytics because big data is truly representative of popular opinion based on millions of people. Despite bringing new opportunities, big data encounters the challenge of computation such as extremely large number of observations (e.g., millions of transactions), high dimensionality (e.g., thousands of items), and immediate response (e.g., analyzing the massive data and reporting the result of analysis within several minutes). Association analysis, the fundamentals of data mining, achieved notable success in many applications. However, taking big data into consideration, the conventional association analysis is frustrated by the extraction of patterns information. Specifically, the computational complexity of frequent itemsets mining increases exponentially by the number of items, which has been proven to be an NP-Complete problem. Although many studies used a pruning patterns strategy to reduce the complexity, it probably distorts the shape of data and incurs inaccurate result. In this thesis, we provide novel perspectives on association analysis. In addition to the higher frequency of itemsets, it would be a potential application in the exploration of behavior and relationship between observations. Thus, we devise relative patterns discovery (named RPD) to explore the same patterns between each two observations. It is sensible to examine the behavioral characteristic of an observation by comparison with that of other observations. Instead of pruning patterns, RPD can represent a natural panorama of data, which is appropriate for controlled experiments for discovering more decisive factors. We also propose parallel, decomposable and maintainable components to enhance RPD. For a practical purpose, based on the knowledge of relative patterns, we propose two scoring metrics for evaluating anomaly, and further, we design a scalable outlier detection method (named SOD) for big data analytics. The results of empirical investigations, conducted with various real-world datasets on UCI Machine Learning Repository, demonstrate that our proposed scoring metrics generally outperforms that of previous studies not only in accuracy but also in efficiency. Particularly, in the large-scale dataset (i.e., 494,021 observations and 41 dimensions), the execution time of SOD takes around 6 seconds; moreover, SOD achieves good accuracy (i.e., the area under the curve (AUC) is 0.741). These investigations show evidence that the concept of RPD is practicable in big data analytics.

並列摘要


ABSTRACT Recently, enterprises and governments invested aggressively in big data analytics because big data is truly representative of popular opinion based on millions of people. Despite bringing new opportunities, big data encounters the challenge of computation such as extremely large number of observations (e.g., millions of transactions), high dimensionality (e.g., thousands of items), and immediate response (e.g., analyzing the massive data and reporting the result of analysis within several minutes). Association analysis, the fundamentals of data mining, achieved notable success in many applications. However, taking big data into consideration, the conventional association analysis is frustrated by the extraction of patterns information. Specifically, the computational complexity of frequent itemsets mining increases exponentially by the number of items, which has been proven to be an NP-Complete problem. Although many studies used a pruning patterns strategy to reduce the complexity, it probably distorts the shape of data and incurs inaccurate result. In this thesis, we provide novel perspectives on association analysis. In addition to the higher frequency of itemsets, it would be a potential application in the exploration of behavior and relationship between observations. Thus, we devise relative patterns discovery (named RPD) to explore the same patterns between each two observations. It is sensible to examine the behavioral characteristic of an observation by comparison with that of other observations. Instead of pruning patterns, RPD can represent a natural panorama of data, which is appropriate for controlled experiments for discovering more decisive factors. We also propose parallel, decomposable and maintainable components to enhance RPD. For a practical purpose, based on the knowledge of relative patterns, we propose two scoring metrics for evaluating anomaly, and further, we design a scalable outlier detection method (named SOD) for big data analytics. The results of empirical investigations, conducted with various real-world datasets on UCI Machine Learning Repository, demonstrate that our proposed scoring metrics generally outperforms that of previous studies not only in accuracy but also in efficiency. Particularly, in the large-scale dataset (i.e., 494,021 observations and 41 dimensions), the execution time of SOD takes around 6 seconds; moreover, SOD achieves good accuracy (i.e., the area under the curve (AUC) is 0.741). These investigations show evidence that the concept of RPD is practicable in big data analytics.

參考文獻


Agarwal, R., & Srikant, R. (1994, September). Fast algorithms for mining association rules. In Proc. of the 20th VLDB Conference (pp. 487-499).
Aggarwal, C. C., & Philip, S. Y. (2005). An effective and efficient algorithm for high-dimensional outlier detection. The VLDB journal, 14(2), 211-221.
Angiulli, F., & Pizzuti, C. (2005). Outlier mining in large high-dimensional data sets. IEEE Transactions on Knowledge and Data Engineering, 17(2), 203-215.
Angiulli, F., Basta, S., & Pizzuti, C. (2006). Distance-based detection and prediction of outliers. IEEE Transactions on Knowledge and Data Engineering, 18(2), 145-160.
Barron, A., Rissanen, J., & Yu, B. (1998). The minimum description length principle in coding and modeling. IEEE Transactions on Information Theory, 44(6), 2743-2760.

延伸閱讀