ABSTRACT Recently, enterprises and governments invested aggressively in big data analytics because big data is truly representative of popular opinion based on millions of people. Despite bringing new opportunities, big data encounters the challenge of computation such as extremely large number of observations (e.g., millions of transactions), high dimensionality (e.g., thousands of items), and immediate response (e.g., analyzing the massive data and reporting the result of analysis within several minutes). Association analysis, the fundamentals of data mining, achieved notable success in many applications. However, taking big data into consideration, the conventional association analysis is frustrated by the extraction of patterns information. Specifically, the computational complexity of frequent itemsets mining increases exponentially by the number of items, which has been proven to be an NP-Complete problem. Although many studies used a pruning patterns strategy to reduce the complexity, it probably distorts the shape of data and incurs inaccurate result. In this thesis, we provide novel perspectives on association analysis. In addition to the higher frequency of itemsets, it would be a potential application in the exploration of behavior and relationship between observations. Thus, we devise relative patterns discovery (named RPD) to explore the same patterns between each two observations. It is sensible to examine the behavioral characteristic of an observation by comparison with that of other observations. Instead of pruning patterns, RPD can represent a natural panorama of data, which is appropriate for controlled experiments for discovering more decisive factors. We also propose parallel, decomposable and maintainable components to enhance RPD. For a practical purpose, based on the knowledge of relative patterns, we propose two scoring metrics for evaluating anomaly, and further, we design a scalable outlier detection method (named SOD) for big data analytics. The results of empirical investigations, conducted with various real-world datasets on UCI Machine Learning Repository, demonstrate that our proposed scoring metrics generally outperforms that of previous studies not only in accuracy but also in efficiency. Particularly, in the large-scale dataset (i.e., 494,021 observations and 41 dimensions), the execution time of SOD takes around 6 seconds; moreover, SOD achieves good accuracy (i.e., the area under the curve (AUC) is 0.741). These investigations show evidence that the concept of RPD is practicable in big data analytics.
ABSTRACT Recently, enterprises and governments invested aggressively in big data analytics because big data is truly representative of popular opinion based on millions of people. Despite bringing new opportunities, big data encounters the challenge of computation such as extremely large number of observations (e.g., millions of transactions), high dimensionality (e.g., thousands of items), and immediate response (e.g., analyzing the massive data and reporting the result of analysis within several minutes). Association analysis, the fundamentals of data mining, achieved notable success in many applications. However, taking big data into consideration, the conventional association analysis is frustrated by the extraction of patterns information. Specifically, the computational complexity of frequent itemsets mining increases exponentially by the number of items, which has been proven to be an NP-Complete problem. Although many studies used a pruning patterns strategy to reduce the complexity, it probably distorts the shape of data and incurs inaccurate result. In this thesis, we provide novel perspectives on association analysis. In addition to the higher frequency of itemsets, it would be a potential application in the exploration of behavior and relationship between observations. Thus, we devise relative patterns discovery (named RPD) to explore the same patterns between each two observations. It is sensible to examine the behavioral characteristic of an observation by comparison with that of other observations. Instead of pruning patterns, RPD can represent a natural panorama of data, which is appropriate for controlled experiments for discovering more decisive factors. We also propose parallel, decomposable and maintainable components to enhance RPD. For a practical purpose, based on the knowledge of relative patterns, we propose two scoring metrics for evaluating anomaly, and further, we design a scalable outlier detection method (named SOD) for big data analytics. The results of empirical investigations, conducted with various real-world datasets on UCI Machine Learning Repository, demonstrate that our proposed scoring metrics generally outperforms that of previous studies not only in accuracy but also in efficiency. Particularly, in the large-scale dataset (i.e., 494,021 observations and 41 dimensions), the execution time of SOD takes around 6 seconds; moreover, SOD achieves good accuracy (i.e., the area under the curve (AUC) is 0.741). These investigations show evidence that the concept of RPD is practicable in big data analytics.