以主成份分析法處理定量資料缺失值問題

缺失資料(missing value)這個問題在資料挖掘過程中困擾著資料分析者與資料維護者。如何妥善處理缺失資料，資訊科技的運用與方法的選擇是一大關鍵。目前有的缺失值處理方式通常會造成分析結果的偏差以及會造成資料量的不足。傳統決策樹處理缺失資料是利用最普遍的值去填補缺失值; MVC法( Missing Values Completion ) 則利用容錯關連規則發現資料間的相關規則，並且利用這些相關規則猜測出多個缺失值。但是MVC法並不適合用在定量關連規則( quantitative association rules )的尋找上。本研究將利用主成份分析法( Principle Component Analysis )特徵向量的概念，從定量的各變項中，找出彼此有相關存在的變項之情境，以處理資料挖掘( data mining )之前資料清潔程序中定量資料缺失的問題，並找出該缺失值應填補的值為何。實驗結果顯示，主成份分析法在預測缺失值方面僅需要足夠且具有代表性的資料記錄，與迴歸法做比較後，主成份分析法有相當良好的表現。在應用方面，不僅提高知識發掘步驟中資料清潔程序( data cleaning )的有效性，並且可以根據歷史資料的變項結構預來測新進資料的缺失值。

關鍵字

知識發掘；主成份分析法；缺失資料；資料清潔；資料挖掘

並列摘要

Missing value is an important issue for the analysts during the process of data mining process. The problem is what kind of methods will be suitable to complete those missing values. A current approach filling the missing values in decision trees is using the most common value. This value can be chosen either from the whole data set or from data sets constructed for the classification task. An alternative method, MVC (Missing Values Completion), uses the association rules, discovered with RAR (Robust Association Rules) to mine databases containing multiple missing values, allows to use it for the missing values problem. However, the MVC method is not suitable for mining the quantitative association rules. Our approach is using the concept of PCA (Principal Component Analysis) and eigenvectors figuring out the principal components from quantitative attributes, and using the PCs (Principal Components) handling the missing values problem. The study demonstrated that the principal component analysis model can be used to administer a more reliable and valid values then the regression, with enough and representative data records. For the application, this approach is not only enhancing the validity of the data cleaning progress in KDD, but also predicting correctly the missing values in new data records that is according to the historic data records.

並列關鍵字

KDD ； Data cleaning ； Data mining ； Principal Components Analysis ； Missing values

參考文獻

Agrawal, R., Imielinski, T. and Swami, A., “Mining Association Rules between Sets of Items in Large Databases”, In Proc. of the 1993 ACM SIGMOD

Chen, M., Han, J. and Yu, P.S., “Data mining: An overview from a database perspective”, IEEE Trans. On Knowledge and Data Engineering 8,pp.866-883, 1996.

Fayyad, U. and Uthurusamy, R., “Data mining and knowledge discovery in databases”, Communications of the ACM: Data Mining and Knowledge Kiscovery ( special issue), 39(11), November 1996.

英文參考文獻

Google Scholar

Berry, M. and Linoff, G.,“Data mining techniques : for marketing, sales, and customer support”, New York : Wiley Computer Pub, 1997.

Google Scholar

被引用紀錄

謝明芳（2011）。台大雜交系杜鵑之遺傳變異性分析〔碩士論文，國立臺灣大學〕。華藝線上圖書館。https://doi.org/10.6342/NTU.2011.02480

劉偉倫（2001）。應用資料分析及探勘技術於健保醫療費用管控及申報異常篩選作業〔碩士論文，元智大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0009-0112200611300342

張怡雅（2010）。以主成份分析建構高效率決策樹〔碩士論文，國立臺中科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0061-2108201019191600

國際替代計量

以主成份分析法處理定量資料缺失值問題

主題瀏覽