缺失資料(missing value)這個問題在資料挖掘過程中困擾著資料分析者與資料維護者。如何妥善處理缺失資料,資訊科技的運用與方法的選擇是一大關鍵。 目前有的缺失值處理方式通常會造成分析結果的偏差以及會造成資料量的不足。傳統決策樹處理缺失資料是利用最普遍的值去填補缺失值; MVC法( Missing Values Completion ) 則利用容錯關連規則發現資料間的相關規則,並且利用這些相關規則猜測出多個缺失值。但是MVC法並不適合用在定量關連規則( quantitative association rules )的尋找上。 本研究將利用主成份分析法( Principle Component Analysis )特徵向量的概念,從定量的各變項中,找出彼此有相關存在的變項之情境,以處理資料挖掘( data mining )之前資料清潔程序中定量資料缺失的問題,並找出該缺失值應填補的值為何。 實驗結果顯示,主成份分析法在預測缺失值方面僅需要足夠且具有代表性的資料記錄,與迴歸法做比較後,主成份分析法有相當良好的表現。 在應用方面,不僅提高知識發掘步驟中資料清潔程序( data cleaning )的有效性,並且可以根據歷史資料的變項結構預來測新進資料的缺失值。
Missing value is an important issue for the analysts during the process of data mining process. The problem is what kind of methods will be suitable to complete those missing values. A current approach filling the missing values in decision trees is using the most common value. This value can be chosen either from the whole data set or from data sets constructed for the classification task. An alternative method, MVC (Missing Values Completion), uses the association rules, discovered with RAR (Robust Association Rules) to mine databases containing multiple missing values, allows to use it for the missing values problem. However, the MVC method is not suitable for mining the quantitative association rules. Our approach is using the concept of PCA (Principal Component Analysis) and eigenvectors figuring out the principal components from quantitative attributes, and using the PCs (Principal Components) handling the missing values problem. The study demonstrated that the principal component analysis model can be used to administer a more reliable and valid values then the regression, with enough and representative data records. For the application, this approach is not only enhancing the validity of the data cleaning progress in KDD, but also predicting correctly the missing values in new data records that is according to the historic data records.