運用多模型對缺失值預測之研究

在大數據分析的過程中，資料的完整性與一致性往往是影響分析結果正確性的很重要因素。因此在分析的程序開始之前，要對所收集的資料來源進行資料清理的工作，以確保後續分析不會因為資料的異常而造成結果的錯誤，因此在資料清理中維持資料的完整性是一項相當重要的工作。造成資料不完整的原因之一是所收集的資料中含有缺失值，而缺失值的出現源自於資料收集過程中人為疏失、儀器故障等因素。目前對於處理缺失值的常見方式為以下幾種：將有缺失值的值組直接忽略、或是使用缺失值屬性的集中趨勢量測（如均值、中位數等方式）進行缺失值的填補。這些方法可能會造成將該值組的原有特徵性的流失，對於後續的資料分析、應用的產出造成影響，而導致結果的不正確。針對此問題，本研究針對單一欄位缺失值使用機器學習方法來進行填補。我們以不包含缺失值的資料作為訓練資料，以K-Means分群方式將資料分為多個群集以捕捉資料之間不易見的關聯，每個群集再以多重迴歸以及類神經網路建立預測模型。對需要預測的缺失值首先以KNN演算法求得該資料所屬的群集，再套用該群集的模型來計算預測值。在實驗中證明本研究所提出的多模型填補的方式，在以均方根誤差來統計精準度的結果中，均優於現有的填補演算法。

關鍵字

缺失值；多重迴歸分析；類神經網路； k－平均分群演算法

並列摘要

The integrity and consistency of data substantially influence the results of big data analytics. Data cleansing is often performed prior to the start of analyses to maintain these qualities in input data and ensure the results are not distorted by data anomalies. A key goal of data cleansing is to preserve data integrity. Missing values in the collected data are the main factor undermining such integrity and often result from human negligence or machine malfunction during data collection. Methods for addressing this problem include ignoring data that contain missing values or substituting the missing values with measures of central tendency, such as means or medians. These methods may result in incorrect predictions of missing values because of an inability to detect relationships among the input data. As a result, outcomes of subsequent analyses may also be incorrect. In this study, we used machine learning techniques to manage data containing missing values for a single attribute. We used a data set without missing values as the training data and clustered it using the k-means algorithm. Prediction models were built for each cluster using the resulting data. The k-nearest neighbor algorithm was used to determine the clusters of data, and models of the clusters were used to compute the missing values. We compared the results of the root-mean-square error of our models with that of other models commonly used in simulations, and the results revealed that our models were more accurate.

並列關鍵字

Missing Value ； Multiple Regression ； Artificial Neural Network ； K-Means Clustering

國際替代計量

運用多模型對缺失值預測之研究

全文下載

主題瀏覽