帶有錯誤分類與測量誤差數據的高維度變數選取與估計

二元分類一直是統計分析或監督式學習中值得被討論的內容。在建立二元結果與變數的模型選擇上,logistic 與 probit 的模型是較常被使用的。然而,在資料維度遽增以及不可忽視的測量誤差存在測量結果、變數當中,過去的傳統方法已不適用,這為我們在資料分析上帶來了重大的挑戰。為了解決上述的問題,我們提出有效的推論方法處理測量誤差並同時進行變數選取。具體來說,我們首先考慮 logistic 或 probit 的模型,將經過校正的應變數與自變數放入我們的估計函數中。接著,我們透過 boosting 方法去做變數選取並計算參數的估計值。在數值研究當中,我們所提出的方法能夠準確地保留重要變數且能精準地計算出估計參數。此外,經過誤差校正的結果在整體的分析表現上是顯著優於沒有校正的結果。

關鍵字

二元分類資料； boosting ；誤差校正；測量誤差；回歸模型校正

並列摘要

Binary classification has been an attractive topic in statistical analysis or supervised learning. To model a binary response and predictors, logistic regression models or probit models are perhaps commonly used approaches. However, because of the rapid growth of the dimension of the data as well as the non ignorability of measurement error in responses and/or predictors, data analysis becomes challenging and conventional methods are invalid. To address those concerns, we propose a valid inferential method to deal with measurement error and handle variable selection simultaneously. Specifically, we primarily consider logistic regression models or probit models, and propose corrected estimating functions by incorporating error-eliminated responses and predictors. After that, we develop the boosting procedure with corrected estimating functions accommodated to do variable selection and estimation.Through numerical studies, we find that the proposed method accurately retains informative predictors as well as gives precise estimators, and its performance is generally better than that without measurement error correction.

並列關鍵字

binary data ； boosting ； error elimination ； measurement error ； regression calibration

參考文獻

Brown, B., Miller, C. J., and Wolfson, J. (2017). ThrEEBoost: Thresholded boosting for variable selection and prediction via estimating equations. Journal of Computational and Graphical Statistics, 26, 579-588.

Google Scholar

Brown, B., Weaver, T., and Wolfson, J. (2019). MEBoost: Variable selection in the presence of measurement error. Statistics in Medicine, 38, 2705-2718.

Google Scholar

Carroll, R. J., Ruppert, D., Stefanski, L. A., and Crainiceanu, C. M. (2006). Measurement Error in Nonlinear Model. Chapman and Hall.

Google Scholar

Carroll, R. J., Spiegelman, C. H., Gordon Lan, K. K., Bailey, K. T., and Abbott, R. D. (1984). On errors-in-variables for binary regression models. Biometrika, 71, 19-25.

Google Scholar

Chen, L.-P. (2020). Variable selection and estimation for the additive hazards model subject to left-truncation, right-censoring and measurement error in covariates. Journal of Statistical Computation and Simulation, 90, 3261-3300.

Google Scholar

延伸閱讀

Ying, T. C. (2015). 考量錯誤傳遞影響之服務導向軟體系統可靠度預測與評量 [master's thesis, National Tsing Hua University]. Airiti Library. https://www.airitilibrary.com/Article/Detail?DocID=U0016-0312201510301827
傅婕寧（2012）。具共變量測量誤差之現狀數據分析〔碩士論文，淡江大學〕。華藝線上圖書館。https://doi.org/10.6846/TKU.2012.01351
Chao, H. Y., Chen, C. C., Cheng, C. P., & Chen, J. H. (2018). 探討差異試題功能檢核中的遺漏變數偏誤. 中華心理學刊, 60(4), 233-250. https://doi.org/10.6129/CJP.201812_60(4).0002
俞允晨（2017）。高維度不平衡資料演算法之變數篩選〔碩士論文，淡江大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0002-1508201711314900
Kuzhali, J. V., & Vengataasalam, S. (2014). A Novel Ensemble Classifier based Classification on Large Datasets with Hybrid Feature Selection Approach. Research Journal of Applied Sciences, Engineering and Technology, 7(17), 3633-3642. https://www.airitilibrary.com/Article/Detail?DocID=20407467-201405-201507070018-201507070018-3633-3642

查找全文

主題瀏覽