批次變數選取之穩健檢定

當迴歸進行變數選取，我們發現獨立變數個數及樣本數之比例將影響選取結果的可靠性。然而通常在極大量的獨立變數，卻往往只能收集到有限的樣本個數。在迴歸的順向選取法(Forward selection)中，若利用傳統F 檢定準則，當樣本數遠小於變數個數，結果會誤挑不顯著的變數，且選入的變數個數會受限於樣本數。我們提出MaxF 做為新的檢定統計量藉以增加順向選取法的可靠度，同時MaxF在虛無假設下的分配為已知，因此透過數值方法可以去計算檢定機率值。在這個新的檢定準則之下，我們利用批次選取去選取所有顯著的獨立變數，且對於這些不同批次所選入的變數，進一步的去做相依性分析，發掘它們之間各種潛在的關係。在論文中，我們的模擬不同樣本數及獨立變數的設定去測試MaxF檢定統計量之穩健性而相依性分析方法會分析更加複雜的模擬狀況。最後將用半導體良率(semiconductor yield)及基因表現(gene express)等兩實例加以驗證。

關鍵字

批次變數選取

並列摘要

When the variable selection is used in regression, the selection reliability is greatly affected by the number of candidate variables as compared to the sample size. However, very often we could only collect limited data for analysis, while there are a large number of possible independent variables. In the forward selection procedure, problems arise when the sample size n is very smaller than the number of variables p. Under the conventional F-test selecting criterion, noise variables are often mistakenly selected if the sample size is relatively small or the number of candidate variables is relatively large. The number of selected variables is also limited by the sample size. A new test statistic, named MaxF with a known null distribution will be proposed in this study. The test statistic can improve the reliability of the forward selection procedure and can be numerically calculated. Based on the new criteria, an extended selection procedure is developed to overcome the limitation of sample size and to continuously select significant variables into different batches. After batch-to-batch selection, we propose dependency analysis methodologies to figure out the inter-relationships among batches of selected variables. The proposed test statistic is examined by simulated data under various scenarios with different sample size and number of candidate variables. The dependency analysis methodologies will handle more complex simulation cases. The approach is also demonstrated and tested through a semiconductor yield data and gene express cases.

並列關鍵字

MaxF test Batch-to-batch Selection

參考文獻

[1] Alvin, C.R. and Fu, C.P., “Inflation of R2 in best subset regression”, British Journal of Mathematical and Statistical Psychology, 45, p.49-53, 1980

[2] Bendel, R.B. and Afifi, A.A. “Comparison of stopping rules in forward stepwise regression” Journal of the American Statistical Association, 72, 46-53, 1977

[3] Derksen, S. and Keselman, H.J. “Backward, forward and stepwise automated subset selection algorithms: frequency of obtaining authentic and noise variables” British Journal of Mathematical and Statistical Psychology, 45, 265-286 1992

[4] Draper, N., Guttman, I. and Lapczak, L. “Actual rejection levels in a certain stepwise test”. Communication Statistical Theory Methods, A8, 99-105, 1979.

[5] Draper, N. and Smith, H. “Applied regression analysis”, 2nd ed. New York: Wily 1981.

被引用紀錄

黃哲賢（2012）。應用線性迴歸模型於射頻IC特性預測〔碩士論文，國立交通大學〕。華藝線上圖書館。https://doi.org/10.6842/NCTU.2012.00207

Lu, Y. P. (2008). 整合統計分析與知識推論系統的貝氏架構設計 -以半導體良率分析為例 [master's thesis, Yuan Ze University]. Airiti Library. https://www.airitilibrary.com/Article/Detail?DocID=U0009-2307200816125100

國際替代計量

批次變數選取之穩健檢定

全文下載

主題瀏覽