當迴歸進行變數選取,我們發現獨立變數個數及樣本數之比例將影響選取結果的可靠性。然而通常在極大量的獨立變數,卻往往只能收集到有限的樣本個數。在迴歸的順向選取法(Forward selection)中,若利用傳統F 檢定準則,當樣本數遠小於變數個數,結果會誤挑不顯著的變數,且選入的變數個數會受限於樣本數。我們提出MaxF 做為新的檢定統計量藉以增加順向選取法的可靠度,同時MaxF在虛無假設下的分配為已知,因此透過數值方法可以去計算檢定機率值。在這個新的檢定準則之下,我們利用批次選取去選取所有顯著的獨立變數,且對於這些不同批次所選入的變數,進一步的去做相依性分析,發掘它們之間各種潛在的關係。在論文中,我們的模擬不同樣本數及獨立變數的設定去測試MaxF檢定統計量之穩健性而相依性分析方法會分析更加複雜的模擬狀況。最後將用半導體良率(semiconductor yield)及基因表現(gene express)等兩實例加以驗證。
When the variable selection is used in regression, the selection reliability is greatly affected by the number of candidate variables as compared to the sample size. However, very often we could only collect limited data for analysis, while there are a large number of possible independent variables. In the forward selection procedure, problems arise when the sample size n is very smaller than the number of variables p. Under the conventional F-test selecting criterion, noise variables are often mistakenly selected if the sample size is relatively small or the number of candidate variables is relatively large. The number of selected variables is also limited by the sample size. A new test statistic, named MaxF with a known null distribution will be proposed in this study. The test statistic can improve the reliability of the forward selection procedure and can be numerically calculated. Based on the new criteria, an extended selection procedure is developed to overcome the limitation of sample size and to continuously select significant variables into different batches. After batch-to-batch selection, we propose dependency analysis methodologies to figure out the inter-relationships among batches of selected variables. The proposed test statistic is examined by simulated data under various scenarios with different sample size and number of candidate variables. The dependency analysis methodologies will handle more complex simulation cases. The approach is also demonstrated and tested through a semiconductor yield data and gene express cases.