透過您的圖書館登入
IP:3.21.100.34
  • 學位論文

樣本高效迴歸樹及多對多相對重要性分析用於因果分析方法之研究

Causal Analysis Methods by Sample-Efficient Regression Tree and Many-to-many Relative Importance Analysis

指導教授 : 陳正剛

摘要


逐步迴歸分析與迴歸樹分析常應用於建立單一反應變數對多個影響因子的因果分析模型。逐步迴歸分析無法自動分群樣本建立逐段線性迴歸模型。迴歸樹循序地選擇屬性進行資料分群,最終並連結各分群至特定的線性迴歸模型,因此迴歸樹可用於建立逐段線性迴歸模型。不過現有的迴歸樹在每個節點選擇屬性並分離資料,在經歷幾個階層後樣本數會快速減少,樣本數消耗會導致之後的屬性選擇結果過度依存於先前分裂形成的小數目樣本,而產生不可靠的屬性選擇。本研究首先將結合迴歸樹與逐步迴歸分析的優點,提出樣本高效迴歸樹方法以有效建立逐段迴歸模型。 另外一方面,當考量多個相關的反應變數與多個相關潛在影響因子變數的複雜關係時, 將反應變數個別考量將不再是發現重要的影響因子的有效方法。雖然文獻中已經有直接同時分析多反應變數對多影響因子相關性的方法。但這些方法在變數多重共線性的前提下無法合理解釋各變數對於此多對多相關性貢獻度。已有文獻提出泛用的架構以估算一對多迴歸分析中的變數重要性指標。本研究其次將延伸此一架構以估算多對多相關分析中變數貢獻性指標。 本文使用假設案例及實際半導體良率分析案例闡明樣本高效迴歸樹及多對多相對重要性分析並驗證兩者於因果分析應用的效力。結果顯示樣本高效迴歸樹在有限的樣本數限制下仍可有效發掘潛在的因果分析模型。案例結果亦顯示多對多相對重要性分析相較於現有方法更有效發掘兩個變數集合之間的因果關係。

並列摘要


Forward stepwise regression analysis and regression tree are used for one-to-many causal analysis. Forward stepwise regression analysis selects critical attributes all the way with the same set of data. Regression analysis is, however, not capable of splitting data to construct piecewise regression models. Regression trees have been known to be an effective data mining tool for constructing piecewise models by iteratively splitting data set and selecting attributes into a hierarchical tree model. However, the sample size reduces sharply after few levels of data splitting causing unreliable attribute selection. In this research, we propose sample-efficient regression tree (SERT) approach that combines the forward selection in regression analysis and the regression tree methodologies to effectively construct a piecewise linear causal model. As multiple responses are mingled with potential causal factors, one-response-at–a-time correlation analysis is no longer sufficient to discover critical factors that result in change in correlated responses. Though methodologies of many-to-many correlation analysis have been proposed in the literature, difficulties arise, especially when there exist multi-collinearity effects among variables, to measure the relative importance of a variable’s contribution in the association between a set of responses and a set of factors. Johnson’s dominance analysis [1] offers a general framework for determination of relative importance of independent variables in linear multiple regression models. In this research, we also extend Johnson’s dominance index to many-to-many correlation analysis as a measurement to summarize the association relationship between two sets of variables. Hypothetical and actual semiconductor yield-analysis cases are used to illustrate both SERT and many-to-many relative importance analysis. Case studies show that SERT is effective in discovering the dataset’s underlying model where the sample size available for analysis is relatively small. Case study also shows the effectiveness of many-to-many relative important methods, as compared to other conventional methods, in analysis of two sets of variables.

參考文獻


[1] J.W. Johnson, “A heuristic method for estimating the relative weight of predictor variables in multiple regression”, Multivariate Behavioral Research, vol. 35, pp. 1-19, 2000.
[2] A. Sen and M. Srivastava, Regression analysis, theory, methods and applications. Springer, Berlin, 1990.
[3] L. Breiman , J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and regression trees. Monterey, CA: Wadsworth. 1984.
[4] H. Hotelling, “Relations between two sets of variates”, Biometrika, vol. 28, pp. 321-377, 1936.
[5] H. Wold, “Path model with latent variables: the NIPALS approach”, in Quantitative Sociology: International Perspectives on Mathematical and Statistical Modeling, H. M. Blalock et al., editors, New York: Academic Press, 1975, pp. 307-357.

延伸閱讀