在面對一群已知基因型而尚未做外表型調查的候選族群時,我們提出了一個有效的演算法以幫助我們由候選族群中選擇最佳的次族群作為訓練族群(training set),這些被選中的訓練族群會被調查外表型資料並以其基因型和表現型資料建立全基因組選拔(genomic selection, GS) 模型。在本篇研究中,我們考慮全基因組迴歸模式(whole-genome regression model),並以脊迴歸(ridge regression) 來估計GS 模型中分子標記的效應,所配適的GS 模型在育種中會接著被用於計算只有基因型資料之測試族群的育種價估計值(genomic estimated breeding values, GEBV),我們提出一個新的判斷準則用於決定所需的訓練族群,這個準則是由GEBV 與真實外表型值的皮爾生相關係數(Pearson’s correlation coefficient) 所發展而來,在本篇研究中我們使用R 語言來分析一組水稻的資料,由結果顯示,使用我們提出的演算法所選擇的訓練族群相較於隨機選擇訓練族群能夠使所配適的模型具有更高的預測準確性。
For a given candidate set of individuals which have been genotyped but not phenotyped, we develop a highly efficient algorithm to determine an optimal subset from the candidate set. The chosen subset serves as a training set to be phenotyped, and then a genomic selection (GS) model is built based on its resulting phenotype and genotype data. In this study, we typically consider the whole-genome regression model, and adopt ridge regression estimation for marker effects in the GS model. The resulting GS model is then employed to predict genomic estimated breeding values (GEBVs) for a given test set of individuals which have been genotyped only. We propose a new optimality criterion to determine the required training set, which is directly derived from Pearson’s correlation between the GEBVs and phenotypic values of the test set. Pearson’s correlation is the standard measure for prediction accuracy of a GS model. We implement our training set determination algorithm in R language, and illustrate it with a rice genome data set. It is shown that the training set generated from our algorithm can usually achieve a significantly improved prediction accuracy in comparison with a randomly selected training set.