利用微陣列基因表現量的資料做為疾病分類的工具在以往的文獻中已被認為是有用的方法,許多的分類方法被廣泛提出並做比較,其中Prediction Analysis of Microarray (PAM)為一個常用的方法;Bayesian Binary Regression (BBR)則是在文章分類的領域中所提出的一個新方法。本文第一部份利用BBR與PAM做為分析工具,用以分析基因表現量資料庫,並對兩種分析工具在訓練樣本(training set)與測試樣本(test set)的錯誤率比較其優劣。第二部份則是藉由PAM和BBR做為分析工具並且重複抽取樣本,探討訓練樣本的組成對於測試樣本錯誤率的影響。利用白血病及肺癌的基因表現量資料庫,PAM與BBR在分類上都可以達到很好的分類效果,但是對於預測時所使用的基因數目方面,PAM比BBR要用較多的基因。關於訓練樣本的組成則分為兩個部份討論:改變訓練樣本的樣本數與在訓練樣本中不同類別的人數比例。重複抽樣的結果顯示,在固定同一組測試樣本下,訓練樣本的樣本數越多預測結果越好;另外,訓練樣本的組成也很重要,當訓練樣本和測試樣本的類別比例不同時,將有可能導致兩者估計出來的預測錯誤率有差距。
Using microarray gene expression data as a tool for disease classification has been recognized as a useful method. There have been many methods proposed for analyzing these data. Among which PAM (Prediction Analysis of Microarray) is a popular method in recent years. Similar problem arose in the area of text classification and BBR (Bayesian Binary Regression) was proposed recently. In the first part of this study, we used BBR to analyze gene expression datasets and compared the performance with that of PAM. The performance is based on the error rates of both training set and testing set. The results showed that PAM and BBR have similar performance in classification. However, PAM usually used more genes than BBR. In the second part, we investigated the effect of sample size and composition of training set on the error rate of testing set. In examing the performance, we split training set according two ways: fix composition and change sample size or fix sample size and change composition. The results showed that for the same testing set, the more sample size of training set, the lower error rate. Furthermore, it is important to aware that the composition of training set to the testing set will also affect prediction performance.