  • 學位論文


Assessment of Feature Seletion Methods on Cardiovascular Diseases Factors

指導教授 : 許巍嚴


心血管疾病為當前之重要疾病之一,其發生率逐漸升高。在研究上,發現可能影響其疾病的因素很多,包括低密度脂蛋白膽固醇、C-反應蛋白…等。是否還有其他相關因子或是較重要之因子,都是本研究所要討論的。另外本研究希望幫助研究者進行心血管疾病因子的屬性重要性判斷,使得分類正確率提高。也希望能夠輔助醫師進行屬性的重要性確認,讓醫師判斷影響因子能夠更有效率。本研究在資料前處理根據文獻將資料進行分類,接著利用特徵選擇mRMR演算法將所有因子進行重要性排序,再透過Weka軟體使用分類器觀看結果比較。 結果得知,將所有因子進行分類的平均正確率為55.5%,而經過mRMR演算法得到排序後進行分類得到的最高平均正確率為57%。原本的屬性個數為21個,而使用mRMR演算法後使得正確率達到最高的平均因子個數為7.5個,比CfsSubsetEval+Bestfirst之屬性縮減法(平均個數為9)和FilteredAttributeEval之屬性排序方法(平均個數為11.5)來的少。 結果發現,使用少數特定因子就能使得分類正確率達到最高;也就是說, weight、waist、bmi、glucose_ac…等都可能是影響hsCRP的重要因子。因此本研究希望提供研究人員或是輔助醫生進行判斷而且影響心血管疾病的因子可能很多,我們希望能夠藉此方法輔助研究者或是增加醫師進行判斷的效率。


Cardiovascular disease is one of the current major diseases, its incidence gradually increased. In the study, we found that many factors may affect the cardiovascular disease, including low-density lipoprotein cholesterol, C- reactive protein ... and so on. Any relevant factor or the other important factors are discussed in this study. In addition the study hopes to help researchers determine the importance attribute of cardiovascular disease factors and enhance the classification accuracy rate. Also we hope to help the doctors to confirm the importance of attributes, so that doctors can determine the impact factor more efficiently. In the data preprocessing, the data are classified according to the literature, then we use the feature selection algorithms, mRMR algorithms to let the factors in order of importance. Finally, we use the classification methods to compare results through Weka software. As a result that, all the factors of correct classification rate are averaged 55.5%, while after mRMR algorithm to rank at the highest average correct classification rate was 57%. The original number of attributes are 21, and the average number of factors which we use mRMR algorithm so that the correct classification rate reached the highest are 7.5. The number of attributes of the property are less than the CfsSubsetEval +Bestfirst reduction method (average number is 9) and the FilteredAttributeEval of properties by sorted method (average number of 11.5). It was found that less factors will be able to let the correct classification rate reach the highest; that is, weight, waist, bmi, glucose_ac ... may be the important factors that impact the factor hscrp. In this study, we hope to provide researchers or assist doctors to decide. Because lots of factors may impact the Cardiovascular disease, we hope to hope this method for assist investigators to increase the efficiency. It was found that less factors will be able to let the correct classification rate reach the highest; that is, weight, waist, bmi, glucose_ac ... may be the important factors that impact the factor hscrp. In this study, we hope to provide researchers or assist doctors to decide. Because lots of factors may impact the Cardiovascular disease, we hope to hope this method for assist investigators to increase the efficiency.


Acid, S., De Campos, L. M., & Fernández, M. (2011, November). Minimum redundancy maximum relevancy versus score-based methods for learning Markov boundaries. In Intelligent Systems Design and Applications (ISDA), 2011 11th International Conference on IEEE , 619-623.
Battiti, R. (1994). Using mutual information for selecting features in supervised neural net learning. Neural Networks, IEEE Transactions on, 5(4), 537-550.
Breiman, L (1996), “Bagging predictors ”, Machine Learning, 24 (2) :123-140.
Cang, S., & Yu, H. (2012). Mutual information based input feature selection for classification problems. Decision Support Systems, 54(1), 691-698.
Improving the ranking quality of medical image retrieval using a genetic feature selection method. Decision support systems, 51(4), 810-820.
