透過您的圖書館登入
IP:3.145.105.105
  • 學位論文

整合機器學習與重複採樣技術於卵巢癌分期預測模型之建立

Implementation of Ovarian Cancer Prediction Model based on Machine Learning and Over-sampling Techniques

指導教授 : 陳牧言
共同指導教授 : 蔡孟勳(Meng-Hsiun Tsai)

摘要


國人每年罹患癌症死亡之數,從民國71年開始即位居國人十大死因第一名,到民國103年為止,已經三十二年連續蟬連國人十大死因首位,而在婦科十大死亡癌症中,排行前三名分別為乳癌、子宮頸癌以及卵巢癌,卵巢癌一直是婦女死亡癌症排行前幾位,在卵巢癌初期的診斷是非常困難的,就算是每年例行性婦女健康檢查也只能診斷出少數一部分罹患早期卵巢癌的婦女病患,因此,婦女在確診為卵巢癌時,往往已經錯過黃金治療期。 針對卵巢癌方面的研究,截至目前,已經有許多學者使用各種機器學習演算法,如類神經網路、基因演算法等,希望找出關鍵分類屬性建立分類預測模型或是利用傳統統計方法找出差異性的部分去探討卵巢癌,而建立分類預測模型即可以達到及早發現及早治療之目標,透過卵巢癌病患樣本資料進行資料訓練建立分類預測模型,當有出現新的卵巢癌病患樣本即可以透過先前建立好的分類預測模型進行樣本分類,及時提供醫療服務。 基於尋找卵巢癌各期別之重要的特徵基因之目的,本論文使用特徵選取技術並結合機器學習演算法來分析卵巢癌微陣列資料集,以及採用增生少數樣本技術SMOTE (Synthetic Minority Over-sampling Technique)來平衡卵巢癌微陣列資料集合各期樣本數量,最終目的是希望建立準確分類卵巢癌各期別之重要特徵基因的預測模型,達到提早預防卵巢癌的目標;本論文實驗依照卵巢癌各期別建立了五個資料集合,實驗結果表明,在五個資料集合最佳的實驗結果之準確率可達到100%,因此,本論文證實特徵選取方法確實能有效地找出具有鑑別力之重要特徵,若能和增生少數樣本技術SMOTE與機器學習演算法結合的話,確實能有效使得預測分類模型準確率提升,並且整體預測分類模型運行效能提升。

並列摘要


Since 1982, cancer has remained the first rank of ten leading causes of death for 32 consecutive years until 2014. Among the top 10 cancer deaths in gynecology, the top 3 rankings are respectively breast cancer, cervical cancer, and ovarian cancer. Ovarian cancer has been on top rank of causes of cancer death in women. Diagnosis of early ovarian cancer is very difficult; even annual routine health checks can help diagnose only a small number of women suffering from early-stage ovarian cancer patients. Usually when a woman is made a definite diagnosis of ovarian cancer, the golden period has been missed. There have been many scholars using various machine learning algorithms, such as neural network and genetic algorithm, in researches on ovarian cancer, hoping to find out the key classification attributes and establish classification models or to use traditional statistical methods to explore ovarian cancer via identified differences. Establishing predictive classification model helps to achieve the goal of early detection and early treatment. Data of samples of ovarian cancer patients is used for data training to establish predictive classification model. Once there is sample of new ovarian cancer patient, the established predictive classification model can be used to process sample classification and medical treatment and services can be provided timely. In order to achieve the purpose of finding important indicator genes and using them to establish ovarian cancer classification predictive model, this study uses feature selection technique and combines it with machine learning algorithm to analyze ovarian cancer microarray data sets, and adopts SMOTE (Synthetic Minority Over-sampling Technique) to balance the amount of samples at each stages. The ultimate goal is to establish a predictive model that accurately classifies important feature genes of various stages of ovarian cancer, so early prevention of ovarian cancer can be achieved. The experiment in this study establishes five data sets for each stage of ovarian cancer. The result indicates that the accuracy of the result of the best data set among the five can be up to 100%. As a result, this study proves that the feature selection method can really find out the most distinguishable feature in an effective way. If it can be combined with the SMOTE (Synthetic Minority Over-sampling Technique) and machine learning algorithms, it will effectively increase the accuracy and the overall efficacy of the predictive classification model.

參考文獻


[4] Al Snousy, M. B., El Deeb, H. M., Badran, K., & Al Khlil, I. A. (2011). Suite of decision tree-based classification algorithms on cancer gene expression data. Egyptian Informatics Journal, 12(2), 73-82.
[5] Aldape Pérez, M., Yáñez Márquez, C., Camacho Nieto, O., & Argüelles Cruz, A. J. (2012). An associative memory approach to medical decision support systems. Computer methods and programs in biomedicine, 106(3), 287-307.
[6] Ansari, D., Nilsson, J., Andersson, R., Regnér, S., Tingstedt, B., & Andersson, B. (2013). Artificial neural networks predict survival from pancreatic cancer after radical surgery. The American Journal of Surgery, 205(1), 1-7.
[7] Azar, A. T., Elshazly, H. I., Hassanien, A. E., & Elkorany, A. M. (2014). A random forest classifier for lymph diseases. Computer methods and programs in biomedicine, 113(2), 465-473.
[8] Ball, G., Mian, S., Holding, F., Allibone, R., Lowe, J., Ali, S., et al. (2002). An integrated approach utilizing artificial neural networks and SELDI mass spectrometry for the classification of human tumours and rapid identification of potential biomarkers. Bioinformatics, 18(3), 395-404.

延伸閱讀