Machine Learning with Automatic Feature Selection for Multi-class Protein Fold Classification

In machine learning, both the properly used networks and the selected features are important factors which should be considered carefully. These two factors will influence the result, whether for better or worse. In bioinformatics, the amount of features may be very large to make machine learning possible. In this study we introduce the idea of feature selection in the problem of bioinformatics. We use neural networks to complete our task where each input node is associated with a gate. At the beginning of the training, all gates are almost closed, and, at this time, no features are allowed to enter the network. During the training phase, gates are either opened or closed, depending on the requirements. After the selection training phase has completed, gates corresponding to the helpful features are completely opened while gates corresponding to the useless features are closed more tightly. Some gates may be partially open, depending on the importance of the corresponding features. So, the network can not only select features in an online manner during learning, but it also does some feature extraction. We combine feature selection with our novel hierarchical machine learning architecture and apply it to multi-class protein fold classification. At the first level the network classifies the data into four major folds: all alpha, all beta, alpha+beta and alpha beta. In the next level, we have another set of networks which further classifies the data into twenty-seven folds. This approach helps achieve the following. The gating network is found to reduce the number of features drastically. It is interesting to observe that, for the first level using just 50 features selected by the gating network, we can get a test accuracy comparable to that using 125 features in neural classifiers. The process also helps us get a better insight into the folding process. For example, tracking the evolution of different gates, we can find which characteristics (features) of the data are more important for the folding process. Eventually, it reduces the computation time. The use of the hierarchical architecture helps us get a better performance also.

並列關鍵字

machine learning ； hierarchical architecture ； feature selection ； gate ； neural network ； protein fold ； bioinformatics

延伸閱讀

林立哲（2018）。Comparison of Feature Selection Methods in Statistics and Machine Learning〔碩士論文，國立交通大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0030-0205201911070330
AHUJA, R., & SHARMA, S. C. (2021). Exploiting Machine Learning and Feature Selection Algorithms to Predict Instructor Performance in Higher Education. Journal of Information Science and Engineering, 37(5), 993-1009. https://doi.org/10.6688/JISE.202109_37(5).0001
李庚修（2021）。Application of Feature Selection and Weight Evaluation on Deep Learning Network Compression〔碩士論文，淡江大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0002-1009202101332100
Abdulameer, M. H., Abdullah, S. N. H. S., & Othman, Z. A. (2014). Neural Gen Feature Selection for Supervised Learning Classifier. Research Journal of Applied Sciences, Engineering and Technology, 7(15), 3181-3187. https://www.airitilibrary.com/Article/Detail?DocID=20407467-201404-201507060017-201507060017-3181-3187
Liu, Z. Q., Bensmail, H., & Tan, M. (2012). Efficient Feature Selection and Multiclass Classification with Integrated Instance and Model Based Learning. Evolutionary Bioinformatics, (2012), 197-205. https://doi.org/10.4137/EBO.S9407

國際替代計量

Machine Learning with Automatic Feature Selection for Multi-class Protein Fold Classification

全文下載

主題瀏覽