在生醫文獻中的生醫詞彙,存在著例如複合字、同義詞、慣用語、甚至新的命名法則的問題,造成不同文獻中的生醫詞彙未必具有一致性,這使得自動化生醫資料整合的目標因此困難重重。而其中最初步,對系統效能影響最深遠的,莫過於如何從文獻中正確的找出生醫詞彙,即生物醫學名詞辨識( Biomedical Named Entity Recognition, Biomedical NER )。 我們在這篇論文中將利用隱藏式馬可夫模型( Hidden Markov Model ),針對文獻中的摘要部份進行剖析。目標是從文獻摘要中找出生醫詞彙。我們的方法共有四個步驟:首先利用五種生醫詞彙的特徵對文字做分群。第二步,利用分群好的訓練資料產生一個隱藏式馬可夫模型。第三步,將使用者輸入的文章讀入,並且依照前述的四種生醫詞彙特徵對文字做分群。最後,利用Machine Learning演算法,將讀入的文章中,系統判定為生醫詞彙之文字做標記。
With the progress of biomedical science, text mining in biomedical domain is getting important. Since there are many irregularities and ambiguous contexts in biomedical literature such as various compound words, synonyms, acronyms, and even the laws of naming are not literally consistent, how to correctly identify biological terms from text is a fundamental requirement for information extraction. In this paper we propose a biological term extractor which is based on Hidden Markov Models. There are four steps to accomplish our task. First, the tokens in training data are clustered by five features at the first stage. Second, train a Hidden Markov Model by these clustering tokens. Third, normalize user’s input and cluster these tokens. Finally, annotate the biological terms according to the Machine Learning algorithm.