  • 學位論文


he Query of Bioinformatics Literatures By Document Similarity

指導教授 : 王經篤


利用關鍵字搜尋文獻的時候,對於所要找尋內容概念不清楚的新手而言,剛開始往往面臨無法給予適當關鍵字來搜尋的困境。 在本研究中,我們提出一個『利用文件查詢』的方法,來幫助那些因為事先沒有概念去給定關鍵字來做文件搜尋的使用者。 本研究主要分成兩大步驟,包含『文件相似度的計算』與『文件向量化』。 關於『文件向量化』的部分,我們根據樣式出現分佈情形,來計算適當樣式權重,將每一個文件轉成一個向量。 『文件相似度的計算』則是根據使用者所輸入所要查詢的文件,將其轉成一個向量後,與所有文獻中文件所轉換成的向量,兩兩計算向量之間的相似度,然後依照相似度的大小,將文件排列給使用者,作為所查詢文件相關程度的參考。 樣式的集合我們分別採用『字典』和『內文』兩種方式,向量之間相似度的計算採用『餘弦相似度』(cosine similarity),實驗結果顯示,『字典』所查詢的文件精確度較高。


It is an obstacle for the beginner to use keywords to search for related documents from the literatures, especailly for the one who was not familar with the concept of what he was looking for. This reasearch includes two processes, 『the computation of document similarity』 and 『document vectorization』. Regarding 『document vectorization』, we transfer each document into one vector by appropriate pattern weighting according to the distribution of the patterns. 『the computation of document similarity』 means to compute the similarities between the vector of query document and the one of each document in the literature after transfering the query docuemnt into one vector, and to give the order of the documents for user's reference by sorting the values of these similarities. We have two approaches to have the set of the patterns, including『dictionary』 and 『content』, and use cosine similarity to evaluate the similiarity of two vectors. Experimental results showed that the value of precision achieved by the『dictionary』was higher than that achieved by the 『content』.


[12] Jyh-Jong Tsay and Jing-Doo Wang. Design and evaluation of ap- proaches for automatic chinese text categorization. International Journal of Computational Linguistics and Chinese Language Processing(CLCLP), 2000.
[11] Gerard Salton and Michael J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill Book Company, 1983.
[1] 杜經農. Perl, 2001.
[2] 蕭世文. perl, 2001.
[4] Justin Hicks Mounir Errami James Lewis, Stephan Ossowski and Harold R. Garner. Text similarity: an alternative way to search medline. bioinformatics, 22(18):2298 2304, 2006.


