隨著生物醫學和很多分析方的快速發展,使用文件探勘工具去尋找蛋白質間交互作用變得越來越重要。現今研究學者藉由閱讀生醫文獻以獲得重要資訊,但生醫文獻的數量卻以驚人的速度成長,如果以人工擷取資訊,將會耗費大量人力跟時間,因此從文件中自動擷取重要訊息的需求量增加。 我們利用了淺層剖析器跟考量句子結構,發展了一個能從文獻中自動擷取蛋白質間交互作用的資訊系統。我們系統比對句子的文法樣式跟傳統作法不同。我們設計有效率的演算法並考量句子的語意制定一些規則以擷取蛋白質交互關係,而關係中並區分出有作用蛋白質跟被作用蛋白質。我們的系統由以下數個步驟所組成,分別是醫學文獻前處理、斷句、斷字、詞類標記、蛋白質名詞辨識、描述交互作用的關鍵字、介係詞及連接詞標記、蛋白質間交互作用的擷取。最後利用兩個測試集來評估此系統,分別是 LLL05競賽與BioCreAtIvE-PPI。
With the rapid progress of biomedical science and large amounts of analysis methods, many researchers nowadays access knowledge about protein-protein interaction through PubMed abstracts, but the amount of biomedical literature is enormous and continues to grow at exponential rate. Therefore, the demand for automatic extraction of information from text has been increasing, using text mining tools to find knowledge such as protein-protein interactions, which is useful for specific analysis tasks has become critical. We develop a system which can automatically extracts protein-protein interactions from free text using a shallow parser and sentence structure analysis techniques. Our system matches sentences against syntax patterns typically describing protein-protein interactions. We design an efficient algorithm and develop a set of rules which extracts protein-protein interactions from their syntactic roles. Protein-protein interactions include ACTOR ( doner of action) and OBJECT (receiver of action).There are essential steps to accomplish our system which includes preprocessor, sentence splitting, tokenization, part-of-speech tagging, protein names recognition, interaction keywords , prepositions , conjunction tagging and protein-protein interactions extracting. Finally, we evaluate our system on two samples, one derived from the LLL05 challenge, the other from BioCreAtIvE-PPI.