應用詞彙量化及潛藏語意分析於口語敘述檢索醫療文件之研究

預防重於治療之預防醫學的觀念，隨著人口老化及慢性病增多而逐漸受到重視。現今人們對於自我照護的觀念日月增強，許多疾病及藥物名稱對一般民眾而言，不易被清楚認識或正確使用；此外，隨著網際網路及行動裝置的蓬勃發展，許多專業知識相當便利於網路上取得。然而，醫藥內容相較於一般人對疾病的自我認知與了解仍有相當的差異，如何建立或提供一個跨專業知識搜尋機制，使一般民眾可輕鬆找尋醫藥相關知識及內容，形成重要且必須面對的發展課題。因此，本研究目的為運用自然語言處理及計算語言學方法於建立易於一般民眾進行醫藥專業知識檢索系統，並探討醫療專業知識與一般醫療口語查詢語句之關係。本研究特定目標主要包含：(1) 發展針對國家網路醫院KingNet醫藥百科辭典之網頁自動爬行系統，並透過中文斷詞剖析處理非結構性欄位之短文，以建立醫藥內容語料庫；(2) 運用改良式詞彙量化技術及觸發序對模型來篩選並擴展有意義之高關聯關鍵詞集並轉換成特徵向量描述形式；(3) 運用潛藏語意分析技術深度降低向量維度，以有效表達檢索語句；(4) 最後，應用向量空間模型及餘弦原理進行向量相似度比對，以檢索出使用者所需的醫藥內容。本研究具體發展出一可透過口語描述檢索出醫護相關知識的輔助系統。隨機於所爬行而得之8694筆資料庫中擷取測試資料，在Top-15正確率評量原則下，檢索正確率已可達100%，實驗結果呈現本研究所提方法之可行性與實用性；未來將可提供更簡易、先進之醫療關鍵特徵向量化相似度檢索機制，以具體展現自我照顧衛教內容傳遞之新作為。

關鍵字

醫療文件；資訊檢索；詞彙量化；潛藏式語意分析；向量空間模型

並列摘要

Preventive medicine and healthcare promotion are important for improving quality of daily life. However, medical context is hard to be understood. Therefore, this study aimed to establish a back-end database for the name of the disease and a cost-effective link so that people can easily search for professional medical knowledge wondering search platform.(1)This study applies natural language processing and computational linguistics methods to develop an assisted query system in medical information retrieval from general description.(2)CKIP word segmentation system was utilized to parse medical content. A statistical term quantification method, based on Term Frequency–Inverse Document Frequency, was adopted to select a set of keywords, which was re-organized as a vector.(3)A Latent Semantic Analysis was performed to reduce the keyword vectors for advanced matching processing using arctan principle.(4) Finally, the vector space model and the principle of vector cosine similarity matching to retrieve the contents of a user needed medicine. A corpus with 8694 medical terms and their interpretations was collected from KingNet website. An automatic database access mechanism in both local and remote sites was also developed for updating the corpus. Randomly 8694 selected document in the database test data, the Top-15 accuracy under assessment principles, to retrieve the correct rate of up to 100%, the experimental results presented in this study the feasibility and practicality of the proposed method; future will provide easier, advanced medical key feature to quantify the similarity retrieval mechanism to give concrete expression to self-care health education as the delivery of new content.