本研究提出一個以語境為基礎,並採用非監督式的EM演算法估算參數的統計式搜尋模式,利用比對左右Contexts的相似度,來解決因為語意不明確導致搜尋結果不符合使用者意向或不夠完整,甚至與需求無關的問題。同時也提出一套EM演算法,來自動學習模式的參數,免除了一般左右Contexts的比對必需為相同字串 (Exact Match) 的限制,使左右Contexts的比對可以達到 ”同義辭比對”(Synonymous Match) 的效果。 我們將此搜尋模式定義為一個機器翻譯的問題,計算Source Query所有含左右Contexts的Context Windows與候選的Target Object的所有Context Windows的相似度,累加而成Source 及 Target共生的期望值,藉以估算Target物件與Source物件等價的機率。若Source Query與Target Object含有越多相似度高的Context Windows,則Source Query和Target Object為等價物件的機率就越高。 在這裡我們將此一搜尋模式應用到搜尋簡體與繁體同義詞的對應問題,並且與先前的研究做比較。結果發現,EM演算法可以有效改善以語境為基礎的統計式搜尋模式的正確率,最高可以達到48%。 根據語境式搜尋模式的特性,未來除了找出繁簡同義詞對應之外,亦可作音樂搜尋或影像搜尋等等,應用極廣。
This study proposes a statistical context-based statistical searching model using EM algorithm to solve the problem of disambiguity by considering the contexts around the object we want to find out. We regard this searching problem as a machine translation problem. Because the two problems have the same purpose that given a source query to find the best answer with high probability from target candidates. In our model, pieces of contexts are accumulated to enforce the translation probability for a search result to be the translation of the source query. We apply this model to the searching for the term alignment between simplified and traditional Chinese synonyms (such as “激光” vs.“雷射”for “laser”). The EM algorithm for our context-based searching model can be improved up to 48% of the accuracy in comparison with previous work on he same task. In the future, the context-based searching model will not only find out the synonyms alignment for a monolingual corpus with region variation, such as in China and Taiwan where using the same language but different dictionaries, but also search for the target object, such as images, mp3 files and video files, which has the contexts similar to those of source query.