A Knowledge Component Extraction Technology Based on the Figures and Tables

指導教授 : 侯建良


隨著知識文件之多樣化發展及各領域知識之迅速累積,一份知識文件之內容可能涵蓋數個主題知識段落及不同之領域知識概念,如欲由其中擷取特定主題知識實具有困難性。然而傳統知識擷取模式僅能回饋知識擷取者以整份知識文件為基礎之知識單元,造成知識擷取者需耗費時間閱讀過多知識文件中不相關之主題資訊,始可取得其所需知識。故若能藉由「知識元件化」之概念,將知識文件切割為以主題知識段落為基礎之知識單元,即可使知識攝取者迅速且準確地搜尋並擷取特定領域知識。而因圖表通常為知識文件之主要精髓,所有主題知識之關鍵內容往往環繞於圖表周圍之段落內容中,故本研究將針對自由形式知識文件提出一套可自動地擷取圖表主題知識之方法論。 本方法論之詳細作法乃首先以領域詞彙庫為基礎擷取圖表關鍵詞彙;其次,則針對目標文件內容進行文句斷句,以作為後續擷取圖表敘述段落之基礎。之後,透過「關鍵詞彙比對法」及「起始結尾句比對法」等模式擷取圖表敘述段落;其中,「關鍵詞彙比對法」為計算圖表關鍵詞彙於文件中各文句之出現頻率,進而以頻率為基礎擷取圖表敘述段落;而「起始結尾句比對法」則經由整理知識文件,得知圖表敘述段落起始句與結尾句之語意結構特性,再以此語意結構特性與文件內容進行比對,即可擷取符合圖表敘述段落起始句特性與結尾句特性之段落內容;而結合本方法論所擷取之圖表敘述段落及圖表圖形即為圖表所對應之主題知識。 本研究根據圖表主題知識擷取方法論建構一套圖表主題知識擷取系統,並以「台灣物流年鑑」為案例進行系統驗證,以確認本方法論之準確性及可行性。而由驗證結果得知,本系統可透過匯入訓練資料而強化系統推論之能力,進而使系統推論績效達良好之水準。整體而言,本研究所提出之知識單元擷取技術可提升知識擷取者搜尋並擷取知識文件中特定主題知識之效率,進而促進知識文件之蘊含知識更能被知識擷取者搜尋及應用。


With the growing complexity of document contents and the significant increase of domain knowledge, it is difficult for knowledge receivers to understand the specific domain knowledge. However, the traditional knowledge extraction schemes usually provide complete documents to the knowledge receivers and much time is required for the knowledge receivers to acquire domain knowledge. The concept of component-based knowledge is to divide the documents into several knowledge components corresponding more specific domains and can be used to reduce the time required for the knowledge receivers to search the specific domain knowledge. Moreover, since the figures and tables in a document usually contain the important implicit knowledge expressed within the document, the aim of this research is to extract the knowledge components form the documents (e.g., the industry yearbooks) on the basis of figures and tables. In this research, a Knowledge Component Extraction (KCE) model with two algorithms namely Keyword Mapping Algorithm (KMA) and Sentence Mapping Algorithm (SMA) is developed. In order to demonstrate applicability of the proposed mothodology, a web-based knowledge component extraction system is also established based on the proposed model. Furthermore, the Taiwan Logistics Yearbooks are applied as examples to evaluate the proposed model. The verification results show that the developed system is a high-performance knowledge component extraction system. As a whole, this research provides an approach for knowledge receivers to efficiently and accurately acquire the domain knowledge.


