本論文主旨在設計以正規文句為主的文件資訊萃取系統。此設計是利用正規定義及領域屬性特質,配合關聯式資料模型,完成資訊萃取。目的在改善半結構文件查詢、管理上先天性的弱點,此模型的特點是能在異質的領域內依需求做不同的資訊萃取。 基於半結構化文件和中文語法的特性,文中提出將半結構文件內容轉換成資料庫表格的方法論及實驗證明,另外比較在完成正規化後查詢路徑的變化及查詢能力的改善。本研究以資訊應用的角度來思考文件處理的問題,分析傳統表格型態涵蓋半結構文件資訊的能力,進而發展電子文件寫作的新標準,提高文件的被查詢能力及降低加值處理的困難。
The objective of this paper is to design an information extraction system, which focus on sentences with normal forms. This system executes the information extraction by using normalized definitions, the properties of domain attributes and the relative data models. The purpose of this paper is to improve the functions of query semi-structured documents and to reduce the defects in management. The feature of this model can extract information from the different style document by user’s demand. This extraction system is based on the characteristics of semi-structured documents and Chinese grammars. We submit a methodology to describe how to transfer semi-structured documents into database and prove the methodology by experiments. Besides, we evaluated the differences of query path after normalization and assessed the performance of query ability. This research reflects on documents processing in view of information applications. It analyses the capability of covering semi-structured documents into a traditional table and further, it develops the standards for documents writing to enhance the capability of documents being queried and reduce the difficulty of add-value processing.