文件關聯式概念資訊萃取系統

本論文主旨在設計以正規文句為主的文件資訊萃取系統。此設計是利用正規定義及領域屬性特質，配合關聯式資料模型，完成資訊萃取。目的在改善半結構文件查詢、管理上先天性的弱點，此模型的特點是能在異質的領域內依需求做不同的資訊萃取。基於半結構化文件和中文語法的特性，文中提出將半結構文件內容轉換成資料庫表格的方法論及實驗證明，另外比較在完成正規化後查詢路徑的變化及查詢能力的改善。本研究以資訊應用的角度來思考文件處理的問題，分析傳統表格型態涵蓋半結構文件資訊的能力，進而發展電子文件寫作的新標準，提高文件的被查詢能力及降低加值處理的困難。

關鍵字

資訊擷取；文件正規；文件管理；自然語言；查詢路徑；半結構化文件

並列摘要

The objective of this paper is to design an information extraction system, which focus on sentences with normal forms. This system executes the information extraction by using normalized definitions, the properties of domain attributes and the relative data models. The purpose of this paper is to improve the functions of query semi-structured documents and to reduce the defects in management. The feature of this model can extract information from the different style document by user’s demand. This extraction system is based on the characteristics of semi-structured documents and Chinese grammars. We submit a methodology to describe how to transfer semi-structured documents into database and prove the methodology by experiments. Besides, we evaluated the differences of query path after normalization and assessed the performance of query ability. This research reflects on documents processing in view of information applications. It analyses the capability of covering semi-structured documents into a traditional table and further, it develops the standards for documents writing to enhance the capability of documents being queried and reduce the difficulty of add-value processing.

並列關鍵字

Information Extraction ； document normalize ； document manager ； natural language ； query path ； semi-structured document

參考文獻

[1] Augusto Celentan , et al., “Knowledge-Based Document Retrieval in Office Environment: The Kabiria System”, ACM Tranacations of Information Systems, Vol.13, NO.3 Pages 237-268 , July 1995

[2] V.Christophides,et al., “From Structured Documents to Novel Query Facilities”,SIGMOD,94-5/94 Minneapoils,Minnestoa, USA,Pages 313-324,1994

[7] Richard Sproat and Chilin Shih,1990,A Statistical Method for Finding Word Boundaries in Chinese Text, Computer Processing of Chinese & Oriental Langages,Vol4,March 1990

[10] FanganmJ,L,”The Effectiveness of a Nonsyntatic Approach to Automatic Phrase Indexing for Document Retrieval,” Journal of American society for Information Science ,40(2),1989,115-132

[11] Zimin Wu and Gwyneth Tseng,”ACTS: An Automatic Chinese Text Segmentation System for Full Text Retrieval . ”Journal of American Society for Information Scince,46(2),1995,83-96

被引用紀錄

林義楠（2000）。Intranet上以RDB為核心的文件編輯器之研究〔碩士論文，元智大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0009-0112200611295805

國際替代計量

文件關聯式概念資訊萃取系統

主題瀏覽