文件結構化與自動回存資料庫之研究

目前大部分各公司的商業資訊都會以某種標準格式來做資料的交換，XML（eXtensible Markup Language）文件是最常用來資料傳遞的載具，但由於各家公司資料庫設計不盡相同導致100間公司可能有100種資料庫欄位名稱，在做資料轉換時常會因為兩家公司資料庫欄位名稱不完全相同而產生資料傳遞的困難。雖然對於一般公司的上下游公司(supplying chain)還能夠協調成統一的資料庫欄位名稱或是建立一些正確的轉換機制來傳遞彼此的資訊。我們期望能做到公司能透過原本資料庫內的資訊和字詞字典輔助就能將外來的相關資料庫電子文件資訊轉入資料庫中。本研究提出一個方式希望能透果某些字詞的協助將各類電子文件以自動化方式透過下列3階段來擷取文件中資料，將其轉為結構化資訊進而能存回原有資料庫中。下列為本系統的3個步驟： 1. 從文件中尋找與原資料庫的相關表格。 2. 區分第一步驟中的相關表格為目的表格及參照表格。前者為即E-R模型中菱形方塊，後者為即E-R模型中長方形方塊。 3. 擷取文章中與表格相關資訊及回存資料庫中。本論文共分為6個部分：序論、相關技術分析、研究設計模型、實作步驟及方法、實驗結果與討論、結論。

關鍵字

文件結構化；資訊擷取

並列摘要

At present, commercial information of most company always use standard form for information exchange. XML document is the most common used carrier to transfer information. But since each company has its own database design or schema, there are some difficulties resulted from information exchange between two different companies. Although supplying chain companies may adjust their databases into one common schema or construct a common exchange mechanism to exchange information among each other, some unavoidable problems still exist. We expect that company may use information contained in original database and the assistance of term dictionary to convert external electronic document into original database. In this paper, we propose an automatic conversion method by using the assistance of some specific terms and applying the three steps we introduce to retrieve information contained in document. The retrieved information will then be converted into structural information and be stored back into original database. The three steps in the system we construct are as follows: 1. Find tables that are related to database from document 2. Separate the tables from step 1 into destination table and reference table; the former is that represented by diamond shape of E-R model, the latter is that represented by rectangular shape of E-R model. 3. Retrieve information related to tables from document and save into original database. This paper consists of 6 section: introduction, analysis of related techniques, model design, implementation method, experiment result and discussion, conclusion.