透過您的圖書館登入
IP:216.73.216.183
  • 學位論文

應用文件分類技術於多維度文件倉儲系統

Applying Text Classification Techniques in Multidimensional Document Warehouse System

指導教授 : 曹承礎

摘要


資訊科技的發展,讓我們處於資訊過載的環境,傳統的關鍵字搜尋已經無法滿足我們的需求,我們開始尋找可以從多維度做查詢的工具。資料倉儲的系統提供儲存、分析數字的能力,卻無法處理文件類型的檔案,因此本研究探討建立全新的文件倉儲系統,希望能解決上述的兩個問題。 在本研究中,我們介紹自動化擷取元資料的方法以及如何建立完整的文件倉儲系統。我們預先定義十五種元資料為十五種類別,使用支撐向量法的分類演算法,根據訓練支撐向量法所得的分類規則,從新進的文件裡找出每一個句子所屬的類別,再將文件轉換為標記的XML格式。接著,我們應用多維度的星狀架構建立文件倉儲系統,以元資料輔助文件載入文件倉儲的流程。最後搭配線上分析處理與自行撰寫的程式,提供分析多維度文件倉儲所需的工具。 實驗的結果,證明利用支撐向量法的演算法,可以得到高度準確性的分類規則。應用這一些分類規則,可以協助我們分析文件的內容,完整找出文件裡的元資料。同時,我們建立的雛形系統,展示文件倉儲系統的運作流程與組成元素,提供系統建構的基礎與參考。後端的線上分析處理與多維度的查詢工具,讓我們可以從多個角度,尋找文件、分析文件,挖掘隱藏在文件裡的資訊。

並列摘要


The development and growth of information technologies have caused a situation called “information overloading”. Therefore, we begin to look for new tools which allow us to create a query in multidimensional perspectives rather then to use traditional keyword-based search engines. Data warehouse systems provide the capabilities of storing and analyzing numerical data but lack the ability to deal with document collections. In order to solve these problems above, we are going to build a whole new system. In this paper, we describe automatic metadata extraction algorithm and build up a document warehouse system. We define 15 kinds of metadata as 15 classes. Using support vector machine, we create 15 classifies to extract metadata from a new document. Sentences in the document with corresponding metadata were saved in xml format. Next, we use star schema to build a multidimensional document warehouse system. Metadata is used to support the process of loading documents into document warehouse. We also provide client side tools such as OLAP, cube browser, MDX query interface. Our Experiments show that support vector machine can achieve high classification performance. We can extract most metadata from a document by SVM classifier. The prototype system built in this paper also shows the fundamental components and processes in a document warehouse system. The OLAP tools and multidimensional query tools provide methods of search and analyze document from multi-points of view of user perspectives

參考文獻


【1】 ACM, ACM Computing Classification System, http://www.acm.org/class/. 1998
【3】 Blair, D., Information retrieval and the philosophy of language. The Computer Journal, 35(3), 1992
【5】 Ching-Huei Tsou, .NET Implementation of Support Vector Machine, http://blogs.mit.edu/tsou/posts/1255.aspx, 2004
【6】 Cortes, C. and Vapnik, V.N. Support –vector networks. Machine Learning Journal, 20:273-297, 1995
【9】 Eierman, Michael A., Niederman, Fred, and Adams, Carl, ”DSS theory: A model of constructs and relationships,” Decision Support Systems, Vol. 14, 1995, pp. 1-26.

被引用紀錄


盧惠文(2015)。太陽光電躉購費率對減碳效益之評估研究-結合資料倉儲及系統動態學之運用〔碩士論文,國立清華大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0016-0312201510263716

延伸閱讀