從領域相關Web資料建構知識網路

網路普及和網頁資訊量快速的增長下，大量的資訊散佈於網路中，使用者由Web取得資訊的方式，大都依靠搜尋引擎來取得大量的相關網頁。可是我們很難快速的去吸收這些資訊，如需整合多方面的資訊，則需要花費更多時間。因此，本論文設計和完成一個知識網路建構系統，有效擷取網路上大量雜亂的資訊，分析成為知識網路，讓使用者可有效率地藉由瀏覽知識網路的方式，快速取得想要的資訊。本系統以I3S (Intelligent Internet Information System) 平台為基礎，分成三階段達到此目標。首先，利用I3DDC (Domain Data Collector) 可快速及有效蒐集領域相關的網頁。透過I3DME (Domain Metadata Extractor) 可將領域相關的重要metadata擷取出來。接著系統開發O2RES (Object-Object Relation Extraction System) 來做名詞辨識並找出名詞之間的關聯，利用這些資訊我們可以快速並有效的建置領域知識網路 (Domain Knowledge Network, DKN)。在此我們主要是利用CRFs (Conditional Random Fields) 訓練出的名詞辨識模組，針對人名、組織和地點這三類的名詞來做辨識，整體的辨識度可達到90.13%。接著，以搜尋引擎的文字分析和索引技術，結合關聯探勘 (Mining Association) 法，系統從大量領域相關的網頁文字資訊，分析辨識出名詞間之關連性，自動建立領域相關的知識網路。遵循W3C所制訂的SVG (Scalable Vector Graphics)，本系統也實作Web-based知識網路的瀏覽系統。由三個不同領域的case studies (新聞、BRIDGE和書) 可歸納，本系統制訂出一般化的DKN表示法，與系統結合，可用於不同領域之應用，並且透過DKN的瀏覽方式，使用者可以快速的吸收大量領域相關的資訊。

關鍵字

知識網路；資料探勘；資訊擷取；名詞辨識

並列摘要

With the popularity of Internet and the exponential growth of web pages and contents, the vast information is available on the Web. Users usually submit keywords to search engines for retrieving numerous relevant pages. But the relevant information is still too much to absorb. If users want to organize various types of information, they have to spend a long time to seek useful information. Therefore, we propose the Object-Object Relation Extraction System (O2RES) to mine concepts relations and build the Domain Knowledge Networks (DKN).that is effective to represent the disordered information on the Web. By visualizing the DKN as Knowledge Map, users can efficiently extend their concepts about a specific domain, the desired information is therefore obtained by browsing the Knowledge Map. O2RES is constructed based on our Intelligent Internet Information System (I3S) web-based platform. O2RES is seamlessly integrated with the three-phase system architecture of I3S. The first phase, Domain Data Collector (DDC) efficiently and effectively gather domain-related data. In the second phase, Domain Metadata Extractor (DME) extracts metadata information from those domain-related websites and pages. Then, Object-Object Relation System (O2RES) is applied to recognize the name entities and extract the object-object relations for build several DKNs efficiently and effectively. The Name Entity Recognition (NER) module is adjusted to precisely recognize the names of person, organization, and location by using CRFs (Conditional Random Fields). The average performance (F-measure) of NER is 90.13% on these three kinds of name entities. Based on mining association rules, the system automatically explores relations between names to build the DKN. Using the W3C SVG (Scalable Vector Graphics) standard, we also present DKNs as the Knowledge Map to provide browsing services. Applying O2RES to several case studies (news, telecommunication, and book), we show that the DKN-based browsing mechanism is useful for users to explore domain-related information.

並列關鍵字

Knowledge Network ； Data Mining ； Information Extraction ； Name Entity Recognition

參考文獻

[1] Chakrabarti, S., van den Berg, M. and Dom, B., “Focused crawling: A new approach to topic-specific web resource discovery,” Proceedings of the 8th World Wide Web Conference, Toronto, 1999.

Google Scholar

[2] Jun Zhu, Bo Zhang, Zaiqing Nie, Ji-Rong Wen, Hsiao-Wuen Hon, “Webpage understanding: an integrated approach, ” Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, August 12-15, 2007, San Jose, California, USA

Google Scholar

[3] Borkar, V., Deshmukh, K., and Sarawagi, S. Automatic segmentation of text into structured records. In Proc. of SIGMOD, 2001

Google Scholar

[4] Cohen, W. W., and Sarawagi, S. Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods. In Proc. of SIGKDD, 2004

Google Scholar

[5] Lin, S.-H. and Ho, J.-M., “Discovering Informative Content Blocks from Web Documents,” The Eighth ACM SIGKDD, 2002

Google Scholar

國際替代計量

從領域相關Web資料建構知識網路

主題瀏覽