半自動化的領域metadata擷取系統-以電子新聞為例

在現在的社會中，由於資訊的傳遞速度越來越快，網際網路的使用率也越來越高。因此，網際網路儼然已經成為快速與即時獲取資訊的平台。然而也正因為如此，大量資訊氾濫累積於網路，結果造成了多數使用者無法快速的尋找到自己想要知道的資訊。就算現今已經有許多搜尋引擎提供讓使用者透過輸入領域關鍵字 (domain keywords)來獲得資訊，但是龐大的搜尋結果卻讓人更無所適從，還是很難找到真正想要的資訊。因此，若是能將搜尋引擎所提供的資訊搜尋服務，規範為特定領域知識 (domain knowledge) 的資訊與知識加值服務，便能讓使用者更快速的獲得更明確的資訊。基於這些動機，本論文以「電子新聞」作為應用領域 (application domain)，設計並實現電子新聞相關的metadata擷取技術及整合機制。首先，藉由搜尋分析的方法，有效的將抓取領域的重要網站的所有網頁，並按照HTML網頁設計常用標籤 (tag)，將網頁原始碼轉換為符號序列 (symbolic sequences)。以網站為單位，分析站內各網頁間的相似度，並加以分群 (clustering)，藉此分析網站內容管理系統 (content management systems) 可能的設計和使用的網頁版面 (page templates) 種類。接著，以各網站共用template的網頁為單位，分析主要內容區塊 (包含新聞文章的HTML DOM結構)。亂數選取區塊，透過快速人工標記的方法，系統即能在人工標記幾個區塊的metadata後，就能快速累積足夠知識，擷取與整合該領域網站的metadata。在兩個新聞網站的實驗中，證明使用者僅需幾分鐘的系統操作時間，即能準確擷取新聞metadata。最後，本論文也將metadata擷取成果，整合應用於開發適合手機閱讀的新聞多媒體系統。

關鍵字

領域知識； Metadata擷取；序列比對；分群

並列摘要

The rapid growth of information on the Internet makes the Web become the most important information platform on the world. However, numerous data and information result in "users suffer from too much information" so that finding desired information becomes a hard work. Focusing information sources into a specific domain is a feasible solution for providing better information-finding and knowledge-valuing services. Based on these motivations, we propose the knowledge engineering process for efficiently and effectively building domain knowledge to provide these services. To prove the process workable, the Semi-Automatic Domain Metadata Extraction System is designed and implemented as the case study on "news" domain. In the first phase, the system automatically collects and parses large amount of pages from domain-related websites with keyword "news (新聞)" submitted by the user. By translating an HTML page into a symbolic sequence, the system analyzes intra-site and inter-site similarities among pages based on the Edit Distance method. Then, similar sequences (pages within the same site) are grouped into a cluster to represent Page Template of the site. By analyzing similarities among sequence segments from the same template, representative blocks are extracted to denote the content block of such kind of template. Finally, the user interacts with the system to label metadata for the template of the site. Experiments show that proposed methods are workable to classify template, identify content blocks, and extract metadata.

並列關鍵字

Domain Knowledge ； Metadata Extraction ； Sequence Alignment ； Clustering

參考文獻

[1] Chakrabarti, S., van den Berg, M. and Dom, B., “Focused crawling: A new approach to topic-specific web resource discovery,” Proceedings of the 8th World Wide Web Conference, Toronto, 1999.