  • 學位論文


Web Taxonomy Construction using a Cross-lingual Hierarchical Thesaurus

指導教授 : 楊正仁


在過往的觀察中,我們發現在一些網頁分類目錄中存在著不同語言網頁資訊量極度不平等的情況, 例如在ODP分類目錄中,有些語言的網頁數量相對於英文網頁數量是非常貧乏的,如中文和韓文。 然而這些語言當中,其實已經含有一些豐富網頁數量的分類目錄,但這些分類目錄架構和ODP有些不同。 因此,我們希望利用這些非英文且不同分類架構目錄的網頁,合併到ODP的非英文目錄中,使這些非英文的目錄內容更加豐富。 但是由於非英文目錄網頁的數量過於稀少,我們藉由英文分類目錄中所含的豐富階層索引典資訊,來輔助非英文目錄的建構。 針對此點,本論文使用階層式整合的方法對網頁目錄內容進行建構,並結合目前在文件分類上具有良好表現的支援向量機(SVM)進行實作。同時,我們應用來源和目的目錄的階層式索引典資訊於網頁目錄建構中,再以跨語言階層索引典的資訊輔助建構, 以進一步提升SVM在目錄建構的效果。實驗中採用真實的網頁目錄加以測試,結果顯示我們提出的跨語言階層索引典方法能有效地提升網頁目錄建構的成效。


In our observations, we find that the inequality problem exists in the amount of Web pages of different languages. For example, the ODP directory contains a large number of English Web pages, but only has a relatively small number of Chinese and Korean Web pages. However, some Web taxonomies actually contain many Chinese and Korean Web pages than ODP. Therefore, we plan to use these abundant Web resources to fertilize the content of non-English ODP taxonomies. Since non-English ODP directories have rare Web pages, we utilize English ODP directory as an external hierarchical thesaurus to help the construction of non-English ODP directories. The external cross-lingual hierarchical thesaurus has been employed in a hierarchical catalog integration scheme to construct non-English Web taxonomies. As shown in our experiments, the construction performance is therefore improved with the cross-lingual hierarchical thesaurus.


[1] R. Agrawal and R. Srikant, “On Integrating Catalogs,” in Proceedings of the 10th
Statistical Machine Translation,” in Proceedings of the 19th Brazilian Symposium
[4] I.-X. Chen, J.-C. Ho, “Hierarchical Web Catalog Integration with Conceptual Relationships
in a Thesaurus,” International Journal of Computational Linguistics and
[5] A. Doan, J. Madhavan, P. Domingos, and A. Halevy, “Learning to Map between Ontologies
