隨著網路的蓬勃發展,電子化文獻快速地傳播,無論是發表或取得都非常方便,這樣的現象使文獻大量地增加,但文獻大多散亂在無涯無際的網路世界,使得找尋相關文獻成為一件耗時費力的事。若有一套能夠將網路上互相關聯的文獻組織起來的系統,就能輕而易舉地查詢到相關參考文獻,這是使用者的一大福音。 本文主要探討文獻Header和reference的內容,因為這兩個部分能給我們大量文獻的基本資訊,如標題、作者、出版商與出版日期等等,這些資訊非常適合用來整理文獻,它們提供我們能以各種不同的維度方向去觀察並做分類分群與搜尋。 要整理文獻,首要的工作就是要整理出文獻的Metadata。本研究的工作就是要將非結構化的文獻資料整理成具結構化的資料並賦予其意義。工作內容共分成三階段:第一階段先分析文字的特徵,並依據特徵對文字做分群。第二階段將分群好的文字以Machine Learning的演算法將其適當的分段並給予合乎其意義的Metadata。最後再將這些有意義的結構化資料存入資料庫,以方便將來再使用。
Along with the network vigorous development, the electronic literature rapidly disseminates. It is very convenient to issue and obtain extremely. Such phenomenon makes the literature massively increase. The matter which literatures scattered in disorder in networks causes researchers consume time to search relevant articles. If we have a system which can organize relevant literature in networks, it is easy to query relevant references. It is a great good news to users. This article probes into Header and Reference in literatures mainly, because these two parts can give us a large number of basic information about literature, like title, author, publisher and publication date and so on. These information extremely suitably use for to reorganize the literature. They provide us to be able to observe, search and to make the classification by each kind of different dimension. To organize literature, the primary work is to organize the Metadata of literature. This research work is to have the non- structured literature change into the structured data and entrusts with its meanings. The work is divided into three stages:Analyse the feature of the token and make a cluster according to features at the first stage. At the second stage, clustering token will be segmented suitably with algorithms of Machine Learning and extract Metadata from segmented tokens. Finally, we will store these meaningful structured data into database in order to facilitate them in the future.