使用者利用關鍵字來查詢的中文文件探勘方式,必須對所要尋找的內容,有具 體概念,才能給定適當的關鍵字。另一方面,在某一些可能有相關的文件中,如果 沒有共同的關鍵字,便很難察覺其間具有相關性。在本論文中,我們利用事件的歷 史資料來進行相關性事件的中文文件探勘,我們定義『事件』為足以表達某一個概 念的連續中文文字1,而事件的『歷史資料』則代表該事件於過去某段時間中,分 別在連續單位時間內所出現次數的序列。由於Haar小波轉換具有可保留序列波形的 特性,我們將事件的歷史資料序列,依照事先給定的時間區間大小,逐一分割成固 定長度的小片段,然後將這些片段轉換成利用平均值(mean)和差值(difference)的 Haar小波方式來表示,如此我們便可以利用小波波形的相似性來找出可能具有相關 性的事件。在本論文中,我們提出了三種事件探勘方式:熱門事件探勘、因果事件 探勘、特定區間事件探勘,並且由實驗中,探勘出不同的中文新聞相關性事件。
To use keyword search in Chinese document mining, one has to have a concrete idea of the item he is searching in order to give an appropriate keyword. On the other hand, in between possibly related articles, without a common keyword, it would be difficult to detect their correlation. In this thesis, we utilize historical serial data of events to conduct data-mining of correlated events in Chinese articles. By “historical serial data”, we refer to that event’s sequence during consecutive units of time of the past. By “related events”, we refer to events which historical serial data share similar evolution trend, such as “opening up Japanese car import” and “car sales”. By “event” we refer to a sequence of Chinese characters2 that sufficiently express one concept. As for “historical data”, it is that certain event’s occurrence sequence during each of consecutive units of time. Since Haar Wavelet transformation possess the characteristic of retaining sequence wave pattern, we cut the sequence of historical data, according to the given time-frame, into set-length fragments. Then we transform these fragments into the Haar wavelet mode of mean and difference. In this way, we can utilize the similarities of wavelet wave pattern to find possibly related events. In this thesis, we offer three methods of event data-mining: popular event mining, cause-and-effect event mining, seasonal event mining. Through experiments, we explore different related Chinese news events.