運用非監督式學習強化斷詞系統-以PTT資料為例

隨著網路快速的發展，已有許多人是藉由網路來抒發自己的情緒及想法，此時分析網路中的資料顯得格外重要，也常使用到文字探勘中「斷詞」的技術。斷詞往往沒有一個明確的斷詞系統或是詞庫進行使用，因此本研究提出了兩階段斷詞，是使用非監督式及監督式斷詞系統所結合而成。我們希望藉由此兩階段斷詞形成研究文章的專屬詞庫，同時從監督式斷詞系統中選出最適合該研究文章的斷詞系統，將可以節省挑選「較適合」詞庫的時間，也能省下挑選斷詞系統的時間。研究結果確實能形成9.7萬個詞的詞庫，也改善了一般斷詞系統較容易斷出二字詞的缺點，同時能找出有意義且較長的詞彙。在評估方面，將以人工斷詞為基礎進行召回率(recall)、精確率(precision)及F值(F-measure)的計算，發現使用本研究所建議之詞庫及斷詞法，在代表整體表現的F值上將能提升11%左右。

關鍵字

監督式學習；非監督式學習；詞庫建立；文字評估指標；人工斷詞

並列摘要

With the rapid development of the internet, many people have used the internet to express their emotions and ideas. At this time, it is extremely important to analyze the data on the internet, and the “segmentation technique” is also commonly used for Chinese expressions. However, word segmentation is often used without a clear word segmentation system or lexicon, so this study proposes a two-stage word segmentation technique, which is a combination of unsupervised and supervised word segmentation systems. It is hoped that the two-stage word segmentation will form an exclusive lexicon for research articles. At the same time, it can also select the word segmentation system that is most suitable for the research article from the supervised word segmentation system. This not only saves time for selecting the “more suitable” lexicon, but also saves time in selecting the word segmentation system. Our research results indeed form a lexicon of 97,000 words, and also improve the shortcomings of the general word segmentation system that is easier to segment two-word terms and can find meaningful and long terms. In terms of evaluation, the calculation of recall, precision and F-measure are based on artificial word segmentation. It is found that the use of the lexicon and the word segmentation system recommended by this research will increase the F-measure representing the overall performance by as much as 11%.