透過您的圖書館登入
IP:18.223.32.230
  • 學位論文

運用非監督式學習強化斷詞系統-以PTT資料為例

Application of Unsupervised Learning to Reinforce Chinese Text Segment System - The Case Study of PTT Data

指導教授 : 陳景祥
共同指導教授 : 李百靈(Pai-Ling Li)

摘要


隨著網路快速的發展,已有許多人是藉由網路來抒發自己的情緒及想法,此時分析網路中的資料顯得格外重要,也常使用到文字探勘中「斷詞」的技術。斷詞往往沒有一個明確的斷詞系統或是詞庫進行使用,因此本研究提出了兩階段斷詞,是使用非監督式及監督式斷詞系統所結合而成。 我們希望藉由此兩階段斷詞形成研究文章的專屬詞庫,同時從監督式斷詞系統中選出最適合該研究文章的斷詞系統,將可以節省挑選「較適合」詞庫的時間,也能省下挑選斷詞系統的時間。 研究結果確實能形成9.7萬個詞的詞庫,也改善了一般斷詞系統較容易斷出二字詞的缺點,同時能找出有意義且較長的詞彙。在評估方面,將以人工斷詞為基礎進行召回率(recall)、精確率(precision)及F值(F-measure)的計算,發現使用本研究所建議之詞庫及斷詞法,在代表整體表現的F值上將能提升11%左右。

並列摘要


With the rapid development of the internet, many people have used the internet to express their emotions and ideas. At this time, it is extremely important to analyze the data on the internet, and the “segmentation technique” is also commonly used for Chinese expressions. However, word segmentation is often used without a clear word segmentation system or lexicon, so this study proposes a two-stage word segmentation technique, which is a combination of unsupervised and supervised word segmentation systems. It is hoped that the two-stage word segmentation will form an exclusive lexicon for research articles. At the same time, it can also select the word segmentation system that is most suitable for the research article from the supervised word segmentation system. This not only saves time for selecting the “more suitable” lexicon, but also saves time in selecting the word segmentation system. Our research results indeed form a lexicon of 97,000 words, and also improve the shortcomings of the general word segmentation system that is easier to segment two-word terms and can find meaningful and long terms. In terms of evaluation, the calculation of recall, precision and F-measure are based on artificial word segmentation. It is found that the use of the lexicon and the word segmentation system recommended by this research will increase the F-measure representing the overall performance by as much as 11%.

參考文獻


參考文獻
中文文獻
吳冠輝,2019。基於兩詞彙的序列關係建造非監督式 SeqWORDS 斷詞方法,國立政治大學統計學研究所碩士學位論文。
邱兆揚,2006。利用Google互聯網分類新聞語料之新詞自動擷取技術支援詞庫式中文斷詞系統,國立臺灣師範大學應用電子科技研究所。
林渝翔,2011。一個產生長詞與新詞的中文混合斷詞系統,元智大學資訊管理學系碩士班。

延伸閱讀