透過您的圖書館登入
IP:13.59.192.254
  • 期刊
  • OpenAccess

基於對照表以及語言模型之簡繁字體轉換

Chinese Characters Conversion System Based on Lookup Table and Language Model

摘要


中國大陸與台灣的文字同屬於華文字體,但字體上卻分爲簡體字與繁體字。中國大陸與台灣近年來在中文書籍及網路上皆有大量的資訊交流。基於閱讀習慣,文字勢必需要執行簡繁轉換後才利於雙方的讀者閱讀。傳統的簡繁轉換擁有簡體一字對繁體多字的歧異問題以及兩岸用語不同的問題。因此,本研究設計一個具有擴展性的簡繁轉換系統,透過人工擷取維基百科新增對照表內容來改善兩岸用語不同的問題,以及使用語言模型改善簡體字一個字對繁體字多個字的歧異問題。此系統可以降低各種中文電子書籍執行簡繁轉換後人工校正的成本。具有彈性的架構使得系統可以持續擴充改進。

並列摘要


The character sets used in China and Taiwan are both Chinese, but they are divided into simplified and traditional Chinese characters. There are large amount of information exchange between China and Taiwan through books and Internet. To provide readers a convenient reading environment, the character conversion between simplified and traditional Chinese is necessary. The conversion between simplified and traditional Chinese characters has two problems: one-to-many ambiguity and term usage problems. Since there are many traditional Chinese characters that have only one corresponding simplified character, when converting simplified Chinese into traditional Chinese, the system will face the one-to-many ambiguity. Also, there are many terms that have different usages between the two Chinese societies. This paper focus on designing an extensible conversion system, that can take the advantage of community knowledge by accumulating lookup tables through Wikipedia to tackle the term usage problem and can integrate language model to disambiguate the one-to-many ambiguity. The system can reduce the cost of proofreading of character conversion for books, e-books, or online publications. The extensible architecture makes it easy to improve the system with new training data.

參考文獻


王曉明、魏林梅()。
王寧、王曉明()。
李樹德(2009)。Word“中文簡繁轉換”存在的問題與解決對策。2009 年9 月2 日,取自http://www.yywzw.com/show.aspx?id=1570&cid=142.
劉匯丹、吳健()。
陳勇志、吳世弘、盧家慶、谷圳()。

被引用紀錄


黃文奇(2011)。中文文字蘊涵系統之特徵分析〔碩士論文,朝陽科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0078-1511201110382717
楊善順(2014)。蘊涵分析於改進中文文字蘊涵識別系統〔碩士論文,朝陽科技大學〕。華藝線上圖書館。https://www.airitilibrary.com/Article/Detail?DocID=U0078-0905201416542675

延伸閱讀