由後綴陣列與序列排比探索有意義的中文文句樣式

宋政隆

doi:10.6342/NTU.2010.03352

透過您的圖書館登入 IP:18.219.83.70

透過您的圖書館登入

IP:18.219.83.70

繁體中文
English
简体中文

精確檢索 : 冠狀病毒
模糊檢索 : 冠狀病毒
冠狀病毒感染

冠狀病毒疾病
查詢出版品: 冠狀病毒

進階查詢

查詢歷史

主題瀏覽

【下載完整報告】國民法官、工作與心理健康成熱門研究議題？熱門研究焦點一次看！

學位論文

由後綴陣列與序列排比探索有意義的中文文句樣式

Exploring Chinese Text Meaningful Patterns with Suffix Array and Sequence Alignment

宋政隆(Cheng-Lung Sung)

指導教授：顏嗣鈞

國立臺灣大學/電機資訊學院/電機工程學研究所/博士(2010年)

https://doi.org/10.6342/NTU.2010.03352

全文下載

摘要

無資料

關鍵字

後綴陣列；序列排比；中文文句樣式

並列摘要

Chinese texts are character-based, not word-based, and there is no boundary mark between words in Chinese sentences. Each Chinese character stands for one phonological syllable and, in most cases, represents a morpheme. This raises a problem because, in Chinese, less than 10% of the word types (and less than 50% of the tokens in a text) are composed of a single character. In most Chinese IR tasks, identifying keywords is difficult because of segmentation ambiguities and the occurrence of unknown words. As a result, a great deal of research has focused on extracting words from raw Chinese texts (i.e., sentences without text segmentation). In this dissertation, we have proposed two different approaches to deal with Chinese natural language processing problems: (1) Term Contributed Frequency for Chinese Word Extraction We introduce a statistical suffix array-based Chinese term extraction approach that calculates the term contributed frequency (TCF) without a dictionary. We use an external data structure called the TCF-Node to store two kinds of term frequency, which can be used to solve the N-gram frequency distortion problem. The proposed term contributed frequency-based approach is a novel attempt to extract Chinese terms automatically and effectively. In addition to handle text corpora dynamically, our approach does not impose any strict requirements on the size and quality of the training corpora. (2) Alignment-Based Surface Patterns for Chinese Factoid Question Answering Systems Traditional information retrieval (IR) uses keywords or implicit rules, such as latent semantic indexing, to index a text. However, humans recognize a text through semantic information. Therefore, we propose an alignment-based surface pattern approach, called ABSP, which integrates semantic information into syntactic patterns. ABSP employs a new strategy to extract surface patterns from non-segmented passages. It uses the surface patterns to extract important terms from questions, and then constructs the terms’ relations from sentences in the corpus. Finally, the relations are used to rank answer candidates. We incorporate the approach into Chinese question answering (QA) to verify the possibility of ABSP in Chinese. Our experiments show that ABSP improves the answer accuracy in existing cross-lingual QA system that has high coverage. We believe the approach is robust and portable to other domains.

並列關鍵字

surface pattern ； suffix array ； sequence alignment

參考文獻

[3] J.-s. Chang and K.-y. Su, "An Unsupervised Iterative Method for Chinese New

[18] C.-r. Huang, et al., "Readings in Chinese natural language processing," Journal

[51] R.-P. J. Mo, et al., "Determinative-measure compounds in Mandarin Chinese:

[52] J.-s. Chang and K.-y. Su, "An Unsupervised Iterative Method for Chinese New

[1] H. Feng, et al., "Accessor variety criteria for Chinese word extraction," Comput.

國際替代計量

由後綴陣列與序列排比探索有意義的中文文句樣式

全文下載

主題瀏覽

由後綴陣列與序列排比探索有意義的中文文句樣式

Exploring Chinese Text Meaningful Patterns with Suffix Array and Sequence Alignment

摘要

關鍵字

並列摘要

並列關鍵字

參考文獻

延伸閱讀

國際替代計量

本網站使用Cookies