透過您的圖書館登入
IP:3.145.52.86
  • 學位論文

由後綴陣列與序列排比探索有意義的中文文句樣式

Exploring Chinese Text Meaningful Patterns with Suffix Array and Sequence Alignment

指導教授 : 顏嗣鈞

摘要


無資料

並列摘要


Chinese texts are character-based, not word-based, and there is no boundary mark between words in Chinese sentences. Each Chinese character stands for one phonological syllable and, in most cases, represents a morpheme. This raises a problem because, in Chinese, less than 10% of the word types (and less than 50% of the tokens in a text) are composed of a single character. In most Chinese IR tasks, identifying keywords is difficult because of segmentation ambiguities and the occurrence of unknown words. As a result, a great deal of research has focused on extracting words from raw Chinese texts (i.e., sentences without text segmentation). In this dissertation, we have proposed two different approaches to deal with Chinese natural language processing problems: (1) Term Contributed Frequency for Chinese Word Extraction We introduce a statistical suffix array-based Chinese term extraction approach that calculates the term contributed frequency (TCF) without a dictionary. We use an external data structure called the TCF-Node to store two kinds of term frequency, which can be used to solve the N-gram frequency distortion problem. The proposed term contributed frequency-based approach is a novel attempt to extract Chinese terms automatically and effectively. In addition to handle text corpora dynamically, our approach does not impose any strict requirements on the size and quality of the training corpora. (2) Alignment-Based Surface Patterns for Chinese Factoid Question Answering Systems Traditional information retrieval (IR) uses keywords or implicit rules, such as latent semantic indexing, to index a text. However, humans recognize a text through semantic information. Therefore, we propose an alignment-based surface pattern approach, called ABSP, which integrates semantic information into syntactic patterns. ABSP employs a new strategy to extract surface patterns from non-segmented passages. It uses the surface patterns to extract important terms from questions, and then constructs the terms’ relations from sentences in the corpus. Finally, the relations are used to rank answer candidates. We incorporate the approach into Chinese question answering (QA) to verify the possibility of ABSP in Chinese. Our experiments show that ABSP improves the answer accuracy in existing cross-lingual QA system that has high coverage. We believe the approach is robust and portable to other domains.

參考文獻


[51] R.-P. J. Mo, et al., "Determinative-measure compounds in Mandarin Chinese:
[3] J.-s. Chang and K.-y. Su, "An Unsupervised Iterative Method for Chinese New
[52] J.-s. Chang and K.-y. Su, "An Unsupervised Iterative Method for Chinese New
[18] C.-r. Huang, et al., "Readings in Chinese natural language processing," Journal
[24] T.-H. Ong and H. Chen, "Updateable PAT-Tree Approach to Chinese Key

延伸閱讀