A Data Augmentation Method based on Chinese Character Vector

Text data Augmentation technology uses limited data to generate new samples by changing the original text content. It can effectively increase the data size of the training set, improve the generalization ability and robustness of the model, and solve the model overfitting caused by insufficient training data or uneven sample distribution. This paper proposes a data augmentation method based on a Chinese character vector. First, use the Word2vec model to obtain a Chinese character vector collection by training on the Chinese Wikipedia corpus. Second, choose a replacement character in the text that needs to be augmented. Finally, find one or more characters with the closest similarity through the character vector set to replace the words selected in step 2 to generate new samples. We mix the text in the original training set and the augmented text to form a new training set as the input of the CNN (Convolutional Neural Network) model for classification. The experimental results show that the model performance of CNN improves by 1.57% before and after using our proposed data augmentation approach. Compared to the word-level replacement data augmentation technique, our approach reduces the algorithm running time by four-fifths.

關鍵字

Character Vector ； Data Augmentation ； Text Classification ； Convolutional Neural Network

參考文獻

Duyu Tang, Bing Qin, and Ting Liu. 2015. Document modeling with gated recurrent neural. pages 1422–1432.

Edunov , S.; Ott, M.; Auli, M.; Grangier, D. Understanding Back-Translation at Scale. In Proceedings of the 2018 Conference onEmpirical Methods in Natural Language Processing, Brussels, Belgium,31 October–4 November 2018; pp. 489–500.

Haralabopoulos, G., Wagner, C., McAuley, D., & Simperl, E. (2018). A multivalued emotion lexicon created and evaluated by the crowd. In 2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS) (pp. 355–362). IEEE.

Pan S J , Qiang Y . A Survey on Transfer Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10):1345-1359.

Gatys L A , Ecker A S , Bethge M . Image Style Transfer Using Convolutional Neural Networks[C]// 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016.

主題瀏覽

A Data Augmentation Method based on Chinese Character Vector

摘要

關鍵字

參考文獻

延伸閱讀

本網站使用Cookies