Note (General)application/pdf
Automatic text classification (ATC) is the task of automatically assigning one or more appropriate categories for a document according to its content or topic. Traditionally, text classification is carried out by human experts as it requires a certain level of vocabulary recognition and knowledge processing. With the rapid explosion of texts in digital form and growth of online information, text classification has become an important research area owing to the need to automatically handle and organize text collections. The applications of this technology are manifold, including automatic indexing for information retrieval systems, document organization, text filtering, spam filtering, and even hierarchical categorization of web pages. Many standard machine learning techniques have been applied to automated text classification problems, and k-Nearest Neighbor algorithm (kNN) and Support Vector Machine (SVM) have been reported as the top performing methods for English text classification. However, the studies on Chinese text classification are less su_cient compared with English, and Chinese text has its own characteristics. As there is no natural delimiter between Chinese words, this means that Chinese segmentation is necessary before any other preprocessing. Conventional feature selection and extraction methods for English text classification may not be applicable to Chinese as the unique linguistics and complex ambiguities in Chinese natural language. As a consequence, Chinese segmentation is a major issue in Chinese document processing and has been extensively discussed in the literature. Numerous different segmentation approaches have been proposed for Chinese text classification. These approaches can be basically divided into character-based approach and word-based approach. Since there are no publicly available Chinese corpora, it is difficult to tell which method performed better. Therefore, this research focuses on the methods of Chinese segmentation, feature selection and feature combination. We first perform experiments using character-based approach and word-based approach, respectively. Then we make a comparison between them and find out the advantages and problems for these two approaches. Based on the experimental results and Abstract analysis, we propose a method by using the combintaion of the two approaches. Furthermore, we evaluated the effectiveness of feature extraction, feature transformation and dimension reduction techniques to and further improve the accuracy of Chinese text classification. We first performed Chinese text classification using character-based (N-gram) approach. Experimental results show that the combination of uni-gram and bi-gram (1+2-gram) proved to be the most effcient method to represent Chinese document. We experimentally evaluated the effectiveness of feature transformation techniques including normalizing absolute frequency to relative frequency and power transformation. The results show a significant improvement in performance. N-gram and word segmentation extraction on a large corpus will yield a large number of possible features. In ATC, high dimensionality of the feature space may be problematic in terms of computational time and storage resources. Experiments prove that Principal Component Analysis is an effcient and effective way to reduce the dimensionality. Then, we presented several experiments based on word-based approach. We proposed a novel feature selection method based on part-of-speech analysis. According the components of Chinese texts, we utilize the words’ part-of-speech attributes to filter lots of meaningless terms. The results show that nouns are the most important features for Chinese texts and suitable combination of part-of-speech can lead to better classification performance. Several sets of experiments were carried out to study the impact of automatic word segmentation errors on Chinese text classification. Comparison experiment of four word-based approaches was carried out and the results show that the performance was significantly reduced when using automatic word segmentation instead of manual word segmentation which means errors caused by automatic word segmentation have an obvious impact on classification performance. Furthermore, we proposed an effective way of combining character-based (N-gram) and wordbased approach for Chinese text classification. Uni-gram and bi-gram features are considered as the baseline model and then combined with word features of length greater than or equal to three. We further introduced a weight coeffcient which can be used to give higher weights to word features. Experimental results show that our proposed approach achieved the highest performance. In future work, as the growing of categories will increase the diffculty of text classification, extensive experimental evaluation using more texts in more categories will be studied. Another work is to improve classification performance of Optical Character Reader (OCR) texts. The digitization process of printed documents involves generating texts by an OCR system. However, OCR texts usually contain errors due to misrecognized characters. Therefore, it is necessary to investigate how to deal with these texts effectively.
本文 / Division of Systems Engineering Graduate School of Engineering Mie University
66p
Collection (particular)国立国会図書館デジタルコレクション > デジタル化資料 > 博士論文
Date Accepted (W3CDTF)2015-07-01T13:17:09+09:00
Data Provider (Database)国立国会図書館 : 国立国会図書館デジタルコレクション