著者・編者Eronen Juuso Kalevi Kristian
並列タイトル等Improving Multilingual Automatic Cyberbullying Detection With Feature Density And Cross-lingual Zero-shot Transfer
一般注記In this thesis, I study two different methods for improving multilingual automatic cyberbullyingdetection. First, I study the effectiveness of Feature Density (FD) using different linguisticallybackedfeature preprocessing methods in order to estimate dataset complexity, which in turn isused to comparatively estimate the potential performance of machine learning (ML) classifiersprior to any training. I hypothesize that estimating dataset complexity allows for the reductionof the number of required experiments iterations, making it possible to optimize the resourceintensivetraining of ML models which is becoming a serious issue due to the increases in availabledataset sizes and the ever rising popularity of models based on Deep Neural Networks (DNN).The problem of constantly increasing needs for more powerful computational resources is alsoaffecting the environment due to alarmingly-growing amount of CO2 emissions caused by trainingof large-scale ML models. I use cyberbullying datasets collected for multiple languages, namelyEnglish, Japanese and Polish. The difference in linguistic complexity of datasets allows me toadditionally discuss the efficacy of linguistically-backed word preprocessing.Second, I study the selection of transfer languages for automatic abusive language detection.I demonstrate the effectiveness of cross-lingual transfer learning for zero-shot abusive languagedetection. This way it is possible to use existing data from higher-resource languages to buildbetter detection systems for languages lacking data. The datasets are from eight different languagesfrom three language families. I measure the distance between the languages using several languagesimilarity measures, especially by quantifying the World Atlas of Language Structures. I showthat there is a correlation between linguistic similarity and classifier performance, making itpossible to choose an optimal transfer language for zero shot abusive language detection.Next, I demonstrate that this method is also generally applicable to multiple Natural LanguageProcessing tasks, specifically sentiment analysis, named entity recognition and dependency parsing.I show that there is also a correlation between linguistic similarity and zero-shot cross-lingualtransfer performance for these tasks, allowing me to select an ideal transfer language in order toaid with the problem of dealing with languages that do not currently have a sufficient amountof data. Lastly, I show that the World Atlas of Language Structures can be quantified into aneffective linguistic similarity method.
コレクション(個別)国立国会図書館デジタルコレクション > デジタル化資料 > 博士論文
受理日(W3CDTF)2022-11-07T16:56:35+09:00
連携機関・データベース国立国会図書館 : 国立国会図書館デジタルコレクション