A study on Mongolian text-to-speech system based on deep neural network

国立国会図書館永続的識別子: info:ndljp/pid/12304401

資料種別: 博士論文

著者: Byambadorj, Zolzaya

出版者: -

出版年: 2022-03-23

資料形態: デジタル

ページ数・大きさ等: -

授与大学名・学位: 徳島大学,博士（工学）

すべて見る

国立国会図書館での利用に関する注記

本資料は、掲載誌(URI)等のリンク先にある学位授与機関のWebサイトやCiNii Dissertationsから、本文を自由に閲覧できる場合があります。

資料に関する注記

一般注記：: There are about 7,000 languages spoken today in the world. However, most natural language processing and speech processing studies have been conducted...

書店で探す

障害者向け資料で読む

障害者向け資料を見る（1種類）

書店で探す

障害者向け資料で読む

他サービス
- テキストデータ国立国会図書館デジタルコレクションで確認する

書誌情報

この資料の詳細や典拠（同じ主題の資料を指すキーワード、著者名）等を確認できます。

デジタル

資料種別: 博士論文
タイトル: A study on Mongolian text-to-speech system based on deep neural network
著者・編者: Byambadorj, Zolzaya
著者標目: Byambadorj, Zolzaya
出版年月日等: 2022-03-23
出版年（W3CDTF）: 2022-03-23
並列タイトル等: ディープニューラルネットワークに基づくモンゴル語のテキスト音声合成システムに関する研究
授与機関名: 徳島大学
授与年月日: 2022-03-23
授与年月日（W3CDTF）: 2022-03-23
報告番号: 甲第3634号
学位: 博士（工学）
博論授与番号: 甲第3634号
本文の言語コード: eng
件名標目: Text normalization
text to speech
low resource language
noisy text
transliterated text
language model
seq2seq model
character conversion
speech synthesis
transfer learning
data augmentation
対象利用者: 一般
一般注記: There are about 7,000 languages spoken today in the world. However, most natural language processing and speech processing studies have been conducted for high resource languages such as English, Japanese and Mandarin. Preparing large amounts of training data is expensive and time-consuming, which creates a significant hurdle when developing some systems for the world’s many, less widely spoken languages. Mongolian is one of these low-resource languages. We proposed to build a text-to-speech system (TTS, also called speech synthesis) for the low resource Mongolian language. We present two studies within this TTS system, “text normalization” and “speech synthesis,” on the Mongolian language with limited training data. TTS system converts written text into machine-generated synthetic speech. One of the biggest challenges to developing a TTS system for a new language is converting transcripts into a real “spoken” form, the exact words that the speaker said. This is an important preprocessing for TTS systems known as text normalization. In other words, text normalization is transforming text into a standard form and is an essential part of the speech synthesis system. Later it also became important for processing social media text because of the rapid expansion in user-generated content on social media sites. As the use of social media grows rapidly, there is no doubt that the TTS system will need to generate speech from social media text. Therefore, we were more interested in social media text normalization. Thus, this thesis consists of two main parts, text normalization and speech synthesis. We experimentally demonstrated how to improve the output of the model used for each using a small amount of training data. The followings are brief descriptions of each part.Text normalization: The huge increase in social media use in recent years has resulted in new forms of social interaction, changing our daily lives. Social media websites are a rich source of text data, but the processing and analysis of social media text is a challenging task because written social media messages are usually informal and ‘noisy’. Due to increasing contact between people from different cultures as a result of globalization, there has also been an increase in the use of the Latin alphabet, and as a result a large amount of transliterated text is being used on social media. Although there is a standard for the use of Latin letters in the language, the public does not generally observe it when writing on social media. Therefore, social media text also contains many noisy, transliterated words. For example, many people who speak Mongolian are using the Latin alphabet to write Mongolian words on social media, instead of using the Cyrillic alphabet. These messages are informal and ‘noisy’ however, because everyone uses their own judgement as to which Latin letters should be substituted for particular Cyrillic letters, since there are 35 letters in the Mongolian Cyrillic alphabet, versus 26 letters in the modern Latin alphabet (not counting letters with diacritical marks such as accents, umlauts, etc.). In most research on noisy text normalization, both the source text and target text are in the same language. In other words, the alphabets used in the source and target texts are the same. Text normalization is difficult to perform with noisy text even when it is not transliterated. In this thesis, our first goal is to convert noisy, transliterated text into formal writing in a different alphabet. Therefore, it poses more challenges in the text normalization task. We propose a variety of character level sequence-to-sequence (seq2seq) models for normalizing noisy, transliterated text written in Latin script into Mongolian Cyrillic script, for scenarios in which there is a limited amount of training data available. When there is a limited amount of training data, and the rules for writing noisy, transliterated text are not limited, we encounter a difficult challenge when attempting to normalize out-of-vocabulary (OOV) words. Therefore, we applied performance enhancement methods, which included various beam search strategies, N-gram-based context adoption, edit distance-based correction and dictionary-based checking, in novel ways to two basic seq2seq models. We experimentally evaluated these two basic models as well as fourteen enhanced seq2seq models, and compared their noisy text normalization performance with that of a transliteration model and a conventional statistical machine translation (SMT) model. The proposed seq2seq models improved the robustness of the basic seq2seq models for normalizing OOV words, and most of our models achieved higher normalization performance than the conventional method.Speech synthesis: Deep learning techniques are currently being applied in automated TTS systems, resulting in significant improvements in performance. These methods require large amounts of text-speech pair data for model training however, and collecting this data is costly. Tacotron 2 we used, a state-of-the-art end-to-end speech synthesis system, requires more than 10 hours of training data to produce good synthesized speech. Therefore, our second goal is to build a single-speaker TTS system containing both a spectrogram prediction network and a neural vocoder for the target Mongolian language, using only 30 minutes of target Mongolian language text-speech paired data for training. We evaluate three methods for training the spectrogram prediction models of our TTS system, which produce mel-spectrograms from the input phoneme sequence; (1) cross-lingual transfer learning, (2) data augmentation, and (3) a combination of the previous two methods. In the cross-lingual transfer learning method, we used two high-resource language datasets, English (24 hours) and Japanese (10 hours). We also used 30 minutes of target language data for training in all three methods, and for generating the augmented data used for training in methods (2) and (3) mentioned above. We found that using both cross-lingual transfer learning and augmented data during training resulted in the most natural synthesized target speech output. We also compare single-speaker and multi-speaker training methods, using sequential and simultaneous training, respectively. The multi-speaker models were found to be more effective for constructing a single-speaker, low-resource TTS model. In addition, we trained two Parallel WaveGAN (PWG) neural vocoders, one using 13 hours of our augmented data with 30 minutes of target language data and one using the entire 12 hours of the original target language dataset. Our subjective AB preference test indicated that the neural vocoder trained with augmented data achieved almost the same perceived speech quality as the vocoder trained with the entire target language dataset. We found that our proposed TTS system consisting of a spectrogram prediction network and a PWG neural vocoder was able to achieve reasonable performance using only 30 minutes of target language training data. We also found that by using 3 hours of target language data, for training the model and for generating augmented data, our proposed TTS model was able to achieve performance very similar to that of the baseline model, which was trained with 12 hours of target language data.
国立国会図書館永続的識別子: info:ndljp/pid/12304401
https://dl.ndl.go.jp/pid/12304401
コレクション（共通）: 障害者向け資料
コレクション（障害者向け資料：レベル1）: テキストデータ
コレクション（個別）: 国立国会図書館デジタルコレクション > デジタル化資料 > 博士論文
https://dl.ndl.go.jp/collections/A00014
収集根拠: 博士論文（自動収集）
受理日（W3CDTF）: 2022-07-05T02:30:21+09:00
記録形式（IMT）: PDF
オンライン閲覧公開範囲: 国立国会図書館内限定公開
デジタル化資料送信: 図書館・個人送信対象外
遠隔複写可否（NDL）: 可
掲載誌（URI）: http://repo.lib.tokushima-u.ac.jp/117116
連携機関・データベース: 国立国会図書館 : 国立国会図書館デジタルコレクション
https://dl.ndl.go.jp