博士論文
Available in National Diet Library
Find on the publisher's website
国立国会図書館デジタルコレクション
Digital data available
Check on the publisher's website
DOI[10.15002/00025228]to the data of the same series
Research on Real-time Voice Adaptation with Speech Features as Interpreted Objectives
- Persistent ID (NDL)
- info:ndljp/pid/12302983
- Material type
- 博士論文
- Author
- MIDTLYNG, Mads Alexander
- Publisher
- -
- Publication date
- 2022-03-24
- Material Format
- Digital
- Capacity, size, etc.
- -
- Name of awarding university/degree
- 法政大学 (Hosei University),博士(理学)
Notes on use at the National Diet Library
本資料は、掲載誌(URI)等のリンク先にある学位授与機関のWebサイトやCiNii Dissertationsから、本文を自由に閲覧できる場合があります。
Notes on use
Note (General):
- type:ThesisIn the field of speech processing, Voice Adaptation (VA), a technique for translating a spoken message from a source voice into a target vo...
Search by Bookstore
Read this material in an accessible format.
Search by Bookstore
Read in Disability Resources
Bibliographic Record
You can check the details of this material, its authority (keywords that refer to materials on the same subject, author's name, etc.), etc.
Digital
- Material Type
- 博士論文
- Author/Editor
- MIDTLYNG, Mads Alexander
- Author Heading
- Publication Date
- 2022-03-24
- Publication Date (W3CDTF)
- 2022-03-24
- Degree grantor/type
- 法政大学 (Hosei University)
- Date Granted
- 2022-03-24
- Date Granted (W3CDTF)
- 2022-03-24
- Dissertation Number
- 甲第545号
- Degree Type
- 博士(理学)
- Conferring No. (Dissertation)
- 甲第545号
- Text Language Code
- eng
- Subject Heading
- Target Audience
- 一般
- Note (General)
- type:ThesisIn the field of speech processing, Voice Adaptation (VA), a technique for translating a spoken message from a source voice into a target voice while retaining the prosodic information, has seen serious attempts since 1980s. In order to successfully swap the spoken message between voices, the human voice which is a very complex and variable set of information must be comparable between two speakers uttering the same phrase. To do this, early research saw the use of probabilistic methods such as Gaussian Mixture Models, pitch shifting and complex mathematical models to represent a human’s physical traits such as gender, weight, vocal tract, tongue and more. In order to train the model, many speech samples, often from multiple sources were used just to be able to utter a few letters or words. In the 2010s the emergence of Deep Neural Networks (DNN) saw many uses, one being voice adaptation. While results were better than classical methods, DNN is a complex procedure that require a lot of data and time to process it in order to create a model. The more it trains, the better the model becomes; thus, it is not ideal for dynamic environments where information must be changed or added quickly. While historically, adaptation quality and amount has improved, it is still rated near the average mark quality-wise when objective evaluations are performed, and the technology is not mature enough to see serious real-world use. One area of use that is especially interesting is online in video games. In a setting where it’s commonplace to mask oneself with a new name and visual avatar, except for the voice, this is a perfect area to introduce a lightweight working VA method into.We propose a new design for training and performing VA that takes advantage of evolutionary computing to find the ideal match for a comparing function. Rather than considering language, grammar, and physique, we consider the voice as a collection of sounds a person can make. This is determined by phonemes, which vary by language, but in essence are a set of determined sounds that are used to pronounce any word for the current language. Thus, if we can collect these phonemes in conjunction with varying parameters, it should be possible to puzzle together a desired output based on the structure of the input sound. First, a short recording is made based on a manuscript reading which is designed to make the voice subject utter as many phonemes and word-combinations as possible. This audio is split into tiny fragments referred to as frames. Each frame is put through a stylized form of quantization which is a normalization stage. The processed frame represents an abstraction of the original acoustic information, which can make it easier to compare similar utterances. At the same time, other temporal and spectral features are extracted, and a selection of most crucial features are encoded into two separate color objects with RGBA-channels to group the temporal and spectral features. This information along with where they link to in the original audio is stored as a voice profile. During the actual adaptation, similar splitting and quantization are performed on the source speaker’s voice. Their color-encoded frames are used in a multi-objective optimization problem which while solving an objective function evolves over generations to present an increasingly more ideal set of target solutions. The objective function is a hybrid of a native benchmark problem and a custom function which measures the distance, or likeness between two colors, namely the source color and a target color obtained from the voice profile. Technically this is a 2-objective evaluation, but due to how the various features are encoded into color channels and the color likeness is measured, 6 objectives are being evaluated at once. When the evolution is stopped, we can with confidence select one of the solutions as an ideal target frame and put it through final adjustments before outputted. We evaluate this research using standard objective evaluation methods such as Mean Opinion Score (MOS) and ABX, MUltiple Stimuli with Hidden Reference and Anchor (MUSHRA) to evaluate the sound quality and consistency, and Single Ease Question (SEQ) to survey the interest in VA for games as well as benchmarking the performance. We see that the latest proposed iteration that has many vocal features encoded into colors for the multi-objective optimization problem achieves better quality and the training time is remarkable compared to traditional DNN.To conclude, the proposed VA system proves that it smaller amounts of speech data is sufficient in order to create a working VA as long as it is obtained through predetermined parameters such as speech intensity and phoneme content, and that multi-objective optimization problems can contribute to finding a more ideal target frame than arbitrary rules and subjective assumptions, which makes this suitable to see use in an interactive, real-time software such as games where anyone can attempt to use it.
- DOI
- 10.15002/00025228
- Persistent ID (NDL)
- info:ndljp/pid/12302983
- Collection
- Collection (Materials For Handicapped People:1)
- Collection (particular)
- 国立国会図書館デジタルコレクション > デジタル化資料 > 博士論文
- Acquisition Basis
- 博士論文(自動収集)
- Date Accepted (W3CDTF)
- 2022-07-05T02:30:21+09:00
- Date Created (W3CDTF)
- 2022-06-16
- Format (IMT)
- PDFapplication/pdf
- Access Restrictions
- 国立国会図書館内限定公開
- Service for the Digitized Contents Transmission Service
- 図書館・個人送信対象外
- Availability of remote photoduplication service
- 可
- Periodical Title (URI)
- Data Provider (Database)
- 国立国会図書館 : 国立国会図書館デジタルコレクション