Paralinguistic and Nonverbal Information Extraction from Speech Signal towards Empathetic Dialogue Systems

The cover of this title could differ from library to library.

Available in National Diet Library

Find on the publisher's website

国立国会図書館デジタルコレクション

Digital data available

Check on the publisher's website

DOI[10.15002/00025229]to the data of the same series

Paralinguistic and Nonverbal Information Extraction from Speech Signal towards Empathetic Dialogue Systems

Persistent ID (NDL): info:ndljp/pid/12302982

Material type: 博士論文

Author: FUJIMURA, Hiroshi

Publisher: -

Publication date: 2022-03-24

Material Format: Digital

Capacity, size, etc.: -

Name of awarding university/degree: 法政大学 (Hosei University),博士(理学)

View All

Notes on use at the National Diet Library

本資料は、掲載誌(URI)等のリンク先にある学位授与機関のWebサイトやCiNii Dissertationsから、本文を自由に閲覧できる場合があります。

Notes on use

Note (General)：: type:ThesisIn this research, we aim to extract paralinguistic and nonverbal information such as emotions, speaking style, and speaker attributes towar...

Search by Bookstore

Read this material in an accessible format.

View materials in accessible formats for people with print disabilities (1Type)

2024-02-02 再収集
2024-02-02 再収集

Search by Bookstore

Read in Disability Resources

Other Services
- テキストデータ check it on 国立国会図書館デジタルコレクション

Bibliographic Record

You can check the details of this material, its authority (keywords that refer to materials on the same subject, author's name, etc.), etc.

Digital

Material Type: 博士論文
Title: Paralinguistic and Nonverbal Information Extraction from Speech Signal towards Empathetic Dialogue Systems
Author/Editor: FUJIMURA, Hiroshi
Author Heading: FUJIMURA, Hiroshi
Publication Date: 2022-03-24
Publication Date (W3CDTF): 2022-03-24
Degree grantor/type: 法政大学 (Hosei University)
Date Granted: 2022-03-24
Date Granted (W3CDTF): 2022-03-24
Dissertation Number: 甲第546号
Degree Type: 博士(理学)
Conferring No. (Dissertation): 甲第546号
Text Language Code: eng
Subject Heading: Speech recognition
Speech emotion
Speaker attributes
Phoneme recognition
Target Audience: 一般
Note (General): type:Thesis
In this research, we aim to extract paralinguistic and nonverbal information such as emotions, speaking style, and speaker attributes towards a human-like empathetic dialogue system. Empathy is the ability to project the other person’s feelings and thoughts onto the other person’s knowledge. It plays an important role in human communication. In particular, personalization and understanding emotion are essential for an advanced dialogue system. This research focuses on methods for estimating speaker attributes, personal speaking-style and emotion category that are related to personalization and emotion in real-time from a small amount of speech information like a human agent. By integrating the methods proposed in this paper, it is possible to realize more human-like recognition of paralinguistic and nonverbal information for automatic dialogue systems using speech. This doctoral dissertation consists of five chapters. In chapter 1, the introduction is described.In chapter 2, we propose a method for identifying speaker attributes, which are nonverbal information in speech. We specially focus on the identification of male and female speeches as speaker attributes in this chapter. In order to extract speaker attributes, it is necessary to first detect a speech segment from a sound signal sequence, which is a mixture of speech and non-speech segments, and then to identify them in the speech segment. In conventional speaker attribute identification, the endpoint of speech with a certain length of continuous speech is detected, and then the features to identify speaker attributes are extracted, and an identification process is performed for the segment. However, a delay time occurs to identify speaker attributes since the process starts after the end of speech is detected. In our method, the speaker attributes and the probabilities for the speech and non-speech segments are calculated simultaneously for each time frame using a single neural network. The framework can identify speaker attributes sequentially based on their accumulated probabilities. This method made it possible to classify male and female speech with high accuracy while maintaining the accuracy of speech segment detection.In Chapter 3, we propose a phoneme identification method that leads to the extraction of low-intelligibility speech. When low-intelligibility speech occurs, phonemes in a relevant part of a speech are unclear and differ significantly from the nature of phonemes in ordinary speech. Since features of the phonemes depend on a relative phoneme position, it is necessary to cluster them depending on the phoneme position, and a discriminative model for each cluster is trained to determine whether the phoneme is clearly uttered or not. Therefore, we propose a discriminator that contains phoneme environment-dependent clusters inside, which enables to discriminate phonemes without pre-clustering and to calculate a score for the intelligibility.In chapter 4, we propose a method for extracting paralinguistic and nonverbal information, such as fillers and word fragments. There are many variations of fillers and word fragments, and it is not easy to keep all patterns as a dictionary of language in advance. Therefore, existing methods use two-pass decoding to detect fillers and word fragments based on a confusion network output from first-pass recognition and sub-word language model to deal with various fillers and word fragments. However, this method is unsuitable for real-time applications because it can only start processing after decoding the end of the utterance. To solve this problem, we propose a method of learning filler and word fragment acoustic patterns as filler symbols and word fragment symbols, respectively, and incorporating a detection process using filler symbols and word fragment symbols into a WFST decoder for speech recognition, thereby processing them in a single pass of the decoder.There is no need to register all speech patterns of filler, and word fragment in a language dictionary since the proposed method treats filler and word fragment as a single acoustic symbol. By this method, fillers and word fragments can be detected in real-time. Simultaneously, the speech is recognized in one pass without degrading the accuracy. As for fillers, the number of occurrences can be controlled by using a confidence score based on the number of occurrences of filler symbols.In chapter 5, we propose a method for recognizing emotions, which are paralinguistic and nonverbal information. At present, the accuracy of emotion classification for 7 or 8 emotions is only 70 or 80%, even when the emotions are uttered intentionally. Therefore, performance improvement is desired. Emotional features in speech are contained in both a short speech signal and a long speech signal. Therefore, many efforts have been made to improve the performance of emotion classification by incorporating features of various temporal resolutions. Conventional emotion recognition methods tried to improve the performance by using a single neural network encompassing multiple temporal resolutions. However, they have not been able to significantly improve the performance due to the small emotional speech database.We consider that the performance of emotion classification methods using high-level statistical functions (HSFs), which show high accuracy in emotion classification, can be improved by extracting and combining HSFs from windows with multiple temporal resolutions instead of a single fixed window length. In this paper, we aim to improve the accuracy by extending the HSFs extracted from a single fixed window in the existing methods to HSFs generated from multiple windows with temporal resolutions of 30 or more. In addition, to reduce the number of parameters to be learned simultaneously for a small amount of data, stacking with Gradient Boosting Decision Trees (GBDT) is applied when combining features of multiple temporal resolutions. As a result, we obtained the highest emotion classification performance for the American emotional speech database. In addition, although the method initially uses multiple temporal resolutions of more than 30, it is found that the same classification performance can be obtained with only 15 temporal resolution features based on analyzing by GBDT.
DOI: 10.15002/00025229
https://doi.org/10.15002/00025229
Persistent ID (NDL): info:ndljp/pid/12302982
https://dl.ndl.go.jp/pid/12302982
Collection: 障害者向け資料
Collection (Materials For Handicapped People:1): テキストデータ
Collection (particular): 国立国会図書館デジタルコレクション > デジタル化資料 > 博士論文
https://dl.ndl.go.jp/collections/A00014
Acquisition Basis: 博士論文（自動収集）
Date Accepted (W3CDTF): 2022-07-05T02:30:21+09:00
Date Created (W3CDTF): 2022-06-16
Format (IMT): PDF
application/pdf
Access Restrictions: 国立国会図書館内限定公開
Service for the Digitized Contents Transmission Service: 図書館・個人送信対象外
Availability of remote photoduplication service: 可
Periodical Title (URI): https://doi.org/10.15002/00025229
http://hdl.handle.net/10114/00025229
Data Provider (Database): 国立国会図書館 : 国立国会図書館デジタルコレクション
https://dl.ndl.go.jp

Paralinguistic and Nonverbal Information Extraction from Speech Signal towards Empathetic Dialogue Systems

Search by Bookstore

Read this material in an accessible format.

Table of Contents

Search by Bookstore

Read in Disability Resources

Bibliographic Record

Digital