GCCE 2024 - 2024 IEEE 13th Global Conference on Consumer Electronics 119-120 2024年
Obese and overweight individuals are at high risk for chronic diseases such as sleep apnea and diabetes. Therefore, it is necessary to track eating behavior to determine the causes of obesity; however, it is time- and labor-intensive to follow the lives of specific individuals and observe their eating behavior. Thus, a method to automatically monitor eating behavior should be considered. As one approach to monitoring methods, we propose a method for convenient recognition of food category for food intake sounds recorded by microphones (below the ear microphone, throat microphone and acoustic microphone), which is less burdensome to the body and better from the viewpoint of privacy protection. Furthermore, a comparison of MFB and large-scale pre-trained speech models (wav2vec2.0, wavLM, and HuBERT) showed the effectiveness of large-scale pre-trained speech models in the food recognition task.
GCCE 2024 - 2024 IEEE 13th Global Conference on Consumer Electronics 808-810 2024年
To enhance speaker verification for short utterances, we have developed a Same Speaker Identification Deep Neural Network (SSI-DNN). This network identifies whether two utterances are uttered by the same speaker with greater accuracy by focusing on the same texts. In this paper, we extend the detection target of the SSI-DNN from monosyllabic utterances to word utterances to improve the speaker recognition performance. Experimental results showed that the SSI-DNN trained on word utterances achieved an EER of 0.1% to 2.8%. These results indicated that the SSI-DNN outperformed the x-vector-based speaker verification method, which is a representative speaker verification method.
GCCE 2024 - 2024 IEEE 13th Global Conference on Consumer Electronics 141-143 2024年
Hands-free control of shower settings, such as temperature, is highly desirable, enhancing user convenience when both hands are occupied or eyes are closed. In this paper, we propose a speaker-dependent, template-based isolated word recognition system using pre-trained large speech models (LSMs) to realize voice-activated shower control with a single microphone. Specifically, we examine the performance of 3 LSMs (wav2vec2.0, HuBERT, WavLM) as well as conventional MFCC as features. Additionally, we investigate speech enhancement using a Convolutional Recurrent Neural Network (CRN) to improve robustness against shower noise. Our experiments for recognizing 30 words with SNRs ranging from -5 dB to 20 dB demonstrate that HuBERT achieves the highest recognition accuracy (77.8 to 95.6%). CRN, on the other hand, improved recognition accuracy only under -5 dB conditions, but its accuracy was only 80.8%.
GCCE 2024 - 2024 IEEE 13th Global Conference on Consumer Electronics 805-807 2024年
Recent advances in AI technology have brought not only many benefits but also considerable risks due to malicious use of the technology. One key example is spoofing through speech synthesis and voice conversion technologies against speaker verification system. To tackle this challenge, we proposed a two-step matching method as a robust speaker verification, in which a user specifies an emotion to a system in advance, and the user is accepted only when the user speaks with the specified emotion. This previous method reduced the false acceptance rate. However, the false rejection rate increased. To overcome this problem, we propose a novel method that integrates speaker and emotion verification scores in this work. Experiments revealed that the proposed method can reduce the equal error rate compared with that of the conventional method to assign the optimal weight to the speaker and emotional information contained in the speech.
IECON 2011: 37TH ANNUAL CONFERENCE ON IEEE INDUSTRIAL ELECTRONICS SOCIETY 1300-1305 2011年
This paper presents a novel class-E-M power amplifier for low harmonic contents and high output power. In the proposed class-E-M power amplifier, the symmetric configuration is applied to the class-E-M power amplifier. By applying the symmetric configuration, the proposed class-E-M power amplifier obtains the extremely low total harmonic distortion and four times higher output power than the conventional single class-E-M power amplifier. For achieving the class-E(M)ZVS/ZDVS/ZCS/ZDCS conditions, the MOSFET drain-to-source nonlinear parasitic capacitances, finite dc-feed inductance, equivalent series resistances of inductors, and switch-on resistances are considered in the circuit designs. A design example is presented along with the PSpice-simulation and experimental waveforms at 3.5 MHz operating frequency. The waveforms from the PSpice-simulations and circuit experiments agreed with the numerical predictions quantitatively, which validates the effectiveness of the proposed class-E-M power amplifier in this paper.
本稿では,音声と画像を用いたマルチモーダル音声認識の共通評価基盤 CENSREC-1-AV について紹介する.CENSREC-1-AV では,音声・画像データベースおよびベースラインシステムを提供する.音声は学習用クリーンデータのほか,乗用車走行雑音を付与したものを収録した.画像はカラー映像と近赤外線映像を収録し,ガンマ補正を用いて乗用車走行シミュレーション画像をテストデータとした.ベースラインシステムでは,MFCC と,固有顔ないしはオプティカルフローを特徴量として,マルチストリーム HMM により認識を行った.This paper introduces an evaluation framework for multimodal speech recognition: CENSREC-1-AV. The corpus CENSREC-1-AV provides an audiovisual speech database and a baseline system of multimodal speech recognition. Speech signals were recorded in clean condition for training and in-car noises were overlapped for testing. Color and infrared pictures were captured as training data, and image corruption was conducted for testing using the gamma correction technique. In the baseline system, acoustic MFCC as well as eigenface or optical-flow information are adopted as audio and visual features respectively, then multi-stream HMMs are used as a recognition model.
電気学会論文誌. C, 電子・情報・システム部門誌 = The transactions of the Institute of Electrical Engineers of Japan. C, A publication of Electronics, Information and System Society 130(5) 863-872 2010年
In this paper, a new framework for removing mixed noise composed of the impulse and Gaussian noises from images is presented in which the FINDRM with the directional difference and the Bivariate Shrinkage Function (BSF) in the Dual-Tree Complex Wavelet Transform (DT-CWT) domain are used. First, the noise detection phase of the Fuzzy Impulse Noise Detection and Reduction Method (FINDRM) is used to determine whether a pixel is an impulse or not. When the pixel is determined as an impulse noise, the FINDRM with the directional difference is used to restore the impulse noise. Second, Gaussian noise is removed by using the BSF, which considers the relationships between wavelet coefficients in the DT-CWT domain. Applying the proposed framework to an image corrupted by mixed noise, a clean image can be obtained.
自然言語処理では処理単位として文などの意味的なまとまりがある単位を用いるため,音声認識結果に対して文境界を示す必要がある.本研究では,まず SVM を用いた文境界検出において文境界直前における語の出現しやすさを考慮することによって文境界検出に適した特徴空間の作成方法を提案する.さらに,音声認識時に認識結果と共に出力される単語信頼度を素性として文境界検出に利用することを検討する.文境界検出においては 『日本語話し言葉コーパス (CSJ)』 を対象として SVM を用いて評価実験を行った.Since the units of processing for Natural Language Processing(NLP) are based on syntactic structure, for example sentence, it is necessary to detect the sentence boundary for the Automatic Speech Recognition(ASR) outputs. In this paper, at first, we propose the feature space that is applied to detecting sentence boundary with Support Vector Machine(SVM) by considering the frequency of the word immediately before sentence boundary. At second, we examine using confidence measure of ASR outputs for sentence boundary detection with SVM. We evaluated our methods on the Corpus of Spontaneous Japanese(CSJ).
本論文では,順位統計量を用いた話者照合手法を紹介する.さらに,順位統計量を用いた話者照合手法における照合コストを下げるためのコホート話者の選択方法について提案する.コホート話者は申告者の音声に対してシステムに登録された不特定多数の話者モデル (GMM) との尤度の順位を基準に作成する.評価実験として,科学警察研究所が構築した大規模話者骨導音声データベースに収録されている男性 283 名の気導音声を用いて実験を行った.従来手法では,全話者 283 名による順位統計量で算出した minDCF が 0.0092 に対して,提案手法では平均 57 名の順位統計量で 0.0098,101 名の順位統計量で 0.0094 という同等の性能を達成した.また,照合スコアとして T-norm を用いた場合の minDCF が 0.0154 だった.In this paper, we introduce a novel speaker verification method which determines whether a claimer is accepted or rejected by the rank of the claimer in a large number of speaker models instead of score normalization, such as T-norm and Z-norm. The method has advantages over the standard T-norm in speaker verification accuracy. However, it needs much computation time as well as T-norm that needs calculating likelihoods for many cohort models. Hence, we also discuss the speed-up the method that selects cohort speakers for each target speaker in the training stage. This data driven approach can significantly reduce computation time resulting in faster speaker verification decision. We conducted text-independent speaker verification experiments using large-scale Japanese speaker recognition evaluation corpus constructed by National Research Institute of Police Science. From the corpus, we used utterances collected from 283 Japanese males. As results, the proposed method whose the number of cohort speaker is 57 achieved an minDCF of 0.0098, while using 282 speakers as cohort speaker obtained 0.0092 and T-norm obtained 0.0154.
我々は独奏者のブレスによる合図を伴奏制御のインタフェースとして利用可能な伴奏システムを開発してきており,以前の研究では曲の冒頭部においてブレスの合図を利用できるシステムを提案した.本研究では曲の冒頭だけでなく,曲中でもブレスによる合図を利用可能な手法を提案する.システムを実装し,人間の演奏者による評価実験を行った結果,ブレスによる合図を用いた方がずれが減少し、演奏者による主観評価も高いことが示された.We are developing the accompaniment system using musical cues by the soloist's breath. In our previous study, we introduced the method of using breath cues at the beginning of musical piece. In this study, we introduced the method using breath cues not only at the beginning but also during a piece and performed the evaluation experiment by human soloists. As a result, it was suggested that the new system achieved better synchronization between the soloist and the system than the previous system and the performers who used the system preferred the new system better than the previous system.
In this research, we analyzed the overlap phenomena at turn-taking points in Japanese Sign Language Dialogue. The spontaneous dialogue data were recorded in the environment where they can look at each other via prompters and three dialogue data by six native signers were used for the analysis. First, it was shown that the overlaps at turn-taking point occurred with very high frequency (75%). Secondly, we analyzed these phenomena based on "turn-taking system for conversation" by H. Sacks, E.A. Schegloff and G. Jefferson and found the situations where the speaker (signer) continued his/her utterance after TRP (transition-relevance place) and the next speaker started his/her turn by recognizing or projecting the TRP, therefore the overlap occurred. We consider these types of overlap as the normal turn-taking. Finally, there were a few case (18%) where the turn-taking rule was broken and the other cases follow the rule.
In this research, we analyzed the overlap phenomena at turn-taking points in Japanese Sign Language Dialogue. The spontaneous dialogue data were recorded in the environment where they can look at each other via prompters and three dialogue data by six native signers were used for the analysis. First, it was shown that the overlaps at turn-taking point occurred with very high frequency (75%). Secondly, we analyzed these phenomena based on "turn-taking system for conversation" by H. Sacks, E.A. Schegloff and G. Jefferson and found the situations where the speaker (signer) continued his/her utterance after TRP (transition-relevance place) and the next speaker started his/her turn by recognizing or projecting the TRP, therefore the overlap occurred. We consider these types of overlap as the normal turn-taking. Finally, there were a few case (18%) where the turn-taking rule was broken and the other cases follow the rule.
In this research, we analyzed the overlap phenomena at turn-taking points in Japanese Sign Language Dialogue. The spontaneous dialogue data were recorded in the environment where they can look at each other via prompters and three dialogue data by six native signers were used for the analysis. First, it was shown that the overlaps at turn-taking point occurred with very high frequency (75%). Secondly, we analyzed these phenomena based on "turn-taking system for conversation" by H. Sacks, E.A. Schegloff and G. Jefferson and found the situations where the speaker (signer) continued his/her utterance after TRP (transition-relevance place) and the next speaker started his/her turn by recognizing or projecting the TRP, therefore the overlap occurred. We consider these types of overlap as the normal turn-taking. Finally, there were a few case (18%) where the turn-taking rule was broken and the other cases follow the rule.
近年,音声から書き起こしを自動的に作成するシステムに関する研究がさかんに行われている.これまでは,音声を正確に書き起こすことに重点をおいて研究されてきているが,見た者にとって議論の内容をより理解しやすい書き起こしの作成が重要であると考えられる.議論の内容を正確に伝えるには言語情報だけでは不十分であり,議論の場面や発話意図,感情といった情報も必要であると考えられる.そこで,本研究では会議や討論などの書き起こしに発話意図を付与することを目指し,テキストと音声の両方から発話印象について分析することを目的とした.まず,文字の太さや大きさの変化といった文字の装飾や,「!」,「?」などの記号に着目し,そのようなテキストの変化を書き起こしに付与する形で主観評価実験を行うことにより「疑問」,「驚き」などの発話印象がどの程度感じられるのかを調べた.また,音声についても同様に主観評価実験を行い,その結果と「F0」や「パワー」などの韻律パラメータを使って重回帰分析を行い,韻律パラメータと発話印象の関係を分析した.その結果,各テキスト変化,韻律パラメータとそれぞれの発話印象との関係が明らかになった.さらにそれらを総合的に分析することで,テキストと音声では発話印象の受け方が異なるものと,同じ傾向のものがあることが明らかになった.In recent years, a great amount of research has been done on systems that transcribe utterances through automatic speech recognition. This research has generally been focused on transcribing utterances correctly. What is presently required, however, is a transcription method that enables the overall content of a given discourse to be more easily understood by readers. It is generally considered that linguistic information by itself is insufficient for this purpose, and that a way of showing speaker's intentions and emotions is also required. In this study, we analyzed user's impressions of utterances from both text and speech, with the aim of at indexing the impressions to the transcriptions of discourse forums such as meetings and discussions. We investigated how impressions such as “doubt” and “surprise” are felt by changing the size of written characters and indexing signs such as question marks and exclamation marks in the text. The relation between prosody parameters and utterance impressions was analyzed by using multiple linear regression. As a result, we were able to clarify the relationship between variations of text, prosody parameters, and utterance impressions.
電気学会論文誌. C, 電子・情報・システム部門誌 = The transactions of the Institute of Electrical Engineers of Japan. C, A publication of Electronics, Information and System Society 129(10) 1902-1907 2009年
To establish a universal communication environment, computer systems should recognize various modal communication languages. In conventional sign language recognition, recognition is performed by the word unit using gesture information of hand shape and movement. In the conventional studies, each feature has same weight to calculate the probability for the recognition. We think hand position is very important for sign language recognition, since the implication of word differs according to hand position. In this study, we propose a sign language recognition method by using a multi-stream HMM technique to show the importance of position and movement information for the sign language recognition. We conducted recognition experiments using 28,200 sign language word data. As a result, 82.1 % recognition accuracy was obtained with the appropriate weight (position:movement=0.2:0.8), while 77.8 % was obtained with the same weight. As a result, we demonstrated that it is necessary to put weight on movement than position in sign language recognition.