GCCE 2024 - 2024 IEEE 13th Global Conference on Consumer Electronics 119-120 2024年
Obese and overweight individuals are at high risk for chronic diseases such as sleep apnea and diabetes. Therefore, it is necessary to track eating behavior to determine the causes of obesity; however, it is time- and labor-intensive to follow the lives of specific individuals and observe their eating behavior. Thus, a method to automatically monitor eating behavior should be considered. As one approach to monitoring methods, we propose a method for convenient recognition of food category for food intake sounds recorded by microphones (below the ear microphone, throat microphone and acoustic microphone), which is less burdensome to the body and better from the viewpoint of privacy protection. Furthermore, a comparison of MFB and large-scale pre-trained speech models (wav2vec2.0, wavLM, and HuBERT) showed the effectiveness of large-scale pre-trained speech models in the food recognition task.
GCCE 2024 - 2024 IEEE 13th Global Conference on Consumer Electronics 808-810 2024年
To enhance speaker verification for short utterances, we have developed a Same Speaker Identification Deep Neural Network (SSI-DNN). This network identifies whether two utterances are uttered by the same speaker with greater accuracy by focusing on the same texts. In this paper, we extend the detection target of the SSI-DNN from monosyllabic utterances to word utterances to improve the speaker recognition performance. Experimental results showed that the SSI-DNN trained on word utterances achieved an EER of 0.1% to 2.8%. These results indicated that the SSI-DNN outperformed the x-vector-based speaker verification method, which is a representative speaker verification method.
GCCE 2024 - 2024 IEEE 13th Global Conference on Consumer Electronics 141-143 2024年
Hands-free control of shower settings, such as temperature, is highly desirable, enhancing user convenience when both hands are occupied or eyes are closed. In this paper, we propose a speaker-dependent, template-based isolated word recognition system using pre-trained large speech models (LSMs) to realize voice-activated shower control with a single microphone. Specifically, we examine the performance of 3 LSMs (wav2vec2.0, HuBERT, WavLM) as well as conventional MFCC as features. Additionally, we investigate speech enhancement using a Convolutional Recurrent Neural Network (CRN) to improve robustness against shower noise. Our experiments for recognizing 30 words with SNRs ranging from -5 dB to 20 dB demonstrate that HuBERT achieves the highest recognition accuracy (77.8 to 95.6%). CRN, on the other hand, improved recognition accuracy only under -5 dB conditions, but its accuracy was only 80.8%.
GCCE 2024 - 2024 IEEE 13th Global Conference on Consumer Electronics 805-807 2024年
Recent advances in AI technology have brought not only many benefits but also considerable risks due to malicious use of the technology. One key example is spoofing through speech synthesis and voice conversion technologies against speaker verification system. To tackle this challenge, we proposed a two-step matching method as a robust speaker verification, in which a user specifies an emotion to a system in advance, and the user is accepted only when the user speaks with the specified emotion. This previous method reduced the false acceptance rate. However, the false rejection rate increased. To overcome this problem, we propose a novel method that integrates speaker and emotion verification scores in this work. Experiments revealed that the proposed method can reduce the equal error rate compared with that of the conventional method to assign the optimal weight to the speaker and emotional information contained in the speech.
本稿では,2001年10月に音声言語情報処理研究会内に設立した雑音下音声認識の評価に関するワーキンググループの活動状況の報告を行う.このワーキンググループでは,雑音下音声認識に於ける評価法,共通のコーパスの策定に加えて,欧州で進められているETSI AURORA雑音下音声認識アルゴリズム開発プロジェクトに合わせたアルゴリズム開発を目指している.This paper reports current status of the SLP working group establised in October 2001 on the noisy speech recognition. The working group aims to develop standards, common corpus, and noisy speech recognition system in conjunction with Europian ETSI AURORA evaluation projects.
携帯電話の発展にともない急激に携帯端末によるワイアレスモバイル環境の普及が進んでいる。一般に携帯端末は非常に小型であるため、携帯端末に付属する入力デバイスによる操作は困難である。この問題を解決する一方法として、音声による携帯端末操作が考えられる。しかし、携帯端末内のメモリやCPUなどのハードウェアは、中・大語彙の音声認識処理の全てを行うまでには至っていない。そこで、音響分析、特徴パラメータの圧縮を携帯端末内で行いサーバに伝送し、サーバで特徴パラメータの復元、音声認識を行う分散音声認識 (DSR: Distributed SpeechRecognition)が提案された。分散音声認識では、携帯端末とサーバ間で伝送するデータ形式等を共通化する必要があり、現在、欧州電気通信標準化機構 (ETSI: the European Telecommunications StandardsInstitute)において、標準化が進められている。本稿では、ETSI標準分散音声認識フロントエンドを用い日本語連続音声認識実験を行った結果を報告する。同フロントエンドは、特徴パラメータの圧縮にベクトル量子化を用いるため、入力系の周波数特性の差異はベクトル量子化歪みを増加させ、認識精度を低下させる原因となる可能性が高い。そこで、本稿では、入力系の周波数特性の差異によるベクトル量子化歪みを減少させる手法を提案する。音声認識実験結果より、提案手法は周波数特性の差異による認識精度の劣化を低減することが可能であった。This paper reports an evaluation of European Telecommunications Standards Institute (ETSI) standard Distributed Speech Recognition (DSR) front-end through continuous word recognition on a Japanese speech corpus and proposes a method, the Bias Removal Method (BRM), that reduces the distortion between feature vector and VQ codebook. Experimental results show that using non-quantized features in acoustic model training procedure can improve the recognition performance of DSR fornt-end features and that the proposed method can improve recognition performances of DSR front-end feature.
ベクトル空間モデル(VSM)は情報検索における代表的な検索モデルである.同モデルでは文書が単語の出現頻度に基づくベクトルで表現されるため,そのベクトル空間は一般にスパースかつ高次元となりメモリや検索時間の増大を招くとともに,文書中に含まれる無意味な単語がノイズ的な影響を及ぼし検索精度を低下させるという問題を生じる.これに対し特異値分解(SVD)を用い次元数を削減した空間で類似度を計算する潜在的意味インデキシング(Latent Semantic Indexing; LSI)が提案され,その効果が報告されている.本稿ではSVDに比べより少ない演算量で近似的に主成分分析を行うことが可能なSimple Principal Component Analysis(SPCA)を次元削減に適用する.MEDLINEコレクションを用いた検索実験を行った結果,SVDと同等以上の検索性能をSPCAにより達成した.The Vector Space Model (VSM) is a popular information retrieval model, which represents a document collection by a term-by-document matrix. Since term-by-document matrices are usually high-dimensional and sparse, they are susceptible to noise and are also difficult to capture the underlying semantic structure. Additionally, computing resources necessary for the storage and processing of such data is enormous. Dimensionality reduction is a way to overcome these problems. Principal Component Analysis (PCA) and Singular Value Decomposition (SVD) are popular techniques for dimensionality reduction based on matrix decomposition. However, such methods consume a large amount of computation resources. In the work described here, we use Simple Principal Component Analysis (SPCA), which is a data-oriented fast method, for dimensionality reduction of the vector space model. Experiments based on the MEDLINE collection showed that SPCA achieved significant improvement compared to the conventional vector space model.