GCCE 2024 - 2024 IEEE 13th Global Conference on Consumer Electronics 119-120 2024年
Obese and overweight individuals are at high risk for chronic diseases such as sleep apnea and diabetes. Therefore, it is necessary to track eating behavior to determine the causes of obesity; however, it is time- and labor-intensive to follow the lives of specific individuals and observe their eating behavior. Thus, a method to automatically monitor eating behavior should be considered. As one approach to monitoring methods, we propose a method for convenient recognition of food category for food intake sounds recorded by microphones (below the ear microphone, throat microphone and acoustic microphone), which is less burdensome to the body and better from the viewpoint of privacy protection. Furthermore, a comparison of MFB and large-scale pre-trained speech models (wav2vec2.0, wavLM, and HuBERT) showed the effectiveness of large-scale pre-trained speech models in the food recognition task.
GCCE 2024 - 2024 IEEE 13th Global Conference on Consumer Electronics 808-810 2024年
To enhance speaker verification for short utterances, we have developed a Same Speaker Identification Deep Neural Network (SSI-DNN). This network identifies whether two utterances are uttered by the same speaker with greater accuracy by focusing on the same texts. In this paper, we extend the detection target of the SSI-DNN from monosyllabic utterances to word utterances to improve the speaker recognition performance. Experimental results showed that the SSI-DNN trained on word utterances achieved an EER of 0.1% to 2.8%. These results indicated that the SSI-DNN outperformed the x-vector-based speaker verification method, which is a representative speaker verification method.
GCCE 2024 - 2024 IEEE 13th Global Conference on Consumer Electronics 141-143 2024年
Hands-free control of shower settings, such as temperature, is highly desirable, enhancing user convenience when both hands are occupied or eyes are closed. In this paper, we propose a speaker-dependent, template-based isolated word recognition system using pre-trained large speech models (LSMs) to realize voice-activated shower control with a single microphone. Specifically, we examine the performance of 3 LSMs (wav2vec2.0, HuBERT, WavLM) as well as conventional MFCC as features. Additionally, we investigate speech enhancement using a Convolutional Recurrent Neural Network (CRN) to improve robustness against shower noise. Our experiments for recognizing 30 words with SNRs ranging from -5 dB to 20 dB demonstrate that HuBERT achieves the highest recognition accuracy (77.8 to 95.6%). CRN, on the other hand, improved recognition accuracy only under -5 dB conditions, but its accuracy was only 80.8%.
GCCE 2024 - 2024 IEEE 13th Global Conference on Consumer Electronics 805-807 2024年
Recent advances in AI technology have brought not only many benefits but also considerable risks due to malicious use of the technology. One key example is spoofing through speech synthesis and voice conversion technologies against speaker verification system. To tackle this challenge, we proposed a two-step matching method as a robust speaker verification, in which a user specifies an emotion to a system in advance, and the user is accepted only when the user speaks with the specified emotion. This previous method reduced the false acceptance rate. However, the false rejection rate increased. To overcome this problem, we propose a novel method that integrates speaker and emotion verification scores in this work. Experiments revealed that the proposed method can reduce the equal error rate compared with that of the conventional method to assign the optimal weight to the speaker and emotional information contained in the speech.
本論文では腕動作が同一で手型が異なる手話単語を含む手話文に対する連続手話認識手法を提案する.腕動作のみによる手話認識については手首座標を三次元で追跡し,隠れマルコフモデルにより認識する手法をすでに提案している.しかしながら,その手法では腕動作が同一で手型が異なる単語を区別することはできなかった.本論文ではそのような単語に対して,手首座標の追跡時に腕が静止した時点と腕の運動方向が大きく変化した時点の手型画像を取得しておき,それらをサポートベクターマシンで分類することで,同一腕動作を持つ手話単語の識別を行う.評価実験として,腕動作が同一で手型が異なる単語群を辞書から抽出し,認識実験を行った.結果として,手型の異なる単語について約 8 割の認識率を実現した.In this paper, we will introduce a continuous sign language recognition method which can distinguish the words with the same arm motion and the different hand shape. We have proposed a sign language recognition method based on the Hidden Markov Model tracking the signer's arm motion. However, the method used only arm motion and it was unable to distinguish the words with the different hand shape and the same arm motion. In this study, the hand shape images were extracted when the arm motion stopped or the movement direction of arm changes significantly. The extracted images are classified by the Support Vector Machine and identified as the proper sign word. As the result of the recognition experiment, the recognition accuracy was about 80% for the words with the different hand shape.
自動伴奏システムが人間らしい伴奏を実現するためには,人聞が行っている伴奏制御方法の解明が必要である.我々の先行研究において,人間同士の合奏を分析することにより,「独奏者と伴奏者の演奏タイミングのずれ」 と 「伴奏者の時間長変化」 の履歴から未来の演奏を予測する手法を提案したが,その研究で分析したデータは単純な練習曲であったため,現実的な合奏における有効性は不明であった.そこで本研究では名演奏家の合奏録音データを分析し,先行研究のパラメータの有効性を調査することを目的とする.分析の結果,演奏表現上の音楽的逸脱が少ない状況では先行研究の有効性が示されたが,音楽的逸脱が大きい状況では先行研究のパラメータでは予測が困難であることが示された.In order to realize an automatic accompaniment system like a human, it is necessary that control method of human accompanist is elucidated. In our previous study, we proposed a method to predict the duration of the next beat using the history of two parameters; "the time difference between the soloist and the accompanist" and "the tempo modification of the accompanist". However, the score used in the analysis is just a simple etude, so the pragmatic effectiveness of the method for real-world music is not clear. Therefore, we analyze ensemble recordings by virtuosi to investigate the effectiveness of these two parameters. As a result, the effectiveness was shown when the tempo was stable, on the other hand, it was shown that the prediction was difficult in the situation where the virtuosi perform expressively with tempo rubato.