GCCE 2024 - 2024 IEEE 13th Global Conference on Consumer Electronics 119-120 2024年
Obese and overweight individuals are at high risk for chronic diseases such as sleep apnea and diabetes. Therefore, it is necessary to track eating behavior to determine the causes of obesity; however, it is time- and labor-intensive to follow the lives of specific individuals and observe their eating behavior. Thus, a method to automatically monitor eating behavior should be considered. As one approach to monitoring methods, we propose a method for convenient recognition of food category for food intake sounds recorded by microphones (below the ear microphone, throat microphone and acoustic microphone), which is less burdensome to the body and better from the viewpoint of privacy protection. Furthermore, a comparison of MFB and large-scale pre-trained speech models (wav2vec2.0, wavLM, and HuBERT) showed the effectiveness of large-scale pre-trained speech models in the food recognition task.
GCCE 2024 - 2024 IEEE 13th Global Conference on Consumer Electronics 808-810 2024年
To enhance speaker verification for short utterances, we have developed a Same Speaker Identification Deep Neural Network (SSI-DNN). This network identifies whether two utterances are uttered by the same speaker with greater accuracy by focusing on the same texts. In this paper, we extend the detection target of the SSI-DNN from monosyllabic utterances to word utterances to improve the speaker recognition performance. Experimental results showed that the SSI-DNN trained on word utterances achieved an EER of 0.1% to 2.8%. These results indicated that the SSI-DNN outperformed the x-vector-based speaker verification method, which is a representative speaker verification method.
GCCE 2024 - 2024 IEEE 13th Global Conference on Consumer Electronics 141-143 2024年
Hands-free control of shower settings, such as temperature, is highly desirable, enhancing user convenience when both hands are occupied or eyes are closed. In this paper, we propose a speaker-dependent, template-based isolated word recognition system using pre-trained large speech models (LSMs) to realize voice-activated shower control with a single microphone. Specifically, we examine the performance of 3 LSMs (wav2vec2.0, HuBERT, WavLM) as well as conventional MFCC as features. Additionally, we investigate speech enhancement using a Convolutional Recurrent Neural Network (CRN) to improve robustness against shower noise. Our experiments for recognizing 30 words with SNRs ranging from -5 dB to 20 dB demonstrate that HuBERT achieves the highest recognition accuracy (77.8 to 95.6%). CRN, on the other hand, improved recognition accuracy only under -5 dB conditions, but its accuracy was only 80.8%.
GCCE 2024 - 2024 IEEE 13th Global Conference on Consumer Electronics 805-807 2024年
Recent advances in AI technology have brought not only many benefits but also considerable risks due to malicious use of the technology. One key example is spoofing through speech synthesis and voice conversion technologies against speaker verification system. To tackle this challenge, we proposed a two-step matching method as a robust speaker verification, in which a user specifies an emotion to a system in advance, and the user is accepted only when the user speaks with the specified emotion. This previous method reduced the false acceptance rate. However, the false rejection rate increased. To overcome this problem, we propose a novel method that integrates speaker and emotion verification scores in this work. Experiments revealed that the proposed method can reduce the equal error rate compared with that of the conventional method to assign the optimal weight to the speaker and emotional information contained in the speech.
Smart Innovation, Systems and Technologies 98 103-109 2019年
A time session variability between the enrollment data and the recognized data degrades speaker recognition performance. Hence, the time session variability is one of the most important issues in the speaker recognition technology. In this paper, we propose a robust speaker recognition method for the time session variability. The proposed method estimates a time session variability subspace. Then, the proposed method carries out the speaker recognition in the orthogonal complement of the time session variability subspace. In addition, we incorporate a linear discriminant analysis method into the proposed method. In order to evaluate the proposed method, we conducted a speaker identification experiment. Experimental results show that the proposed method improves speaker identification performance of baseline.
COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2016 6898031:1-6898031:9 2016年 査読有り
Eye motion-based human-machine interfaces are used to provide a means of communication for those who can move nothing but their eyes because of injury or disease. To detect eye motions, electrooculography (EOG) is used. For efficient communication, the input speed is critical. However, it is difficult for conventional EOG recognition methods to accurately recognize fast, sequentially input eye motions because adjacent eye motions influence each other. In this paper, we propose a context-dependent hidden Markov model- (HMM-) based EOG modeling approach that uses separate models for identical eye motions with different contexts. Because the influence of adjacent eye motions is explicitly modeled, higher recognition accuracy is achieved. Additionally, we propose a method of user adaptation based on a user-independent EOG model to investigate the trade-off between recognition accuracy and the amount of user-dependent data required for HMM training. Experimental results show that when the proposed context-dependent HMMs are used, the character error rate (CER) is significantly reduced compared with the conventional baseline under user-dependent conditions, from 36.0 to 1.3%. Although the CER increases again to 17.3% when the context-dependent but user-independent HMMs are used, it can be reduced to 7.3% by applying the proposed user adaptation method.
International Journal of Biometrics 7(2) 83-96 2015年 査読有り
GMM-UBM super-vectors will potentially lead to worse modelling for speaker verification due to the inter-session variability, especially when a small amount of training utterances were available. In this study, we propose a phoneme dependent method to suppress the inter-session variability. A speaker's model can be represented by several various phoneme Gaussian mixture models. Each of them covers an individual phoneme whose inter-session variability can be constrained in an inter-session independent subspace constructed by principal component analysis (PCA), and it uses corpus uttered by a single speaker that has been recorded over a long period. SVM-based experiments performed using a large corpus, constructed by the National Research Institute of Police Science (NRIPS) to evaluate Japanese speaker recognition, and demonstrate the improvements gained from the proposed method.
Music Technology meets Philosophy - From Digital Echos to Virtual Ethos: Joint Proceedings of the 40th International Computer Music Conference, ICMC 2014, and the 11th Sound and Music Computing Conference, SMC 2014, Athens, Greece, September 14-20, 2014 2014年 査読有り
14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5 3479-3483 2013年 査読有り
Denoising autoencoder is applied to reverberant speech recognition as a noise robust front-end to reconstruct clean speech spectrum from noisy input. In order to capture context effects of speech sounds, a window of multiple short-windowed spectral frames are concatenated to form a single input vector. Additionally, a combination of short and long-term spectra is investigated to properly handle long impulse response of reverberation while keeping necessary time resolution for speech recognition. Experiments are performed using the CENSREC-4 dataset that is designed as an evaluation framework for distant-talking speech recognition. Experimental results show that the proposed denoising autoencoder based front-end using the short-windowed spectra gives better results than conventional methods. By combining the long-term spectra, further improvement is obtained. The recognition accuracy by the proposed method using the short and long-term spectra is 97.0% for the open condition test set of the dataset, whereas it is 87.8% when a multi condition training based baseline is used. As a supplemental experiment, large vocabulary speech recognition is also performed and the effectiveness of the proposed method has been confirmed.
Xiuqin Wei, Shingo Kuroiwa, Tomoharu Nagashima, Marian K. Kazimierczuk, Hiroo Sekiya
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS 59(9) 2137-2146 2012年9月 査読有り
This paper introduces a push-pull class-E-M power amplifier for achieving low harmonic contents and high output power. By applying the push-pull configuration of the class-E-M power amplifier, the proposed amplifier achieves an extremely lower total harmonic distortion (THD) and about four times higher output power than the conventional single class-E-M power amplifier. Design curves of the proposed amplifiers for satisfying the class-E-M ZVS/ZDVS/ZCS/ZDCS conditions are given. A design example is shown along with the PSpice-simulation and experimental waveforms for 1-MHz amplifier, considering the MOSFET drain-to-source nonlinear parasitic capacitances, MOSFET switch-on resistances, and equivalent series resistance of the inductors. The waveforms from the PSpice simulation and circuit experiment satisfied all the switching conditions, which has shown the accuracy of the design curves given in this paper and validated the effectiveness of the push-pull class-E-M power amplifier.
13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3 734-737 2012年 査読有り
To provide an efficient means of communication for those who cannot move muscles of the whole body except eyes due to amyotrophic lateral sclerosis (ALS), we are developing a speech synthesis interface that is based on electrooculogram (BOG) input. BOG is an electrical signal that is observed through electrodes attached on the skin around eyes and reflects eye position. A key component of the system is a continuous recognizer for the BOG signal. In this paper, we propose and investigate a hidden Markov model (HMM) based BOG recogmizer applying continuous speech recognition techniques. In the experiments, we evaluate the recognition system both in user dependent and independent conditions. It is shown that 96.1% of recognition accuracy is obtained for five classes of eye actions by a user dependent system using six channels. While it is difficult to obtain good performance by a user independent system, it is shown that maximum likelihood linear regression (MLLR) adaptation helps for BOG recognition.
2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) 5029-5032 2012年 査読有り
Direct likelihood maximization selection (DLMS) selects a subset of language model training data so that likelihood of in-domain development data is maximized. By using recognition hypothesis instead of the in-domain development data, it can be used for unsupervised adaptation. We apply DLMS to iterative unsupervised adaptation for presentation speech recognition. A problem of the iterative unsupervised adaptation is that adapted models are estimated including recognition errors and it limits the adaptation performance. To solve the problem, we introduce the framework of unsupervised cross-validation (CV) adaptation that has originally been proposed for acoustic model adaptation. Large vocabulary speech recognition experiments show that the CV approach is effective for DLMS based adaptation reducing 19.3% of error rate by an initial model to 18.0%.
2012 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC) 1-4 2012年 査読有り
For large vocabulary continuous speech recognition, speech decoders treat time sequence with context information using large probabilistic models. The software of such speech decoders tend to be large and complex since it has to handle both relationships of its component functions and timing of computation at the same time. In the traditional signal processing area such as measurement and system control, block diagram based implementations are common where systems are designed by connecting blocks of components. The connections describe flow of signals and this framework greatly helps to understand and design complex systems. In this research, we show that speech decoders can be effectively decomposed to diagrams or pipelines. Once they are decomposed to pipelines, they can be easily implemented in a highly abstracted manner using a pure functional programming language with delayed evaluation. Based on this perspective, we have re-designed our pure-functional decoder Husky proposing a new design paradigm for speech recognition systems. In the evaluation experiments, it is shown that it efficiently works for a large vocabulary continuous speech recognition task.
2012 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC) 1-4 2012年 査読有り
We are developing S-CAT computer test system that will be the first automated adaptive speaking test for Japanese. The speaking ability of examinees is scored using speech processing techniques without human raters. By using computers for the scoring, it is possible to largely reduce the scoring cost and provide a convenient means for language learners to evaluate their learning status. While the S-CAT test has several categories of question items, open answer question is technically the most challenging one since examinees freely talk about a given topic or argue something for a given material. For this problem, we proposed to use support vector regression (SVR) with various features. Some of the features rely on speech recognition hypothesis and others do not. SVR is more robust than multiple regression and the best result was obtained when 390 dimensional features that combine everything were used. The correlation coefficients between human rated and SVR estimated scores were 0.878, 0.847, 0.853, and 0.872 for fluency, accuracy, content, and richness measures, respectively.
INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & DECISION MAKING 11(1) 33-53 2012年1月 査読有り
Collaborative filtering (CF) is one of the most prevalent recommendation techniques, providing personalized recommendations to users based on their previously expressed preferences and those of other similar users. Although CF has been widely applied in various applications, its applicability is restricted due to the data sparsity, the data inadequateness of new users and new items (cold start problem), and the growth of both the number of users and items in the database (scalability problem). In this paper, we propose an efficient iterative clustered prediction technique to transform user-item sparse matrix to a dense one and overcome the scalability problem. In this technique, spectral clustering algorithm is utilized to optimize the neighborhood selection and group the data into users' and items' clusters. Then, both clustered user-based and clustered item-based approaches are aggregated to efficiently predict the unknown ratings. Our experiments on MovieLens and book-crossing data sets indicate substantial and consistent improvements in recommendations accuracy compared to the hybrid user-based and item-based approach without clustering, hybrid approach with k-means and singular value decomposition (SVD)-based CF. Furthermore, we demonstrated the effectiveness of the proposed iterative technique and proved its performance through a varying number of iterations.
Xiuqin Wei, Hiroo Sekiya, Shingo Kuroiwa, Tadashi Suetsugu, Marian K. Kazimierczuk
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS 58(10) 2556-2565 2011年10月 査読有り
This paper presents expressions for the waveforms and design equations to satisfy the ZVS/ZDS conditions in the class-E power amplifier, taking into account the MOSFET gate-to-drain linear parasitic capacitance and the drain-to-source nonlinear parasitic capacitance. Expressions are given for power output capability and power conversion efficiency. Design examples are presented along with the PSpice-simulation and experimental waveforms at 2.3 W output power and 4 MHz operating frequency. It is shown from the expressions that the slope of the voltage across the MOSFET gate-to-drain parasitic capacitance during the switch-off state affects the switch-voltage waveform. Therefore, it is necessary to consider the MOSFET gate-to-drain capacitance for achieving the class-E ZVS/ZDS conditions. As a result, the power output capability and the power conversion efficiency are also affected by the MOSFET gate-to-drain capacitance. The waveforms obtained from PSpice simulations and circuit experiments showed the quantitative agreements with the theoretical predictions, which verify the expressions given in this paper.
ELECTRONICS AND COMMUNICATIONS IN JAPAN 94(4) 44-54 2011年4月 査読有り
Super-function based machine translation (SFBMT), which is a type of example-based machine translation, has a feature which makes it possible to expand the coverage of examples by changing nouns into variables. However, there have been problems extracting entire date/time expressions containing parts-of-speech other than nouns, because only nouns/numbers were changed into variables. We describe a method of extracting date/time expressions for SFBMT. SFBMT uses noun determination rules to extract nouns and a bilingual dictionary to obtain the correspondence of the extracted nouns between the source and the target languages. In this method, we add a rule to extract date/time expressions and then extract date/time expressions from a Japanese English bilingual corpus. The evaluation results shows that the precision of this method for Japanese sentences is 96.7%, with a recall of 98.2%, and the precision for English sentences is 94.7%, with a recall of 92.7%. (C) 2011 Wiley Periodicals, Inc. Electron Comm Jpn, 94(4): 44-54, 2011; Published online in Wiley Online Library (wileyonlinelibrary.com). DOI 10.1002/ecj.10262
The purpose of our study is to develop a spoken dialogue system for in-vehicle appliances. Such a multi-domain dialogue system should be capable of reacting to a change of the topic, recognizing fast and accurately separating words as well as whole sentences. We propose a novel recognition method by integrating a sentence, partial words, and phonemes. The degree of confidence is determined by the degree to which recognition results match on these three levels. We conducted speech recognition experiments for in-vehicle appliances. In the case of sentence units, the recognition accuracy was 96.2% by the proposed method and 92.9% by the conventional word bigram. As for word units, recognition accuracy of the proposed method was 86.2% while that of whole word recognition was 75.1%. Therefore, we concluded that our method can be effectively applied in spoken dialogue systems for in-vehicle appliances.
Xiuqin Wei, Hiroo Sekiya, Shingo Kuroiwa, Tadashi Suetsugu, Marian K. Kazimierczuk
2010 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS 3200-3203 2010年 査読有り
In this paper, we present analytical expressions for the waveforms and design equations for achieving the ZVS/ZDS conditions in the class-E power amplifier, taking into account the gate-to-drain parasitic capacitance of the MOSFET. We also give a design example along with PSpice simulation and experimental results. The voltage waveforms obtained from both the PSpice simulation and the circuit experiment achieved the class-E ZVS/ZDS conditions completely, which verify the analytical expressions. The results in this paper indicate that it is important to consider the effect of the MOSFET gate-to-drain capacitance for achieving the class E ZVS/ZDS conditions. The experimental power conversion efficiency achieved 92.8 % at output power P-o = 4.06 W and operating frequency f = 7 MHz.
IEEE NLP-KE 2009: PROCEEDINGS OF INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING 220-+ 2009年 査読有り
The amount of accessible information in the Internet increases every day and it becomes greatly difficult to deal with such a huge source of information. Consequently, Recommender Systems (RS) which are considered as powerful tools for Information Retrieval (IR), can access these available information efficiently. Unfortunately, the recommendations accuracy is seriously affected by the problems of data sparsity and scalability. Additionally, the time of recommendations is very essential in the Recommender Systems. Therefore, we propose a proficient dimensionality reduction-based Collaborative Filtering (CF) Recommender System. In this technique, the Singular Value Decomposition-free (SVD-free) Latent Semantic Indexing (LSI) is utilized to obtain a reduced data representation solving the sparsity and scalability limitations. Also, the SVD-free extremely reduce the time and memory usage required for dimensionality reduction employing the partial symmetric Eigenproblem. Moreover, to estimate the optimal number of reduced dimensions which greatly influences the system accuracy, the Particle Swarm Optimization (PSO) algorithm is utilized to automatically obtain it. As a result, the proposed technique enormously increases the recommendations prediction quality and speed. In additions, it decreases the memory requirements. To show the efficiency of the proposed technique, we employed it to the MovieLens dataset and the results was very promising.
ACM International Conference Proceeding Series 331-334 2009年 査読有り
To achieve the greater accessibility for deaf people, sign language recognition systems and sign language animation systems must be developed. In Japanese sign language (JSL), previous studies have suggested that emphasis and emotion cause changes in hand movements. However, the relationship between emphasis and emotion and the signing speed has not been researched enough. In this study, we analyzed the hand movement variation in relation to the signing speed. First, we recorded 20 signed sentences at three speeds (fast, normal, and slow) using a digital video recorder and a 3D position sensor. Second, we segmented sentences into three types of components (sign words, transitions, and pauses). In our previous study, we analyzed hand movement variations of sign words in relation to the signing speed. In this study, we analyzed transitions between adjacent sign words by a method similar to that in the previous study. As a result, sign words and transitions showed a similar tendency, and we found that the variation in signing speed mainly caused changes in the distance hands moved. Furthermore, we compared transitions with sign words and found that transitions were slower than sign words. Copyright 2009 ACM.