Smart Innovation, Systems and Technologies 98 103-109 2019年
A time session variability between the enrollment data and the recognized data degrades speaker recognition performance. Hence, the time session variability is one of the most important issues in the speaker recognition technology. In this paper, we propose a robust speaker recognition method for the time session variability. The proposed method estimates a time session variability subspace. Then, the proposed method carries out the speaker recognition in the orthogonal complement of the time session variability subspace. In addition, we incorporate a linear discriminant analysis method into the proposed method. In order to evaluate the proposed method, we conducted a speaker identification experiment. Experimental results show that the proposed method improves speaker identification performance of baseline.
COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2016 6898031:1-6898031:9 2016年 査読有り
Eye motion-based human-machine interfaces are used to provide a means of communication for those who can move nothing but their eyes because of injury or disease. To detect eye motions, electrooculography (EOG) is used. For efficient communication, the input speed is critical. However, it is difficult for conventional EOG recognition methods to accurately recognize fast, sequentially input eye motions because adjacent eye motions influence each other. In this paper, we propose a context-dependent hidden Markov model- (HMM-) based EOG modeling approach that uses separate models for identical eye motions with different contexts. Because the influence of adjacent eye motions is explicitly modeled, higher recognition accuracy is achieved. Additionally, we propose a method of user adaptation based on a user-independent EOG model to investigate the trade-off between recognition accuracy and the amount of user-dependent data required for HMM training. Experimental results show that when the proposed context-dependent HMMs are used, the character error rate (CER) is significantly reduced compared with the conventional baseline under user-dependent conditions, from 36.0 to 1.3%. Although the CER increases again to 17.3% when the context-dependent but user-independent HMMs are used, it can be reduced to 7.3% by applying the proposed user adaptation method.
International Journal of Biometrics 7(2) 83-96 2015年 査読有り
GMM-UBM super-vectors will potentially lead to worse modelling for speaker verification due to the inter-session variability, especially when a small amount of training utterances were available. In this study, we propose a phoneme dependent method to suppress the inter-session variability. A speaker's model can be represented by several various phoneme Gaussian mixture models. Each of them covers an individual phoneme whose inter-session variability can be constrained in an inter-session independent subspace constructed by principal component analysis (PCA), and it uses corpus uttered by a single speaker that has been recorded over a long period. SVM-based experiments performed using a large corpus, constructed by the National Research Institute of Police Science (NRIPS) to evaluate Japanese speaker recognition, and demonstrate the improvements gained from the proposed method.
Music Technology meets Philosophy - From Digital Echos to Virtual Ethos: Joint Proceedings of the 40th International Computer Music Conference, ICMC 2014, and the 11th Sound and Music Computing Conference, SMC 2014, Athens, Greece, September 14-20, 2014 2014年 査読有り
14TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2013), VOLS 1-5 3479-3483 2013年 査読有り
Denoising autoencoder is applied to reverberant speech recognition as a noise robust front-end to reconstruct clean speech spectrum from noisy input. In order to capture context effects of speech sounds, a window of multiple short-windowed spectral frames are concatenated to form a single input vector. Additionally, a combination of short and long-term spectra is investigated to properly handle long impulse response of reverberation while keeping necessary time resolution for speech recognition. Experiments are performed using the CENSREC-4 dataset that is designed as an evaluation framework for distant-talking speech recognition. Experimental results show that the proposed denoising autoencoder based front-end using the short-windowed spectra gives better results than conventional methods. By combining the long-term spectra, further improvement is obtained. The recognition accuracy by the proposed method using the short and long-term spectra is 97.0% for the open condition test set of the dataset, whereas it is 87.8% when a multi condition training based baseline is used. As a supplemental experiment, large vocabulary speech recognition is also performed and the effectiveness of the proposed method has been confirmed.
Xiuqin Wei, Shingo Kuroiwa, Tomoharu Nagashima, Marian K. Kazimierczuk, Hiroo Sekiya
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS 59(9) 2137-2146 2012年9月 査読有り
This paper introduces a push-pull class-E-M power amplifier for achieving low harmonic contents and high output power. By applying the push-pull configuration of the class-E-M power amplifier, the proposed amplifier achieves an extremely lower total harmonic distortion (THD) and about four times higher output power than the conventional single class-E-M power amplifier. Design curves of the proposed amplifiers for satisfying the class-E-M ZVS/ZDVS/ZCS/ZDCS conditions are given. A design example is shown along with the PSpice-simulation and experimental waveforms for 1-MHz amplifier, considering the MOSFET drain-to-source nonlinear parasitic capacitances, MOSFET switch-on resistances, and equivalent series resistance of the inductors. The waveforms from the PSpice simulation and circuit experiment satisfied all the switching conditions, which has shown the accuracy of the design curves given in this paper and validated the effectiveness of the push-pull class-E-M power amplifier.
13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3 734-737 2012年 査読有り
To provide an efficient means of communication for those who cannot move muscles of the whole body except eyes due to amyotrophic lateral sclerosis (ALS), we are developing a speech synthesis interface that is based on electrooculogram (BOG) input. BOG is an electrical signal that is observed through electrodes attached on the skin around eyes and reflects eye position. A key component of the system is a continuous recognizer for the BOG signal. In this paper, we propose and investigate a hidden Markov model (HMM) based BOG recogmizer applying continuous speech recognition techniques. In the experiments, we evaluate the recognition system both in user dependent and independent conditions. It is shown that 96.1% of recognition accuracy is obtained for five classes of eye actions by a user dependent system using six channels. While it is difficult to obtain good performance by a user independent system, it is shown that maximum likelihood linear regression (MLLR) adaptation helps for BOG recognition.
2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP) 5029-5032 2012年 査読有り
Direct likelihood maximization selection (DLMS) selects a subset of language model training data so that likelihood of in-domain development data is maximized. By using recognition hypothesis instead of the in-domain development data, it can be used for unsupervised adaptation. We apply DLMS to iterative unsupervised adaptation for presentation speech recognition. A problem of the iterative unsupervised adaptation is that adapted models are estimated including recognition errors and it limits the adaptation performance. To solve the problem, we introduce the framework of unsupervised cross-validation (CV) adaptation that has originally been proposed for acoustic model adaptation. Large vocabulary speech recognition experiments show that the CV approach is effective for DLMS based adaptation reducing 19.3% of error rate by an initial model to 18.0%.
2012 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC) 1-4 2012年 査読有り
For large vocabulary continuous speech recognition, speech decoders treat time sequence with context information using large probabilistic models. The software of such speech decoders tend to be large and complex since it has to handle both relationships of its component functions and timing of computation at the same time. In the traditional signal processing area such as measurement and system control, block diagram based implementations are common where systems are designed by connecting blocks of components. The connections describe flow of signals and this framework greatly helps to understand and design complex systems. In this research, we show that speech decoders can be effectively decomposed to diagrams or pipelines. Once they are decomposed to pipelines, they can be easily implemented in a highly abstracted manner using a pure functional programming language with delayed evaluation. Based on this perspective, we have re-designed our pure-functional decoder Husky proposing a new design paradigm for speech recognition systems. In the evaluation experiments, it is shown that it efficiently works for a large vocabulary continuous speech recognition task.
2012 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC) 1-4 2012年 査読有り
We are developing S-CAT computer test system that will be the first automated adaptive speaking test for Japanese. The speaking ability of examinees is scored using speech processing techniques without human raters. By using computers for the scoring, it is possible to largely reduce the scoring cost and provide a convenient means for language learners to evaluate their learning status. While the S-CAT test has several categories of question items, open answer question is technically the most challenging one since examinees freely talk about a given topic or argue something for a given material. For this problem, we proposed to use support vector regression (SVR) with various features. Some of the features rely on speech recognition hypothesis and others do not. SVR is more robust than multiple regression and the best result was obtained when 390 dimensional features that combine everything were used. The correlation coefficients between human rated and SVR estimated scores were 0.878, 0.847, 0.853, and 0.872 for fluency, accuracy, content, and richness measures, respectively.
INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & DECISION MAKING 11(1) 33-53 2012年1月 査読有り
Collaborative filtering (CF) is one of the most prevalent recommendation techniques, providing personalized recommendations to users based on their previously expressed preferences and those of other similar users. Although CF has been widely applied in various applications, its applicability is restricted due to the data sparsity, the data inadequateness of new users and new items (cold start problem), and the growth of both the number of users and items in the database (scalability problem). In this paper, we propose an efficient iterative clustered prediction technique to transform user-item sparse matrix to a dense one and overcome the scalability problem. In this technique, spectral clustering algorithm is utilized to optimize the neighborhood selection and group the data into users' and items' clusters. Then, both clustered user-based and clustered item-based approaches are aggregated to efficiently predict the unknown ratings. Our experiments on MovieLens and book-crossing data sets indicate substantial and consistent improvements in recommendations accuracy compared to the hybrid user-based and item-based approach without clustering, hybrid approach with k-means and singular value decomposition (SVD)-based CF. Furthermore, we demonstrated the effectiveness of the proposed iterative technique and proved its performance through a varying number of iterations.
Xiuqin Wei, Hiroo Sekiya, Shingo Kuroiwa, Tadashi Suetsugu, Marian K. Kazimierczuk
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I-REGULAR PAPERS 58(10) 2556-2565 2011年10月 査読有り
This paper presents expressions for the waveforms and design equations to satisfy the ZVS/ZDS conditions in the class-E power amplifier, taking into account the MOSFET gate-to-drain linear parasitic capacitance and the drain-to-source nonlinear parasitic capacitance. Expressions are given for power output capability and power conversion efficiency. Design examples are presented along with the PSpice-simulation and experimental waveforms at 2.3 W output power and 4 MHz operating frequency. It is shown from the expressions that the slope of the voltage across the MOSFET gate-to-drain parasitic capacitance during the switch-off state affects the switch-voltage waveform. Therefore, it is necessary to consider the MOSFET gate-to-drain capacitance for achieving the class-E ZVS/ZDS conditions. As a result, the power output capability and the power conversion efficiency are also affected by the MOSFET gate-to-drain capacitance. The waveforms obtained from PSpice simulations and circuit experiments showed the quantitative agreements with the theoretical predictions, which verify the expressions given in this paper.
ELECTRONICS AND COMMUNICATIONS IN JAPAN 94(4) 44-54 2011年4月 査読有り
Super-function based machine translation (SFBMT), which is a type of example-based machine translation, has a feature which makes it possible to expand the coverage of examples by changing nouns into variables. However, there have been problems extracting entire date/time expressions containing parts-of-speech other than nouns, because only nouns/numbers were changed into variables. We describe a method of extracting date/time expressions for SFBMT. SFBMT uses noun determination rules to extract nouns and a bilingual dictionary to obtain the correspondence of the extracted nouns between the source and the target languages. In this method, we add a rule to extract date/time expressions and then extract date/time expressions from a Japanese English bilingual corpus. The evaluation results shows that the precision of this method for Japanese sentences is 96.7%, with a recall of 98.2%, and the precision for English sentences is 94.7%, with a recall of 92.7%. (C) 2011 Wiley Periodicals, Inc. Electron Comm Jpn, 94(4): 44-54, 2011; Published online in Wiley Online Library (wileyonlinelibrary.com). DOI 10.1002/ecj.10262
The purpose of our study is to develop a spoken dialogue system for in-vehicle appliances. Such a multi-domain dialogue system should be capable of reacting to a change of the topic, recognizing fast and accurately separating words as well as whole sentences. We propose a novel recognition method by integrating a sentence, partial words, and phonemes. The degree of confidence is determined by the degree to which recognition results match on these three levels. We conducted speech recognition experiments for in-vehicle appliances. In the case of sentence units, the recognition accuracy was 96.2% by the proposed method and 92.9% by the conventional word bigram. As for word units, recognition accuracy of the proposed method was 86.2% while that of whole word recognition was 75.1%. Therefore, we concluded that our method can be effectively applied in spoken dialogue systems for in-vehicle appliances.
Xiuqin Wei, Hiroo Sekiya, Shingo Kuroiwa, Tadashi Suetsugu, Marian K. Kazimierczuk
2010 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS 3200-3203 2010年 査読有り
In this paper, we present analytical expressions for the waveforms and design equations for achieving the ZVS/ZDS conditions in the class-E power amplifier, taking into account the gate-to-drain parasitic capacitance of the MOSFET. We also give a design example along with PSpice simulation and experimental results. The voltage waveforms obtained from both the PSpice simulation and the circuit experiment achieved the class-E ZVS/ZDS conditions completely, which verify the analytical expressions. The results in this paper indicate that it is important to consider the effect of the MOSFET gate-to-drain capacitance for achieving the class E ZVS/ZDS conditions. The experimental power conversion efficiency achieved 92.8 % at output power P-o = 4.06 W and operating frequency f = 7 MHz.
IEEE NLP-KE 2009: PROCEEDINGS OF INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING 220-+ 2009年 査読有り
The amount of accessible information in the Internet increases every day and it becomes greatly difficult to deal with such a huge source of information. Consequently, Recommender Systems (RS) which are considered as powerful tools for Information Retrieval (IR), can access these available information efficiently. Unfortunately, the recommendations accuracy is seriously affected by the problems of data sparsity and scalability. Additionally, the time of recommendations is very essential in the Recommender Systems. Therefore, we propose a proficient dimensionality reduction-based Collaborative Filtering (CF) Recommender System. In this technique, the Singular Value Decomposition-free (SVD-free) Latent Semantic Indexing (LSI) is utilized to obtain a reduced data representation solving the sparsity and scalability limitations. Also, the SVD-free extremely reduce the time and memory usage required for dimensionality reduction employing the partial symmetric Eigenproblem. Moreover, to estimate the optimal number of reduced dimensions which greatly influences the system accuracy, the Particle Swarm Optimization (PSO) algorithm is utilized to automatically obtain it. As a result, the proposed technique enormously increases the recommendations prediction quality and speed. In additions, it decreases the memory requirements. To show the efficiency of the proposed technique, we employed it to the MovieLens dataset and the results was very promising.
ACM International Conference Proceeding Series 331-334 2009年 査読有り
To achieve the greater accessibility for deaf people, sign language recognition systems and sign language animation systems must be developed. In Japanese sign language (JSL), previous studies have suggested that emphasis and emotion cause changes in hand movements. However, the relationship between emphasis and emotion and the signing speed has not been researched enough. In this study, we analyzed the hand movement variation in relation to the signing speed. First, we recorded 20 signed sentences at three speeds (fast, normal, and slow) using a digital video recorder and a 3D position sensor. Second, we segmented sentences into three types of components (sign words, transitions, and pauses). In our previous study, we analyzed hand movement variations of sign words in relation to the signing speed. In this study, we analyzed transitions between adjacent sign words by a method similar to that in the previous study. As a result, sign words and transitions showed a similar tendency, and we found that the variation in signing speed mainly caused changes in the distance hands moved. Furthermore, we compared transitions with sign words and found that transitions were slower than sign words. Copyright 2009 ACM.
INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5 2319-+ 2009年 査読有り
In this paper, we propose a novel speaker verification method which determines whether a claimer is accepted or rejected by the rank of the claimer in a large number of speaker models instead of score normalization, such as T-norm and Z-norm. The method has advantages over the standard T-norm in speaker verification accuracy. However, it needs much computation time as well as T-norm that needs calculating likelihoods for many cohort models. Hence, we also discuss the speed-up using the method that selects cohort subset for each target speaker in the training stage. This data driven approach can significantly reduce computation resulting in faster speaker verification decision. We conducted text-independent speaker verification experiments using large-scale Japanese speaker recognition evaluation corpus constructed by National Research Institute of Police Science. As a result, the proposed method achieved an equal error rate of 2.2 %, while T-norm obtained 2.7 %.
International Journal of Biomedical Soft Computing and Human Sciences: the official journal of the Biomedical Fuzzy Systems Association 14(1) 3-10 2009年
Recently, Distributed Speech Recognition (DSR) systems are widely deployed in Japanese cellular telephone networks. In these systems, personal authentication with voice is strongly desired. In this paper, we present several speaker recognition techniques developed in the University of Tokushima for Distributed Speaker Identification/Verification (DSI/DSV) systems. Especially, we present recent progress on a non-parametric speaker recognition system that is robust to quantization in the distributed systems comparing with conventional speaker recognition systems based on Gaussian Mixture Model (GMM). Evaluation results using the Japanese de facto standard speaker recognition corpus and CCC Speaker Recognition Evaluation 2006 data developed by the Chinese Corpus Consortium (CCC) show higher performance of the proposed method than GMM and VQ-distortion in the European Telecommunications Standards Institute (ETSI) DSR standard environment.
2009 INTERNATIONAL SYMPOSIUM ON INTELLIGENT SIGNAL PROCESSING AND COMMUNICATION SYSTEMS (ISPACS 2009) 449-452 2009年 査読有り
Recently, some new sensors, such as bone-conductive microphones, throat microphones, and non-audible murmur (NAM) microphones, besides conventional condenser microphones have been developed for collecting speech data. Accordingly, some researchers began to study speaker and speech recognition using speech data collected by these new sensors. We focus on bone-conduction speech data collected by the bone-conductive microphone. In this paper, we first investigate speaker verification performances of bone-conduction speech. In addition, we propose a method of using bone-conduction speech and air-conduction together for the speaker verification. The proposed method integrates the similarity calculated by air-conduction speech model and similarity calculated by bone-conduction speech model. Using 99 female speakers' speech data, we conducted speaker verification experiments. Experimental results show that the speaker verification performance of bone-conduction is lower than that of air-conduction speech. However, the proposed method can improve the speaker verification performance of bone- and air-conduction speech. Actually, the proposed method can reduce the equal error rate of air-conduction speech by 16.0% and the equal error rate of bone-conduction speech by 71.7%.