S Nakamura, K Takeda, K Yamamoto, T Yamada, S Kuroiwa, N Kitakoka, T Nishiura, A Sasou, M Mizumachi, C Miyajima, M Fujimoto, T Endo
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS E88D(3) 535-544 2005年3月 査読有り
This paper introduces an evaluation framework for Japanese noisy speech recognition named AURORA-2J. Speech recognition systems must still be improved to be robust to noisy environments, but this improvement requires development of the standard evaluation corpus and assessment technologies. Recently, the Aurora 2, 3 and 4 corpora and their evaluation scenarios have had significant impact on noisy speech recognition research. The AURORA-2J is a Japanese connected digits corpus and its evaluation scripts are designed in the same way as Aurora 2 with the help of European Telecommunications Standards Institute (ETSI) AURORA group. This paper describes the data collection, baseline scripts, and its baseline performance. We also propose a new performance analysis method that considers differences in recognition performance among speakers. This method is based on the word accuracy per speaker, revealing the degree of the individual difference of the recognition performance. We also propose categorization of modifications, applied to the original HTK baseline system, which helps in comparing the systems and in recognizing technologies that improve the performance best within the same category.
INTERSPEECH 2005 - Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, September 4-8, 2005 3085-3088 2005年 査読有り
The study of human-computer interaction is now the most popular research domain overall computer science and psychology science. The most of essential issues recently focus on not only the information about the physical computing but also the affective computing. The emotion states of human being can dramatically affect their actions. It is important for a computer to understand what the people feel at the time. In this paper, we propose a novel method to predict the future emotion state of person depending on the current emotion state and affective factors by an advanced mental state transition network[l]. The psychological experiment with about 100 participants has been done to obtain the structure and the coefficients of the model. The test experiment also has been done to certificate the prediction validity of this model.
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS E87D(5) 1119-1126 2004年5月 査読有り
This paper addresses problems involved in performing speech recognition over mobile and IP networks. The main problem is speech data loss caused by packet loss in the network. We present two missing-feature-based approaches that recover lost regions of speech data. These approaches are based on the reconstruction of missing frames or on marginal distributions. For comparison, we also use a packing method, which skips lost data. We evaluate these approaches with packet loss models. i.e., random loss and Gilbert loss models. The results show that the marginal-distributed-based technique is most effective for a packet loss environment; the degradation of word accuracy is only 5% when the packet loss rate is 30% and only 3% when mean burst loss length is 24 frames in the case of DSR front-end. The simple data imputation method is also effective in the case of clean speech.
ITCC 2004: INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: CODING AND COMPUTING, VOL 2, PROCEEDINGS 298-302 2004年 査読有り
Parallel corpus is a very important tool to construct a good machine translation system or make any natural language processing research for cross language information retrieval. Internet archive is a good source of parallel documents in different languages. In order to construct a good parallel corpus from the Internet archive, Bilingual dictionary that contains word pairs which may not exist in commercial dictionaries is a must. Extracting a bilingual dictionary from the internet parallel documents is important to add words that are absent from the traditional dictionaries. This paper describes two algorithms to automatically extract an English/Arabic bilingual dictionary from parallel texts that exist in the Internet archive. The system should preferably be useful for many different language pairs. Like most of the systems done, the accuracy of our system is directly proportional to the amount of sentence pairs used By controlling the system parameters, we could achieve 100% precision for the output bilingual dictionary, but the size of the dictionary will be smaller.
ITCC 2004: INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY: CODING AND COMPUTING, VOL 1, PROCEEDINGS 241-245 2004年 査読有り
Time-series forecasting is an important research area in several domains. Recently, neural networks have been very successfully applied in time series to improve multivariate prediction ability. Several neural network models have already been developed for the market prediction. Some are applied to predicting the change of future interest rate and exchange rate; some are applied to recognizing certain price patterns that are characteristic of future price changes. This paper presents a neural network model for technical analysis of stock market, and its application to a buying and selling timing prediction system for stock index of Japan. This paper also describes a natural language generation system to express prediction information of TOPIX in natural language for non-expert users. This system has evolved to be one of the most comprehensive grammars of English for prediction expressions.
S Kuroiwa, M Naito, M Nakamura, S Sakayori, T Mukasa
ELECTRONICS AND COMMUNICATIONS IN JAPAN PART II-ELECTRONICS 87(4) 44-52 2004年 査読有り
A system that automatically rejects prank calls coming through home country direct from abroad, which is one of the international telephone services, is presented in this paper. Home country direct is a service whereby a user can use international telephone services in his/her native language by directly accessing home country's international station operators. Since this service does not require fees for calling operators, prank calls made by children from abroad pose a problem. Thus, an "automatic prank call rejection system" that determines a legitimate user by instructing him in Japanese to say a specific word, and determining him to be a legitimate user if he repeats this word correctly or determining the call to be a prank call otherwise has been developed. When this system was applied to commercial services, it rejected 94.7% of prank calls. Legitimate users erroneously rejected constituted 0.8%. It has been confirmed that erroneously rejected legitimate users ended up being connected by repeating the word correctly ultimately by hanging up the phone and redialling a number of times. This system has been found to reject about 10,000 prank calls a day when applied to the operations of the KDD International Telephone Center since March 1996. (C) 2004 Wiley Periodicals, Inc.
F Ren, K Matsumoto, S Mitsuyoshi, S Kuroiwa, G Lin
2003 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN AND CYBERNETICS, VOLS 1-5, CONFERENCE PROCEEDINGS 1666-1672 2003年 査読有り
In the near future, if will be necessary for the senior citizens to nurse the other senior citizens because of declining population of children and increasing a new type of family. We have been developing welfare robots which can support lives of the senior citizens and have sensibility to lighten the burden imposed on nursing. The measurement of emotions from the usual conversation is considered to be one of its basic researches. In this paper, we are going to propose the algorism of the emotion measurement and the prototype system based on this algorism, and we are also going to discuss its validity.
8th European Conference on Speech Communication and Technology, EUROSPEECH 2003 - INTERSPEECH 2003, Geneva, Switzerland, September 1-4, 2003 1769-1772 2003年 査読有り
8th European Conference on Speech Communication and Technology, EUROSPEECH 2003 - INTERSPEECH 2003, Geneva, Switzerland, September 1-4, 2003 2003年 査読有り
8th European Conference on Speech Communication and Technology, EUROSPEECH 2003 - INTERSPEECH 2003, Geneva, Switzerland, September 1-4, 2003 4 3081-3084 2003年 査読有り
2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS 392-395 2003年 査読有り
In this study we present blind equalization techniques for ETSI standard Distributed Speech Recognition (DSR) front-end which compensate for acoustic mismatch caused by input devices. The DSR front-end employs vector quantization (VQ) for feature parameter compression so that the mismatch does not only cause a shift of parameters but also increases VQ distortion. Although CMS is one of the most effective methods to compensate for the shift, it can not decrease VQ distortion in DSR. To compensate for the shift and decrease VQ distortion simultaneously, the proposed methods estimate the shift in the input data necessary to match the VQ codebook distribution. The methods do not need the acoustic likelihood which is calculated in a decoder on the server side. Therefore, they are applicable to the DSR front-end. Japanese Newspaper Article Sentences database (JNAS) was used for the equalization experiments. While the word error rate (WER) for ETSI standard DSR front-end was 18.6 % under acoustic mismatched condition, our propsed method yielded a rate of 12.3 %.
Fuji Ren, Kazuyuki Matsumoto, Shunji Mitsuyoshi, Shingo Kuroiwa, Gai Lin
Proceedings of the IEEE International Conference on Systems, Man and Cybernetics 2 1666-1672 2003年
In ike near future, it will be necessary for the senior citizens to nurse the other senior citizens because of declining population of children and increasing a new type of family. We have been developing welfare robots which can support lives of the senior citizens and have sensibility to lighten the burden imposed on nursing. The measurement of emotions from the usual conversation is considered to be one of its basic researches. In this paper, we are going to propose the algorism of the emotion measurement and the prototype system based on this algorism, and we are also going to discuss its validity.
Proceedings of the 46th IEEE International Midwest Symposium on Circuits & Systems, Vols 1-3 978-981 2003年 査読有り
In order to construct a good machine translation system or make any natural language processing research for cross language information retrieval you must have a good parallel corpus. Internet archive contains a lot of parallel documents. To construct a good parallel corpus from the Internet archive, you must have a good bilingual dictionary. This paper describes an algorithm to automatically extract an English/ Arabic bilingual dictionary from parallel texts that exist in the Internet archive. The system should preferably be useful for many different language pairs. Unlike most of the systems done, our system can extract translation pairs from a very small parallel corpus. This new system can extract translations from only two sentences in one language and two sentences in the other language if the requirements of the system accomplished. Moreover, this system is able to extract word pairs that are translation of each other and the explanation of the Arabic or English word in the other language as well. The accuracy of the system is 59.1% in the case of one English word translated to one Arabic word, 23.9% in the case of one English word translated to more than one Arabic word (Arabic phrase), and 14.6% in the case of one Arabic word translated to more than one English word (English phrase).
2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5 1699-1704 2002年 査読有り
In this paper, we present a new machine translation (MT) approach using MT engines and sentence partitioning. A multiple engine MT system consists of several MT engines running in parallel, coordinated by a controller. Each engine is implemented using an existing MT technique and has its own characteristics. When translating a sentence, each engine translates it independently. If more than one engine translates the sentence successfully, the controller chooses the best translation according to a combining algorithm implemented using translation statistics. If no engine succeeds in translating the sentence, the controller partitions the sentence, coordinates the engines to translates its constituent simple sentences, and combines the partial translation results into a translation result for the whole input sentence. A complex sentence is partitioned based on conjunctives and punctuation marks such as comma and semicolon. We have developed a multiple engine MT system based on the above approach. The system consists of four independent MT engines. The experiments show that the proposed approach is effective for implementing practical MT systems.
2001 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5 960-965 2002年 査読有り
The, Vector Space Model (VSM) is a conventional information retrieval model, which represents a document, collection by a term-by-document matrix. Since term-by-document, matrices are usually high-dimensional and sparse, they are susceptible to noise and are also difficult, to capture the. underlying semantic, structure. Additionally, the storage, and processing of such matrices places great, demands on computing resources. Dimensionality reduction is a way to overcome these problems. Principal Component, Analysis (PCA) and Singular Value Decomposition (SVD) are. popular techniques for dimensionality reduction based on matrix decomposition, however they contain both positive. and negative values in the decomposed matrices. In the work described here, we use Non-negative Matrix Factorization (NMF) for dimensionality reduction of the vector space model. Since matrices decomposed by NMF only contain non-negative values, the original data are represented by only additive, not subtractive, combinations of the basis vectors. This characteristic of parts-based representation is appealing because it reflects the intuitive notion of combining parts to form a whole. Also NMF computation is based on the simple. iterative algorithm, it, is therefore advantageous for applications involving large, matrices. Using MEDLINE collection, we experimentally showed that NMF offers great improvement, over the. vector space model.
EUROSPEECH 2001 Scandinavia, 7th European Conference on Speech Communication and Technology, 2nd INTERSPEECH Event, Aalborg, Denmark, September 3-7, 2001 1099-1102 2001年 査読有り
2001 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-VI, PROCEEDINGS 493-496 2001年 査読有り
We propose an efficient mixture Gaussian synthesis method for decision tree based state tying that produces better context-dependent models in a short period of training time. This method makes it possible to handle mixture Gaussian HMMs in decision tree based state tying algorithm, and provides higher recognition performance compared to the conventional HMM training procedure using decision tree based state tying on single Gaussian HMMs. This method also reduces the steps of HMM training procedure because the mixture incrementing process is not necessary. We applied this method to training of telephone speech triphones, and evaluated its effect on Japanese phonetically balanced sentence tasks. Our method achieved a 1 to 2 point improvement in phoneme accuracy and a 67% reduction in training time.
This paper describes speech endpoint detection methods for continuous speech recognition systems used over telephone networks. Speech input to these systems may be contaminated not only by various ambient noises but also by various irrelevant sounds generated by users such as coughs, tongue clicking, lip noises and certain out-of-task utterances. Under these adverse conditions, robust speech endpoint detection remains an unsolved problem. We found in fact, that speech endpoint detection errors occurred in over 10% of the inputs in field trials of a voice activated telephone extension system. These errors were caused by problems of (1) low SNR, (2) long pauses between phrases and (3) irrelevant sounds prior to task sentences. To solve the first two problems, we propose a real-time speech ending point detection algorithm based on the implicit approach, which finds a sentence end by comparing the likelihood of a complete sentence hypothesis and other hypotheses. For the third problem, we propose a speech beginning point detection algorithm which rejects irrelevant sounds by using likelihood ratio and duration conditions. The effectiveness of these methods was evaluated under various conditions. As a result, we found that the ending point detection algorithm was not affected by long pauses and that the beginning point detection algorithm successfully rejected irrelevant sounds by using phone HMMs that fit the task. Furthermore, a garbage model of irrelevant sounds was also evaluated and we found that the garbage modeling technique and the proposed method compensated each other in their respective weak points and that the best recognition accuracy was achieved by integrating these methods. (C) 1999 Elsevier Science B.V. All rights reserved.
IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS E78D(6) 636-641 1995年6月 査読有り
We carried out a one year field trial of a voice-activated automatic telephone exchange service at KDD Laboratories which has about 200 branch phones. This system has DSP-based continuous speech recognition hardware which can process incoming calls in real lime using a vocabulary of 300 words. The recognition accuracy was found to be 92.5% for speech read from a written text under laboratory conditions independent of the speaker. In this paper, we describe the performance of the system obtained as a result of the field trial. Apart from recognition accuracy, there was about 20% error due to out-of-vocabulary input and incorrect detection of speech endpoints which had not been allowed for in the laboratory experiments. Also, we found that the recognition accuracy for actual speech was about 18% lower than for speech read from text even if there were no out-of-vocabulary words. In this paper, we examine error variations for individual data in order to try and pinpoint the cause of incorrect recognition. It was found from experiments on the collected data that the pause model used, filled pause grammar and differences of channel frequency response seriously affected recognition accuracy. With the help of simple techniques to overcome these problems, we finally obtained a recognition accuracy of 88.7% for real data.