電気学会論文誌. C, 電子・情報・システム部門誌 = The transactions of the Institute of Electrical Engineers of Japan. C, A publication of Electronics, Information and System Society 129(10) 1902-1907 2009年
To establish a universal communication environment, computer systems should recognize various modal communication languages. In conventional sign language recognition, recognition is performed by the word unit using gesture information of hand shape and movement. In the conventional studies, each feature has same weight to calculate the probability for the recognition. We think hand position is very important for sign language recognition, since the implication of word differs according to hand position. In this study, we propose a sign language recognition method by using a multi-stream HMM technique to show the importance of position and movement information for the sign language recognition. We conducted recognition experiments using 28,200 sign language word data. As a result, 82.1 % recognition accuracy was obtained with the appropriate weight (position:movement=0.2:0.8), while 77.8 % was obtained with the same weight. As a result, we demonstrated that it is necessary to put weight on movement than position in sign language recognition.
我々雑音下音声認識評価ワーキンググループは,2001 年 10 月から情報処理学会音声言語情報処理研究会の下に組織され,数多く研究されている雑音下の音声認識手法を容易に評価・比較可能な標準評価基盤 CENSREC シリーズの開発・配布を行ってきた.本稿ではその CENSREC シリーズを概観し,また主な音声認識研究の発表の場である日本音響学会全国大会および IEEE ICASSP の発表件数調査を踏まえて,その位置づけを確認する.最後に,今後の展望について述べる.We organized a working group under Special Interest Group of Spoken Language Processing in Information Processing Society of Japan have developed evaluation frameworks of noisy speech recognition (CENSREC series) with which one can evaluate his/her own noise-robust speech recognition method and compare it with the others. In this report, we introduce the series and then review the history of the noisy speech recognition researches in ASJ and ICASSP and view the roles of our works in the history. Finally we discuss the future directions.
Super-Function Based Machine Translation(SFBMT) which is a type of Example-Based Machine Translation has a feature which makes it possible to expand the coverage of examples by changing nouns into variables, however, there were problems extracting entire date/time expressions containing parts-of-speech other than nouns, because only nouns/numbers were changed into variables. We describe a method for extracting date/time expressions for SFBMT. SFBMT uses noun determination rules to extract nouns and a bilingual dictionary to obtain correspondence of the extracted nouns between the source and the target languages. In this method, we add a rule to extract date/time expressions and then extract date/time expressions from a Japanese-English bilingual corpus. The evaluation results shows that the precision of this method for Japanese sentences is 96.7%, with a recall of 98.2% and the precision for English sentences is 94.7%, with a recall of 92.7%.
Previous studies suggested that emphasis or emotion causes changes in the hand movements of people using Japanese Sign Language. There has not been enough research on the change in signing speed (lower or higher), and there has been only a little research on the duration of sign components (words, transitions, and pauses). In this study, we analyzed the arm movement variation in relationship to the signing speed. The arm movements used to sign 20 sentences were recorded at three speeds (high, normal, and low) using a motion tracking system. We analyzed the relationship between the signing speed and the size of the gestures or the speed of the arms. We found that a change in signing speed caused mainly a change in the size of the gestures and that when the gesture was constrained by the location of the arm, the arm speed changed.
音声認識実用化において,雑音下の音声認識の性能向上が叫ばれている.現在も多くの研究が行われているが,これらの手法を客観的に比較評価する標準評価基盤が必要と考えられる.我々は 2001 年 10 月から情報処理学会音声言語情報処理研究会の下で雑音下音声認識評価ワーキンググループとして活動し,標準評価基盤 CENSREC シリーズを構築・配布している.これまでの CENSREC シリーズを概観し,さらに今年度新たに配付する残響下音声認識評価基盤 CENSREC-4 の概要を述べる.そして,ワーキンググループ最終年度に向けて,今後どのような方針で評価基盤を設計・構築・配付していくのかを述べる.Performance improvement of noisy speech recognition is urgent for practical use of speech recognition and methods for this purpose should be compared on common evaluation frameworks. We organized a working group under Special Interest Group of Spoken Language Processing in Information Processing Society of Japan, to develop evaluation frameworks of noisy speech recognition to compare many methods for processing of noisy speech. In this paper, we review the series of CENSREC series and then introduce the reverberant speech recognition evaluation framework CENSREC-4, the newest CENSREC. Finally we describe the road-maps of future CENSRECs.
F0 モデルパラメータに基づいて音声の変換・再合成を行うツールを開発した.先行研究において,発話の平均モーラ長やパワー,F0 モデルパラメータなどの韻律情報から係り受け構造,話者交代/継続の予測がある程度の精度で可能であるという結果が得られていた.これらの結果を実際のシステムに反映するためには,音声の聴取による心理実験が必要である.本研究では遺伝的アルゴリズムによる A-b-S を利用して推定された F0 モデルのパラメータを変更し,STRAIGHT によって再合成を行うツールを開発し,心理実験に使用する音声を快適に作成できる GUI 環境を構築した.We have been developing F0 modification and re-synthesis tool of speech based on F0 model. In the preceding research, syntactic structure and turn-taking were able to be predicted by prosodic information such as average mora duration, power and F0 model parameters. To evaluate the effectiveness of this idea in actual applications, we need to perform psychological listening experiments. In this research, to realize the environment that can easily make speech samples used for listening experiments, we have been developing a tool which can freely change F0 model parameters which were automatically estimated by the genetic algorithm and can re-synthesize the speech data with changed F0 model parameters by using STRAIGHT technology.
本稿では,科学警察研究所によって構築された大規模話者骨導音声データベースを用いた話者照合実験を行った結果を報告する.実験には,664名(男性 336名,女性 328名)のコンデンサマイクで収録された音声(気導音),骨導マイクで収録された音声(骨導音)を用いた.実験では,以前我々が提案した複数話者モデルの順位情報を用いた話者照合手法を評価した.また,話者モデルとして GMM とベクトル量子化 (VQ) セントロイドの比較,発声時期の違いによる照合精度の比較を行った.実験結果より,提案手法は従来の T-Norm を用いた話者照合手法より高い照合精度を示すことが観測された.さらに,話者モデルの違いによる照合精度の比較結果より,気導音では VQ セントロイドを用いた方が照合精度が高く,骨導音では GMM を用いた方が高いことが観測された.また,骨導音による照合精度は気導音より低く,さらに骨導音は時期差が生じた場合,照合精度低下が著しいことが観測された.In this paper, we conducted a speaker verification experiment using large-scale speech database maintained by National Research Institute of Police Science, Japan. In this exepriment, we used speech data of 664 people collected by a capacitor microphone and a bone-conducted microphone. From experimental results, we confirmed that our proposed method that uses rank information obtained by multiple speaker model in previous work improved verification performance than a conventional method using T-norm score. In addition, we compared the speaker model based on GMMs and that based on VQ centroids. From this comparison, we can see that the speaker model based on VQ centroids is higher performance than that based on GMMs under the condition of the capacitor microphone speech. However, VQ centroids degraded the performance of that based on GMMs under the condition of the bone-conducted speech. Moreover, the performances of the bone-conducted speech significant degraded performance if there were difference of the speaking session between the registration and the testing.
これまで我々は,討論や会議における書き起こしに発話印象を付与することを目指して,韻律情報をもとに発話印象を推定する手法について検討を行ってきた.本研究では,新たに韻律情報として F0 モデルから抽出したアクセント成分とフレーズ成分を用いて分析を行った.また,音声から推定された発話印象をどのように書き起こしに付与するかについて分析を行った.今回は,討論や会議といった音声の書き起こしを対象としているため,文字の太さ,大きさといった文字の装飾や感嘆符,疑問符などの記号の付与に着目した.対話音声の書き起こしにこれらのテキスト表現を行い,発話印象の主観評価実験を行った結果,音声から感じる発話印象とテキストから感じる発話印象の違いが明らかとなった.We have studied on estimation of utterance impression using prosody in order to index the utterance impression to transcription of debates and meetings. In this study, it estimated the utterance impression using accent and phrase elements extracted by F0 model. Moreover, it analyzed how to index the utterance impression to the transcription. We focused on thickness and size of character and sign of question and exclamation marks. We conducted subjective evaluation of the utterance impression using speech and text in dialogue speech. As a result, it demonstrated that the utterance impressions by speech and text are different.