IECON 2011: 37TH ANNUAL CONFERENCE ON IEEE INDUSTRIAL ELECTRONICS SOCIETY 1300-1305 2011年
This paper presents a novel class-E-M power amplifier for low harmonic contents and high output power. In the proposed class-E-M power amplifier, the symmetric configuration is applied to the class-E-M power amplifier. By applying the symmetric configuration, the proposed class-E-M power amplifier obtains the extremely low total harmonic distortion and four times higher output power than the conventional single class-E-M power amplifier. For achieving the class-E(M)ZVS/ZDVS/ZCS/ZDCS conditions, the MOSFET drain-to-source nonlinear parasitic capacitances, finite dc-feed inductance, equivalent series resistances of inductors, and switch-on resistances are considered in the circuit designs. A design example is presented along with the PSpice-simulation and experimental waveforms at 3.5 MHz operating frequency. The waveforms from the PSpice-simulations and circuit experiments agreed with the numerical predictions quantitatively, which validates the effectiveness of the proposed class-E-M power amplifier in this paper.
本稿では,音声と画像を用いたマルチモーダル音声認識の共通評価基盤 CENSREC-1-AV について紹介する.CENSREC-1-AV では,音声・画像データベースおよびベースラインシステムを提供する.音声は学習用クリーンデータのほか,乗用車走行雑音を付与したものを収録した.画像はカラー映像と近赤外線映像を収録し,ガンマ補正を用いて乗用車走行シミュレーション画像をテストデータとした.ベースラインシステムでは,MFCC と,固有顔ないしはオプティカルフローを特徴量として,マルチストリーム HMM により認識を行った.This paper introduces an evaluation framework for multimodal speech recognition: CENSREC-1-AV. The corpus CENSREC-1-AV provides an audiovisual speech database and a baseline system of multimodal speech recognition. Speech signals were recorded in clean condition for training and in-car noises were overlapped for testing. Color and infrared pictures were captured as training data, and image corruption was conducted for testing using the gamma correction technique. In the baseline system, acoustic MFCC as well as eigenface or optical-flow information are adopted as audio and visual features respectively, then multi-stream HMMs are used as a recognition model.
電気学会論文誌. C, 電子・情報・システム部門誌 = The transactions of the Institute of Electrical Engineers of Japan. C, A publication of Electronics, Information and System Society 130(5) 863-872 2010年
In this paper, a new framework for removing mixed noise composed of the impulse and Gaussian noises from images is presented in which the FINDRM with the directional difference and the Bivariate Shrinkage Function (BSF) in the Dual-Tree Complex Wavelet Transform (DT-CWT) domain are used. First, the noise detection phase of the Fuzzy Impulse Noise Detection and Reduction Method (FINDRM) is used to determine whether a pixel is an impulse or not. When the pixel is determined as an impulse noise, the FINDRM with the directional difference is used to restore the impulse noise. Second, Gaussian noise is removed by using the BSF, which considers the relationships between wavelet coefficients in the DT-CWT domain. Applying the proposed framework to an image corrupted by mixed noise, a clean image can be obtained.
自然言語処理では処理単位として文などの意味的なまとまりがある単位を用いるため,音声認識結果に対して文境界を示す必要がある.本研究では,まず SVM を用いた文境界検出において文境界直前における語の出現しやすさを考慮することによって文境界検出に適した特徴空間の作成方法を提案する.さらに,音声認識時に認識結果と共に出力される単語信頼度を素性として文境界検出に利用することを検討する.文境界検出においては 『日本語話し言葉コーパス (CSJ)』 を対象として SVM を用いて評価実験を行った.Since the units of processing for Natural Language Processing(NLP) are based on syntactic structure, for example sentence, it is necessary to detect the sentence boundary for the Automatic Speech Recognition(ASR) outputs. In this paper, at first, we propose the feature space that is applied to detecting sentence boundary with Support Vector Machine(SVM) by considering the frequency of the word immediately before sentence boundary. At second, we examine using confidence measure of ASR outputs for sentence boundary detection with SVM. We evaluated our methods on the Corpus of Spontaneous Japanese(CSJ).
本論文では,順位統計量を用いた話者照合手法を紹介する.さらに,順位統計量を用いた話者照合手法における照合コストを下げるためのコホート話者の選択方法について提案する.コホート話者は申告者の音声に対してシステムに登録された不特定多数の話者モデル (GMM) との尤度の順位を基準に作成する.評価実験として,科学警察研究所が構築した大規模話者骨導音声データベースに収録されている男性 283 名の気導音声を用いて実験を行った.従来手法では,全話者 283 名による順位統計量で算出した minDCF が 0.0092 に対して,提案手法では平均 57 名の順位統計量で 0.0098,101 名の順位統計量で 0.0094 という同等の性能を達成した.また,照合スコアとして T-norm を用いた場合の minDCF が 0.0154 だった.In this paper, we introduce a novel speaker verification method which determines whether a claimer is accepted or rejected by the rank of the claimer in a large number of speaker models instead of score normalization, such as T-norm and Z-norm. The method has advantages over the standard T-norm in speaker verification accuracy. However, it needs much computation time as well as T-norm that needs calculating likelihoods for many cohort models. Hence, we also discuss the speed-up the method that selects cohort speakers for each target speaker in the training stage. This data driven approach can significantly reduce computation time resulting in faster speaker verification decision. We conducted text-independent speaker verification experiments using large-scale Japanese speaker recognition evaluation corpus constructed by National Research Institute of Police Science. From the corpus, we used utterances collected from 283 Japanese males. As results, the proposed method whose the number of cohort speaker is 57 achieved an minDCF of 0.0098, while using 282 speakers as cohort speaker obtained 0.0092 and T-norm obtained 0.0154.
我々は独奏者のブレスによる合図を伴奏制御のインタフェースとして利用可能な伴奏システムを開発してきており,以前の研究では曲の冒頭部においてブレスの合図を利用できるシステムを提案した.本研究では曲の冒頭だけでなく,曲中でもブレスによる合図を利用可能な手法を提案する.システムを実装し,人間の演奏者による評価実験を行った結果,ブレスによる合図を用いた方がずれが減少し、演奏者による主観評価も高いことが示された.We are developing the accompaniment system using musical cues by the soloist's breath. In our previous study, we introduced the method of using breath cues at the beginning of musical piece. In this study, we introduced the method using breath cues not only at the beginning but also during a piece and performed the evaluation experiment by human soloists. As a result, it was suggested that the new system achieved better synchronization between the soloist and the system than the previous system and the performers who used the system preferred the new system better than the previous system.
In this research, we analyzed the overlap phenomena at turn-taking points in Japanese Sign Language Dialogue. The spontaneous dialogue data were recorded in the environment where they can look at each other via prompters and three dialogue data by six native signers were used for the analysis. First, it was shown that the overlaps at turn-taking point occurred with very high frequency (75%). Secondly, we analyzed these phenomena based on "turn-taking system for conversation" by H. Sacks, E.A. Schegloff and G. Jefferson and found the situations where the speaker (signer) continued his/her utterance after TRP (transition-relevance place) and the next speaker started his/her turn by recognizing or projecting the TRP, therefore the overlap occurred. We consider these types of overlap as the normal turn-taking. Finally, there were a few case (18%) where the turn-taking rule was broken and the other cases follow the rule.
In this research, we analyzed the overlap phenomena at turn-taking points in Japanese Sign Language Dialogue. The spontaneous dialogue data were recorded in the environment where they can look at each other via prompters and three dialogue data by six native signers were used for the analysis. First, it was shown that the overlaps at turn-taking point occurred with very high frequency (75%). Secondly, we analyzed these phenomena based on "turn-taking system for conversation" by H. Sacks, E.A. Schegloff and G. Jefferson and found the situations where the speaker (signer) continued his/her utterance after TRP (transition-relevance place) and the next speaker started his/her turn by recognizing or projecting the TRP, therefore the overlap occurred. We consider these types of overlap as the normal turn-taking. Finally, there were a few case (18%) where the turn-taking rule was broken and the other cases follow the rule.
In this research, we analyzed the overlap phenomena at turn-taking points in Japanese Sign Language Dialogue. The spontaneous dialogue data were recorded in the environment where they can look at each other via prompters and three dialogue data by six native signers were used for the analysis. First, it was shown that the overlaps at turn-taking point occurred with very high frequency (75%). Secondly, we analyzed these phenomena based on "turn-taking system for conversation" by H. Sacks, E.A. Schegloff and G. Jefferson and found the situations where the speaker (signer) continued his/her utterance after TRP (transition-relevance place) and the next speaker started his/her turn by recognizing or projecting the TRP, therefore the overlap occurred. We consider these types of overlap as the normal turn-taking. Finally, there were a few case (18%) where the turn-taking rule was broken and the other cases follow the rule.
近年,音声から書き起こしを自動的に作成するシステムに関する研究がさかんに行われている.これまでは,音声を正確に書き起こすことに重点をおいて研究されてきているが,見た者にとって議論の内容をより理解しやすい書き起こしの作成が重要であると考えられる.議論の内容を正確に伝えるには言語情報だけでは不十分であり,議論の場面や発話意図,感情といった情報も必要であると考えられる.そこで,本研究では会議や討論などの書き起こしに発話意図を付与することを目指し,テキストと音声の両方から発話印象について分析することを目的とした.まず,文字の太さや大きさの変化といった文字の装飾や,「!」,「?」などの記号に着目し,そのようなテキストの変化を書き起こしに付与する形で主観評価実験を行うことにより「疑問」,「驚き」などの発話印象がどの程度感じられるのかを調べた.また,音声についても同様に主観評価実験を行い,その結果と「F0」や「パワー」などの韻律パラメータを使って重回帰分析を行い,韻律パラメータと発話印象の関係を分析した.その結果,各テキスト変化,韻律パラメータとそれぞれの発話印象との関係が明らかになった.さらにそれらを総合的に分析することで,テキストと音声では発話印象の受け方が異なるものと,同じ傾向のものがあることが明らかになった.In recent years, a great amount of research has been done on systems that transcribe utterances through automatic speech recognition. This research has generally been focused on transcribing utterances correctly. What is presently required, however, is a transcription method that enables the overall content of a given discourse to be more easily understood by readers. It is generally considered that linguistic information by itself is insufficient for this purpose, and that a way of showing speaker's intentions and emotions is also required. In this study, we analyzed user's impressions of utterances from both text and speech, with the aim of at indexing the impressions to the transcriptions of discourse forums such as meetings and discussions. We investigated how impressions such as “doubt” and “surprise” are felt by changing the size of written characters and indexing signs such as question marks and exclamation marks in the text. The relation between prosody parameters and utterance impressions was analyzed by using multiple linear regression. As a result, we were able to clarify the relationship between variations of text, prosody parameters, and utterance impressions.