JP2000047683A

JP2000047683A - Segmentation aids and media

Info

Publication number: JP2000047683A
Application number: JP10216261A
Authority: JP
Inventors: Ikuyo Katsuse; 郁代勝瀬; Hidetsugu Maekawa; 英嗣前川
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 1998-07-30
Filing date: 1998-07-30
Publication date: 2000-02-18

Abstract

(57)【要約】【課題】音声合成に関する技術では、前処理として高
精度の音声セグメンテーションが必要であるが、自然発
話の連続音声のセグメンテーションを自動的に完璧に行
なうことは極めて困難である。同時に処理の即時処理性
や完全自動化の必然性はないので人間による作業が介在
することとなる。本発明の目的は、高度な専門知識がな
い人でも高精度の音声などのセグメンテーションを可能
にするインタフェースを構築することである。【解決手段】セグメンテーション候補を算出する自動
セグメンテーション部１２を有し、それらの候補を画面
に表示して作業者がそれらを選択または修正したり、聴
取や読図による確認を行ないながらセグメンテーション
作業が行なえるGUI制御による修正部１３から構成され
る。上記構成により高度な専門知識がない作業者にも容
易に高精度のセグメンテーションの達成が可能になる。 (57) [Problem] A technology relating to speech synthesis requires high-precision speech segmentation as preprocessing, but it is extremely difficult to automatically and perfectly perform continuous speech segmentation of spontaneous speech. At the same time, there is no necessity for immediate processing and complete automation, so human work is involved. An object of the present invention is to construct an interface that enables a person without a high level of specialized knowledge to perform high-precision segmentation of speech and the like. SOLUTION: An automatic segmentation unit 12 for calculating segmentation candidates is provided, and the candidates are displayed on a screen so that an operator can select or correct them, or perform a segmentation operation while checking by listening or reading. It comprises a correction unit 13 under GUI control. With the above configuration, it is possible to easily achieve high-precision segmentation even for an operator who does not have a high level of specialized knowledge.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は信号のセグメンテー
ション方法に関し、特に音声セグメンテーション作業を
行なう作業者が、人工音声の合成の際に必要となる音素
セグメンテーションを行うためのセグメンテーション補
助装置に関するものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a signal segmentation method and, more particularly, to a segmentation assisting device for an operator who performs a voice segmentation operation to perform a phoneme segmentation required when synthesizing artificial speech.

【０００２】[0002]

【従来の技術】合成音声を生成する技術や自動音声認識
に関する技術では、音声に対してセグメンテーションと
ラベリングを行なうことは、それぞれの技術の前処理と
してきわめて重要である。これまでにも自動セグメンテ
ーションに関する技術に関して多くの報告がある。例え
ば電子情報通信学会論文誌D-II Vol.73 No.1 第１頁か
ら第９頁に発表されているものや特公平6-337692や特公
平7-13587などがある。ここでは、電子情報通信学会論
文誌DII Vol.73 No.1 第１頁から第９頁に発表されたも
のを例として詳細に記述する。本論文では、専門家が用
いている手法や知識をもちいて、連続音声中の音韻セグ
メンテーションを行うエキスパートシステムを構築した
例を報告している。提案されたシステムでは、ルール化
された専門家の知識や戦略に従って入力音声中の音韻を
検出し、その境界を決定している。音韻環境すなわち前
後の音韻の種類と典型的な音韻変形に依存するあいまい
な知識を表現するために確信度を用いた仮説推論を行
い、音韻環境の仮説のもとで音響特徴を抽出している。
図１５がシステムが用いている音韻決定のための戦略で
ある。入力音声はスペクトル分析され、そのパワースペ
クトラムの特徴に基づいてある音韻である確信度が計算
される。これが音韻候補の検出である。さらに、仮説さ
れた音韻候補のそれぞれに対して、音韻環境を仮説す
る。この音韻環境の仮説のもとで、音韻境界の確信度を
求める。このようにして検出された音韻候補仮説の確信
度と音韻境界の確信度を統合して、音韻境界の確からし
さを決定する。複数の音韻境界が候補となったとき、最
も確信度の高い音韻境界が決定される。このようにして
遂行された音声セグメンテーションの結果、子音のみで
あるが、９０％以上の音韻境界が境界のずれ１０ｍｓ以
内の精度で検出された。ただし、付加誤りは３０％から
４０％あった。ここで付加誤りとは、実際にはない音韻
のラベルが付加される誤りのことを意味する。例えば
「ｋａＮｒｉ」というラベルが正解のとき「ｋａｅＮｒ
ｉ」というラベルがふられたとすると、「ｅ」は誤って
付加されたものであるので付加誤りとなる。付加誤り
は、正解との位置のずれでは表現できない誤りであるの
で別途評価の対象となる。2. Description of the Related Art In a technique for generating synthesized speech and a technique relating to automatic speech recognition, performing segmentation and labeling on speech is extremely important as preprocessing of each technique. There have been many reports on techniques related to automatic segmentation. For example, the publications of IEICE Transactions D-II Vol. 73 No. 1 pages 1 to 9 and Japanese Patent Publication No. 6-337692 and Japanese Patent Publication No. 7-13587. Here, a detailed description will be given by taking, as an example, the one published on the first to ninth pages of the IEICE Transactions DII Vol.73 No.1. In this paper, we report an example of constructing an expert system that performs phoneme segmentation in continuous speech using techniques and knowledge used by experts. In the proposed system, phonemes in input speech are detected according to the knowledge and strategy of a ruled expert, and their boundaries are determined. It performs hypothesis inference using certainty factors to express ambiguous knowledge that depends on the phonological environment, that is, the type of the preceding and succeeding phonological units and typical phonological transformations, and extracts acoustic features based on the hypothesis of the phonological environment .
FIG. 15 shows a strategy for determining phonemes used by the system. The input speech is subjected to spectrum analysis, and certainty, which is a certain phoneme, is calculated based on the characteristics of the power spectrum. This is the detection of phonemic candidates. Further, a phonological environment is hypothesized for each hypothesized phonological candidate. Based on the hypothesis of the phonological environment, the certainty of the phonological boundary is obtained. The certainty of the phoneme candidate hypothesis detected in this way and the certainty of the phoneme boundary are integrated to determine the certainty of the phoneme boundary. When a plurality of phoneme boundaries are candidates, the phoneme boundary with the highest certainty is determined. As a result of the voice segmentation performed in this manner, 90% or more phoneme boundaries, which are only consonants, were detected with an accuracy within 10 ms of the boundary shift. However, the addition error ranged from 30% to 40%. Here, the addition error means an error in which a phoneme label that does not actually exist is added. For example, if the label “kaNri” is correct, “kaeNr”
If the label "i" is given, "e" is an erroneously added one, so that an addition error occurs. The additional error is an error that cannot be expressed by a deviation of the position from the correct answer, and thus is separately evaluated.

【０００３】[0003]

【発明が解決しようとする課題】従来の技術では、自然
な発話の連続音声のセグメンテーションとラベリングを
自動的に完璧に行うことは極めて困難である。特に合成
音声を生成する場合は、そのセグメンテーション結果を
用いて合成した音声を人間が聞くことになるため、高い
セグメンテーションの精度が要求される一方で、自動音
声認識装置のように処理のリアルタイム性は要求され
ず、前もってセグメンテーションされた結果をデータベ
ースとして蓄積しておけばよい。このように完全自動化
の必然性はなくかつ高精度のセグメンテーションが要求
される場合には人間による作業を介在することができ
る。このような状況では、従来は音声波形やパワースペ
クトラムに基づきセグメンテーション作業が遂行されて
いるが、そのような高精度のセグメンテーション作業に
は高度な専門知識が要求される。With the prior art, it is extremely difficult to automatically and perfectly perform the segmentation and labeling of a continuous speech of a natural utterance. In particular, when generating synthetic speech, humans hear the synthesized speech using the segmentation results, so high segmentation accuracy is required, but real-time processing like an automatic speech recognition device requires It is only necessary to accumulate the result of segmentation in advance as a database without being required. As described above, there is no necessity of complete automation, and when high-precision segmentation is required, human work can be interposed. In such a situation, a segmentation operation is conventionally performed based on a speech waveform or a power spectrum, but such a highly accurate segmentation operation requires a high degree of expertise.

【０００４】本発明はこのような従来の高精度なセグメ
ンテーション作業には高度な専門知識が必要となるとい
う課題を考慮し、高度な専門知識がない人でも高精度の
音声セグメンテーションを可能にするセングメンテーシ
ョン補助装置を提供することを目的とするものである。[0004] The present invention takes into account the problem that such a conventional high-precision segmentation operation requires a high degree of specialized knowledge, and enables a person who does not have high-level expertise to perform high-precision voice segmentation. It is an object to provide a fragmentation assisting device.

【０００５】[0005]

【課題を解決するための手段】上述した課題を解決する
ために、第１の本発明（請求項１に対応）は、音響信号
または画像信号を入力する信号入力手段と、前記信号入
力手段で入力された前記音響信号または画像信号に対し
て、自動的にセグメンテーションを行い、セグメント境
界の候補を算出する自動セグメンテーション手段と、前
記自動セグメンテーション手段で算出された前記セグメ
ント境界の候補を画面に表示し、ＧＵＩ制御によって前
記セグメント境界の候補を確認しながら、前記候補を選
択または修正することによりセグメンテーションを行う
修正手段と、を備えたことを特徴とするセグメンテーシ
ョン補助装置である。In order to solve the above-mentioned problems, a first aspect of the present invention (corresponding to claim 1) is a signal input means for inputting an audio signal or an image signal, and the signal input means comprises: Automatic segmentation is performed on the input audio signal or image signal, and automatic segmentation means for calculating segment boundary candidates, and the segment boundary candidates calculated by the automatic segmentation means are displayed on a screen. And a correction means for performing segmentation by selecting or correcting the candidate while confirming the candidate for the segment boundary by GUI control.

【０００６】また第２の本発明（請求項２に対応）は、
前記信号入力手段とは別に設けられ、前記自動セグメン
テーション手段で自動セグメンテーションを行う際に使
用するか及び／または前記修正手段でＧＵＩ制御により
前記セグメント境界の候補を選択または修正する際に使
用するテキストを入力するテキスト入力手段と、を備え
たことを特徴とする第１の発明に記載のセグメンテーシ
ョン補助装置である。A second invention (corresponding to claim 2) is:
A text which is provided separately from the signal input means and is used when performing automatic segmentation by the automatic segmentation means and / or used when selecting or correcting the segment boundary candidate by GUI control by the correction means. The segmentation assisting device according to the first aspect, further comprising a text input unit for inputting.

【０００７】また第３の本発明（請求項３に対応）は、
前記テキスト入力手段により入力された前記テキストを
音素記号表記に変換する音素記号表記変換手段と、を備
え、前記テキストの使用に代えて、その音素記号表記を
前記自動修正手段で使用するか及び／または前記修正手
段で使用することを特徴とする第２の発明に記載のセグ
メンテーション補助装置である。A third aspect of the present invention (corresponding to claim 3) is:
Phoneme symbol notation conversion means for converting the text input by the text input means into phoneme symbol notation, and using the phoneme symbol notation in the automatic correction means instead of using the text, and / or Alternatively, the segmentation assisting device according to the second invention is used in the correcting means.

【０００８】また第４の本発明（請求項４に対応）は、
前記自動セグメンテーション手段は、音声信号のゼロク
ロス数、ピッチ、パワー、音声フォルマント、ケプスト
ラム又は位相を用いる音響信号処理により音素境界候補
の算出を行うことを特徴とする第１〜３の発明のいずれ
かに記載のセグメンテーション補助装置である。。A fourth invention (corresponding to claim 4) is:
The automatic segmentation means according to any one of the first to third inventions, wherein the phoneme boundary candidate is calculated by an audio signal processing using a zero crossing number, a pitch, a power, an audio formant, a cepstrum or a phase of the audio signal. It is a segmentation assistance device as described. .

【０００９】また第５の本発明（請求項５に対応）は、
前記自動セグメンテーション手段は、前記テキスト又は
音素記号表記を利用して音節または音素マッチングによ
り音素境界候補の算出を行うことを特徴とする第２の発
明または第３の発明のいずれかに記載のセグメンテーシ
ョン補助装置である。A fifth invention (corresponding to claim 5) provides:
The segmentation assistance according to any of the second or third invention, wherein the automatic segmentation means calculates phoneme boundary candidates by syllable or phoneme matching using the text or phoneme symbol notation. Device.

【００１０】また第６の本発明（請求項６に対応）は、
前記自動セグメンテーション手段は、前記音響信号処理
による音素境界候補の算出を行い、かつ音節または音素
マッチングによる音素境界候補の算出を行うことを特徴
とする第２の発明または第３の発明のいずれかに記載の
セグメンテーション補助装置である。A sixth aspect of the present invention (corresponding to claim 6) is:
The second invention or the third invention, wherein the automatic segmentation means calculates a phoneme boundary candidate by the acoustic signal processing, and calculates a phoneme boundary candidate by syllable or phoneme matching. It is a segmentation assistance device as described.

【００１１】また第７の本発明（請求項７に対応）は、
前記自動セグメンテーション手段は、音声信号のゼロク
ロス数、ピッチ、パワー、音声フォルマント、ケプスト
ラム又は位相を用いる前記音響信号処理により与えられ
た音素境界候補の数と、前記テキスト入力手段により入
力されたテキストが変換された音素記号表記から得られ
る音素境界候補の数とを比較し、所定の判断基準に基づ
いて前記音響信号処理により与えられた前記音素境界候
補の数を削減することを特徴とする第６の発明に記載の
セグメンテーション補助装置である。A seventh aspect of the present invention (corresponding to claim 7) is:
The automatic segmentation means converts the number of phoneme boundary candidates given by the sound signal processing using the number of zero crosses, pitch, power, sound formant, cepstrum or phase of the sound signal, and converts the text input by the text input means. And comparing the number of phoneme boundary candidates obtained from the obtained phoneme symbol notation with the number of phoneme boundary candidates given by the acoustic signal processing based on a predetermined criterion. It is a segmentation assistance device according to the invention.

【００１２】また第８の本発明（請求項８に対応）は、
前記自動セグメンテーション手段は、前記音節または音
素マッチングにより算出された音素境界候補を用いて前
記音響信号処理により与えられた前記音素境界候補の数
を削減することを特徴とする第７の発明に記載のセグメ
ンテーション補助装置である。An eighth aspect of the present invention (corresponding to claim 8) is:
The method according to claim 7, wherein the automatic segmentation means reduces the number of the phoneme boundary candidates given by the acoustic signal processing using the phoneme boundary candidates calculated by the syllable or phoneme matching. It is a segmentation aid.

【００１３】また第９の本発明（請求項９に対応）は、
音響信号または画像信号とこれに対して何らかの基準で
セグメンテーションされたセグメンテーション候補を入
力する入力手段と、作業者がＧＵＩを介して前記セグメ
ンテーション候補を選択、移動、削除または追加するこ
とによりセグメンテーション修正作業を行う修正手段
と、を備えたことを特徴とするセグメンテーション補助
装置である。According to a ninth aspect of the present invention (corresponding to claim 9),
An input means for inputting an audio signal or an image signal and a segmentation candidate segmented based on some reference thereto, and a worker selecting, moving, deleting or adding the segmentation candidate through a GUI to perform a segmentation correction operation. Correction means for performing the segmentation.

【００１４】また第１０の本発明（請求項１０に対応）
は、前記修正手段は、任意の前記音素境界候補を選択、
移動、削除または追加することによりセグメンテーショ
ン作業を行うことを特徴とする第１〜３の発明または第
９の発明のいずれかに記載のセグメンテーション補助装
置である。The tenth invention (corresponding to claim 10)
The correction means selects any of the phoneme boundary candidates,
The segmentation assisting device according to any one of the first to third inventions or the ninth invention, wherein a segmentation operation is performed by moving, deleting, or adding.

【００１５】また第１１の本発明（請求項１１に対応）
は、前記修正手段は、任意の前記音素境界候補の間の音
声セグメントを聴取により修正することを特徴とする第
請１〜３の発明または第９の発明のいずれかに記載のセ
グメンテーション補助装置である。The eleventh invention (corresponding to claim 11)
The segmentation assisting device according to any one of the first to third inventions or the ninth invention, wherein the correction means corrects a speech segment between any of the phoneme boundary candidates by listening. is there.

【００１６】また第１２の本発明（請求項１２に対応）
は、前記修正手段は、作業者がＧＵＩを介して選択した
セグメンテーション結果に基づいて音声合成を実施して
聴取による再度の修正を行うことを特徴とする第１〜３
の発明または第９の発明のいずれかに記載のセグメンテ
ーション補助装置である。The twelfth invention (corresponding to claim 12)
Wherein the correcting means performs speech synthesis based on the segmentation result selected by the operator via a GUI and performs another correction by listening.
A segmentation assisting device according to any one of the above inventions or the ninth invention.

【００１７】また第１３の本発明（請求項１３に対応）
は、第１〜１２の発明のいずれかにおける各手段の全部
または一部の機能を実現するためのプログラムを格納し
ていることを特徴とする媒体である。The thirteenth invention (corresponding to claim 13)
Is a medium storing a program for realizing all or a part of the function of each means in any one of the first to twelfth inventions.

【００１８】[0018]

【発明の実施の形態】以下に、本発明の実施の形態につ
いて図面を参照して説明する。Embodiments of the present invention will be described below with reference to the drawings.

【００１９】(実施の形態１)図１において、１１は音声
音響信号を入力する信号入力部、１２は自動的に音素セ
グメンテーションを行いセグメント境界の候補を算出す
る自動セグメンテーション部、１３はＧＵＩ制御により
自動セグメンテーション部１２で算出されたセグメント
境界候補を修正する修正部、１４はＧＵＩ制御により修
正された音素セグメンテーション結果を出力するセグメ
ンテーション結果出力部である。(Embodiment 1) In FIG. 1, reference numeral 11 denotes a signal input unit for inputting an audio sound signal, 12 an automatic segmentation unit for automatically performing phoneme segmentation and calculating segment boundary candidates, and 13 for GUI control. A correction unit 14 that corrects the segment boundary candidates calculated by the automatic segmentation unit 12 is a segmentation result output unit that outputs a phoneme segmentation result corrected by GUI control.

【００２０】ここで、図１０の説明をする。１０２はセ
グメンテーション作業の途中経過やセグメンテーション
結果を音声音響信号の波形と重ね合わせて表示する表示
領域、１０１は表示領域のＹ軸の表示倍率を決めるＹ軸
の縮尺、１０５はマウスでセグメンテーション候補を選
択、修正、移動、削除するための作業領域、１０３は作
業領域１０５のＸ軸（時間軸）方向の表示倍率を決める
Ｘ軸の縮尺、１０４はＹ軸の表示倍率を決めるＹ軸の縮
尺である。また後述するように、１０６は入力された音
素記号表記を表示する区切り文字である。図１０では区
切り文字１０６には、「Ｋａーｓｏｒｕｉｄ
ｏー、ＫａーｄｏｇａｍｅＮ」（カー
ソル移動、カード画面）と音素記号表記が表示されてい
る。１１０はセグメンテーション作業を初期化する初期
化ボタン、１１１は自動セグメンテーション部１２にお
いて自動的に音素セグメンテーションが行われた結果や
作業領域１０５でセグメンテーション候補の修正が行わ
れた結果を消去する認識結果クリアボタン、１１２は自
動セグメンテーション部１２で行われた音素セグメンテ
ーションを修正した結果を確定し出力するラベル候補確
定ボタン、１０７は入力された音声音響信号を再生する
再生ボタンである。また後述するように１０８は作業領
域１０５にて修正された音素セグメンテーションの結果
を用いて実際に音声合成を行い聴取による確認を行う確
認再生ボタンである。また１０９は自動的に音素セグメ
ンテーションを行いセグメント候補を算出する認識実行
ボタン、１１３〜１１６はセグメンテーション候補にラ
ベルをふるもの、１１７はセグメント境界候補をＸ軸
（時間軸）にそって移動させる候補修正ボタン、１１８
はセグメント境界候補を削除する候補削除ボタン、１１
９はセグメント境界候補から実際にセグメント境界とな
る候補を選択する候補選択ボタンである。また後述する
ように１２０はセグメント境界候補間を実際に再生する
ことにより聴取により確認する候補確認ボタンである。
また１２１は新たにセグメント境界候補を追加する候補
追加ボタンである。Here, FIG. 10 will be described. 102 is a display area for displaying the progress of the segmentation work and the segmentation result superimposed on the waveform of the audio sound signal, 101 is a Y-axis scale that determines the display magnification of the Y-axis of the display area, and 105 is a segmentation candidate selected with a mouse , A work area for correction, movement, and deletion, 103 is an X-axis scale that determines the display magnification in the X-axis (time axis) direction of the work area 105, and 104 is a Y-axis scale that determines the Y-axis display magnification. . As will be described later, reference numeral 106 denotes a delimiter for displaying the input phoneme symbol notation. In FIG. 10, "Ka-soruid" is used as the delimiter 106.
o-, Ka-do ga me N "(cursor movement, card screen) and phoneme symbol notation are displayed. Reference numeral 110 denotes an initialization button for initializing a segmentation operation, and 111 denotes a recognition result clear button for erasing a result obtained by automatically performing phoneme segmentation in the automatic segmentation unit 12 or a result obtained by correcting a segmentation candidate in the work area 105. , 112 are label candidate confirmation buttons for confirming and outputting the result of correcting the phoneme segmentation performed by the automatic segmentation unit 12, and 107 is a playback button for playing back the input audio signal. As will be described later, reference numeral 108 denotes a confirmation playback button for actually synthesizing speech using the result of the phoneme segmentation corrected in the work area 105 and confirming by listening. Reference numeral 109 denotes a recognition execution button for automatically performing phoneme segmentation to calculate a segment candidate. Reference numerals 113 to 116 label a segmentation candidate. 117 denotes a candidate correction for moving a segment boundary candidate along the X axis (time axis). Button, 118
Is a candidate delete button for deleting a segment boundary candidate, 11
Reference numeral 9 denotes a candidate selection button for selecting a candidate which is actually a segment boundary from the segment boundary candidates. As described later, reference numeral 120 denotes a candidate confirmation button for confirming by listening by actually reproducing between segment boundary candidates.
Reference numeral 121 denotes a candidate addition button for adding a new segment boundary candidate.

【００２１】次にこのような本実施の形態の動作を説明
する。図１において、信号入力部１１により入力された
音声音響信号に対して自動セグメンテーション部１２に
おいて自動的に音素セグメンテーションが行なわれる。
GUI制御による修正部１３は自動セグメンテーション結
果を例えば図１０のように画面に表示する。作業者はマ
ウスなどを操作することによりセグメンテーション結果
の修正ができる。作業者によって修正されたセグメンテ
ーション結果はセグメンテーション結果出力部１４に送
られ出力される。Next, the operation of the embodiment will be described. In FIG. 1, an automatic segmentation unit 12 automatically performs a phoneme segmentation on a voice acoustic signal input by a signal input unit 11.
The correction unit 13 based on GUI control displays the result of the automatic segmentation on the screen as shown in FIG. 10, for example. The operator can correct the segmentation result by operating a mouse or the like. The segmentation result corrected by the operator is sent to the segmentation result output unit 14 and output.

【００２２】さらに上述した動作を図１０と図１とを対
応付けて説明する。初期化ボタン１１０をマウスでクリ
ックし、前回の音素セグメンテーション作業を消去した
のち、ファイル名を指定して、音声音響信号を読み込
む。これは信号入力部１１において音声音響信号を読み
込むことに対応する。次に認識実行ボタン１０９をマウ
スでクリックして自動セグメンテーションを行いセグメ
ンテーション境界候補を算出する。これは自動セグメン
テーション部１２において自動的に音素セグメンテーシ
ョンを行うことに対応する。図１０において作業領域１
０５と表示領域１０１には、音声音響信号に重ね合わせ
て、多くの縦線が表示されているが、これが自動セグメ
ンテーション部１２において算出されたセグメント境界
候補である。次に候補修正ボタン１１７をマウスでクリ
ックするかキーボードのファンクションキー「Ｆ５」を
押すことにより画面は修正モードに入り、作業領域１０
５に表示されているセグメント境界候補をマウスで選択
することができる。選択されたセグメント境界候補は、
マウスを移動することにより、Ｘ軸にそって移動できる
ようになる。図１０では、候補修正ボタンが選択された
状態となっている。さらに、候補削除ボタン１１８をマ
ウスでクリックするかキーボードのファンクションキー
「Ｆ６」を押すと、画面は削除モードに入り、作業領域
１０５に表示されているセグメント境界候補をマウスで
選択することによりその候補を削除することができる。
さらに候補選択ボタン１１９をマウスでクリックするか
キーボードのファンクションキーＦ７を押すことにより
画面は選択モードに入り、作業領域１０５に表示されて
いるセグメント境界候補をマウスで選択することができ
る。選択されたセグメント境界候補は、セグメント境界
となり作業領域１０５と表示領域１０２に表示色を変え
て表示される。さらに、候補確認ボタン１２０をマウス
でクリックするかキーボードのファンクションキー「Ｆ
８」を押すと、画面は確認モードに入り、作業領域１０
５に表示されているセグメント境界候補のうち選択され
た境界候補間の音声が再生され聴取により確認すること
ができる。さらに候補追加ボタン１２１をマウスでクリ
ックするかキーボードのファンクションキー「Ｆ９」を
押すと、画面は追加モードに入り、作業領域１０５をマ
ウスでクリックすることにより対応するＸ軸の箇所にセ
グメント境界候補を追加することができる。以上のよう
に１１７〜１２１の各機能をつかってセグメント境界候
補を修正する。これは、ＧＵＩ制御による修正部１３に
おいて、セグメント境界候補を修正することに対応す
る。Further, the above-described operation will be described with reference to FIG. 10 and FIG. After the initialization button 110 is clicked on with a mouse to delete the previous phoneme segmentation work, a file name is specified, and a voice acoustic signal is read. This corresponds to reading a voice acoustic signal in the signal input unit 11. Next, the recognition execution button 109 is clicked on with a mouse to perform automatic segmentation and calculate a segmentation boundary candidate. This corresponds to automatically performing phoneme segmentation in the automatic segmentation unit 12. In FIG. 10, work area 1
Many vertical lines are displayed in the display area 05 and the display area 101 so as to be superimposed on the audio sound signal. These are the segment boundary candidates calculated by the automatic segmentation unit 12. Next, by clicking the candidate correction button 117 with a mouse or pressing the function key “F5” of the keyboard, the screen enters a correction mode, and the work area 10
The segment boundary candidates displayed in 5 can be selected with the mouse. The selected segment boundary candidates are
By moving the mouse, it becomes possible to move along the X axis. In FIG. 10, the candidate correction button has been selected. Furthermore, when the candidate deletion button 118 is clicked with the mouse or the function key “F6” of the keyboard is pressed, the screen enters a deletion mode, and the segment boundary candidate displayed in the work area 105 is selected with the mouse. Can be deleted.
Further, by clicking the candidate selection button 119 with a mouse or pressing the function key F7 of the keyboard, the screen enters a selection mode, and a segment boundary candidate displayed in the work area 105 can be selected with a mouse. The selected segment boundary candidate becomes a segment boundary and is displayed in the work area 105 and the display area 102 with different display colors. Further, the candidate confirmation button 120 is clicked with a mouse or the function key “F
Pressing "8" causes the screen to enter confirmation mode and
The sound between the selected boundary candidates among the segment boundary candidates displayed in 5 is reproduced and can be confirmed by listening. Further, when the candidate addition button 121 is clicked with the mouse or the function key “F9” of the keyboard is pressed, the screen enters an addition mode. Can be added. As described above, the segment boundary candidates are corrected using the functions 117 to 121. This corresponds to correcting the segment boundary candidate in the correction unit 13 based on the GUI control.

【００２３】以上のように本実施の形態によれば、自動
セグメンテーション部１２によって算出された音素境界
の候補を画面上に提示し、ＧＵＩ制御による修正部１３
によって選択ならびに修正を行なうことにより、熟練し
た作業者でなくても高精度の音声セグメンテーションを
行うことができる。As described above, according to the present embodiment, the phoneme boundary candidates calculated by the automatic segmentation unit 12 are presented on the screen, and the correction unit 13 by GUI control is displayed.
By performing the selection and correction according to the above, it is possible to perform high-precision voice segmentation even without a skilled operator.

【００２４】(実施の形態２)本実施の形態は第１の実施
の形態にテキスト入力部を加えたものである。(Embodiment 2) This embodiment is obtained by adding a text input unit to the first embodiment.

【００２５】図２において、１１は信号入力部、２１は
テキスト入力部、１２は自動セグメンテーション部、１
３はＧＵＩ制御による修正部、１４はセグメンテーショ
ン結果出力部である。本実施の形態では、テキスト入力
部２１には信号入力部１１に入力される音声に対応した
テキストが入力される。自動セグメンテーション部１２
は信号入力部１１から得られる音声音響情報だけでなく
テキスト入力部２１から得られるテキスト情報も利用し
て音素セグメンテーションを行なうことができる。同様
にＧＵＩ制御による修正部１３は、テキスト入力部２１
から入力されたテキストを参考にしてセグメント境界候
補の修正を行うことができる。In FIG. 2, 11 is a signal input unit, 21 is a text input unit, 12 is an automatic segmentation unit, 1
Reference numeral 3 denotes a correction unit based on GUI control, and reference numeral 14 denotes a segmentation result output unit. In the present embodiment, text corresponding to the voice input to signal input unit 11 is input to text input unit 21. Automatic segmentation unit 12
The phoneme segmentation can be performed using not only the sound and audio information obtained from the signal input unit 11 but also the text information obtained from the text input unit 21. Similarly, the correction unit 13 based on the GUI control includes a text input unit 21
The segment boundary candidates can be corrected with reference to the text input from the.

【００２６】(実施の形態３)本実施の形態は第２の実施
の形態に音素記号表記変換部３１を加えたものである。
図３において、１１は信号入力部、２１はテキスト入力
部、１２は自動セグメンテーション部、３１は入力され
たテキストを音素記号表記に自動的に変換する音素記号
表記変換部、１３はＧＵＩ制御による修正部、１４はセ
グメンテーション結果出力部である。音素記号表記変換
部３１ではテキスト入力部２１に入力されたテキストを
音素記号表記に変換する。音素記号表記変換部３１は自
らが持つ単語と文法が対になった辞書を利用してテキス
トから音素記号表記に変換する。テキスト入力部２１に
例えば「カーソル移動、カード画面」と入力されたとす
ると、音素記号表記変換部３１によってこれは自動的に
図１０の区切り文字１０６に示されているように「Ｋａ
ーｓｏｒｕｉｄｏー、Ｋａーｄｏ
ｇａｍｅＮ」と変換される。(Embodiment 3) This embodiment is obtained by adding a phoneme symbol notation conversion unit 31 to the second embodiment.
In FIG. 3, 11 is a signal input unit, 21 is a text input unit, 12 is an automatic segmentation unit, 31 is a phoneme symbol notation conversion unit that automatically converts input text into phoneme symbol notation, and 13 is correction by GUI control. Reference numeral 14 denotes a segmentation result output unit. The phoneme symbol notation conversion unit 31 converts the text input to the text input unit 21 into phoneme symbol notation. The phoneme symbol notation conversion unit 31 converts a text into a phoneme symbol notation using a dictionary in which the word and the grammar of the phoneme symbol are paired. If, for example, “cursor movement, card screen” is input to the text input unit 21, this is automatically changed to “Ka” as shown by the delimiter 106 in FIG.
ー so ruido, Ka-do
ga me N ".

【００２７】以上のように本実施の形態によれば、テキ
スト入力部２１には仮名漢字交じりの文章などといった
音素表記ではないテキストを入力することが可能にな
る。As described above, according to the present embodiment, it is possible to input text that is not a phoneme notation, such as a sentence containing kana and kanji, into the text input unit 21.

【００２８】(実施の形態４)本実施の形態は上述した実
施の形態２または３の自動セグメンテーション部１２に
おける自動セグメンテーションを音響信号処理より行な
う例を説明する。図３における自動セグメンテーション
部１２では音素境界で特徴的に変化することの多い音響
パラメータの抽出を行なう。抽出されるパラメータは例
えば音声信号のゼロクロス数やピッチやパワーや音声ホ
ルマントやケプストラムといったものである。抽出され
たそれぞれのパラメータに関して時間的な変化率が計算
される。これらの変化率が極大となる時刻またはサンプ
ル点が計算される。これらの時刻やサンプル点が音素境
界の候補として自動セグメンテーション部１４から出力
される。(Embodiment 4) In this embodiment, an example will be described in which the automatic segmentation in the automatic segmentation unit 12 of the above-described second or third embodiment is performed by acoustic signal processing. The automatic segmentation unit 12 in FIG. 3 extracts acoustic parameters that often change characteristically at phoneme boundaries. The extracted parameters are, for example, the number of zero crosses, the pitch, the power, the audio formant, the cepstrum, etc. of the audio signal. A temporal change rate is calculated for each of the extracted parameters. The time or sample point at which these rates of change are maximal is calculated. These times and sample points are output from the automatic segmentation unit 14 as phoneme boundary candidates.

【００２９】なお、本発明の音響信号処理で抽出するパ
ラメータは、上述した実施の形態におけるゼロクロス
数、パワー、音声ホルマント又はケプストラムに限ら
ず、位相など、要するに音素境界で特徴的に変化するも
のでありさえすればよい。The parameters extracted in the audio signal processing of the present invention are not limited to the number of zero crossings, power, voice formant or cepstrum in the above-described embodiment, but are characteristically changed at phoneme boundaries, such as phases. You only have to.

【００３０】さらに、本発明の音響信号処理で抽出する
パラメータは、上述した実施の形態におけるその変化率
の最大点が計算されるものに限らず、音響パラメータの
値そのものを使用するなど、要するに音素境界が検出で
きる計算方法でありさえすればよい。Further, the parameters to be extracted in the audio signal processing of the present invention are not limited to those for which the maximum point of the rate of change is calculated in the above-described embodiment, but are simply phoneme values such as using the values of the audio parameters themselves. What is necessary is just a calculation method that can detect the boundary.

【００３１】(実施の形態５)本実施の形態は上述した実
施の形態２または３の自動セグメンテーション部におけ
る自動セグメンテーションを音節または音素マッチング
により行なう例を説明する。図３における自動セグメン
テーション部１２は音節または音素マッチングを行な
う。あらかじめ用意された音声認識用の学習データと信
号入力部１１により入力された音声データについて隠れ
マルコフモデル(HMM)を用いてマッチングが行なわれ
る。そしてそれぞれの音節または音素が割り当てられた
音声音響信号のサンプル点または時刻の情報が音素境界
の候補として自動セグメンテーション部１２から出力さ
れる。(Embodiment 5) In this embodiment, an example will be described in which the automatic segmentation in the automatic segmentation unit of the above-described second or third embodiment is performed by syllable or phoneme matching. The automatic segmentation unit 12 in FIG. 3 performs syllable or phoneme matching. Matching is performed on training data for speech recognition prepared in advance and speech data input by the signal input unit 11 using a hidden Markov model (HMM). Then, the automatic segmentation unit 12 outputs information on the sample point or time of the audio sound signal to which each syllable or phoneme is assigned as a candidate for a phoneme boundary.

【００３２】なお、本発明の音節または音素マッチング
は、上述した実施の形態における隠れマルコフモデルに
限らず、ＤＰマッチングなど、要するに時間伸縮（発話
時間に依存しないこと）の可能なマッチング方法であり
さえすればよい。Note that the syllable or phoneme matching of the present invention is not limited to the hidden Markov model in the above-described embodiment, and is even a matching method capable of performing time expansion and contraction (not depending on speech time), such as DP matching. do it.

【００３３】(実施の形態６)以下本発明の第６の実施の
形態について、図面を参照しながら説明する。(Embodiment 6) Hereinafter, a sixth embodiment of the present invention will be described with reference to the drawings.

【００３４】図４において、１１は信号入力部、２１は
テキスト入力部、３１は音素記号表記変換部、４１は音
響信号処理部、４２は音節または音素マッチング部、１
３はＧＵＩ制御による修正部、１２はセグメンテーショ
ン結果出力部である。In FIG. 4, 11 is a signal input unit, 21 is a text input unit, 31 is a phoneme symbol notation conversion unit, 41 is an audio signal processing unit, 42 is a syllable or phoneme matching unit,
Reference numeral 3 denotes a correction unit based on GUI control, and reference numeral 12 denotes a segmentation result output unit.

【００３５】本実施の形態では前記第２の実施の形態の
自動セグメンテーション部１４が音響信号処理部４１と
音節または音素マッチング部４２から構成されている。
このように自動セグメンテーション部１２において複数
の異なる手法によりセグメンテーションを行なうことに
より音素境界候補の抜けを防ぐことができる。In this embodiment, the automatic segmentation section 14 of the second embodiment comprises an acoustic signal processing section 41 and a syllable or phoneme matching section 42.
As described above, the segmentation is performed by the automatic segmentation unit 12 using a plurality of different methods, so that it is possible to prevent missing of the phoneme boundary candidates.

【００３６】(実施の形態７)以下本発明の第７の実施の
形態について図面を参照しながら説明する。(Embodiment 7) Hereinafter, a seventh embodiment of the present invention will be described with reference to the drawings.

【００３７】図５において、１１は信号入力部、２１は
テキスト入力部、４１は音響信号処理部、３１は音素記
号表記変換部、５１は候補削減部、４２は音節または音
素マッチング部、１３はＧＵＩ制御による修正部、１４
はセグメンテーション結果出力部である。In FIG. 5, 11 is a signal input unit, 21 is a text input unit, 41 is an audio signal processing unit, 31 is a phoneme symbol notation conversion unit, 51 is a candidate reduction unit, 42 is a syllable or phoneme matching unit, and 13 is Correction unit by GUI control, 14
Is a segmentation result output unit.

【００３８】本実施の形態では、候補削減部５１を導入
することにより、作業者がＧＵＩ上で扱う音素境界候補
の数をあらかじめ絞り込んでおくことにより操作性を向
上させることができる。候補削減部５１の具体的な働き
について、有声部と無声部の境界を中心に候補を削減す
る方法を例として図９を用いて説明する。「歓喜」とい
うテキストおよび音声音響信号がそれぞれテキスト入力
部２１および信号入力部１１に入力されたとし、音素記
号表記変換部３１の出力が「ＫａＮｋｉ」となったとす
る。音響信号処理部４１の出力である音素境界の候補は
それぞれ図９の（１）から（６）のいずれかに相当す
る。音響信号処理部４１の出力のうち、ゼロクロス数の
変化の極大点を与える場所が４箇所あったとすると、そ
の４箇所は図９の（１）（２）（４）（５）に一意に同
定される。さらに、このようにして（５）の境界を与え
るサンプル点が決まるとその後には有声音のiしかない
ことから、有声部の終りを与える候補以外の候補は削除
できる。また、（２）と（４）の境界を与えるサンプル
点が決まると、ａとＮの境界を与える候補を残せばよい
ことになるが、この場合はロバストにこの境界を与える
候補が同定できないので、逆にこの境界を与えることが
非常に少ない候補を削除することにより候補の数を減ら
す。またこの例のように必ずしも比較的ロバストな候補
の数と対応する音素記号の数が一致するとは限らない。
一致しないような場合は音節または音素マッチング部４
２のセグメンテーション結果を利用して荒くセグメンテ
ーションした結果に近い候補を割り当てることにより同
様な候補削減を行なう。In the present embodiment, the operability can be improved by introducing the candidate reduction unit 51 to narrow down the number of phoneme boundary candidates handled by the operator in advance on the GUI. The specific operation of the candidate reducing unit 51 will be described with reference to FIG. 9 as an example of a method of reducing candidates centering on the boundary between a voiced part and an unvoiced part. It is assumed that the text and the audio sound signal of “joy” are input to the text input unit 21 and the signal input unit 11, respectively, and the output of the phoneme symbol notation conversion unit 31 is “KaNki”. Each of the phoneme boundary candidates output from the acoustic signal processing unit 41 corresponds to one of (1) to (6) in FIG. Assuming that there are four places where the maximum point of the change in the number of zero crossings is present in the output of the acoustic signal processing unit 41, the four places are uniquely identified in (1), (2), (4), and (5) of FIG. Is done. Furthermore, when the sample point giving the boundary of (5) is determined in this way, since there is only a voiced sound i thereafter, candidates other than the candidate giving the end of the voiced part can be deleted. When the sample point that gives the boundary between (2) and (4) is determined, it is sufficient to leave a candidate that gives the boundary between a and N. In this case, the candidate that gives this boundary cannot be identified robustly. , Conversely, reduce the number of candidates by deleting candidates that give very little this boundary. Also, as in this example, the number of relatively robust candidates and the number of corresponding phoneme symbols do not always match.
If they do not match, syllable or phoneme matching unit 4
A similar candidate reduction is performed by assigning a candidate close to the result of the rough segmentation using the result of the second segmentation.

【００３９】ＧＵＩ制御による修正部１３は候補削減部
５１から受けとった音素境界候補と音節または音素マッ
チング部４２と音素記号表記変換部３１の出力を受けと
りそれらを画面に表示する。表示の例を図１１に示す。The correction unit 13 based on the GUI control receives the output of the phoneme boundary candidates received from the candidate reduction unit 51, the syllable or phoneme matching unit 42, and the phoneme symbol notation conversion unit 31, and displays them on the screen. FIG. 11 shows a display example.

【００４０】図１１において、作業領域１０５と表示領
域１０２には、音声音響信号の波形と、候補削減部５１
から受け取った音素境界候補である短い多くの縦線と、
音節または音素マッチング部４２から受け取った音素境
界候補である長い縦線と、音節または音素マッチングの
ラベリングの結果が音素記号表記として表示されてい
る。作業者は、これらの音素境界の候補を修正して実際
の音素境界を決定すればよい。In FIG. 11, the work area 105 and the display area 102 have the waveform of the audio sound signal and the candidate
Many short vertical lines that are phoneme boundary candidates received from
A long vertical line that is a phoneme boundary candidate received from the syllable or phoneme matching unit 42 and a result of labeling of the syllable or phoneme matching are displayed as phoneme symbol notation. The operator may correct these phoneme boundary candidates to determine an actual phoneme boundary.

【００４１】以上のように本実施の形態では、ＧＵＩ画
面上に表示される音素境界の候補の数をあらかじめ自動
的に絞っておくことによって作業者の操作をより簡便に
することができる。As described above, in the present embodiment, the operator's operation can be made simpler by automatically narrowing down the number of phoneme boundary candidates displayed on the GUI screen in advance.

【００４２】(実施の形態８)以下本発明の第８の実施の
形態について図面を参照しながら説明する。Embodiment 8 Hereinafter, an eighth embodiment of the present invention will be described with reference to the drawings.

【００４３】図６において、１３はＧＵＩ制御による修
正部、６０は入力部、６１は表示部、６２は記憶部、６
３は指示の検知部、６４は指示の解釈部、１４はセグメ
ンテーション結果出力部である。入力部６０には、音声
信号や何らかの手段を用いて行われた音素セグメンテー
ション結果や音素記号などが入力される。記憶部６２は
入力部６１に入力されたデータを記憶する。表示部６１
は記憶部６２に格納されているデータをＧＵＩの画面上
に表示する。音声セグメンテーションを行なう作業者は
ＧＵＩ上でマウスを用いて音素境界を表していると思わ
れる候補を選択する。指示の検知部６３は作業者のマウ
ス動作を検知する。指示の解釈部６４は選択された候補
を音素境界として解釈をし、記憶部６２に確定されたセ
グメンテーション位置として格納する。表示部６１は新
しく格納されたデータに基づいて画面にデータを再表示
する。In FIG. 6, 13 is a correction unit by GUI control, 60 is an input unit, 61 is a display unit, 62 is a storage unit, 6
3 is an instruction detecting unit, 64 is an instruction interpreting unit, and 14 is a segmentation result output unit. The input unit 60 receives a speech signal, a result of phoneme segmentation performed using some means, a phoneme symbol, and the like. The storage unit 62 stores the data input to the input unit 61. Display section 61
Displays the data stored in the storage unit 62 on the screen of the GUI. An operator who performs voice segmentation selects a candidate that is considered to represent a phoneme boundary using a mouse on the GUI. The instruction detecting unit 63 detects the mouse operation of the worker. The instruction interpreting unit 64 interprets the selected candidate as a phoneme boundary, and stores it in the storage unit 62 as the determined segmentation position. The display unit 61 redisplays the data on the screen based on the newly stored data.

【００４４】図１２はこの手順に従って作業者によって
なされた音声セグメンテーションの結果の例である。図
１２において、作業領域１０５と表示領域１０２に表示
されている長い縦線が音素セグメンテーションの結果で
あり、アルファベットがラベリングされた音素記号表記
である。このようにして作業者によって最終的に確定さ
れた音声セグメンテーション結果はセグメンテーション
結果出力部１４に送られ出力される。FIG. 12 shows an example of the result of voice segmentation performed by an operator according to this procedure. In FIG. 12, long vertical lines displayed in the work area 105 and the display area 102 are the results of phoneme segmentation, and are phoneme symbol notations in which alphabets are labeled. The voice segmentation result finally determined by the operator in this way is sent to the segmentation result output unit 14 and output.

【００４５】なお、図１４のように実施の形態７で説明
したような処理をひとつにまとめて入力部６０の代わり
におくことができる。つまり、実施の形態７では、音声
信号や何らかの手段を用いて行われた音素セグメンテー
ション結果や音素記号などが入力部６０で入力されてい
たが、その代わりに図１４のような処理によりＧＵＩに
よる修正部１３に音声信号や音素セグメンテーション結
果や音素記号を入力することもできる。As shown in FIG. 14, the processing as described in the seventh embodiment can be put together in place of the input unit 60. That is, in the seventh embodiment, the speech unit and the phoneme segmentation result or the phoneme symbol performed by using some means are input by the input unit 60. Instead, the correction by the GUI is performed by the processing as shown in FIG. A voice signal, a phoneme segmentation result, or a phoneme symbol can also be input to the unit 13.

【００４６】また、本実施の形態では指示の検知部６３
はマウス動作を検知するとしたがマウス動作に限ったも
のではない。タッチパネルに触れるなど他の動作の検出
を行なってもよいし、作業者の音声を認識することによ
り作業者の指示を理解してもよい。In the present embodiment, the instruction detecting unit 63
Detects mouse movements, but is not limited to mouse movements. Other operations such as touching the touch panel may be detected, or the operator's instructions may be understood by recognizing the operator's voice.

【００４７】(実施の形態９)本実施の形態は実施の形態
８の機能に加えて、作業者が音の聴取による確認を行な
いながらセグメンテーション作業を行なえることを特徴
とする。(Embodiment 9) This embodiment is characterized in that, in addition to the function of the embodiment 8, the operator can perform the segmentation work while confirming by listening to the sound.

【００４８】図７は実施の形態７におけるＧＵＩ制御に
よる修正部１３を詳述したもので、６１は表示部、６２
は記憶部、６３は指示の検知部、６４は指示の解釈部、
７１は音声切り出し部、７２は音声再生部である。FIG. 7 shows the modification unit 13 by GUI control according to the seventh embodiment in detail.
Is a storage unit, 63 is an instruction detecting unit, 64 is an instruction interpreting unit,
Reference numeral 71 denotes an audio cutout unit, and reference numeral 72 denotes an audio reproduction unit.

【００４９】図７で示された構成による動作の例を以下
に詳述する。An example of the operation according to the configuration shown in FIG. 7 will be described in detail below.

【００５０】作業者は例えば図１１に示されたようなＧ
ＵＩ画面上の候補確認ボタン１２０をマウスをもちいて
選択することにより、音素境界候補に挟まれた音声セグ
メントや選択された音素境界に挟まれた音声セグメント
を聴取することができる。このとき指示の解釈部７４は
再生すべき音声区間を解釈する。音声切り出し部７５は
記憶部７２から呼び出された音声音響信号のうち再生す
べき音声区間を切り出す。この音声区間は音声再生部７
６において音響信号として再生される。The operator can use, for example, G as shown in FIG.
By selecting the candidate confirmation button 120 on the UI screen using a mouse, it is possible to listen to the voice segment sandwiched between the phoneme boundary candidates and the voice segment sandwiched between the selected phoneme boundaries. At this time, the instruction interpreting unit 74 interprets a voice section to be reproduced. The voice cutout unit 75 cuts out a voice section to be reproduced from the voice acoustic signal called from the storage unit 72. This voice section is a voice playback unit 7
At 6 the audio signal is reproduced.

【００５１】このように本実施の形態では、作業者が再
生された音声を聴取することにより正しいセグメンテー
ションが行なわれたかどうかを確認することができる。As described above, in the present embodiment, it is possible for the operator to check whether the correct segmentation has been performed by listening to the reproduced sound.

【００５２】(実施の形態１０)本実施の形態は、実施の
形態８の機能に加えて作業者によるセグメンテーション
結果を実際に音声合成に適用することにより音声合成に
必要な精度でセグメンテーションができているかどうか
を聴取により確認することができる。(Embodiment 10) In this embodiment, in addition to the function of the embodiment 8, by applying the segmentation result by the operator to speech synthesis, the segmentation can be performed with the accuracy required for speech synthesis. Can be confirmed by hearing.

【００５３】図８は実施の形態７におけるＧＵＩ制御に
よる修正部１３を詳述したもので、６１は表示部、６２
は記憶部、６３は指示の検知部、６４は指示の解釈部、
８１は音声合成部、８２は音声再生部である。FIG. 8 shows the details of the correction unit 13 based on the GUI control according to the seventh embodiment.
Is a storage unit, 63 is an instruction detecting unit, 64 is an instruction interpreting unit,
81 is a voice synthesizing unit, and 82 is a voice reproducing unit.

【００５４】図８で示された構成による動作の例を以下
に詳述する。An example of the operation according to the configuration shown in FIG. 8 will be described in detail below.

【００５５】作業者は例えば図１１に示されたようなＧ
ＵＩ画面上の確認再生ボタン１０８をマウスをもちいて
選択することにより、事前に選択した音素境界の情報を
用いて実際に合成された音声を聴取することにより、事
前に音素境界の精度が実用レベルに十分であるかどうか
の確認ができる。このとき作業者による指示の解釈部６
４は記憶部６２のセグメンテーション位置の情報を音声
合成部８１へ送り音声合成部８１で合成された合成音声
は音声再生部８２から再生される。For example, the operator can use G as shown in FIG.
By selecting the confirmation playback button 108 on the UI screen with the mouse, the user can listen to the synthesized speech using the information of the pre-selected phoneme boundaries, and the accuracy of the phoneme boundaries can be reduced to a practical level in advance. Can be checked whether it is enough. At this time, the operator interprets the instruction 6
Reference numeral 4 indicates that the information of the segmentation position in the storage unit 62 is sent to the voice synthesizing unit 81, and the synthesized voice synthesized by the voice synthesizing unit 81 is reproduced from the voice reproducing unit 82.

【００５６】なお、本実施の形態で示したＧＵＩ制御に
よる修正部１３に上述した実施の形態９で述べた音声切
り出し部７１を加えることにより、実施の形態９で述べ
た機能と本実施の形態で述べた機能をあわせもつことが
できる。この時のＧＵＩ制御による修正部１３の構成を
図１３に示す。図１３は実施の形態７におけるＧＵＩ制
御による修正部１３を詳述したもので、６１は表示部、
６２は記憶部、６３は指示の検知部、６４は指示の解釈
部、７１は音声切り出し部、８１は音声合成部、１３１
は音声再生部である。このように構成すれば、セグメン
テーション候補間を聴取により確認することができ、ま
た実際に合成された音声を聴取することにより音素境界
の精度が実用レベルで十分であるかどうかの確認がで
き、その結果に基づいて再度の修正を行うことができ
る。Note that by adding the voice cutout unit 71 described in the ninth embodiment to the correction unit 13 based on the GUI control shown in the present embodiment, the functions described in the ninth embodiment and the functions described in the present embodiment are added. It can have the functions described in. FIG. 13 shows the configuration of the correction unit 13 based on the GUI control at this time. FIG. 13 illustrates the modification unit 13 by GUI control according to the seventh embodiment in detail, where 61 is a display unit,
62 is a storage unit, 63 is an instruction detection unit, 64 is an instruction interpretation unit, 71 is a voice cutout unit, 81 is a voice synthesis unit, 131
Is an audio reproducing unit. With this configuration, it is possible to confirm by listening between the segmentation candidates, and by listening to the actually synthesized speech, it is possible to confirm whether or not the accuracy of the phoneme boundary is sufficient at a practical level. Another correction can be made based on the result.

【００５７】なお信号入力部が本発明の信号入力手段の
例であり、自動セグメンテーション部が本発明の自動セ
グメンテーション手段の例であり、ＧＵＩ制御による修
正部が本発明の修正手段の例であり、テキスト入力部が
本発明のテキスト入力手段の例であり、音素記号表記変
換部が本発明の音素記号表記変換手段の例である。The signal input unit is an example of the signal input unit of the present invention, the automatic segmentation unit is an example of the automatic segmentation unit of the present invention, and the correction unit by GUI control is an example of the correction unit of the present invention. The text input unit is an example of the text input unit of the present invention, and the phoneme symbol notation conversion unit is an example of the phoneme symbol notation conversion unit of the present invention.

【００５８】さらに、本発明の信号入力手段、は、上述
した実施の形態における音声音響信号を入力するものに
限らず、画像信号も入力することが可能である。信号入
力手段が画像信号を入力した際には、自動セグメンテー
ション手段は画像に対するセグメンテーションを行うこ
とができる。すなわち、画像の背景と対象物を分離する
などの処理を行うことができる。また修正手段は、画像
に対するセグメンテーション結果をＧＵＩ制御によって
修正することができる。Further, the signal input means of the present invention is not limited to the one for inputting the audio and sound signals in the above-mentioned embodiment, but can also input an image signal. When the signal input means inputs an image signal, the automatic segmentation means can perform segmentation on the image. That is, processing such as separation of the background of the image from the object can be performed. The correction means can correct the segmentation result for the image by GUI control.

【００５９】さらに、本実施の形態では、言語種として
日本語を例にして説明したが日本語に限らず、英語、ド
イツ語、フランス語、韓国語、中国語など、要するに音
声認識と音声合成の対象となりうる言語種でありさえす
ればよい。Further, in the present embodiment, the language type has been described using Japanese as an example. However, the language type is not limited to Japanese, but may be English, German, French, Korean, Chinese or the like. It only needs to be a language type that can be targeted.

【００６０】さらに、本発明は、その機能を実現する各
手段の全部または一部の機能を実現するためのプログラ
ムを格納していることを特徴とする媒体でもある。Further, the present invention is also a medium characterized by storing a program for realizing all or a part of each means for realizing the function.

【００６１】[0061]

【発明の効果】以上説明したところから明らかなよう
に、高度な専門知識がない人でも高精度のセグメンテー
ションができるセグメンテーション補助装置を提供する
ことができる。As is apparent from the above description, it is possible to provide a segmentation assisting device that can perform a high-accuracy segmentation even by a person without advanced technical knowledge.

[Brief description of the drawings]

【図１】本発明のセグメンテーション補助装置の第１の
実施の形態におけるブロック図。FIG. 1 is a block diagram of a first embodiment of a segmentation assisting device according to the present invention.

【図２】本発明のセグメンテーション補助装置の第２の
実施の形態におけるブロック図。FIG. 2 is a block diagram of a segmentation assisting device according to a second embodiment of the present invention.

【図３】本発明のセグメンテーション補助装置の第３な
らびに第４ならびに第５の実施の形態におけるブロック
図。FIG. 3 is a block diagram of a segmentation assisting device according to third, fourth, and fifth embodiments of the present invention.

【図４】本発明のセグメンテーション補助装置の第６の
実施の形態におけるブロック図。FIG. 4 is a block diagram of a segmentation assisting device according to a sixth embodiment of the present invention.

【図５】本発明のセグメンテーション補助装置の第７の
実施の形態におけるブロック図。FIG. 5 is a block diagram of a segmentation assisting device according to a seventh embodiment of the present invention.

【図６】本発明のセグメンテーション補助装置の第８の
実施の形態におけるブロック図。FIG. 6 is a block diagram of a segmentation assisting device according to an eighth embodiment of the present invention.

【図７】本発明のセグメンテーション補助装置の第９の
実施の形態におけるブロック図。FIG. 7 is a block diagram of a ninth embodiment of a segmentation assisting device according to the present invention.

【図８】本発明のセグメンテーション補助装置の第１０
の実施の形態におけるブロック図。FIG. 8 shows a tenth embodiment of the segmentation aid according to the present invention.
FIG. 2 is a block diagram according to the embodiment.

【図９】本発明のセグメンテーション補助装置の第７の
実施の形態における音素境界の存在箇所。FIG. 9 shows locations of phoneme boundaries in a seventh embodiment of the segmentation assisting apparatus according to the present invention.

【図１０】本発明のセグメンテーション補助装置の第１
の実施の形態におけるＧＵＩ画面の例。FIG. 10 shows a first example of the segmentation assisting device according to the present invention.
7 is an example of a GUI screen according to the embodiment.

【図１１】本発明のセグメンテーション補助装置の第７
の実施の形態におけるＧＵＩ画面の例。FIG. 11 shows a seventh embodiment of the segmentation assistance device according to the present invention.
7 is an example of a GUI screen according to the embodiment.

【図１２】本発明のセグメンテーション補助装置の第８
の実施の形態におけるＧＵＩ画面の例。FIG. 12 shows an eighth embodiment of the segmentation assisting apparatus according to the present invention.
7 is an example of a GUI screen according to the embodiment.

【図１３】本発明のセグメンテーション補助装置の第１
０の実施の形態におけるブロック図。FIG. 13 shows a first example of the segmentation assisting device according to the present invention.
0 is a block diagram according to the embodiment.

【図１４】本発明のセグメンテーション補助装置の第８
の実施の形態における入力部のひとつの置換例。FIG. 14 shows an eighth embodiment of the segmentation assisting apparatus according to the present invention.
15 is a replacement example of the input unit in the embodiment.

【図１５】従来の連続音声中の音韻セグメンテーション
を行うエキスパートシステムの音韻決定のための戦略を
示す図。FIG. 15 is a diagram showing a conventional strategy for determining phonemes of an expert system for performing phoneme segmentation in continuous speech.

[Explanation of symbols]

１１信号入力部１２自動セグメンテーション部１３ＧＵＩ制御による修正部１４セグメンテーション結果出力部２１テキスト入力部３１音素記号表記変換部４１音響信号処理部４２音節または音素マッチング部５１候補削減部６１表示部６２記憶部６３指示の検知部６４指示の解釈部７１音声切り出し部７２音声再生部８１音声合成部８２音声再生部 Reference Signs List 11 signal input unit 12 automatic segmentation unit 13 correction unit by GUI control 14 segmentation result output unit 21 text input unit 31 phoneme symbol notation conversion unit 41 acoustic signal processing unit 42 syllable or phoneme matching unit 51 candidate reduction unit 61 display unit 62 storage unit 63 Instruction detecting unit 64 Instruction interpreting unit 71 Voice cutout unit 72 Voice reproducing unit 81 Voice synthesizing unit 82 Voice reproducing unit

Claims

[Claims]

1. A signal input unit for inputting an audio signal or an image signal, and a segmentation is automatically performed on the audio signal or the image signal input by the signal input unit to calculate a segment boundary candidate. Automatic segmentation means; and displaying the segment boundary candidates calculated by the automatic segmentation means on a screen, selecting and correcting the candidates while confirming the segment boundary candidates by GUI control, and performing segmentation by selecting or correcting the candidates. And a means for assisting segmentation.

2. The method according to claim 1, wherein the automatic segmentation means is provided separately from the signal input means, and is used when performing automatic segmentation by the automatic segmentation means.
2. The segmentation assisting device according to claim 1, further comprising text input means for inputting text used when selecting or correcting the segment boundary candidate by UI control.

3. A phoneme symbol notation conversion means for converting the text inputted by the text input means into a phoneme symbol notation, wherein the phoneme symbol notation is replaced by the automatic correction means instead of using the text. 3. The method according to claim 2, wherein the correction means is used.
A segmentation aid as described.

4. The method according to claim 1, wherein the automatic segmentation means calculates phoneme boundary candidates by acoustic signal processing using the number of zero crossings, pitch, power, audio formant, cepstrum or phase of the audio signal.
The segmentation assisting device according to any one of claims 1 to 3.

5. The method according to claim 2, wherein the automatic segmentation means calculates phoneme boundary candidates by syllable or phoneme matching using the text or phoneme symbol notation. Segmentation aids.

6. The method according to claim 2, wherein the automatic segmentation means calculates a phoneme boundary candidate by the acoustic signal processing, and calculates a phoneme boundary candidate by syllable or phoneme matching. A segmentation aid according to any of the preceding claims.

7. The automatic segmentation means inputs the number of phoneme boundary candidates given by the audio signal processing using the number of zero crosses, pitch, power, audio formant, cepstrum or phase of the audio signal, and the text input means. Comparing the converted text with the number of phoneme boundary candidates obtained from the converted phoneme symbol notation, and reducing the number of the phoneme boundary candidates given by the acoustic signal processing based on a predetermined criterion. The segmentation assisting device according to claim 6, wherein

8. The automatic segmentation means for reducing the number of phoneme boundary candidates given by the acoustic signal processing using the phoneme boundary candidates calculated by the syllable or phoneme matching. 8. The segmentation assistance device according to 7.

9. An input means for inputting an audio signal or an image signal and a segmentation candidate segmented based on the audio signal or the image signal, and an operator selecting, moving, deleting or adding the segmentation candidate via a GUI. And a correcting means for performing a segmentation correcting operation by the segmentation assisting device.

10. The segmentation work according to claim 1, wherein said correcting means performs a segmentation operation by selecting, moving, deleting or adding any of said phoneme boundary candidates.
10. The segmentation aid according to any one of 3 or 9.

11. The segmentation assisting device according to claim 1, wherein the correction unit corrects a speech segment between any of the phoneme boundary candidates by listening.

12. The method according to claim 1, wherein the correction unit performs voice synthesis based on a segmentation result selected by a worker via a GUI and performs another correction by listening. A segmentation aid according to any of the preceding claims.

13. A medium storing a program for realizing all or a part of the function of each of the means according to claim 1. Description: