JP2001117580A

JP2001117580A - Audio signal processing device and audio signal processing method

Info

Publication number: JP2001117580A
Application number: JP30027299A
Authority: JP
Inventors: Tatsuji Nakagawa; 竜児中川; Keino Pedro; ケイノペドロ; Rosukosu Alex; ロスコスアレックス
Original assignee: Universitat Pompeu Fabra UPF; Yamaha Corp
Current assignee: Universitat Pompeu Fabra UPF; Yamaha Corp
Priority date: 1999-10-21
Filing date: 1999-10-21
Publication date: 2001-04-27
Anticipated expiration: 2019-10-21
Also published as: JP4302837B2

Abstract

(57)【要約】【課題】入力される歌声や楽器音などの音声が譜面の
どの位置にいるかをより正確に検出することが可能とす
る。【解決手段】まず、音声信号が入力されると、その音
声信号を所定の時間（フレーム）単位で区切り、フレー
ム単位の音声信号から特徴パラメータを取得する。そし
て、取得した特徴パラメータを符号帳を参照することに
より、シンボル量子化し、音符辞書を参照してシンボル
観測確率を取得する。そして、予め記憶された音符列に
したがって隠れマルコフモデルが形成し、ビタービアル
ゴリズムを用いて入力音声のフレームに対応する音符を
特定する。これにより、入力音声が音符列のどの位置に
いるかを特定できる。 (57) [Summary] [Problem] It is possible to more accurately detect the position of an input singing voice or a musical instrument sound on a musical score. First, when an audio signal is input, the audio signal is divided into predetermined time units (frames), and characteristic parameters are obtained from the audio signals in frame units. Then, the obtained feature parameters are symbol-quantized by referring to the codebook, and the symbol observation probability is obtained by referring to the note dictionary. Then, a hidden Markov model is formed in accordance with the note sequence stored in advance, and the note corresponding to the frame of the input voice is specified using the Viterbi algorithm. As a result, it is possible to specify the position of the input voice in the note sequence.

Description

DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、予め記憶された音
符列と入力音声とを時系列で対応付ける音声信号処理装
置および音声信号処理方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to an audio signal processing apparatus and an audio signal processing method for associating a previously stored note sequence with an input audio in a time series.

【０００２】[0002]

【従来の技術】従来より、楽器音のピッチや発音時間を
検出することによって、現時点で楽器音譜面のどの位置
にいるかを追従するための方法が考えられている（例え
ば、 "An On-Line Algorithm for Real-time Accompanm
ient"（R. Dannenberg. Proceedings of the ICMC 198
4）や、"The Synthetic performer in the Context of
Live Musical Performance"（B. Vercoe. Proceedings
of the ICMC 1984.）などに記載された方法）。2. Description of the Related Art Conventionally, there has been proposed a method for following a current position of an instrumental music score by detecting a pitch or a sounding time of an instrumental sound (for example, "An On-Line"). Algorithm for Real-time Accompanm
ient "(R. Dannenberg. Proceedings of the ICMC 198
4) and "The Synthetic performer in the Context of
Live Musical Performance "(B. Vercoe. Proceedings
of the ICMC 1984.)).

【０００３】[0003]

【発明が解決しようとする課題】しかし、実際の演奏や
歌声は、完全に譜面通りに進行するとは限らず、微妙な
テンポ、タイミングピッチのずれや、揺らぎなどの不確
定な要素が悪影響を与え、上述した従来の手法では、楽
器音や歌声を譜面のどこにいるかを正確に検出すること
ができない場合がある。However, actual performances and singing voices do not always progress completely according to the musical score, and uncertain factors such as subtle tempo, timing pitch shift, and fluctuations have an adverse effect. However, with the above-described conventional method, it may not be possible to accurately detect where the musical instrument sound or the singing voice is located on the musical score.

【０００４】本発明は、上記の事情を考慮してなされた
ものであり、入力される歌声や楽器音などの音声が譜面
のどの位置にいるかをより正確に検出することが可能な
音声信号処理装置、および音声信号処理方法を提供する
ことを目的とする。[0004] The present invention has been made in view of the above circumstances, and provides an audio signal processing capable of more accurately detecting the position of an input singing voice or instrument sound on a musical score. It is an object to provide an apparatus and an audio signal processing method.

【０００５】[0005]

【課題を解決するための手段】上記課題を解決するた
め、本発明の請求項１に記載の音声信号処理装置は、入
力音声を音符列のいずれかの音符と対応付ける音声信号
処理装置であって、時間列で記述された音符列情報を記
憶する音符列記憶手段と、フレーム単位で入力される音
声信号から特徴パラメータを取得するパラメータ取得手
段と、音声信号の代表的な特徴パラメータを特徴ベクト
ルとしてシンボルにクラスタ化した符号帳と、各音符毎
に状態数、状態遷移確率および前記シンボルの観測確率
とを記憶した認識用音符情報記憶手段と、前記認識用音
符情報記憶手段を参照することにより、前記パラメータ
取得手段により取得された特徴パラメータから前記入力
音声の観測シンボルを取得し、該観測シンボルの観測確
率を取得する観測確率取得手段と、前記認識用音符情報
記憶手段に記憶された状態数および状態遷移確率に基づ
いて、前記音符列情報記憶手段に記憶された前記音符列
の各状態を有限状態ネットワーク上で隠れマルコフモデ
ルによって形成する状態形成手段と、前記観測確率取得
手段によって取得された観測確率と、前記状態形成手段
により形成された前記隠れマルコフモデルとにしたがっ
て状態遷移を決定する状態遷移決定手段と、前記状態遷
移決定手段によって決定された状態遷移に基づいて、前
記入力音声信号の各フレームと前記音符列情報とを対応
付ける対応付け手段とを具備することを特徴としてい
る。According to a first aspect of the present invention, there is provided an audio signal processing apparatus for associating an input voice with one of notes in a note sequence. Note string storage means for storing note string information described in a time sequence, parameter acquisition means for acquiring feature parameters from a speech signal input in frame units, and representative feature parameters of the speech signal as feature vectors. By referring to the codebook clustered into symbols, the number of states for each note, the state transition probability, and the note information storage means for recognition storing the observation probability of the symbol, by referring to the note information storage means for recognition, Observation symbols of the input voice are acquired from the characteristic parameters acquired by the parameter acquisition means, and the observation probability of acquiring the observation probability of the observation symbols is acquired. An acquisition unit, and based on the number of states and a state transition probability stored in the recognition note information storage unit, store each state of the note sequence stored in the note sequence information storage unit on a finite state network using a hidden Markov model. State forming means, the state transition determining means for determining a state transition according to the observation probability obtained by the observation probability obtaining means, and the hidden Markov model formed by the state forming means, and the state transition The image processing apparatus further includes a correspondence unit that associates each frame of the input audio signal with the note string information based on the state transition determined by the determination unit.

【０００６】また、請求項２に記載の音声信号処理装置
は、請求項１に記載の音声信号処理装置において、前記
対応付け手段による対応付け結果に基づいて、現在の入
力音声が前記音符列情報のどの部分であるかを表示する
表示手段をさらに具備することを特徴としている。According to a second aspect of the present invention, in the audio signal processing apparatus according to the first aspect, based on a result of the association by the associating means, a current input speech is converted to the note string information. The display device further comprises display means for displaying which part of the image is.

【０００７】また、請求項３に記載の音声信号処理装置
は、請求項１または２に記載の音声信号処理装置におい
て、前記パラメータ取得手段は、入力される音声信号か
ら少なくともエネルギー、デルタエネルギー、ゼロクロ
ス、ピッチ、デルタピッチおよびピッチエラーを特徴パ
ラメータとして取得することを特徴としている。According to a third aspect of the present invention, in the audio signal processing apparatus according to the first or second aspect, the parameter acquiring means determines at least energy, delta energy, and zero crossing from an input audio signal. , Pitch, delta pitch, and pitch error as characteristic parameters.

【０００８】また、請求項４に記載の音声信号処理装置
は、請求項３に記載の音声信号処理装置において、前記
認識用音符情報記憶手段に記憶されたエネルギー、デル
タエネルギー、ゼロクロス、デルタピッチおよびピッチ
エラーの５種の観測確率は、ガウス分布を用いた観測関
数を用いて算出されており、前記認識用音符情報記憶手
段に記憶されたピッチの観測確率は、ガウス分布を用い
た観測関数とステップ観測確率関数とを用いて算出され
ており、このピッチの観測確率を算出する際に、前記ピ
ッチの有無に応じて前記ガウス分布を用いた観測関数と
前記ステップ観測関数とを使い分けるようにしたことを
特徴としている。According to a fourth aspect of the present invention, in the audio signal processing apparatus according to the third aspect, the energy, delta energy, zero cross, delta pitch, and energy stored in the recognition note information storage means are stored. The five observation probabilities of the pitch error are calculated using an observation function using a Gaussian distribution, and the observation probabilities of the pitches stored in the recognition note information storage means are the same as the observation function using a Gaussian distribution. It is calculated using a step observation probability function, and when calculating the observation probability of this pitch, the observation function using the Gaussian distribution and the step observation function are selectively used depending on the presence or absence of the pitch. It is characterized by:

【０００９】また、請求項５に記載の音声信号処理装置
は、請求項３または４のいずれかに記載の音声信号処理
装置において、前記状態形成手段は、ピッチの有る音、
ピッチの無い音、無音に応じて３種類のleft-to-right
型の隠れマルコフモデルを形成し、前記ピッチの有る音
およびピッチの無い音を３状態のモデルとして形成し、
前記ピッチの無い音を１状態のモデルとして形成するこ
とを特徴としている。According to a fifth aspect of the present invention, in the audio signal processing apparatus according to any one of the third and fourth aspects, the state forming means includes a sound having a pitch,
3 types of left-to-right depending on the pitchless sound and silence
Forming a hidden Markov model of the pattern, forming the pitched sound and the pitchless sound as a three-state model,
It is characterized in that the pitchless sound is formed as a one-state model.

【００１０】また、請求項６に記載の音声信号処理装置
は、請求項５に記載の音声信号処理装置において、前記
状態形成手段は、前記ピッチの有る音の隠れマルコフモ
デルを形成する際、前音符とスラーで接続された音符と
単音符とを別のモデルとして形成することを特徴として
いる。According to a sixth aspect of the present invention, in the audio signal processing apparatus according to the fifth aspect, when the state forming means forms the hidden Markov model of the sound having the pitch, It is characterized in that a note and a note connected by a slur and a single note are formed as different models.

【００１１】また、請求項７に記載の音声信号処理装置
は、請求項１ないし６のいずれかに記載の音声信号処理
装置において、学習用楽音波形データと該楽音波形を音
符化した学習用音符列データとを入力する入力手段と、
前記入力手段から入力される学習用音符列データの音符
毎に有限ネットワーク上で隠れマルコフモデルを形成す
る学習用モデル形成手段と、学習時に、前記学習用モデ
ル形成手段により形成されたモデルの尤度が最大となる
パラメータをｋ平均アルゴリズムにより推定するパラメ
ータ推定手段とをさらに備え、前記認識用音符情報記憶
手段は、前記パラメータ推定手段によって推定されたパ
ラメータにより求められた各音符における特徴ベクトル
の状態遷移確率および観測確率を記憶することを特徴と
している。According to a seventh aspect of the present invention, there is provided the audio signal processing apparatus according to any one of the first to sixth aspects, wherein the learning musical tone waveform data and the learning musical note obtained by converting the musical tone waveform into a musical note. Input means for inputting column data;
A learning model forming means for forming a hidden Markov model on a finite network for each note of the learning note sequence data input from the input means, and a likelihood of the model formed by the learning model forming means during learning. Parameter estimating means for estimating a parameter having a maximum value by a k-means algorithm, wherein the recognition note information storing means includes a state transition of a feature vector in each note obtained by the parameter estimated by the parameter estimating means. It is characterized by storing probabilities and observation probabilities.

【００１２】また、請求項８に記載の音声信号処理装置
は、請求項１ないし７のいずれかに記載の音声信号処理
装置において、前記状態遷移決定手段は、ビタービアル
ゴリズムによって状態遷移を決定することを特徴として
いる。According to an eighth aspect of the present invention, in the audio signal processing apparatus according to any one of the first to seventh aspects, the state transition determining means determines a state transition by a Viterbi algorithm. It is characterized by:

【００１３】また、請求項９に記載の音声信号処理装置
は、請求項８に記載の音声信号処理装置において、前記
音符列記憶手段は、前記音符列に対応する持続時間デー
タを記憶しており、前記状態遷移決定手段は、前記音符
列記憶手段に記憶された持続時間データを前記ビタービ
アルゴリズムに含めることを特徴としている。According to a ninth aspect of the present invention, in the audio signal processing apparatus according to the eighth aspect, the note string storing means stores duration data corresponding to the note string. The state transition determining means includes the duration data stored in the note string storage means in the Viterbi algorithm.

【００１４】また、請求項１０に記載の音声信号処理方
法は、入力音声を予め記憶された音符列のいずれかの音
符と対応付ける音声信号処理方法であって、フレーム単
位で入力される音声信号から特徴パラメータを取得する
パラメータ取得ステップと、予め記憶された音声信号の
代表的な特徴パラメータを特徴ベクトルとしてシンボル
にクラスタ化した符号帳と、各音符毎に状態数、状態遷
移確率および前記シンボルの観測確率とを参照すること
により、前記パラメータ取得ステップにより取得された
特徴パラメータから前記入力音声の観測シンボルを取得
し、該観測シンボルの観測確率を取得する観測確率取得
ステップと、予め記憶された状態数および状態遷移確率
に基づいて、予め記憶された音符列の各状態を有限状態
ネットワーク上で隠れマルコフモデルによって形成する
状態形成ステップと、前記観測確率取得ステップによっ
て取得された観測確率と、前記状態形成ステップにより
形成された前記隠れマルコフモデルとにしたがって状態
遷移を決定する状態遷移決定ステップと、前記状態遷移
決定ステップによって決定された状態遷移に基づいて、
前記入力音声信号の各フレームと前記音符列情報とを対
応付ける対応付けステップとを具備することを特徴とし
ている。According to a tenth aspect of the present invention, there is provided an audio signal processing method for associating an input voice with any one of notes stored in advance in a note sequence. A parameter acquisition step of acquiring characteristic parameters, a codebook in which symbols representative of pre-stored representative characteristic parameters of the audio signal are clustered into symbols, a state number, a state transition probability, and observation of the symbol for each note An observation probability acquisition step of acquiring an observation symbol of the input voice from the feature parameter acquired in the parameter acquisition step by referring to the probability, and an observation probability acquisition step of acquiring an observation probability of the observation symbol; and a state number stored in advance. Based on the state transition probability and each state of the note sequence stored in advance, State formation step formed by the Markov model, the observation probability acquired by the observation probability acquisition step, and a state transition determination step of determining a state transition according to the hidden Markov model formed by the state formation step, Based on the state transition determined by the state transition determining step,
A step of associating each frame of the input audio signal with the note sequence information.

【００１５】また、請求項１１に記載の音声信号処理方
法は、請求項１０に記載の音声信号処理方法において、
前記対応付けステップによる対応付け結果に基づいて、
現在の入力音声が前記音符列情報のどの部分であるかを
表示する表示ステップをさらに具備することを特徴とし
ている。The audio signal processing method according to claim 11 is the audio signal processing method according to claim 10,
Based on the result of the association by the association step,
It is characterized by further comprising a display step of displaying which part of the note string information the current input voice is.

【００１６】また、請求項１２に記載の音声信号処理方
法は、請求項１０または１１に記載の音声信号処理方法
において、前記パラメータ取得ステップでは、入力され
る音声信号から少なくともエネルギー、デルタエネルギ
ー、ゼロクロス、ピッチ、デルタピッチおよびピッチエ
ラーを特徴パラメータとして取得することを特徴として
いる。According to a twelfth aspect of the present invention, in the audio signal processing method according to the tenth or eleventh aspect, in the parameter acquiring step, at least energy, delta energy, and zero-crossing from an input audio signal are obtained. , Pitch, delta pitch, and pitch error as characteristic parameters.

【００１７】また、請求項１３に記載の音声信号処理方
法は、請求項１２に記載の音声信号処理方法において、
前記エネルギー、デルタエネルギー、ゼロクロス、デル
タピッチおよびピッチエラーの５種の観測確率を、ガウ
ス分布を用いた観測関数を用いて算出して記憶する第１
の観測確率算出ステップと、前記ピッチの観測確率を、
ガウス分布を用いた観測関数とステップ観測確率関数と
を用い、前記ピッチの有無に応じて前記ガウス分布を用
いた観測関数と前記ステップ観測関数とを使い分けて算
出して記憶する第２の観測確率算出ステップとをさらに
具備し、前記観測確率取得ステップでは、前記第１およ
び第２の観測確率算出ステップで記憶された観測確率を
参照することにより観測確率を取得することを特徴とし
ている。According to a thirteenth aspect of the present invention, in the audio signal processing method according to the twelfth aspect,
First, the five observation probabilities of the energy, delta energy, zero cross, delta pitch, and pitch error are calculated and stored using an observation function using a Gaussian distribution.
Observation probability calculation step, the observation probability of the pitch,
A second observation probability calculated and stored by using an observation function using the Gaussian distribution and the step observation function in accordance with the presence or absence of the pitch, using an observation function using a Gaussian distribution and a step observation probability function. And a calculation step, wherein the observation probability is obtained by referring to the observation probabilities stored in the first and second observation probability calculation steps.

【００１８】また、請求項１４に記載の音声信号処理方
法は、請求項１２または１３のいずれかに記載の音声信
号処理方法において、前記状態形成ステップでは、ピッ
チの有る音、ピッチの無い音、無音に応じて３種類のle
ft-to-right型の隠れマルコフモデルを形成し、前記ピ
ッチの有る音およびピッチの無い音を３状態のモデルと
して形成し、前記ピッチの無い音を１状態のモデルとし
て形成することを特徴としている。According to a fourteenth aspect of the present invention, in the audio signal processing method according to any one of the twelfth and thirteenth aspects, in the state forming step, a sound having a pitch, a sound having no pitch, 3 types of le according to silence
forming a hidden Markov model of the ft-to-right type, forming the pitched sound and the pitchless sound as a three-state model, and forming the pitchless sound as a one-state model. I have.

【００１９】また、請求項１５に記載の音声信号処理方
法は、請求項１４に記載の音声信号処理方法において、
前記状態形成ステップでは、前記ピッチの有る音の隠れ
マルコフモデルを形成する際、前音符とスラーで接続さ
れた音符と単音符とを別のモデルとして形成することを
特徴としている。According to a fifteenth aspect of the present invention, in the audio signal processing method according to the fourteenth aspect,
In the state forming step, when forming the hidden Markov model of the sound having the pitch, a note connected to the preceding note by a slur and a single note are formed as different models.

【００２０】また、請求項１６に記載の音声信号処理方
法は、請求項１０ないし１５のいずれかに記載の音声信
号処理方法において、学習用楽音波形データと該楽音波
形を音符化した学習用音符列データとを入力する入力ス
テップと、前記入力ステップで入力される学習用音符列
データの音符毎に有限ネットワーク上で隠れマルコフモ
デルを形成する学習用モデル形成ステップと、学習時
に、前記学習用モデル形成手段により形成されたモデル
の尤度が最大となるパラメータをｋ平均アルゴリズムに
より推定するパラメータ推定ステップと、前記パラメー
タ推定ステップによって推定されたパラメータにより求
められた各音符における特徴ベクトルの状態遷移確率お
よび観測確率を記憶する確率記憶ステップとを備え、前
記観測確率取得ステップでは、前記確率記憶ステップに
より記憶された観測確率を参照することにより観測確率
を取得し、前記状態形成ステップでは、前記確率記憶ス
テップにより記憶された状態遷移確率に基づいて、予め
記憶された音符列の各状態を有限状態ネットワーク上で
隠れマルコフモデルによって形成することを特徴として
いる。A sound signal processing method according to a sixteenth aspect of the present invention is the audio signal processing method according to any one of the tenth to fifteenth aspects, wherein the learning musical tone waveform data and the learning musical note obtained by converting the musical tone waveform into a musical note. An input step of inputting column data; a learning model forming step of forming a hidden Markov model on a finite network for each note of the learning note string data input in the input step; A parameter estimating step of estimating a parameter with the maximum likelihood of the model formed by the forming means by a k-means algorithm; and a state transition probability of a feature vector in each note obtained by the parameter estimated by the parameter estimating step; A probability storage step of storing an observation probability. In the step, the observation probability is obtained by referring to the observation probability stored in the probability storage step, and in the state forming step, a note stored in advance based on the state transition probability stored in the probability storage step is used. It is characterized in that each state of the sequence is formed by a hidden Markov model on a finite state network.

【００２１】また、請求項１７に記載の音声信号処理方
法は、請求項１０ないし１６のいずれかに記載の音声信
号処理方法において、前記状態遷移決定ステップは、ビ
タービアルゴリズムによって状態遷移を決定することを
特徴としている。According to a seventeenth aspect of the present invention, in the audio signal processing method according to any one of the tenth to sixteenth aspects, the state transition determining step determines a state transition by a Viterbi algorithm. It is characterized by:

【００２２】また、請求項１８に記載の音声信号処理方
法は、請求項１７に記載の音声信号処理方法において、
前記状態遷移決定ステップでは、予め記憶された音符列
に対応する持続時間データを前記ビタービアルゴリズム
に含めることを特徴としている。The audio signal processing method according to the eighteenth aspect is the audio signal processing method according to the seventeenth aspect,
In the state transition determining step, duration data corresponding to a note sequence stored in advance is included in the Viterbi algorithm.

【００２３】[0023]

【発明の実施の形態】以下、図面を参照して本発明の実
施形態について説明する。Ａ．実施形態の構成Ａ−１．全体構成まず、図１は本発明の一実施形態に係る音声信号処理装
置の構成を示す図である。同図において、符号１はマイ
クであり、歌唱者の歌声や楽器音などの音を収集し、入
力音声信号Ｓｖとして入力音声信号切出部３に出力す
る。符号２は分析窓生成部であり、分析窓生成部２は前
回のフレームで検出したピッチの周期の固定倍の周期を
有する分析窓（例えば、ハミング窓）ＡＷを生成し、入
力音声信号切出部３に出力する。なお、初期状態もしく
は前回のフレームが無音の場合には、予め設定した固定
周期の分析窓を分析窓ＡＷとして入力音声信号切出部３
に出力する。Embodiments of the present invention will be described below with reference to the drawings. A. Configuration of Embodiment A-1. 1. Overall Configuration First, FIG. 1 is a diagram showing a configuration of an audio signal processing device according to an embodiment of the present invention. In FIG. 1, reference numeral 1 denotes a microphone, which collects sounds such as a singer's singing voice and a musical instrument sound, and outputs the collected sounds to an input audio signal cutout unit 3 as an input audio signal Sv. Reference numeral 2 denotes an analysis window generation unit. The analysis window generation unit 2 generates an analysis window (for example, a hamming window) AW having a period that is a fixed multiple of the period of the pitch detected in the previous frame, and cuts out an input audio signal. Output to section 3. When the initial state or the previous frame is silent, the analysis window AW is used as an analysis window of a fixed period set in advance and the input audio signal extracting unit 3.
Output to

【００２４】入力音声信号切出部３は、入力された分析
窓ＡＷと入力音声信号Ｓｖとを掛け合わせ、入力音声信
号Ｓｖをフレーム単位で切り出し、フレーム音声信号Ｆ
Ｓｖとして高速フーリエ変換部４に出力する。高速フー
リエ変換部４は、フレーム音声信号ＦＳｖから周波数ス
ペクトルを求め、特徴パラメータ分析部５に出力する。The input audio signal extracting section 3 multiplies the input analysis window AW by the input audio signal Sv, cuts out the input audio signal Sv in frame units, and outputs the frame audio signal Fv.
It is output to the fast Fourier transform unit 4 as Sv. The fast Fourier transform unit 4 obtains a frequency spectrum from the frame audio signal FSv, and outputs the frequency spectrum to the feature parameter analysis unit 5.

【００２５】特徴パラメータ分析部５は、入力音声のス
ペクトル特性を特徴づける特徴パラメータを抽出し、シ
ンボル量子化部７に出力する。本実施形態では、特徴パ
ラメータとして後に説明する６種類（エネルギー、デル
タエネルギー、ゼロクロス率、ピッチ周波数、デルタピ
ッチおよびトータルエラー）の特徴ベクトルを用いてい
る。認識用音符情報記憶部６は、後に詳しく説明するよ
うに、符号帳６ａと、各音符における特徴ベクトルの状
態数および状態遷移確率とシンボル観測確率とを示す確
率データを記憶した音符辞書６ｂとを備えている。The characteristic parameter analysis unit 5 extracts a characteristic parameter characterizing the spectral characteristics of the input speech, and outputs the characteristic parameter to the symbol quantization unit 7. In the present embodiment, six types of feature vectors (energy, delta energy, zero-cross rate, pitch frequency, delta pitch, and total error) described later are used as the feature parameters. As will be described later in detail, the recognition note information storage unit 6 stores a codebook 6a and a note dictionary 6b storing probability data indicating the number of states of the feature vector and the state transition probability and the symbol observation probability in each note. Have.

【００２６】シンボル量子化部７は、認識用音符情報記
憶部６に記憶された符号帳６ａを参照して、そのフレー
ムにおける最大尤度をもつ特徴シンボルを選び出し、選
び出した特徴シンボルの観測確率を音符列状態形成部８
に出力する。The symbol quantization unit 7 selects a characteristic symbol having the maximum likelihood in the frame with reference to the codebook 6a stored in the recognition note information storage unit 6, and determines the observation probability of the selected characteristic symbol. Note row state forming unit 8
Output to

【００２７】音符列状態形成部８は、上記認識用音符情
報記憶部６と後述する音符列情報記憶部１１とを参照す
ることにより、隠れマルコフモデル（ＨＭＭ）によって
音符列情報記憶部１１に記述された音符列状態を形成す
る。状態遷移決定部９は、シンボル量子化部７から出力
される入力音声から得られたフレーム単位の特徴シンボ
ルの観測確率を用いて、後述するビタービアルゴリズム
にしたがって状態遷移を決定する。これにより、入力音
声の音符をフレーム単位の各時刻において特定できる。The note string state forming unit 8 refers to the recognition note information storage unit 6 and a note string information storage unit 11 described later, and describes the note string information storage unit 11 using a hidden Markov model (HMM). To form a note sequence state. The state transition determination unit 9 determines a state transition according to a Viterbi algorithm described later, using the observation probability of a feature symbol in frame units obtained from the input speech output from the symbol quantization unit 7. Thereby, the note of the input voice can be specified at each time in frame units.

【００２８】マッチング部１０は、特定された入力音声
のフレーム単位の音符により、入力音声のフレーム単位
の各時刻と、音符列情報記憶部１１に記憶された音符列
のいずれの位置であるかを特定し、入力音声と音符列と
の対応付けを行う。表示装置１２は、マッチング部１０
による入力音声と音符列との対応付け結果を表示する。The matching unit 10 determines the time of each frame of the input speech and the position of the note sequence stored in the note sequence information storage unit 11 based on the specified note of the input speech frame. Identify and associate the input voice with the note sequence. The display device 12 includes the matching unit 10
The result of associating the input voice with the note sequence is displayed.

【００２９】Ａ−２．認識用音符情報記憶部次に、上記認識用音符情報記憶部６に記憶される符号帳
６ａおよび音符辞書６ｂについて説明する。符号帳６ａ
は、音声信号の代表的な特徴パラメータを特徴ベクトル
として所定数のシンボルにクラスタ化されている。A-2. Recognition note information storage unit Next, the codebook 6a and the note dictionary 6b stored in the recognition note information storage unit 6 will be described. Codebook 6a
Are clustered into a predetermined number of symbols using a representative feature parameter of the audio signal as a feature vector.

【００３０】Ａ−２−１．特徴ベクトルまず、符号帳６ａについて説明する前に、本実施形態で
用いる６種類の特徴ベクトルについて説明する。A-2-1. Feature Vectors First, before describing the codebook 6a, six types of feature vectors used in the present embodiment will be described.

【００３１】エネルギーエネルギーは音の強さを表す係数であり、次式により特
徴ベクトルにパラメータ化される。Energy Energy is a coefficient representing the intensity of sound, and is parameterized into a feature vector by the following equation.

【数１】デルタエネルギーデルタエネルギーは音の強さを差分として表す係数であ
り、次式により特徴ベクトルにパラメータ化される。(Equation 1) Delta energy Delta energy is a coefficient that represents the sound intensity as a difference, and is parameterized into a feature vector by the following equation.

【数２】ゼロクロス率ゼロクロス率は、有声音であるほどゼロクロス率が低く
なる特徴を有するものであり、次式により特徴ベクトル
にパラメータ化される。(Equation 2) Zero-cross rate The zero-cross rate has a feature that the zero-cross rate decreases as the voiced sound becomes, and is parameterized into a feature vector by the following equation.

【数３】ピッチ周波数ピッチ周波数は、"Fundamental Frequency Estimation
in the SMS Analysis"(P. Cano. DAFX Proceedings 199
8.)に記載されたTwo-Way Mismatch法により求めること
ができる。(Equation 3) Pitch frequency The pitch frequency is defined as "Fundamental Frequency Estimation
in the SMS Analysis "(P. Cano. DAFX Proceedings 199
It can be determined by the Two-Way Mismatch method described in 8.).

【００３２】デルタピッチデルタピッチは、次式により特徴ベクトルにパラメータ
化される。Delta Pitch Delta pitch is parameterized into a feature vector by the following equation:

【数４】トータルエラートータルエラーは、予測ピッチの測定ピッチとのエラ
ー、および測定ピッチの予測ピッチのエラーの２方向か
らのミスマッチを求めることにより有声音らしさを示す
ものである。まず、予測ピッチ（p）の測定ピッチ（m）
とのピッチエラーは次式により表される。(Equation 4) Total Error The total error indicates a voiced soundness by calculating a mismatch between the error of the predicted pitch with the measured pitch and the error of the predicted pitch of the measured pitch from two directions. First, the measured pitch (m) of the predicted pitch (p)
Is expressed by the following equation.

【数５】上記式において、ｆ_kはｋ番目の予測ピーク周波数、Δ
ｆ_kはｋ番目の予測ピーク周波数と測定ピッチの周波数
差、ａ_kはｋ番目の予測アンプリチュード、Ａ_maxはアン
プリチュードの最大値を示す。一方、測定ピッチ（m）
から予測ピッチ（p）へのピッチエラーは、次式で表さ
れる。(Equation 5) In the above equation, f _k is the k-th predicted peak frequency, Δ
f _k is the frequency difference between the k-th predicted peak frequency and the measured pitch, a _k is the k-th predicted amplitude, and A _max is the maximum value of the amplitude. On the other hand, measurement pitch (m)
The pitch error from to the predicted pitch (p) is represented by the following equation.

【数６】上記式において、ｆ_kはｋ番目の予測ピーク周波数、Δ
ｆ_kはｋ番目の予測ピーク周波数と測定ピッチの周波数
差、ａ_kはｋ番目の予測アンプリチュード、Ａ_maxはアン
プリチュードの最大値を示す。(Equation 6) In the above equation, f _k is the k-th predicted peak frequency, Δ
f _k is the frequency difference between the k-th predicted peak frequency and the measured pitch, a _k is the k-th predicted amplitude, and A _max is the maximum value of the amplitude.

【００３３】従って、トータルエラーは、次式のように
なる。Therefore, the total error is as follows.

【数７】Ａ−２−２．音符辞書音符辞書６ｂには、各音符毎にleft-to-right型の隠れ
マルコフモデルが記憶されている。ここで、音符辞書６
ｂには入力音声の物理的特性に応じて３種類のモデルが
用意されている。具体的には、ピッチ有り音、ピッチ無
し音および無音の３種類毎にモデルが用意されている。
また、ピッチ有り音としては、各音程（Ｃ０、Ｃ＃０、
Ｄ０、Ｄ＃０、Ｅ０、Ｆ０、Ｆ＃０、Ｇ０、Ｇ＃０…
…）毎にモデル、つまり状態数および状態遷移確率が記
憶されている（図２および図３参照）。また、後述する
ように予めガウス分布等にしたがって算出された各特徴
ベクトルのシンボルに対する観測確率の値が記憶されて
いる。ここで、ピッチ有り音の場合には、状態数は３で
あり、それぞれ発音の立ち上がり（アタック）、定常状
態（ステディー）、リリース状態の３つの状態を擬似的
に表している。(Equation 7) A-2-2. Note dictionary The note dictionary 6b stores a left-to-right type hidden Markov model for each note. Here, note dictionary 6
For b, three types of models are prepared according to the physical characteristics of the input voice. Specifically, models are prepared for each of three types of sounds, pitched sound, pitchless sound, and silence.
As pitched sounds, each pitch (C0, C # 0,
D0, D # 0, E0, F0, F # 0, G0, G # 0 ...
...), The model, that is, the number of states and the state transition probabilities are stored (see FIGS. 2 and 3). In addition, as described later, an observation probability value for each feature vector symbol calculated in advance according to a Gaussian distribution or the like is stored. Here, in the case of a sound with a pitch, the number of states is three, and three states, namely, a rising state (attack), a steady state (steady state), and a release state of the sound are respectively simulated.

【００３４】また、本実施形態ではピッチ無し音とし
て、破裂音（/p/,/b/……）や摩擦音（/s/,/sh/……）
毎に１モデルが用意されている。本実施形態では、この
ピッチ無し音のモデルでも状態数がアタック、ステディ
ー、リリースの３状態に設定されており、これにより呼
気音、破裂音、摩擦音等の細かなニュアンスを再現でき
るようにしている。また、無音（SILENCE）の場合は、
状態数は１に設定されている。なお、ピッチ有り音の場
合、スラーで全音と接続された音符に関しては、単音符
とは異なるモデルを用意するようにすれば、より精度の
よい入力音声と音符との対応付けが可能となる。Ａ−２−３．符号帳符号帳６ａは、上記６種の特徴ベクトルのうち、エネ
ルギー、Δエネルギー、ゼロクロス率、デルタピ
ッチ、およびトータルエラーの５種類により生成さ
れ、クラスタ分けされている。各クラスタには、各シン
ボルを表す典型的なベクトル集合が入っている。（図４
参照）。本実施形態では、符号帳６ａの作成の際に、公
知のＬＢＧアルゴリズムを使用している。また、符号帳
６ａの作成は、次式で示されるガウス分布の連続密度関
数にしたがってなされる。In the present embodiment, plosive sounds (/ p /, / b /...) And fricative sounds (/ s /, / sh /...) Are used as pitchless sounds.
One model is prepared for each. In the present embodiment, the number of states is set to three states of attack, steady, and release even in the model of the pitchless sound, so that fine nuances such as an expiration sound, a popping sound, and a fricative sound can be reproduced. . In the case of silence (SILENCE),
The number of states is set to one. In the case of a pitched sound, for a note connected to a whole note by a slur, if a model different from a single note is prepared, it is possible to more accurately associate an input voice with a note. A-2-3. Codebook The codebook 6a is generated based on five types of energy, Δ energy, zero cross rate, delta pitch, and total error among the above six types of feature vectors, and is divided into clusters. Each cluster contains a typical set of vectors representing each symbol. (FIG. 4
reference). In the present embodiment, a known LBG algorithm is used when creating the codebook 6a. The codebook 6a is created according to a continuous density function of a Gaussian distribution represented by the following equation.

【数８】ここで、Ｃ_jmは状態ｊにおける成分ｍの混合重み係数を
示している。また、Ｎ（ｊ；μ，Σ_j）は平均ベクトル
μと共分散行列Σの多次元ガウス分布を示しているが、
このままでは多次元であることにより学習パラメータの
数が膨大になるため、本実施形態ではＮ（ｊ；μ，
Σ_j）を一定の関数としている。(Equation 8) Here, C _jm indicates the mixing weight coefficient of the component m in the state j. N (j; μ, Σ _j ) indicates a multidimensional Gaussian distribution of the mean vector μ and the covariance matrix Σ,
Since the number of learning parameters becomes enormous due to being multidimensional in this state, N (j; μ,
Σ _j ) is a constant function.

【００３５】このようなガウス分布にしたがった符号帳
６ａを作成するため、フレームにおける量子化ベクトル
Ｎ_lの観測系列ｙ_t〜_lは部分集合Ｍ_lとなる。ここで、Ｍ
_lは作成する符号帳の混合要素の数である。そして、符
号帳６ａ作成の際には、以下のような演算でパラメータ
が推定されることになる。In order to create the codebook 6a according to such a Gaussian distribution, the observation sequence y _t to _l of the quantization vector N _l in the frame becomes a subset M _l . Where M
_l is the number of mixed elements in the codebook to be created. When the codebook 6a is created, parameters are estimated by the following calculation.

【数９】 (Equation 9)

【数１０】混合重み係数Ｃ_jmは、量子化ベクトルＮ_lmが符号帳のｍ
番目に混合されるときに用いられ、以下の式のように表
される。(Equation 10) The mixing weight coefficient C _jm is obtained by _calculating the quantization vector N _lm
It is used at the time of the second mixing, and is represented by the following equation.

【数１１】次に、本実施形態では、上記５種類以外の特徴ベクトル
であるピッチ周波数については、その観測確率を算出
する際に用いる関数は、ピッチ有り音と、ピッチ無し音
又は無音の場合の２通りに分けられている。ここで、図
５（ａ）はピッチ有り音の場合の観測確率ｂ（y）を算
出するためのステップ関数を示し、図５（ｂ）はピッチ
無し音又は無音の場合の観測確率ｂ（y）を算出するた
めのステップ関数を示す。同図に示すように、ピッチ有
り音の場合、ピッチＦ₀＝０、つまりピッチが検出され
ないときには、算出される観測確率は定数となる。一
方、ピッチが検出された場合、つまりＦ₀＝０でない場
合には、観測確率ｂ’（y）を算出する関数（図５
（ａ）中右上に示すグラフ参照）としてガウス分布にし
たがった連続密度関数に置き換えられる。そして、この
算出の際に用いられるピッチＦ₀Devは以下の式で算出さ
れる。[Equation 11] Next, in the present embodiment, for the pitch frequency, which is a feature vector other than the above five types, the function used to calculate the observation probability has two types: a pitched sound, a pitchless sound, and a silent sound. Divided. Here, FIG. 5A shows a step function for calculating an observation probability b (y) in the case of a pitched sound, and FIG. 5B shows an observation probability b (y) in the case of a pitchless sound or a silent sound. ) Is shown below. As shown in the figure, in the case of a pitched sound, when the pitch F ₀ = 0, that is, when no pitch is detected, the calculated observation probability is a constant. On the other hand, when the pitch is detected, that is, when F ₀ = 0, a function for calculating the observation probability b ′ (y) (FIG. 5)
(A) Refer to the graph shown in the upper right of FIG. 2) as a continuous density function according to a Gaussian distribution. Then, the pitch F ₀ Dev used in this calculation is calculated by the following equation.

【数１２】上記式において、Ｆ_Perfect0は、本来演奏されるべきピ
ッチ周波数を示す。例えば、平均率Ａ４＝４４０Ｈｚの
場合には、各音程において観測確率算出の際に用いられ
るＦ_Perfect0は以下の数値である。Ｃ３………２６１．６２６ＨｚＤ３………２９３．６６ＨｚＥ３………３２９．６２８Ｈｚこのようにガウス分布の連続密度関数等にしたがって６
種類の特徴ベクトルに対する観測確率は算出され、音符
辞書６ｂに記憶されており、また符号帳６ａもこのガウ
ス分布にしたがった形で作成されている。(Equation 12) In the above equation, F _Perfect0 indicates the pitch frequency that should be played. For example, when the average rate A4 = 440 Hz, F _Perfect0 used for calculating the observation probability at each interval is the following numerical value. C3... 261.626 Hz D3... 293.66 Hz E3... 329.628 Hz As described above, 6 according to the continuous density function of the Gaussian distribution, etc.
The observation probabilities for the types of feature vectors are calculated and stored in the note dictionary 6b, and the codebook 6a is also created in accordance with the Gaussian distribution.

【００３６】Ａ−３．音符列情報記憶部次に、音符列情報記憶部１１について説明する。図６に
示すように、音符列情報記憶部１１は、楽曲などの音符
列が時系列に記述されている。また、本実施形態では、
上記音符列の各音符毎にその持続時間情報が記憶されて
いる。従って、図６（ａ）に示すような譜面で示される
楽曲の音符列を記憶する場合には、図６（ｂ）に示すよ
うな音符列情報と持続時間情報が記憶されることにな
る。ここで、持続時間情報は、以下のように表されてい
る。A-3. Note string information storage unit Next, the note string information storage unit 11 will be described. As shown in FIG. 6, the note string information storage unit 11 describes note strings such as music pieces in chronological order. In the present embodiment,
The duration information is stored for each note in the note sequence. Therefore, when storing a note sequence of a musical piece represented by a musical score as shown in FIG. 6A, the note sequence information and the duration information as shown in FIG. 6B are stored. Here, the duration information is expressed as follows.

【００３７】ピッチ有り音又はピッチ無し音の場合、Ｓ１：全音符Ｓ２：２分音符Ｓ３：４分音符Ｓ４：８分音符といったように表される。In the case of a pitched sound or a pitchless sound, it is expressed as follows: S1: whole note S2: half note S3: quarter note S4: eighth note.

【００３８】一方、無音（休符）の場合、Ｕ１：全休符Ｕ２：２分休符Ｕ３：４分休符Ｕ４：８分休符といったように表される。On the other hand, in the case of silence (rest), it is expressed as follows: U1: full rest U2: two-minute rest U3: four-minute rest U4: eight-minute rest

【００３９】従って、上記持続時間の実時間は、楽譜面
の速度表記や設定テンポによって決定される。Accordingly, the actual time of the above-mentioned duration is determined by the speed notation of the musical score and the set tempo.

【００４０】Ｂ．実施形態の動作次に、上記構成を有する音声信号処理装置の動作につい
て説明する。Ｂ−１．概要動作最初に、この音声信号処理装置の概要動作について図７
に示すフローチャートを参照しながら説明する。まず、
マイク１により入力音声信号が生成されると、この音声
信号に対してフレーム単位で高速フーリエ変換して周波
数スペクトルを取得する。そして、取得した周波数スペ
クトルから特徴パラメータ解析を行って上述した６種類
の特徴パラメータを取得する（ステップＳ１）。B. Next, an operation of the audio signal processing device having the above configuration will be described. B-1. Outline operation First, the outline operation of this audio signal processing device is shown in FIG.
This will be described with reference to the flowchart shown in FIG. First,
When an input audio signal is generated by the microphone 1, the audio signal is subjected to fast Fourier transform on a frame basis to obtain a frequency spectrum. Then, feature parameter analysis is performed from the obtained frequency spectrum to obtain the above-described six types of feature parameters (step S1).

【００４１】次に、認識用音符情報記憶部６を参照する
ことにより、シンボル量子化部７によって取得した６種
類の特徴パラメータのシンボル量子化が行われる（ステ
ップＳ２）。そして、シンボル量子化部７は音符辞書６
ｂを参照することにより、シンボル量子化したシンボル
の観測確率を取得する。Next, the symbol quantization of the six types of characteristic parameters obtained by the symbol quantization unit 7 is performed by referring to the recognition note information storage unit 6 (step S2). Then, the symbol quantization unit 7 outputs the note dictionary 6
By referring to b, the observation probability of the symbol quantized symbol is obtained.

【００４２】この後、音符列情報記憶部１１に記憶され
た音符列情報および音符辞書６ｂを参照することによ
り、音符列状態形成部８によって音符列の状態がＨＭＭ
モデルにより構成される（ステップＳ３）。そして、上
述したようにシンボル量子化部７によって取得されたシ
ンボル観測確率と、音符列状態形成部８によって形成さ
れたＨＭＭモデルとに基づき、状態遷移決定部９がビタ
ービアルゴリズムを用いて状態遷移を決定する（ステッ
プＳ４）。ＨＭＭモデルおよびビタービアルゴリズムに
ついては後述する。そして、状態遷移決定部９により決
定された状態遷移に基づいて、マッチング部１０が入力
音声と音符列との時間的な対応付けを行われ（ステップ
Ｓ５）、この対応付け結果が表示装置１２に表示される
（ステップＳ６）。Thereafter, by referring to the note string information stored in the note string information storage section 11 and the note dictionary 6b, the note string state forming section 8 changes the state of the note string to HMM.
It is composed of a model (step S3). Then, based on the symbol observation probability acquired by the symbol quantization unit 7 and the HMM model formed by the note sequence state forming unit 8 as described above, the state transition determining unit 9 performs state transition using the Viterbi algorithm. Is determined (step S4). The HMM model and the Viterbi algorithm will be described later. Then, based on the state transition determined by the state transition determining unit 9, the matching unit 10 performs temporal association between the input voice and the note sequence (step S5), and the association result is displayed on the display device 12. It is displayed (step S6).

【００４３】Ｂ−２．動作の詳細次に、概要動作においてふれた各処理について詳細に説
明する。B-2. Details of Operation Next, each process described in the outline operation will be described in detail.

【００４４】Ｂ−２−１．特徴パラメータ分析およびシ
ンボル量子化図８は、マイク１により生成される入力音声信号から特
徴パラメータを取得してシンボル量子化する処理を説明
するための図である。同図に示すように、入力された音
声信号は、フレーム単位で高速フーリエ変換によって周
波数スペクトルに変換される。この周波数スペクトルに
は、特徴パラメータ分析が行われる。そして、各特徴ベ
クトル毎に、認識用音符情報記憶部６の符号帳６ａから
最大尤度のシンボルを見つけだし、音符辞書６ｂを参照
して見つけだしたシンボルについての観測確率を取得す
る。B-2-1. Characteristic Parameter Analysis and Symbol Quantization FIG. 8 is a diagram for describing a process of acquiring characteristic parameters from an input audio signal generated by the microphone 1 and performing symbol quantization. As shown in the figure, the input audio signal is converted into a frequency spectrum by fast Fourier transform on a frame basis. A characteristic parameter analysis is performed on this frequency spectrum. Then, for each feature vector, the symbol with the maximum likelihood is found from the codebook 6a of the note information storage unit 6 for recognition, and the observation probability of the found symbol is acquired by referring to the note dictionary 6b.

【００４５】Ｂ−２−２．隠れマルコフモデル次に、図９を参照しながら、隠れマルコフモデル（ＨＭ
Ｍ）について説明する。なお、音声の状態は一方向へ遷
移するので、本実施形態では、上述したようにLeft-to-
right型のモデルを用いている。B-2-2. Hidden Markov Model Next, referring to FIG. 9, a hidden Markov model (HM
M) will be described. Since the state of the voice transits in one direction, in the present embodiment, as described above, the left-to-
The right type model is used.

【００４６】時刻ｔにおいて、状態ｉからｊへ遷移する
確率（状態遷移確率）をａ_ijと表す。図９に示す例で
は、状態にとどまる確率をａ₁₁と表し、状態から状
態へ遷移する確率をａ₁₂と表している。このような状
態遷移確率が上述したように音符辞書６ｂには音符毎に
記憶されている（図３参照）。また、本実施形態におい
て、ピッチ有り音およびピッチ無し音では、状態はア
タック、状態はステディー、状態はリリースを示
す。At time t, the probability of transition from state i to j (state transition probability) is represented as a _ij . In the example shown in FIG. 9 represents the probability of staying in the state a _11, it represents the probability of transition from state to state and a _12. Such a state transition probability is stored for each note in the note dictionary 6b as described above (see FIG. 3). In the present embodiment, for the pitched sound and the pitchless sound, the state indicates attack, the state indicates steady, and the state indicates release.

【００４７】各状態の中には特徴ベクトルがそれぞれ存
在し、各々に異なる観測シンボルがある。これをＹ＝
｛ｙ₁、ｙ₂、……、ｙ_t｝と表す。そして、時刻ｔにお
いて状態がｊである時に特徴ベクトルのシンボルｙ_tを
発生させる確率（シンボル観測確率）をｂ_j（ｙ_t）と表
す。ここで、図示のモデルをＭとした場合、観測シンボ
ル系列ｙが状態、、と推移する確率をＹとする
と、ＸとＹが同時に起こる確率は、次の式で表せる。Each state has a feature vector, and each state has a different observation symbol. This is Y =
{Y ₁ , y ₂ ,..., Y _t }. The probability (symbol observation probability) of generating the symbol y _{t of the} feature vector when the state is j at the time _t is represented as b _j (y _t ). Here, when the model shown in the drawing is M, and the probability that the observed symbol sequence y transitions to the state is represented by Y, the probability that X and Y occur simultaneously can be expressed by the following equation.

【数１３】本実施形態では、音符列情報記憶部１１に記憶された音
符列情報に基づいて、図９に示すようなＦＳＮ（有限状
態ネットワーク）を音符単位で形成する。例えば、図６
（ｂ）に示すような情報が記憶されている場合には、音
符辞書６ｂに記憶された音程「Ｅ３」、「Ｇ３」、「無
音」……の状態数および状態遷移確率に基づいて隠れマ
ルコフモデルを形成する。(Equation 13) In the present embodiment, based on the note string information stored in the note string information storage unit 11, an FSN (finite state network) as shown in FIG. 9 is formed for each note. For example, FIG.
If the information as shown in FIG. 3B is stored, the hidden Markov is based on the number of states and the state transition probabilities of the intervals “E3”, “G3”, “silence”... Stored in the note dictionary 6b. Form a model.

【００４８】Ｂ−２−３．状態遷移決定本実施形態では、音符列情報記憶部１１に記憶された音
符列情報に基づいて上記のように形成された隠れマルコ
フモデルと、シンボル量子化部７が取得した特徴シンボ
ルと、このシンボルの観測確率とから、ビタービアルゴ
リズムにしたがって入力音声の状態遷移を決定するが、
その概要を簡単に説明する。ビタービアルゴリズムは、
モデルＭが観測シンボル系列ｙを出力するときの最も可
能性の高い観測状態列を導くためのものである。Φ
_j（ｔ）を時刻ｔで状態にあるときの、観測ベクトルｙ₁
〜ｙ_tに遷移する際の最も高い確率であるとすると、こ
のときの部分的尤度は次の帰納式で表される。B-2-3. In this embodiment, the hidden Markov model formed as described above based on the note sequence information stored in the note sequence information storage unit 11, the characteristic symbol acquired by the symbol quantization unit 7, and the symbol From the observation probability of, the state transition of the input voice is determined according to the Viterbi algorithm.
The outline is briefly described. The Viterbi algorithm is
This is to derive the most likely observation state sequence when the model M outputs the observation symbol sequence y. Φ
Observation vector y ₁ when _j (t) is in the state at time t
Assuming that the probability is the highest when transitioning to y _t , the partial likelihood at this time is expressed by the following induction formula.

【数１４】上記式において、ａ_ijは状態ｉからｊへの状態遷移率
（つまり、音符列情報記憶部１１に記憶された音符列に
より決定される）、ｂ_j（ｙ_t）は特徴ベクトルの各々の
時刻ｔにおけるシンボル観測確率であり、入力される音
声の特徴ベクトルと上述した認識用音符情報記憶部６等
とに基づいて決定されるものである。[Equation 14] In the above equation, a _ij is the state transition rate from state i to j (that is, determined by the note sequence stored in the note sequence information storage unit 11), and b _j (y _t ) is the time of each feature vector. The symbol observation probability at t, which is determined based on the feature vector of the input voice and the above-described recognition note information storage unit 6 and the like.

【００４９】また、Φ₁（１）＝１、Φ_j（１）＝ａ_1jｂ
_j（ｙ₁）とすると、１＜ｊ＜Ｎについて、最大尤度確率
Ｐ’（Ｙ｜Ｍ）は、次の式で表される。Φ ₁ (1) = 1, Φ _j (1) = a _1j b
Assuming that _j (y ₁ ), for 1 <j <N, the maximum likelihood probability P ′ (Y | M) is represented by the following equation.

【数１５】このようにして状態ｊへの最大の確率を再帰的に求める
ことにより、最適なパス、つまり最も高い確率の観測状
態列が導かれ、状態遷移が決定される。(Equation 15) By recursively obtaining the maximum probability for the state j in this manner, an optimal path, that is, an observation state sequence having the highest probability is derived, and a state transition is determined.

【００５０】さらに、本実施形態では、音符列情報記憶
部１１に音符列の各音符に対応した持続時間情報が記憶
されており、上述の最適パス算出のためのビタービアル
ゴリズムにこの持続時間情報を含めた以下に示すような
アルゴリズムを使用している。Further, in the present embodiment, the note string information storage unit 11 stores the duration information corresponding to each note in the note string, and the duration information is stored in the Viterbi algorithm for calculating the optimum path. And the following algorithm including

【００５１】このアルゴリズムでは、各々の音符ｎ、時
間ｔでの持続時間の経過を保持し、時間ｔの状態から時
間ｔ＋１の状態ｊに推移する際の罰則関数Ｐを導入して
いる。In this algorithm, the duration of each note n and time t is kept, and a penalty function P for transitioning from the state at time t to the state j at time t + 1 is introduced.

【数１６】このような罰則関数Ｐの導入に関しての詳細は、"Robus
t Parametric Modeling of Durations in HMMs"（D. Bu
rshtein. ICASSP Proceedings 1995）に記載されてい
る。(Equation 16) For details on the introduction of such a penalty function P, see "Robus
t Parametric Modeling of Durations in HMMs "(D. Bu
rshtein. ICASSP Proceedings 1995).

【００５２】上記罰則関数Ｐにおいて、ΔＤ（ｔ）は対
象となる本来演奏されるべき音の持続時間と、実際の音
の持続時間との差を示している。つまり、ΔＤ（t）＝
Ｄ（t）−Ｄｎとなる。ここで、Ｄ（ｔ）は実際の音の
持続時間であり、Ｄｎは本来演奏されるべき音の持続時
間で、音符列中では持続時間を示すシンボルで表される
（Ｓ１，Ｓ２等、図６（ｂ）参照）。上述したように持
続時間の実時間長は、楽譜面の速度表記や設定テンポに
よって決定される。例えば、図６に示す楽譜において、
持続時間情報が「Ｓ３」、つまり四分音符の場合には、
Ｄｎ＝６０／１２０＝５００msecとなる。また、上記罰
則関数において、ｌ（u）＝logp（u）を示している。こ
こで、ｐ（u）はΔＤの確率密度であり、ガウス混合密
度でモデル化されたものである。In the penalty function P, ΔD (t) represents the difference between the duration of the target sound to be played and the duration of the actual sound. That is, ΔD (t) =
D (t) -Dn. Here, D (t) is the actual duration of the sound, and Dn is the duration of the sound to be originally played, and is represented by a symbol indicating the duration in the note sequence (S1, S2, etc. 6 (b)). As described above, the actual time length of the duration is determined by the speed notation of the musical score and the set tempo. For example, in the music score shown in FIG.
If the duration information is "S3", that is, a quarter note,
Dn = 60/120 = 500 msec. In the penalty function, l (u) = logp (u) is shown. Here, p (u) is the probability density of ΔD, and is modeled by a Gaussian mixture density.

【００５３】このような罰則関数Ｐを上記ビタービアル
ゴリズムに含めるために、モデルパラメータの対数をと
ると、帰納式は次のように表される。When the penalty function P is included in the Viterbi algorithm and the logarithm of the model parameter is taken, the inductive expression is expressed as follows.

【数１７】この式により、音韻の持続時間が考慮された最適なパス
を決定することができる。[Equation 17] With this formula, it is possible to determine the optimal path in consideration of the duration of the phoneme.

【００５４】図１０に示す例では、上記式によって計算
された確率を○あるいは△で示している（○＞△）。例
えば、時刻ｔｍ１〜時刻ｔｍ３（時刻ｔｍ１等はフレー
ム単位の時刻を示す）までの観測をふまえ、状態「Sile
nce」から状態「Ｃ₁」へのパスが形成される確率は、状
態「Silence」から状態「Silence」へのパスが形成され
る確率よりも高く、時刻ｔｍ３におけるベスト確率とな
り、図中太線で示すように状態遷移が決定される。この
ような演算を入力音声の各フレームに対応する時刻（ｔ
ｍ１、ｔｍ２、ｔｍ３、……）毎に行うことによって、
図１０に示す例では、図中太線で示すような遷移したよ
うに決定される。これにより入力音声の音符を各フレー
ム単位の各時刻において特定できるようになる。図１０
に示す場合には、時刻ｔｍ１、ｔｍ２が「Silence」、
ｔｍ３〜ｔｍ１０までが「Ｃ１」、ｔｍ１１〜が「Ｄ
０」といった具合に特定できる。In the example shown in FIG. 10, the probability calculated by the above equation is indicated by ○ or （(○> △). For example, based on observation from time tm1 to time tm3 (time tm1 and the like indicate time in frame units), the state “Sile
The probability that a path from the “nce” to the state “C ₁ ” is formed is higher than the probability that a path from the state “Silence” to the state “Silence” is formed, and is the best probability at the time tm3. The state transition is determined as shown. Such an operation is performed at a time (t) corresponding to each frame of the input voice.
m1, tm2, tm3,...)
In the example shown in FIG. 10, the transition is determined as a transition as shown by the bold line in the figure. Thereby, the note of the input voice can be specified at each time of each frame unit. FIG.
In the case shown in, the times tm1 and tm2 are "Silence",
tm3 to tm10 are "C1", and tm11 to tm10 are "D
0 ".

【００５５】このようにしてフレーム単位の時刻で特定
した入力音声の音符と、音符列情報記憶部１１に記憶さ
れた音符列を構成する各音符との対応付けが可能とな
る。これにより、入力音声と音符列の各音符とを対応付
けた結果を表示装置１２に表示することができる。ここ
で、表示装置１２の表示方法としては、図１０に示すよ
うに、楽譜面を表示し、現時点での入力音声がこの楽譜
面のどの位置にいるかを矢印等で指し示すようにしても
よいし、現在の入力音声に相当する音符を他の音符と色
を変えて表示し、現在の入力音声が音符列のどの位置に
あるかを表示してもよいし、任意である。In this way, it is possible to correlate the notes of the input voice specified at the time of the frame unit with each note constituting the note string stored in the note string information storage unit 11. As a result, the result of associating the input voice with each note in the note sequence can be displayed on the display device 12. Here, as a display method of the display device 12, as shown in FIG. 10, a musical score may be displayed, and an arrow or the like may indicate a position on the musical score where the input voice at the present time is. Alternatively, the note corresponding to the current input voice may be displayed in a different color from the other notes, and the position of the current input voice in the note sequence may be displayed, or may be arbitrary.

【００５６】Ｃ．変形例なお、上述した実施形態では、認識用音符情報記憶部６
の音符辞書６ｂには予め算出した遷移確率やシンボル観
測確率等を記述するようにしていたが、随時学習データ
（楽音や歌声の波形データと、これに対応する音符列を
示すデータとのセット）を入力して、これらのパラメー
タを推定して書き換えるようにしてもよい。この場合、
学習データの音符列を表記したデータを用いて、各音符
について隠れマルコフモデルをＦＳＮに拡張したものを
生成する。そして、入力される学習データの尤度を最大
にするために各々の音符モデルのパラメータを推定する
ことにより求める。ここでは、公知のＫ平均法を用いて
パラメータを推定する方法について簡単に説明する。C. Modified Example In the above-described embodiment, the recognition note information storage unit 6 is used.
In the note dictionary 6b, the transition probability, the symbol observation probability, and the like calculated in advance have been described. However, learning data (a set of musical tone and singing voice waveform data and data indicating a note sequence corresponding thereto) is used as needed. May be input to estimate and rewrite these parameters. in this case,
Using data representing the note sequence of the learning data, a hidden Markov model extended to FSN is generated for each note. Then, in order to maximize the likelihood of the input learning data, it is obtained by estimating the parameters of each note model. Here, a method of estimating parameters using a known K-means method will be briefly described.

【００５７】初期化まず、学習データの波形データを音符列表記の音符毎に
分割する。Initialization First, the waveform data of the learning data is divided for each note in a note string notation.

【００５８】推定次に、遷移に要する時間をカウントし、それを状態の遷
移時間で割ることで遷移確率を算出する。つまり、次式
で遷移確率が算出される。Estimation Next, the transition time is calculated by counting the time required for the transition and dividing the time by the transition time of the state. That is, the transition probability is calculated by the following equation.

【数１８】この過程では、学習時間中の各々の遷移状態と出力シン
ボルを追跡するために、カウンタの管理を行う必要があ
る。(Equation 18) In this process, it is necessary to manage the counter in order to track each transition state and output symbol during the learning time.

【００５９】そして、ピッチ周波数以外の５種類の特徴
ベクトルの観測確率に使用されるガウス分布の連続密度
関数における混合重み係数は、各々の状態ｉについて次
の式で推定される。The mixing weight coefficient in the continuous density function of the Gaussian distribution used for the observation probability of the five types of feature vectors other than the pitch frequency is estimated for each state i by the following equation.

【数１９】また、残りの特徴ベクトルであるピッチ周波数について
の確率関数は、次に示すものを使用する。[Equation 19] The following probability function is used for the remaining feature vector, that is, the pitch frequency.

【数２０】また、上述したようにピッチ周波数についてピッチ音あ
りの場合は、他の５種類の特徴ベクトルと同様にガウス
分布の連続密度関数が用いられるので、この場合には、
上記５種類の特徴ベクトルと同様にして推定される。(Equation 20) Also, as described above, when there is a pitch sound for the pitch frequency, a continuous density function of a Gaussian distribution is used as in the other five types of feature vectors. In this case,
It is estimated in the same way as the above five types of feature vectors.

【００６０】セグメント分け上記に示した推定過程において、推定されたパラメー
タを用いて、あらためてセグメント分けを行う。Segmentation In the estimation process described above, segmentation is performed again using the estimated parameters.

【００６１】反復上記ととを収束するまで繰り返す。Repetition The above is repeated until convergence.

【００６２】このように学習を行うことにより、より正
確なパラメータを推定して音符辞書６ｂに記述すること
ができる。すなわち、音符辞書６ｂを参照して行われる
状態遷移の決定の正確性を向上させることができ、入力
音声が音符列のどの位置にいるかをより正確に検出する
ことができるようになる。By performing the learning as described above, more accurate parameters can be estimated and described in the note dictionary 6b. That is, it is possible to improve the accuracy of the state transition determination performed with reference to the note dictionary 6b, and it is possible to more accurately detect the position of the note sequence in which the input voice is located.

【００６３】[0063]

【発明の効果】以上説明したように、本発明によれば、
入力される歌声や楽器音などの音声が譜面のどの位置に
いるかをより正確に検出することが可能となる。As described above, according to the present invention,
It is possible to more accurately detect the position of the input singing voice or instrument sound on the music score.

[Brief description of the drawings]

【図１】本発明の一実施形態に係る音声信号処理装置
の構成を示すブロック図である。FIG. 1 is a block diagram illustrating a configuration of an audio signal processing device according to an embodiment of the present invention.

【図２】前記音声信号処理装置の構成要素である音符
辞書を説明するための図である。FIG. 2 is a diagram for explaining a note dictionary which is a component of the audio signal processing device.

【図３】前記音声信号処理装置の構成要素である音符
辞書を説明するための図である。FIG. 3 is a diagram for explaining a note dictionary which is a component of the audio signal processing device.

【図４】前記音声信号処理装置の構成要素である符号
帳を説明するための図である。FIG. 4 is a diagram for explaining a codebook which is a component of the audio signal processing device.

【図５】前記音符辞書に記述されたシンボルの観測確
率を算出するための関数を説明するための図である。FIG. 5 is a diagram for explaining a function for calculating an observation probability of a symbol described in the note dictionary.

【図６】前記音声信号処理装置の構成要素である音符
列情報記憶部を説明するための図である。FIG. 6 is a diagram for explaining a note sequence information storage unit that is a component of the audio signal processing device.

【図７】前記音声信号処理装置の動作を説明するため
のフローチャートである。FIG. 7 is a flowchart for explaining the operation of the audio signal processing device.

【図８】入力音声から特徴ベクトルを取得する過程に
ついて説明する図である。FIG. 8 is a diagram illustrating a process of acquiring a feature vector from an input voice.

【図９】前記音声信号処理装置で使用される隠れマル
コフモデルを説明するための図である。FIG. 9 is a diagram for explaining a hidden Markov model used in the audio signal processing device.

【図１０】入力音声と音符列との対応付けを説明する
ための図である。FIG. 10 is a diagram for explaining correspondence between an input voice and a note sequence.

【図１１】入力音声と音符列との対応付け結果の表示
例を説明するための図である。FIG. 11 is a diagram illustrating a display example of a result of associating an input voice with a note sequence.

[Explanation of symbols]

１……マイク、２……分析窓生成部、４……高速フーリ
エ変換部、５……特徴パラメータ分析部、６……認識用
音符情報記憶部、６ａ……符号帳、６ｂ……音符辞書、
８……音符列状態形成部、９……状態遷移決定部、１１
……音符列情報記憶部、１２……表示装置1 microphone, 2 analysis window generation unit, 4 fast Fourier transform unit, 5 feature parameter analysis unit, 6 note information storage unit for recognition, 6a codebook, 6b note dictionary ,
8... Note sequence state forming unit 9... State transition determining unit 11
... Note string information storage unit, 12... Display device

─────────────────────────────────────────────────────
────────────────────────────────────────────────── ───

【手続補正書】[Procedure amendment]

【提出日】平成１１年１２月１５日（１９９９．１２．
１５）[Submission date] December 15, 1999 (1999.12.
15)

【手続補正１】[Procedure amendment 1]

【補正対象書類名】図面[Document name to be amended] Drawing

【補正対象項目名】図８[Correction target item name] Fig. 8

【補正方法】変更[Correction method] Change

【補正内容】[Correction contents]

【図８】 FIG. 8

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 15/00 Ｇ１０Ｌ 3/00 ５５１Ｇ (72)発明者ペドロケイノスペインバルセロナ 08002 メルセ 12 (72)発明者アレックスロスコススペインバルセロナ 08002 メルセ 12 Ｆターム(参考） 5D015 AA06 CC03 CC06 CC13 CC14 CC18 HH23 KK02 5D082 BB01 BB14 BB15 5D378 KK02 KK03 KK05 MM02 MM14──────────────────────────────────────────────────続き Continued on the front page (51) Int.Cl. ⁷ Identification symbol FI Theme coat ゛ (Reference) G10L 15/00 G10L 3/00 551G (72) Inventor Pedro Keino Spain Barcelona 08002 Merce 12 (72) Inventor Alex Roscos Spain Barcelona 08002 Merce 12 F term (reference) 5D015 AA06 CC03 CC06 CC13 CC14 CC18 HH23 KK02 5D082 BB01 BB14 BB15 5D378 KK02 KK03 KK05 MM02 MM14

Claims

[Claims]

1. An audio signal processing apparatus for associating an input voice with any one of notes in a note sequence, a note sequence storing means for storing note sequence information described in a time sequence, and a sound input in frame units. Parameter acquisition means for acquiring a characteristic parameter from a signal, a codebook in which a representative characteristic parameter of a voice signal is clustered into a symbol as a characteristic vector, a number of states for each note, a state transition probability, and an observation probability of the symbol. By referring to the note information storing means for recognition storing the observation symbol of the input voice from the characteristic parameter obtained by the parameter obtaining means, by referring to the note information storing means for recognition. Observation probability acquisition means for acquiring a probability, based on the number of states and the state transition probability stored in the note information storage means for recognition. There are a state forming means for each state of the sequence of notes stored in the sequence of notes data storing means formed by Hidden Markov Model on finite state networks, and observation probability acquired by the observation probability acquiring means,
A state transition determining unit that determines a state transition according to the hidden Markov model formed by the state forming unit; and, based on the state transition determined by the state transition determining unit, each frame of the input voice signal and the An audio signal processing apparatus comprising: a correspondence unit that associates the note string information with the note string information.

2. The apparatus according to claim 1, further comprising a display unit for displaying which part of the note string information the current input voice is based on a result of the association by the association unit. An audio signal processing device as described in the above.

3. The parameter acquisition unit according to claim 1, wherein the parameter acquisition unit acquires at least energy, delta energy, zero cross, pitch, delta pitch, and pitch error from the input audio signal as characteristic parameters. Audio signal processing device.

4. The five observation probabilities of energy, delta energy, zero cross, delta pitch, and pitch error stored in the recognition note information storage means are calculated using an observation function using a Gaussian distribution. The observation probability of the pitch stored in the recognition note information storage means for recognition is calculated using an observation function using a Gaussian distribution and a step observation probability function.When calculating the observation probability of this pitch, The audio signal processing device according to claim 3, wherein an observation function using the Gaussian distribution and the step observation function are selectively used depending on the presence or absence of the pitch.

5. The state forming means includes a sound having a pitch,
3 types of left-to-right depending on the pitchless sound and silence
A hidden Markov model of a pattern is formed, the sound with pitch and the sound without pitch are formed as a three-state model, and the sound without pitch is formed as a one-state model. 5. The audio signal processing device according to any one of 4.

6. The state forming means, when forming the hidden Markov model of the pitched sound, forms a note connected with a preceding note by a slur and a single note as different models. The audio signal processing device according to claim 5.

7. An input means for inputting learning musical sound waveform data and learning musical note sequence data obtained by converting the musical sound waveform into a musical note, and a finite network for each musical note of the learning musical note sequence data input from the input means. And a parameter estimating means for estimating, by a k-means algorithm, a parameter that maximizes the likelihood of the model formed by the learning model forming means during learning. 7. The method according to claim 1, wherein the recognition note information storage means stores a state transition probability and an observation probability of a feature vector in each note obtained by a parameter estimated by the parameter estimation means. An audio signal processing device according to any one of the above.

8. The audio signal processing device according to claim 1, wherein said state transition determining means determines a state transition by a Viterbi algorithm.

9. The note string storage means stores duration data corresponding to the note string, and the state transition determining means stores the duration data stored in the note string storage means into the Viterbi data. The audio signal processing device according to claim 8, wherein the audio signal processing device is included in an algorithm.

10. A voice signal processing method for associating an input voice with any one of a pre-stored note sequence, comprising: a parameter obtaining step of obtaining feature parameters from a voice signal input in frame units; A codebook obtained by clustering representative characteristic parameters of the obtained speech signal into symbols as a feature vector,
By referring to the number of states, the state transition probability, and the observation probability of the symbol for each note, the observation symbol of the input voice is acquired from the characteristic parameter acquired in the parameter acquisition step, and the observation probability of the observation symbol is acquired. Observation probability acquisition step of acquiring, based on the number of states and state transition probabilities stored in advance,
A state forming step of forming each state of a note string stored in advance on a finite state network by a hidden Markov model; an observation probability obtained by the observation probability obtaining step; and the hidden Markov formed by the state forming step. A state transition determining step of determining a state transition in accordance with the model, and, based on the state transition determined by the state transition determining step, an associating step of associating each frame of the input voice signal with the note string information. An audio signal processing method, comprising:

11. The apparatus according to claim 10, further comprising a display step of displaying which part of the note string information the current input voice is based on a result of the association by the association step. The audio signal processing method according to the above.

12. The parameter acquiring step according to claim 10, wherein at least energy, delta energy, zero cross, pitch, delta pitch and pitch error are acquired from the input audio signal as characteristic parameters. Audio signal processing method.

13. The energy, delta energy,
A first observation probability calculation step of calculating and storing five types of observation probabilities of zero cross, delta pitch and pitch error using an observation function using a Gaussian distribution, and using the Gaussian distribution for the observation probability of the pitch. A second observation probability calculation step of separately calculating and storing the observation function using the Gaussian distribution and the step observation function according to the presence or absence of the pitch using the observation function and the step observation probability function The voice according to claim 12, further comprising: in the observation probability acquiring step, the observation probability is acquired by referring to the observation probability stored in the first and second observation probability calculation steps. Signal processing method.

14. In the state forming step, three types of left-t are selected according to a pitched sound, a pitchless sound, and a silence.
An o-right type hidden Markov model is formed, the pitched sound and the pitchless sound are formed as a three-state model, and the pitchless sound is formed as a one-state model. Item 12 or 1
3. The audio signal processing method according to any one of 3.

15. In the state forming step, when forming the hidden Markov model of the pitched sound, a note connected with a previous note, a note connected by a slur, and a single note are formed as different models. The audio signal processing method according to claim 14.

16. An inputting step of inputting learning musical tone waveform data and learning musical note sequence data obtained by converting the musical musical tone into notes, and a finite network for each musical note of the learning musical note sequence data input in the inputting step. A learning model forming step of forming a hidden Markov model by: a parameter estimating step of estimating, by a k-means algorithm, a parameter that maximizes a likelihood of a model formed by the learning model forming means during learning; A probability storage step of storing a state transition probability and an observation probability of a feature vector in each note obtained by the parameter estimated by the estimation step, and the observation probability acquisition step stores the observation probability stored by the probability storage step. Obtain the observation probability by referring to In step, on the basis of the probability memory stored state transition probabilities by step, according to claim 10 in which each state of pre-stored sequence of notes, and forming by Hidden Markov Model on a finite state network
16. The audio signal processing method according to any one of claims 15 to 15.

17. The audio signal processing method according to claim 10, wherein said state transition determining step determines a state transition by a Viterbi algorithm.

18. The method according to claim 17, wherein in the state transition determining step, duration data corresponding to a note string stored in advance is included in the Viterbi algorithm.
3. The audio signal processing method according to 1.