JP2017090848A

JP2017090848A - Music analysis device and music analysis method

Info

Publication number: JP2017090848A
Application number: JP2015224797A
Authority: JP
Inventors: 陽前澤; Akira Maezawa
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2015-11-17
Filing date: 2015-11-17
Publication date: 2017-05-25

Abstract

PROBLEM TO BE SOLVED: To specify a structural section of music with high precision.SOLUTION: A music analysis device 100 is configured to analyze a structural section as a section of a musical structure in music, and comprises: a feature extraction part 22 which extracts, for each unit section of an acoustic signal X representing sound of the music, a timbre feature quantity FT representing a feature of a timbre of the sound and a chord feature quantity FC representing a feature of a chord of the sound; a timbre observation model which represents, for each structural section including a plurality of unit sections, a generation process of the timbre feature quantity FT probabilistically; an estimation processing part 24 which estimates, for each chord observation model representing the generation process of the chord feature quantity FC probabilistically for each unit section within each structural section, a posterior distribution through estimation processing to which the timbre feature quantity FT and chord feature quantity FC that the feature extraction part 22 extracts are applied; and a structure analysis part 26 which specifies a plurality of structural sections of the music from a result of the estimation processing.SELECTED DRAWING: Figure 1

Description

本発明は、楽曲を解析する技術に関する。 The present invention relates to a technique for analyzing music.

楽曲のうち「サビ」や「Ａメロ」等の特定の区間を選択的に再生したり、複数の楽曲から抽出された各区間を相互に連結して再生（例えばＤＪ用途のリミックス）したりするためには、楽曲の音楽的な構造の解析が必要である。特許文献１には、音響信号から順次に抽出される各特徴量の類似度を解析することで繰返し区間を検出および統合し、統合後の複数の繰返し区間から楽曲のサビの区間を選択する技術が開示されている。 Play a specific section such as “Chibi” or “A melody” in the music selectively, or connect the sections extracted from multiple songs to each other (for example, remix for DJ use) To do this, it is necessary to analyze the musical structure of the music. Patent Document 1 discloses a technique for detecting and integrating repeated sections by analyzing the similarity of each feature amount sequentially extracted from an acoustic signal, and selecting a chorus section of music from a plurality of combined repeated sections. Is disclosed.

特開２００４−２３３９６５号公報JP 2004-233965 A

しかし、特許文献１の技術を含む既存の解析技術で楽曲の構造を高精度に解析することは実際には困難である。以上の事情を考慮して、本発明は、楽曲の構造区間を高精度に特定することを目的とする。 However, it is actually difficult to analyze the structure of music with high accuracy using existing analysis techniques including the technique of Patent Document 1. In view of the above circumstances, an object of the present invention is to specify a structure section of a music piece with high accuracy.

以上の課題を解決するために、本発明の好適な態様に係る楽曲解析装置は、楽曲の音響を表す音響信号の単位区間毎に、音響の音色の特徴を表す音色特徴量と当該音響の和音の特徴を表す和音特徴量とを抽出する特徴抽出部と、楽曲内の音楽的な構造の区分であり少なくともひとつの単位区間を含む構造区間毎に音色特徴量の生成過程を確率的に表現する音色観測モデルと、各構造区間内の単位区間毎に和音特徴量の生成過程を確率的に表現する和音観測モデルとの各々について、特徴抽出部が抽出した音色特徴量および和音特徴量を適用した推定処理により事後分布を推定する推定処理部と、推定処理の結果から楽曲の複数の構造区間を特定する構造解析部とを具備する。以上の態様では、構造区間毎の音色観測モデルと各構造区間内の単位区間毎の和音観測モデルとの各々の事後分布が推定処理で推定される。すなわち、楽曲の１個の構造区間内では概略的には音色が統一されるという傾向を反映した音色観測モデルと、構造区間内の単位区間毎に和音は順次に遷移するという傾向を反映した和音観測モデルとが推定処理に利用される。したがって、楽曲の構造区間を高精度に特定することが可能である。 In order to solve the above-described problems, a music analysis device according to a preferred aspect of the present invention provides a timbre feature amount representing a timbre feature of a sound and a chord of the sound for each unit section of a sound signal representing the sound of the music. A feature extraction unit that extracts chord feature values representing the features of the melody and a musical structure division within the musical composition, and a stochastic feature value generation process is stochastically expressed for each structure section including at least one unit section For each of the timbre observation model and the chord observation model that probabilistically represents the chord feature generation process for each unit section in each structural section, the timbre feature amount and chord feature amount extracted by the feature extraction unit were applied. An estimation processing unit that estimates the posterior distribution by the estimation process, and a structure analysis unit that specifies a plurality of structural sections of the music from the result of the estimation process. In the above aspect, the posterior distributions of the timbre observation model for each structural section and the chord observation model for each unit section in each structural section are estimated by the estimation process. That is, a timbre observation model that reflects the tendency that the timbres are roughly unified within one structural section of the music, and a chord that reflects the tendency that the chords sequentially transition for each unit section within the structural section The observation model is used for the estimation process. Therefore, it is possible to specify the structure section of the music with high accuracy.

本発明の好適な態様において、単位区間は、楽曲の小節である。以上の態様では、楽曲の複数の小節の各々を単位区間として音色特徴量および和音特徴量が単位区間毎に抽出されるから、相前後する小節の境界で構造区間が遷移するという傾向のもとで楽曲の構造区間を高精度に推定できるという利点がある。 In a preferred aspect of the present invention, the unit section is a measure of music. In the above aspect, since the timbre feature value and the chord feature value are extracted for each unit section with each of the plurality of measures of the music as a unit section, the structure section tends to transition at the boundary of adjacent measures. Therefore, there is an advantage that the structure section of the music can be estimated with high accuracy.

本発明の好適な態様において、特徴抽出部は、音響の音色に応じた複数の要素を含む基礎音色特徴量を単位区間毎に音響信号から抽出し、複数の単位区間の各々について、当該単位区間の前後の複数の周辺単位区間の各々との間で基礎音色特徴量の要素毎の最小値を選択する第１処理と、複数の周辺単位区間にわたる最小値の最大値を要素毎に選択する第２処理とを実行することで、音色特徴量を生成する。以上の態様では、音響信号の複数の単位区間の各々について、各周辺単位区間との間で基礎音色特徴量の要素毎の最小値を選択する第１処理と、複数の周辺単位区間にわたる当該最小値の最大値を要素毎に選択する第２処理とを実行することで音色特徴量が生成される。すなわち、単位区間と各周辺単位区間との間の共通成分が第１処理で抽出され、複数の周辺単位区間にわたる要素の統合で音色特徴量が生成される。したがって、経時的に変動する非定常的な雑音成分を簡便に低減して適切な音色特徴量を生成することが可能である。 In a preferred aspect of the present invention, the feature extraction unit extracts a basic timbre feature amount including a plurality of elements according to an acoustic timbre from an acoustic signal for each unit section, and for each of the plurality of unit sections, the unit section A first process for selecting a minimum value for each element of the basic timbre feature quantity between each of a plurality of peripheral unit sections before and after the first and a maximum value of a minimum value over a plurality of peripheral unit sections for each element. The timbre feature amount is generated by executing the two processes. In the above aspect, for each of the plurality of unit sections of the acoustic signal, the first process for selecting the minimum value for each element of the basic timbre feature quantity between each of the peripheral unit sections and the minimum over the plurality of peripheral unit sections A timbre feature amount is generated by executing the second process of selecting the maximum value for each element. That is, a common component between the unit section and each peripheral unit section is extracted in the first process, and a timbre feature amount is generated by integrating elements over a plurality of peripheral unit sections. Therefore, it is possible to easily reduce unsteady noise components that change over time and generate appropriate timbre feature quantities.

本発明の好適な態様において、推定処理部は、確率分布の混合数が無限である無限混合分布を音色観測モデルとして推定処理を実行する。以上の態様では、無限混合分布が音色観測モデルとして利用されるから、音響信号の音色の特性に応じて確率分布の混合数が変動する。したがって、音色観測モデルの事後分布を音響信号の特性に応じて適切に推定できるという利点がある。 In a preferred aspect of the present invention, the estimation processing unit executes the estimation process using an infinite mixture distribution with an infinite number of probability distributions as a timbre observation model. In the above aspect, since the infinite mixture distribution is used as a timbre observation model, the number of mixtures in the probability distribution varies according to the timbre characteristics of the acoustic signal. Therefore, there is an advantage that the posterior distribution of the timbre observation model can be appropriately estimated according to the characteristics of the acoustic signal.

本発明の好適な態様において、推定処理部は、構造区間内の各単位区間に対応する複数の状態の系列を複数の構造区間について包含し、各構造区間の最後の状態から他の構造区間の最初の状態への遷移が可能な状態遷移モデルを推定処理に使用する。以上の態様では、単位区間に対応する複数の状態の系列を複数の構造区間について包含し、かつ、各構造区間の末尾の状態から他の構造区間の先頭の状態への遷移が可能な状態遷移モデルが推定処理に適用される。すなわち、複数の単位区間で構成される構造区間が楽曲内で順次に遷移するという傾向が状態遷移モデルで適切に表現される。したがって、楽曲の構造区間を高精度に推定することが可能である。 In a preferred aspect of the present invention, the estimation processing unit includes a plurality of state sequences corresponding to each unit section in the structure section for the plurality of structure sections, and from the last state of each structure section to another structure section. A state transition model capable of transition to the first state is used for the estimation process. In the above aspect, a state transition that includes a series of a plurality of states corresponding to a unit section with respect to a plurality of structural sections and that can transition from the last state of each structural section to the leading state of another structural section The model is applied to the estimation process. That is, a tendency that a structural section composed of a plurality of unit sections transitions sequentially in music is appropriately expressed in the state transition model. Therefore, it is possible to estimate the structure section of the music with high accuracy.

本発明の好適な態様において、推定処理部は、初期値を相違させて複数回にわたり推定処理を反復することで、構造区間を示す識別符号を楽曲の単位区間毎に配列した構造推定系列を推定処理毎に特定し、構造解析部は、複数回にわたる推定処理で特定された複数の構造推定系列から楽曲の構造区間を特定する。以上の態様では、相異なる初期値を適用した推定処理で生成された複数の構造推定系列から楽曲の構造区間が特定される。したがって、各推定処理に適用される初期値の変動（推定処理毎の構造推定系列の相違）に対して頑健に楽曲の構造区間を高精度に推定できるという利点がある。 In a preferred aspect of the present invention, the estimation processing unit repeats the estimation process multiple times with different initial values, thereby estimating a structure estimation sequence in which identification codes indicating the structure sections are arranged for each unit section of the music piece. The process is specified for each process, and the structure analysis unit specifies the structure section of the music from the plurality of structure estimation sequences specified by the estimation process performed a plurality of times. In the above aspect, the structure section of the music is specified from the plurality of structure estimation sequences generated by the estimation process using different initial values. Therefore, there is an advantage that the structure section of the music can be estimated with high accuracy robustly against the fluctuation of the initial value applied to each estimation process (difference in the structure estimation sequence for each estimation process).

本発明の好適な態様に係る楽曲解析方法は、コンピュータシステムが、楽曲の音響を表す音響信号の単位区間毎に、音響の音色の特徴を表す音色特徴量と当該音響の和音の特徴を表す和音特徴量とを抽出し、楽曲内の音楽的な構造の区分であり少なくともひとつの単位区間を含む構造区間毎に音色特徴量の生成過程を確率的に表現する音色観測モデルと、各構造区間内の単位区間毎に和音特徴量の生成過程を確率的に表現する和音観測モデルとの各々について、抽出した音色特徴量および和音特徴量を適用した推定処理により事後分布を推定し、推定処理の結果から楽曲の複数の構造区間を特定する。 In the music analysis method according to a preferred aspect of the present invention, the computer system uses a timbre feature amount representing the timbre feature of the sound and a chord representing the chord feature of the sound for each unit section of the acoustic signal representing the sound of the tune. A timbre observation model that extracts features and extracts the structure of musical structure in the music and includes at least one unit section. The posterior distribution is estimated by the estimation process using the extracted timbre features and chord features for each of the chord observation models that probabilistically express the generation process of the chord features for each unit interval. Identify multiple structural sections of the song.

本発明の第１実施形態に係る楽曲解析装置の構成図である。It is a lineblock diagram of a music analysis device concerning a 1st embodiment of the present invention. 特徴抽出処理のフローチャートである。It is a flowchart of a feature extraction process. 対象楽曲の単位区間（小節）の説明図である。It is explanatory drawing of the unit area (measure) of object music. 雑音圧縮処理のフローチャートである。It is a flowchart of a noise compression process. 雑音抑圧処理の説明図である。It is explanatory drawing of a noise suppression process. 構造推定系列の説明図である。It is explanatory drawing of a structure estimation series. 楽曲構造モデルの説明図である。It is explanatory drawing of a music structure model. 構造解析部による処理の説明図である。It is explanatory drawing of the process by a structure analysis part. 推定結果画像の模式図である。It is a schematic diagram of an estimation result image. 楽曲解析装置の動作のフローチャートである。It is a flowchart of operation | movement of a music analysis apparatus. 第２実施形態における構造解析部の動作の説明図である。It is explanatory drawing of operation | movement of the structure analysis part in 2nd Embodiment.

＜第１実施形態＞
図１は、本発明の第１実施形態に係る楽曲解析装置１００の構成図である。第１実施形態の楽曲解析装置１００は、任意の１個の楽曲（以下「対象楽曲」という）の音楽的な構造を解析する情報処理装置である。具体的には、楽曲解析装置１００は、音楽的な構造（音楽的な意味や聴感的な印象）に応じて対象楽曲を時間軸上で区分した複数の区間（以下「構造区間」という）を解析する。各構造区間は、例えば「イントロ」「Ａメロ」「Ｂメロ」「Ｃメロ」「サビ」等の区間（対象楽曲を構成する要素）である。 <First Embodiment>
FIG. 1 is a configuration diagram of a music analysis apparatus 100 according to the first embodiment of the present invention. The music analysis apparatus 100 according to the first embodiment is an information processing apparatus that analyzes the musical structure of an arbitrary piece of music (hereinafter referred to as “target music”). Specifically, the music analysis device 100 includes a plurality of sections (hereinafter referred to as “structure sections”) in which the target music is divided on the time axis according to a musical structure (musical meaning and auditory impression). To analyze. Each structural section is, for example, a section such as “Intro”, “A melody”, “B melody”, “C melody”, or “rust” (elements constituting the target music).

図１に例示される通り、第１実施形態の楽曲解析装置１００は、演算処理装置１２と記憶装置１４と表示装置１６とを具備するコンピュータシステムで実現される。例えば携帯電話機やスマートフォン等の可搬型の情報処理装置やパーソナルコンピュータ等の可搬型または据置型の情報処理装置が楽曲解析装置１００として利用され得る。表示装置１６（例えば液晶表示パネル）は、演算処理装置１２から指示された画像を表示する。例えば対象楽曲の解析結果が表示装置１６に表示される。 As illustrated in FIG. 1, the music analysis device 100 according to the first embodiment is realized by a computer system including an arithmetic processing device 12, a storage device 14, and a display device 16. For example, a portable information processing device such as a mobile phone or a smartphone, or a portable or stationary information processing device such as a personal computer can be used as the music analysis device 100. The display device 16 (for example, a liquid crystal display panel) displays an image instructed from the arithmetic processing device 12. For example, the analysis result of the target music is displayed on the display device 16.

記憶装置１４は、演算処理装置１２が実行するプログラムや演算処理装置１２が使用する各種のデータを記憶する。半導体記録媒体や磁気記録媒体等の公知の記録媒体または複数種の記録媒体の組合せが記憶装置１４として任意に採用される。第１実施形態の記憶装置１４は、対象楽曲の音響（例えば演奏音や歌唱音）を表す音響信号Ｘを記憶する。なお、記録媒体（例えば光ディスク）に記録された音響信号Ｘを再生する再生装置等の外部装置から楽曲解析装置１００に音響信号Ｘを供給することも可能である。 The storage device 14 stores a program executed by the arithmetic processing device 12 and various data used by the arithmetic processing device 12. A known recording medium such as a semiconductor recording medium or a magnetic recording medium or a combination of a plurality of types of recording media is arbitrarily employed as the storage device 14. The memory | storage device 14 of 1st Embodiment memorize | stores the acoustic signal X showing the sound (for example, performance sound and singing sound) of object music. It is also possible to supply the acoustic signal X to the music analysis device 100 from an external device such as a playback device that reproduces the acoustic signal X recorded on a recording medium (for example, an optical disc).

演算処理装置１２は、記憶装置１４に記憶されたプログラムを実行することで、対象楽曲の構造を解析するための複数の要素（特徴抽出部２２，推定処理部２４，構造解析部２６）として機能する。なお、演算処理装置１２の機能を複数の装置の集合（すなわちシステム）で実現した構成や、演算処理装置１２の機能の一部を専用の電子回路が分担する構成も採用され得る。演算処理装置１２が実現する各機能について以下に詳述する。 The arithmetic processing unit 12 functions as a plurality of elements (a feature extraction unit 22, an estimation processing unit 24, and a structure analysis unit 26) for analyzing the structure of the target musical piece by executing a program stored in the storage device 14. To do. A configuration in which the function of the arithmetic processing device 12 is realized by a set of a plurality of devices (that is, a system), or a configuration in which a dedicated electronic circuit shares a part of the function of the arithmetic processing device 12 may be employed. Each function realized by the arithmetic processing unit 12 will be described in detail below.

＜特徴抽出部２２＞
特徴抽出部２２は、音響信号Ｘの特徴量を抽出する。第１実施形態の特徴抽出部２２は、音響信号Ｘを時間軸上で区分した単位区間毎に音色特徴量ＦT（T：timbre）と和音特徴量ＦC（C：chord）とを抽出する。音色特徴量ＦTは、音響信号Ｘが表す音響の音色の特徴を表す特徴量である。第１実施形態では音響信号ＸのＭＦＣＣ（Mel-Frequency Cpestrum Coefficient）に応じた音色特徴量ＦTを例示する。他方、和音特徴量ＦCは、音響信号Ｘが表す音響の和音（コード）の特徴を表す特徴量である。第１実施形態ではクロマベクトルに応じた特徴量ＦCを例示する。クロマベクトルは、音響信号Ｘのうち音階音（例えば平均律の１２半音の各々）に対応する周波数成分の強度を複数のオクターブにわたり加算した数値を、相異なる複数の音階音の各々について配列した１２次元ベクトルである。 <Feature extraction unit 22>
The feature extraction unit 22 extracts the feature amount of the acoustic signal X. The feature extraction unit 22 of the first embodiment extracts a timbre feature value FT (T: timbre) and a chord feature value FC (C: chord) for each unit section obtained by dividing the acoustic signal X on the time axis. The timbre feature amount FT is a feature amount that represents the timbre feature of the sound represented by the acoustic signal X. In the first embodiment, a timbre feature amount FT corresponding to the MFCC (Mel-Frequency Cpestrum Coefficient) of the acoustic signal X is exemplified. On the other hand, the chord feature value FC is a feature value representing the feature of the acoustic chord (code) represented by the acoustic signal X. In the first embodiment, the feature value FC corresponding to the chroma vector is exemplified. In the chroma vector, a numerical value obtained by adding the intensities of frequency components corresponding to a scale tone (for example, each of twelve semitones of the equal temperament) over the plurality of octaves in the acoustic signal X is arranged for each of a plurality of different scale sounds. It is a dimension vector.

図２は、第１実施形態の特徴抽出部２２が音色特徴量ＦTおよび和音特徴量ＦCを抽出する処理（以下「特徴抽出処理」という）ＳA1のフローチャートである。対象楽曲の解析が利用者から指示された場合に図２の特徴抽出処理ＳA1が開始される。 FIG. 2 is a flowchart of a process SA1 in which the feature extraction unit 22 of the first embodiment extracts the timbre feature quantity FT and the chord feature quantity FC (hereinafter referred to as “feature extraction process”). When the analysis of the target music is instructed by the user, the feature extraction process SA1 in FIG.

特徴抽出処理ＳA1を開始すると、特徴抽出部２２は、図３に例示される通り、音響信号Ｘを時間軸上で複数の単位区間に区分する（ＳA11）。具体的には、特徴抽出部２２は、音響信号Ｘの拍点を検出し、拍点の所定個に相当する１小節分の区間を単位区間として画定する。拍点や小節（単位区間）の検出には、例えば特開２０１５−１１４３６１号公報に記載された技術が好適に利用される。ただし、音響信号Ｘを複数の単位区間に区分する方法は任意であり、以上の例示には限定されない。以上の説明の通り、第１実施形態では、対象楽曲の複数の小節の各々を単位区間として音色特徴量ＦTおよび和音特徴量ＦCが単位区間毎（すなわち小節毎）に抽出されるから、相前後する小節の境界で構造区間が遷移するという音楽的な前提のもとで対象楽曲の構造区間を高精度に推定できるという利点がある。 When the feature extraction process SA1 is started, the feature extraction unit 22 divides the acoustic signal X into a plurality of unit sections on the time axis as illustrated in FIG. 3 (SA11). Specifically, the feature extraction unit 22 detects a beat point of the acoustic signal X, and demarcates a section corresponding to one bar corresponding to a predetermined number of beat points as a unit section. For detection of beat points and measures (unit intervals), for example, a technique described in JP-A-2015-114361 is preferably used. However, the method of dividing the acoustic signal X into a plurality of unit sections is arbitrary, and is not limited to the above examples. As described above, in the first embodiment, the timbre feature value FT and the chord feature value FC are extracted for each unit section (that is, for each measure) with each of the plurality of measures of the target music as a unit section. There is an advantage that the structural section of the target music can be estimated with high accuracy on the musical premise that the structural section transitions at the boundaries of the bars to be played.

音響信号Ｘを複数の単位区間に区分すると、特徴抽出部２２は、基礎音色特徴量ＦT0と基礎和音特徴量ＦC0とを単位区間毎に音響信号Ｘから抽出する（ＳA12）。基礎音色特徴量ＦT0は、音色特徴量ＦTの基礎となる音色の特徴量であり、基礎和音特徴量ＦC0は、和音特徴量ＦCの基礎となる和音の特徴量である。 When the acoustic signal X is divided into a plurality of unit sections, the feature extraction unit 22 extracts the basic tone color feature quantity FT0 and the basic chord feature quantity FC0 from the acoustic signal X for each unit section (SA12). The basic timbre feature quantity FT0 is a timbre feature quantity that is the basis of the timbre feature quantity FT, and the basic chord feature quantity FC0 is a chord feature quantity that is the basis of the chord feature quantity FC.

具体的には、特徴抽出部２２は、図３に例示される通り、任意の１個の単位区間を時間軸上で区分した複数の区間σの各々について音響信号ＸのＭＦＣＣを算定し、当該単位区間内の複数の区間σにわたりＭＦＣＣを配列したベクトルを当該単位区間の基礎音色特徴量ＦT0として生成する。区間σは、例えば対象楽曲の１６分音符に相当する区間である。したがって、第１実施形態の基礎音色特徴量ＦT0は、任意の１個のＭＦＣＣの次元数（例えば１２次元）と単位区間内の区間σの総数（例えば１６個）との積に相当するＩ個の要素を配列したＩ次元ベクトルである（Ｉは２以上の自然数）。すなわち、基礎音色特徴量ＦT0は、単位区間内（１小節内）の音色の時間的な遷移を表現する。また、特徴抽出部２２は、単位区間内の区間σ毎に音響信号Ｘのクロマベクトルを算定し、当該単位区間内の複数の区間σにわたりクロマベクトルを配列したベクトルを当該単位区間の基礎和音特徴量ＦC0として生成する。すなわち、基礎和音特徴量ＦCは、単位区間内の和音の時間的な遷移（すなわちコード進行）を表現する。 Specifically, as illustrated in FIG. 3, the feature extraction unit 22 calculates the MFCC of the acoustic signal X for each of a plurality of sections σ obtained by dividing any one unit section on the time axis. A vector in which MFCCs are arranged over a plurality of sections σ within a unit section is generated as a basic timbre feature quantity FT0 of the unit section. The section σ is a section corresponding to, for example, a sixteenth note of the target music. Therefore, the basic timbre feature value FT0 of the first embodiment is I pieces corresponding to the product of the number of dimensions of any one MFCC (for example, 12 dimensions) and the total number of sections σ in the unit section (for example, 16). Is an I-dimensional vector in which the elements are arranged (I is a natural number of 2 or more). That is, the basic timbre feature amount FT0 expresses the temporal transition of the timbre within a unit interval (within one measure). In addition, the feature extraction unit 22 calculates a chroma vector of the acoustic signal X for each section σ in the unit section, and a vector in which the chroma vectors are arranged over a plurality of sections σ in the unit section is a basic chord feature of the unit section. Generated as quantity FC0. That is, the basic chord feature value FC represents a temporal transition (that is, chord progression) of a chord within a unit section.

ところで、基礎音色特徴量ＦT0には、経時的に変動する非定常的な加法性の雑音成分が含有され得る。第１実施形態の特徴抽出部２２は、時間軸上で相互に近い複数の単位区間にわたり共通に含有される成分（共通成分）を抽出することで雑音成分を抑圧する（ＳA13）。図４は、特徴抽出部２２が基礎音色特徴量ＦT0から雑音成分を抑圧する処理（以下「雑音抑圧処理」という）ＳA13のフローチャートであり、図５は、雑音抑圧処理ＳA13の説明図である。 By the way, the basic tone color feature value FT0 may contain an unsteady additive noise component that varies with time. The feature extraction unit 22 of the first embodiment suppresses noise components by extracting components (common components) that are commonly contained over a plurality of unit sections close to each other on the time axis (SA13). FIG. 4 is a flowchart of a process SA13 in which the feature extraction unit 22 suppresses a noise component from the basic timbre feature quantity FT0 (hereinafter referred to as “noise suppression process”) SA13, and FIG. 5 is an explanatory diagram of the noise suppression process SA13.

雑音抑圧処理ＳA13を開始すると、特徴抽出部２２は、音響信号Ｘの複数の単位区間から１個の単位区間（以下「対象単位区間」という）を選択する（ＳA130）。具体的には、特徴抽出部２２は、音響信号Ｘの先頭から末尾にかけて複数の単位区間の各々を順次に対象単位区間として選択する。 When the noise suppression process SA13 is started, the feature extraction unit 22 selects one unit section (hereinafter referred to as “target unit section”) from the plurality of unit sections of the acoustic signal X (SA130). Specifically, the feature extraction unit 22 sequentially selects each of the plurality of unit sections as the target unit section from the beginning to the end of the acoustic signal X.

図５に例示される通り、１個の対象単位区間の周辺に位置するＪ個（Ｊは２以上の自然数）の単位区間を以下では「周辺単位区間」と表記する。周辺単位区間の個数Ｊは任意である。また、図５では、対象単位区間の前方と後方とに位置する複数の単位区間を周辺単位区間としたが、対象単位区間に対して前方および後方の一方のみに位置する複数の単位区間を周辺単位区間とすることも可能である。 As illustrated in FIG. 5, J (J is a natural number greater than or equal to 2) unit sections located around one target unit section is hereinafter referred to as “peripheral unit section”. The number J of peripheral unit sections is arbitrary. Further, in FIG. 5, a plurality of unit sections positioned in front and rear of the target unit section are set as peripheral unit sections, but a plurality of unit sections positioned only in one of the front and rear of the target unit section are peripheral. It can also be a unit interval.

対象単位区間を選択すると、特徴抽出部２２は、当該対象単位区間について第１処理ＳA131と第２処理ＳA132とを順次に実行する。第１処理ＳA131および第２処理ＳA132の具体的な内容を以下に詳述する。なお、以下の説明では、図５に例示される通り、対象単位区間の基礎音色特徴量ＦT0が、Ｉ個の要素ｆA(1)〜ｆA(I)を包含するＩ次元のベクトルであり、第ｊ番目（ｊ＝１〜Ｊ）の周辺単位区間の基礎音色特徴量ＦT0が、Ｉ個の要素ｆB(1,j)〜ｆB(I,j)を包含するＩ次元のベクトルである場合を想定する。 When the target unit section is selected, the feature extraction unit 22 sequentially executes the first process SA131 and the second process SA132 for the target unit section. Specific contents of the first process SA131 and the second process SA132 will be described in detail below. In the following description, as illustrated in FIG. 5, the basic timbre feature quantity FT0 of the target unit section is an I-dimensional vector including I elements fA (1) to fA (I). Assume that the basic timbre feature value FT0 of the j-th (j = 1 to J) peripheral unit section is an I-dimensional vector including I elements fB (1, j) to fB (I, j). To do.

第１処理ＳA131は、対象単位区間とＪ個の周辺単位区間の各々との間で基礎音色特徴量ＦT0の要素毎の最小値を選択する処理である。具体的には、図５に例示される通り、第１処理ＳA131では、対象単位区間の周辺のＪ個の周辺単位区間の各々についてＩ個の要素ｆC(1,j)〜ｆC(I,j)の系列が生成される。第ｊ番目の周辺単位区間について生成される第ｉ番目の要素ｆC(i,j)は、対象単位区間の基礎音色特徴量ＦT0内の第ｉ番目の要素ｆA(i)と、当該周辺単位区間の基礎音色特徴量ＦT0内の第ｉ番目の要素ｆB(i,j)とのうちの最小値（ｆC(i,j)＝min｛ｆA(i),ｆB(i,j)｝）である。以上の説明から理解される通り、対象単位区間および周辺単位区間の一方のみに出現する雑音成分は第１処理ＳA131で除去される。すなわち、第１処理ＳA131は、対象単位区間と周辺単位区間との間の基礎音色特徴量ＦT0の共通成分を抽出する処理に相当する。 The first process SA131 is a process of selecting the minimum value for each element of the basic timbre feature quantity FT0 between the target unit section and each of the J peripheral unit sections. Specifically, as illustrated in FIG. 5, in the first process SA131, I elements fC (1, j) to fC (I, j) for each of the J peripheral unit sections around the target unit section. ) Series is generated. The i-th element fC (i, j) generated for the j-th peripheral unit section is the same as the i-th element fA (i) in the basic timbre feature FT0 of the target unit section and the peripheral unit section. Is the minimum value (fC (i, j) = min {fA (i), fB (i, j)}) of the i-th element fB (i, j) in the basic timbre feature quantity FT0. . As understood from the above description, the noise component that appears only in one of the target unit section and the peripheral unit section is removed in the first process SA131. That is, the first process SA131 corresponds to a process of extracting a common component of the basic timbre feature quantity FT0 between the target unit section and the peripheral unit section.

第２処理ＳA132は、Ｊ個の周辺単位区間にわたる要素ｆC(i,1)〜ｆC(i,J)（すなわち要素ｆA(i)および要素ｆB(i,j)の最小値）の最大値をＩ個の要素の各々について選択する処理である。具体的には、図５に例示される通り、第２処理ＳA132では、Ｉ個の要素ｆD(1)〜ｆD(I)を包含するＩ次元のベクトルが基礎音色特徴量ＦT1として生成される。基礎音色特徴量ＦT1の第ｉ番目の要素ｆD(i)は、Ｊ個の周辺単位区間にわたる第ｉ番目の要素ｆC(i,1)〜ｆC(i,J)の最大値（ｆD(i)＝max｛ｆC(i,1)〜ｆC(i,J)｝）である。以上の説明から理解される通り、第２処理ＳA132は、Ｊ個の周辺単位区間にわたる要素ｆC(i,j)の統合で基礎音色特徴量ＦT1を生成する処理である。 The second process SA132 calculates the maximum value of the elements fC (i, 1) to fC (i, J) (that is, the minimum value of the elements fA (i) and fB (i, j)) over J peripheral unit sections. This is a process for selecting each of the I elements. Specifically, as illustrated in FIG. 5, in the second process SA132, an I-dimensional vector including I elements fD (1) to fD (I) is generated as the basic timbre feature value FT1. The i-th element fD (i) of the basic timbre feature FT1 is the maximum value (fD (i)) of the i-th elements fC (i, 1) to fC (i, J) over J peripheral unit sections. = Max {fC (i, 1) to fC (i, J)}). As understood from the above description, the second process SA132 is a process for generating the basic timbre feature quantity FT1 by integrating elements fC (i, j) over J peripheral unit sections.

１個の対象単位区間について第１処理ＳA131および第２処理ＳA132を実行すると、特徴抽出部２２は、音響信号Ｘの全部の単位区間について基礎音色特徴量ＦT1の生成（ＳA130〜ＳA132）が完了したか否かを判定する（ＳA133）。判定結果が否定である場合（ＳA133：NO）、特徴抽出部２２は、前述のステップＳA130に処理を移行し、基礎音色特徴量ＦT1の未生成の単位区間（例えば現時点の対象単位区間の直後の単位区間）を対象単位区間として選択（ＳA130）したうえで第１処理ＳA131および第２処理ＳA132を順次に実行する。音響信号Ｘの全部の単位区間について基礎音色特徴量ＦT1が生成されると（ＳA133：YES）、図４の雑音抑圧処理ＳA13は終了する。 When the first process SA131 and the second process SA132 are executed for one target unit section, the feature extraction unit 22 completes the generation of the basic timbre feature value FT1 (SA130 to SA132) for all unit sections of the acoustic signal X. Whether or not (SA133). If the determination result is negative (SA133: NO), the feature extraction unit 22 proceeds to the above-described step SA130, and the unit section in which the basic timbre feature quantity FT1 has not been generated (for example, immediately after the current target unit section). The first process SA131 and the second process SA132 are sequentially executed after selecting (unit section) as the target unit section (SA130). When the basic timbre feature value FT1 is generated for all unit sections of the acoustic signal X (SA133: YES), the noise suppression process SA13 of FIG. 4 ends.

以上に例示した処理（ＳA11〜ＳA13）で音響信号Ｘの単位区間毎に基礎音色特徴量ＦT1と基礎和音特徴量ＦC0とを生成すると、特徴抽出部２２は、基礎音色特徴量ＦT1および基礎和音特徴量ＦC0に対する次元圧縮を実行する（ＳA14）。具体的には、特徴抽出部２２は、基礎音色特徴量ＦT1に対する次元圧縮で低次元（例えば５次元）の音色特徴量ＦTを生成し、基礎和音特徴量ＦC0に対する次元圧縮で低次元（例えば１０次元）の和音特徴量ＦCを生成する（ＳA14）。基礎音色特徴量ＦT1および基礎和音特徴量ＦC0の次元圧縮には、例えば主成分分析等の公知の技術が任意に採用される。以上に例示した特徴抽出処理ＳA1（ＳA11〜ＳA14）により音響信号Ｘの単位区間毎に音色特徴量ＦTと和音特徴量ＦCとが生成される。 When the basic timbre feature quantity FT1 and the basic chord feature quantity FC0 are generated for each unit section of the acoustic signal X by the processing illustrated above (SA11 to SA13), the feature extraction unit 22 performs the basic timbre feature quantity FT1 and the basic chord feature. Dimensional compression is executed for the quantity FC0 (SA14). Specifically, the feature extraction unit 22 generates a low-dimensional (for example, five-dimensional) timbre feature amount FT by dimensional compression on the basic timbre feature amount FT1, and generates a low-dimensional (for example, 10) dimensional compression on the basic chord feature amount FC0. (Dimensional) chord feature value FC is generated (SA14). For the dimensional compression of the basic tone color feature value FT1 and the basic chord feature value FC0, a known technique such as principal component analysis is arbitrarily employed. The timbre feature value FT and the chord feature value FC are generated for each unit section of the acoustic signal X by the feature extraction processing SA1 (SA11 to SA14) exemplified above.

第１実施形態では、音響信号Ｘの複数の単位区間の各々（対象単位区間）について、各周辺単位区間との間で基礎音色特徴量ＦT0の要素毎の最小値を選択する第１処理ＳA131と、Ｊ個の周辺単位区間にわたる当該最小値の最大値を要素毎に選択する第２処理ＳA132とを実行することで基礎音色特徴量ＦT1が生成される。以上の構成によれば、対象単位区間と各周辺単位区間との間の共通成分が第１処理ＳA131で抽出されるから、経時的に変動する非定常的な雑音成分を簡便に低減して適切な音色特徴量ＦTを生成することが可能である。 In the first embodiment, for each of the plurality of unit sections of the acoustic signal X (target unit section), a first process SA131 for selecting a minimum value for each element of the basic timbre feature quantity FT0 between each peripheral unit section and The basic timbre feature value FT1 is generated by executing the second process SA132 for selecting the maximum value of the minimum values over J peripheral unit sections for each element. According to the above configuration, since the common component between the target unit section and each peripheral unit section is extracted by the first process SA131, the non-stationary noise component that varies with time can be easily reduced and appropriately It is possible to generate a timbre feature quantity FT.

ところで、基礎音色特徴量ＦT0の雑音成分の抑圧という観点のみからすると、任意の１個の対象単位区間と直前または直後の周辺単位区間との間で第１処理ＳA131を実行した結果（例えば図５の要素ｆC(1,1)〜ｆC(I,1)）を音色特徴量ＦTとして抽出する構成（以下「対比例」という）も想定される。しかし、構造区間の境界の前後では音響信号Ｘの音色（基礎音色特徴量ＦT0）が大きく変動し得る。したがって、第１処理ＳA131だけを実行する対比例の構成では、構造区間の境界での音色の変動が雑音成分として抑圧され、当該境界の前後の単位区間にて適切な音色特徴量ＦTを生成できない可能性がある。第１実施形態では、Ｊ個の周辺単位区間にわたり要素ｆC(i,j)を統合する第２処理ＳA132で基礎音色特徴量ＦT1が生成されるから、構造区間の境界の前後の単位区間についても音色特徴量ＦTを適切に生成できるという利点がある。すなわち、第１実施形態では、第１処理ＳA131と第２処理ＳA132との双方を実行することで、雑音成分を低減しながら構造区間の境界の前後でも音色特徴量ＦTを適切に生成できるという利点がある。 By the way, only from the viewpoint of suppressing the noise component of the basic timbre feature FT0, the result of executing the first process SA131 between any one target unit section and the immediately preceding or immediately following peripheral unit section (for example, FIG. 5). Are also assumed to be extracted as the timbre feature quantity FT (hereinafter referred to as “proportional”). However, the tone color of the acoustic signal X (basic tone color feature value FT0) can vary greatly before and after the boundary of the structural section. Therefore, in the comparative configuration in which only the first process SA131 is executed, the timbre variation at the boundary of the structural section is suppressed as a noise component, and an appropriate timbre feature quantity FT cannot be generated in the unit sections before and after the boundary. there is a possibility. In the first embodiment, since the basic timbre feature quantity FT1 is generated in the second process SA132 that integrates the elements fC (i, j) over J peripheral unit sections, the unit sections before and after the boundary of the structure section are also generated. There is an advantage that the timbre feature amount FT can be appropriately generated. That is, in the first embodiment, by executing both the first process SA131 and the second process SA132, it is possible to appropriately generate the timbre feature quantity FT before and after the boundary of the structural section while reducing the noise component. There is.

＜推定処理部２４＞
図１の推定処理部２４は、特徴抽出部２２が単位区間毎に抽出した音色特徴量ＦTおよび和音特徴量ＦCを利用して対象楽曲の楽曲構造（各構造区間）を推定する。具体的には、第１実施形態の推定処理部２４は、特定の楽曲構造のもとで特徴量の時系列が観測される確率を記述した確率モデルについて、特徴抽出部２２が抽出した音色特徴量ＦTおよび和音特徴量ＦCの時系列が観測されたときの事後確率の確率分布（以下「事後分布」という）を推定するとともに、事後分布を最大化させる対象楽曲の楽曲構造を推定（ＭＡＰ（maximum a posteriori）推定）する。図６に例示される通り、対象楽曲の楽曲構造は、各構造区間を識別するためのラベルとなる符号（以下「区間符号」という）ｓを音響信号Ｘの単位区間毎に時系列に配列した構造推定系列Ｙで表現される。図６では区間符号ｓが１個のアルファベット（Ａ,Ｂ,Ｃ,……）で表記されている。図６に例示される通り、任意の１個の構造区間に包含される複数の単位区間には共通の区間符号ｓが付加され、相異なる構造区間に包含される複数の単位区間には別個の区間符号ｓが付加される。すなわち、区間符号ｓの内容が変化する時点が対象楽曲の各構造区間の境界の候補となる。 <Estimation processing unit 24>
The estimation processing unit 24 in FIG. 1 estimates the music structure (each structural section) of the target music using the timbre feature quantity FT and chord feature quantity FC extracted by the feature extraction unit 22 for each unit section. Specifically, the estimation processing unit 24 of the first embodiment uses the timbre feature extracted by the feature extraction unit 22 for a probability model that describes the probability that a time series of feature values is observed under a specific music structure. Estimate the probability distribution of the posterior probability when the time series of the amount FT and the chord feature amount FC are observed (hereinafter referred to as “posterior distribution”), and estimate the music structure of the target music that maximizes the posterior distribution (MAP ( maximum a posteriori). As illustrated in FIG. 6, in the music structure of the target music, codes (hereinafter referred to as “section codes”) s serving as labels for identifying each structural section are arranged in time series for each unit section of the acoustic signal X. It is expressed by a structure estimation sequence Y. In FIG. 6, the section code s is represented by one alphabet (A, B, C,...). As illustrated in FIG. 6, a common section code s is added to a plurality of unit sections included in any one structure section, and a plurality of unit sections included in different structure sections are separately provided. An interval code s is added. That is, the time when the content of the section code s changes is a candidate for the boundary of each structural section of the target music.

第１実施形態の推定処理部２４が対象楽曲の楽曲構造を推定する処理（以下「推定処理」という）には、楽曲構造モデルＭSと音色観測モデルＭTkと和音観測モデルＭCkとが利用される。楽曲構造モデルＭSは、楽曲構造を確率的に記述した確率モデルである。また、音色観測モデルＭTkは、音色特徴量ＦTの生成過程を確率的に記述した確率モデルであり、和音観測モデルＭCkは、和音特徴量ＦCの生成過程を確率的に記述した確率モデルである。推定処理部２４は、楽曲構造モデルＭSと音色観測モデルＭTkと和音観測モデルＭCkとの各々について、特徴抽出部２２が単位区間毎に抽出した音色特徴量ＦTおよび和音特徴量ＦCを使用した推定処理により事後分布を推定する。楽曲構造モデルＭSの事後分布を最大化させる区間符号ｓの時系列が構造推定系列Ｙとして推定される。具体的には、対象楽曲のうち構造区間が遷移すると推定される時点で区間符号ｓが変化するように構造推定系列Ｙが生成される。 The music structure model MS, the timbre observation model MTk, and the chord observation model MCk are used for the process in which the estimation processing unit 24 of the first embodiment estimates the music structure of the target music (hereinafter referred to as “estimation process”). The music structure model MS is a probability model in which the music structure is described stochastically. The timbre observation model MTk is a probabilistic model that stochastically describes the generation process of the timbre feature quantity FT, and the chord observation model MCk is a probabilistic model that probabilistically describes the generation process of the chord feature quantity FC. The estimation processing unit 24 uses the timbre feature quantity FT and the chord feature quantity FC extracted for each unit section by the feature extraction unit 22 for each of the music structure model MS, the timbre observation model MTk, and the chord observation model MCk. To estimate the posterior distribution. A time series of section codes s that maximizes the posterior distribution of the music structure model MS is estimated as the structure estimation series Y. Specifically, the structure estimation sequence Y is generated so that the section code s changes when it is estimated that the structure section of the target music transitions.

図７は、楽曲構造モデルＭSの説明図である。第１実施形態の楽曲構造モデルＭSは、相互に連鎖する複数の状態を状態空間に配列した状態遷移モデル（具体的には隠れマルコフモデル）である。具体的には、図７に例示される通り、任意の１個の構造区間内の相異なる単位区間に対応する複数の状態の系列（例えば図７の１列分）が複数の構造区間について並列に配置される。楽曲構造モデルＭSで表現される任意の１個の状態は、各構造区間の区間符号ｓ（ｓ＝Ａ,Ｂ,Ｃ,……）と、当該構造区間内の状態に滞留した時間長（以下「滞留時間」という）ｕとの組合せで特定される。滞留時間ｕは、構造区間内の先頭からの単位区間の個数（小節数）で表現される。すなわち、任意の状態（ｓ,ｕ）は、区間符号ｓで指定される構造区間内の第ｕ番目の単位区間（すなわち、構造区間の始点から単位区間のｕ個分にわたり当該構造区間内に滞留していること）を意味する。 FIG. 7 is an explanatory diagram of the music structure model MS. The music structure model MS of the first embodiment is a state transition model (specifically, a hidden Markov model) in which a plurality of states linked to each other are arranged in a state space. Specifically, as illustrated in FIG. 7, a plurality of state sequences (for example, one column in FIG. 7) corresponding to different unit sections in any one structural section are parallel in the plurality of structural sections. Placed in. An arbitrary state expressed by the music structure model MS includes a section code s (s = A, B, C,...) Of each structure section, and a length of time spent in the state in the structure section (hereinafter referred to as the structure code). It is specified in combination with u). The residence time u is expressed by the number of unit sections (number of bars) from the beginning in the structure section. That is, the arbitrary state (s, u) stays in the structural section from the start point of the structural section to the uth unit section from the starting point of the structural section specified by the section code s. Meaning).

第１実施形態の楽曲構造モデルＭSは、各構造区間の末尾の状態から他の構造区間の先頭の状態への遷移が可能である。具体的には、第１実施形態の楽曲構造モデルＭSでは、滞留時間ｕが閾値Ｄを下回る場合には現在の構造区間内で直後の状態に遷移し（すなわち滞留時間ｕが単位区間の１個分だけ増加し（ｕ＝ｕ＋１））、滞留時間ｕが閾値Ｄに到達した場合に遷移確率τで他の構造区間の先頭の状態（ｕ＝１）に遷移する。閾値Ｄは、構造区間の遷移が発生する可能性が高い単位区間の個数に設定される。例えば、通常の楽曲では、４小節または８小節を単位として構造区間の遷移が発生する場合が多い。以上の傾向を考慮すると、閾値Ｄを例えば４または８に設定した構成が好適である。遷移確率τは、任意の構造区間の末尾の状態から他の構造区間の先頭の状態に遷移する確率であり、任意の１個の構造区間の組合せについて設定される。遷移確率τの事前分布にはディリクレ（Dirichlet）分布が好適に採用される。 The music structure model MS of the first embodiment can transition from the state at the end of each structure section to the state at the beginning of another structure section. Specifically, in the music structure model MS of the first embodiment, when the residence time u is less than the threshold D, the state transitions to the state immediately after the current construction interval (that is, the residence time u is one unit interval). When the dwell time u reaches the threshold value D, the state transitions to the head state (u = 1) of another structural section with the transition probability τ. The threshold value D is set to the number of unit sections that are highly likely to cause structural section transitions. For example, in a normal music piece, there are many cases where a transition of a structural section occurs in units of 4 bars or 8 bars. Considering the above tendency, a configuration in which the threshold value D is set to 4 or 8, for example, is suitable. The transition probability τ is a probability of transition from the last state of an arbitrary structural section to the first state of another structural section, and is set for any one combination of structural sections. The prior distribution of the transition probability τ is preferably a Dirichlet distribution.

対象区間の複数の単位区間のうち第ｋ番目の単位区間の状態を(ｓk,ｕk)と表記する。第１実施形態では、対象楽曲の第ｋ番目の単位区間における音色特徴量ＦTの発生を確率的に表現する音色観測モデルＭTkとして、以下の数式(1)で表現される確率モデルを採用する。

数式(1)の記号Normal(μ_sk,η ^(T),Λ_sk,η ^-1,(T))は、平均μ_sk,η ^(T)および精度行列（共分散行列の逆行列）Λ_sk,η ^-1,(T)で規定される正規分布を意味する。平均μ_sk,η ^(T)および精度行列Λ_sk,η ^-1,(T)の事前分布には、例えば正規-ウィシャート（Normal-Wishart）分布が好適に採用される。また、数式(1)の記号ω_ηは、第η番目の正規分布Normal(μ_sk,η ^(T),Λ_sk,η ^-1,(T))の加重値である。加重値ω_ηの事前分布には例えばＧＥＭ（Griffith-Engen-McCloskey）分布が好適に採用される。数式(1)から理解される通り、第１実施形態では、確率分布の混合数が無限である無限混合分布（具体的には確率分布に正規分布を採用した無限混合ガウス分布）を音色観測モデルＭTkとして利用する。すなわち、音色観測モデルＭTkは、加重値ω_ηを適用した無限個の正規分布Normal(μ_sk,η ^(T),Λ_sk,η ^-1,(T))の加重和として表現される。 The state of the kth unit section among the plurality of unit sections of the target section is expressed as (sk, uk). In the first embodiment, a probability model expressed by the following equation (1) is adopted as a timbre observation model MTk that probabilistically represents the occurrence of the timbre feature quantity FT in the kth unit section of the target music.

The symbol Normal (μ _{sk, η} ^(T) , Λ _{sk, η} ^{-1, (T)} ) in equation (1) is the mean μ _{sk, η} ^(T) and accuracy matrix (inverse matrix of covariance matrix) Λ _{sk , η-} ^{1, (T)} means a normal distribution. As the prior distribution of the mean μ _{sk, η} ^(T) and the accuracy matrix Λ _{sk, η} ^{-1, (T)} , for example, a Normal-Wishart distribution is preferably employed. Further, the symbol omega _eta equation (1), the eta-th normal distribution _{^{Normal (μ sk, η (T}} ), Λ sk, η -1, (T)) is a weighted value of. The prior distribution of weight omega _eta for example GEM (Griffith-Engen-McCloskey) distribution is preferably employed. As understood from Equation (1), in the first embodiment, an infinite mixture distribution with an infinite number of probability distributions (specifically, an infinite mixture Gaussian distribution in which a normal distribution is adopted as the probability distribution) is used as a timbre observation model. Use as MTk. That is, the timbre observation model MTk is expressed as a weighted sum of an infinite number of normal distributions Normal (μ _{sk, η} ^(T) , Λ _{sk, η} ^{-1, (T)} ) to which the weight value ω _η is applied.

楽曲の音色は、相異なる各構造区間の間では相違する一方、任意の１個の構造区間内では概略的には統一される（例えば楽曲の演奏パートや音楽的な印象は構造区間毎に変化する）という傾向がある。以上の傾向を考慮して、第１実施形態の音色観測モデルＭTkは、数式(1)からも理解される通り、音色特徴量ＦTの生成過程を構造区間毎に表現する確率モデルであり、構造区間内の滞留時間ｕには依存しない。すなわち、任意の１個の構造区間に対応する複数の状態にわたり音色観測モデルＭTkは共通する。 The timbre of the music is different between the different structural sections, but is roughly unified within any one structural section (for example, the performance part and musical impression of the music change for each structural section. Tend to). Considering the above tendency, the timbre observation model MTk of the first embodiment is a probability model that expresses the generation process of the timbre feature quantity FT for each structural section, as understood from Equation (1). It does not depend on the residence time u in the section. That is, the timbre observation model MTk is common over a plurality of states corresponding to one arbitrary structural section.

また、第１実施形態では、対象楽曲の第ｋ番目の単位区間における和音特徴量ＦCの発生を確率的に表現する和音観測モデルＭCkとして、以下の数式(2)で表現される確率モデルを採用する。

数式(2)から理解される通り、第１実施形態では、平均μ_sk,uk ^(C)および精度行列Λ_sk,,uk ^-1,(C)で規定される正規分布Normal(μ_sk,uk ^(C),Λ_sk,,uk ^-1,(C)）が和音観測モデルＭCkとして利用される。音色観測モデルＭTkに適用される平均μ_sk,η ^(T)および精度行列Λ_sk,η ^-1,(T)と同様に、数式(2)の平均μ_sk,uk ^(C)および精度行列Λ_sk,,uk ^-1,(C)の事前分布には、例えば正規-ウィシャート分布が好適に採用される。 In the first embodiment, the probability model expressed by the following formula (2) is adopted as the chord observation model MCk that stochastically represents the occurrence of the chord feature value FC in the k-th unit section of the target song. To do.

As understood from the equation (2), in the first embodiment, the normal distribution Normal (μ _{sk, uk} defined by the mean μ _{sk, uk} ^(C) and the accuracy matrix Λ _{sk ,, uk} ^{-1, (C)} is used. ^(C) , Λ _{sk, uk} ^{-1, (C)} ) are used as the chord observation model MCk. Similar to the average μ _{sk, η} ^(T) and accuracy matrix Λ _{sk, η} ^{-1, (T)} applied to the timbre observation model MTk, the average μ _{sk, uk} ^(C) and the accuracy matrix Λ in equation (2) For example, a normal-Wishart distribution is suitably used as the prior distribution of _{sk ,, uk-} ^{1, (C)} .

前述の通り、楽曲の音色が１個の構造区間内で概略的には統一される傾向があるのに対し、楽曲の和音は、相異なる各構造区間の間で相違するだけでなく、構造区間内の単位区間毎（小節毎）に刻々と変動するという傾向がある。以上の傾向を考慮して、第１実施形態の和音観測モデルＭCkは、数式(2)からも理解される通り、構造区間と単位区間との組合せ毎（状態(ｓk,ｕk)毎）に和音特徴量ＦCの生成過程を表現する確率モデルであり、区間符号ｓkおよび滞留時間ｕkの双方に依存する。 As described above, the timbre of the music tends to be roughly unified within one structural section, whereas the chord of the music is not only different between the different structural sections, There is a tendency that it fluctuates every unit section (every bar). Considering the above tendency, the chord observation model MCk of the first embodiment is chorded for each combination of the structural section and the unit section (for each state (sk, uk)) as understood from the equation (2). This is a probabilistic model expressing the generation process of the feature value FC, and depends on both the section code sk and the residence time uk.

なお、以上に説明した確率モデルの事前分布は前述の例示に限定されない。例えば、遷移確率τの事前分布をLogistic Normal分布とした構成や、音色観測モデルＭTkの加重値ω_ηの事前分布にPitman-Yor過程を適用した構成も採用され得る。また、楽曲構造モデルＭSにおける構造区間の総数も任意であり、例えばノンパラメトリックベイジアンＨＭＭ（Hidden Markov Model）のように例えば可算無限の状態数を想定した確率モデルに拡張することも可能である。 In addition, the prior distribution of the probability model demonstrated above is not limited to the above-mentioned illustration. For example, the prior distribution of the transition probabilities τ configuration and that the Logistic Normal distribution, configuration of applying the Pitman-Yor process prior distribution of the weights omega _eta tone observation model MTk also be employed. The total number of structural sections in the music structure model MS is also arbitrary, and can be extended to a probability model that assumes, for example, an infinite number of states, such as a nonparametric Bayesian HMM (Hidden Markov Model).

第１実施形態の推定処理部２４は、以上に説明した各確率モデル（楽曲構造モデルＭS，音色観測モデルＭTk，和音観測モデルＭCk）について、特徴抽出部２２が抽出した音色特徴量ＦTおよび和音特徴量ＦCの時系列が観測されたときの事後分布を例えば変分ベイズ法等の反復推定アルゴリズムで推定し、事後分布を最大化させる構造推定系列Ｙを特定する。以上に説明した推定処理は、特徴抽出部２２が抽出した音色特徴量ＦTおよび和音特徴量ＦCの時系列を利用してＮ回にわたり反復される。 The estimation processing unit 24 of the first embodiment uses the timbre feature amount FT and chord feature extracted by the feature extraction unit 22 for each of the probability models (music structure model MS, timbre observation model MTk, chord observation model MCk) described above. The posterior distribution when the time series of the quantity FC is observed is estimated by an iterative estimation algorithm such as the variational Bayes method, and the structure estimation series Y that maximizes the posterior distribution is specified. The estimation process described above is repeated N times using the time series of the timbre feature value FT and the chord feature value FC extracted by the feature extraction unit 22.

推定処理で各事前分布から抽出（サンプリング）される初期値は推定処理毎に相違する。例えば、正規-ウィシャート分布を事前分布として抽出される音色観測モデルＭTkの平均μ_sk,η ^(T)および精度行列Λ_sk,η ^-1,(T)や、同様に正規-ウィシャート分布を事前分布として通出される和音観測モデルＭCkの平均μ_sk,uk ^(C)および精度行列Λ_sk,,uk ^-1,(C)は、推定処理毎（すなわち試行毎）に相違し得る。したがって、Ｎ回にわたる推定処理の反復が完了した段階では、相異なるＮ個の構造推定系列Ｙが生成される。 The initial value extracted (sampled) from each prior distribution in the estimation process is different for each estimation process. For example, normal - Wishart average tone observation model MTk extracted distributed as prior distribution _{μ sk,} η ^(T) and precision matrix _{^{Λ sk, η -1, (T}} ) and, likewise normal - Wishart distribution priors The mean μ _{sk, uk} ^(C) and the accuracy matrix Λ _{sk ,, uk} ^{−1, (C)} of the chord observation model MCk output as can be different for each estimation process (that is, for each trial). Therefore, N structural estimation sequences Y that are different from each other are generated at the stage where the estimation process has been repeated N times.

以上に説明した通り、第１実施形態では、構造区間毎の音色観測モデルＭTkと各構造区間内の単位区間毎の和音観測モデルＭCkとの各々の事後分布が推定処理で推定される。すなわち、対象楽曲の１個の構造区間内では概略的には音色が統一されるという傾向を反映した音色観測モデルＭTkと、構造区間内の単位区間毎に和音は順次に遷移するという傾向を反映した和音観測モデルＭCkとが推定処理に利用される。したがって、楽曲内の音色の遷移の傾向と和音の遷移の傾向との相違を加味しない構成と比較して、対象楽曲の構造区間を高精度に特定することが可能である。また、第１実施形態では、数式(1)で例示した無限混合分布が音色観測モデルＭTkとして利用されるから、音響信号Ｘの音色の特性（具体的には対象楽曲の音色の複雑性）に応じて正規分布Normal(μ_sk,η ^(T),Λ_sk,η ^-1,(T))の混合数が変動する。したがって、音色観測モデルＭTkの事後分布を音響信号Ｘの特性に応じて適切に推定できるという利点がある。 As described above, in the first embodiment, the posterior distributions of the timbre observation model MTk for each structural section and the chord observation model MCk for each unit section in each structural section are estimated by the estimation process. That is, the timbre observation model MTk reflecting the tendency that the timbre is roughly unified within one structure section of the target music, and the tendency that the chords sequentially transition for each unit section in the structure section are reflected. The chord observation model MCk thus used is used for the estimation process. Therefore, it is possible to specify the structural section of the target music with high accuracy compared to a configuration that does not take into account the difference between the timbre transition tendency and the chord transition tendency in the music. Further, in the first embodiment, since the infinite mixture distribution exemplified in Equation (1) is used as the timbre observation model MTk, the timbre characteristics of the acoustic signal X (specifically, the timbre complexity of the target music) are used. Accordingly, the number of mixtures of the normal distribution Normal (μ _{sk, η} ^(T) , Λ _{sk, η} ^{-1, (T)} ) varies. Therefore, there is an advantage that the posterior distribution of the timbre observation model MTk can be appropriately estimated according to the characteristics of the acoustic signal X.

第１実施形態では、単位区間に対応する複数の状態の系列を複数の構造区間の各々について包含し、かつ、各構造区間の末尾の状態から他の構造区間の先頭の状態への遷移が可能な状態遷移モデルが楽曲構造モデルＭSとして推定処理に適用される。すなわち、複数の単位区間で構成される構造区間が対象楽曲内で順次に遷移するという傾向が楽曲構造モデルＭSで適切に表現される。したがって、対象楽曲の構造区間を高精度に推定することが可能である。 In the first embodiment, a sequence of a plurality of states corresponding to a unit section is included for each of a plurality of structural sections, and transition from the last state of each structural section to the leading state of another structural section is possible A state transition model is applied to the estimation process as the music structure model MS. That is, the tendency that the structure section composed of a plurality of unit sections transitions sequentially in the target music is appropriately expressed in the music structure model MS. Therefore, it is possible to estimate the structure section of the target music with high accuracy.

＜構造解析部２６＞
図１の構造解析部２６は、推定処理部２４による推定処理の結果から対象楽曲の複数の構造区間を特定する。図８は、構造解析部２６が構造区間を特定する処理の説明図である。第１実施形態では、各確率モデル（楽曲構造モデルＭS，音色観測モデルＭTk，和音観測モデルＭCk）の変数の初期値を相違させたＮ回の推定処理により、図８に例示される通り、区間符号ｓの配列が相違するＮ個（図８ではＮ＝５）の構造推定系列Ｙが生成される。構造解析部２６は、推定処理部２４が生成したＮ個の構造推定系列Ｙから楽曲の構造区間を特定する。 <Structural analysis unit 26>
The structure analysis unit 26 in FIG. 1 specifies a plurality of structural sections of the target song from the result of the estimation process by the estimation processing unit 24. FIG. 8 is an explanatory diagram of a process in which the structure analysis unit 26 specifies a structure section. In the first embodiment, as illustrated in FIG. 8, sections are obtained by N estimation processes in which initial values of variables of each probability model (music structure model MS, timbre observation model MTk, chord observation model MCk) are different. N (N = 5 in FIG. 8) structure estimation sequences Y having different arrangements of codes s are generated. The structure analysis unit 26 specifies the structure section of the music from the N structure estimation sequences Y generated by the estimation processing unit 24.

図８では、各構造推定系列Ｙにおいて区間符号ｓが変化する時点（以下「候補境界」という）が破線で図示されている。各候補境界は、対象楽曲の構造区間の境界の候補となる地点である。図８から理解される通り、確率モデルの初期値の相違に起因して各候補境界の位置はＮ個の構造推定系列Ｙの間で相違し得るが、基本的には多数の構造推定系列Ｙにわたり候補境界が共通し易いという傾向がある。候補境界が共通する構造推定系列Ｙが多数であるほど、当該候補境界が対象楽曲の構造区間の実際の境界である確度は高い。以上の傾向を考慮して、第１実施形態の構造解析部２６は、図８に例示される通り、所定の閾値ＮTHを上回る個数の構造推定系列Ｙにわたり共通する候補境界を推定境界として選択し、各推定境界を構造区間の境界として対象楽曲の楽曲構造を表す構造推定系列Ｚを生成する。推定境界は、対象楽曲の構造区間の実際の境界である確度が高い候補境界である。閾値ＮTHは、構造推定系列Ｙの総数Ｎを下回る所定の正数（例えばＮ/２）に設定される。 In FIG. 8, the time points (hereinafter referred to as “candidate boundaries”) when the section code s changes in each structure estimation series Y are illustrated by broken lines. Each candidate boundary is a point that is a candidate for the boundary of the structure section of the target music piece. As understood from FIG. 8, the position of each candidate boundary may be different among the N structure estimation sequences Y due to the difference in the initial value of the probability model. The candidate boundary tends to be common. The greater the number of structure estimation sequences Y with common candidate boundaries, the higher the probability that the candidate boundary is the actual boundary of the structure section of the target song. Considering the above tendency, as illustrated in FIG. 8, the structure analysis unit 26 of the first embodiment selects a candidate boundary that is common over the number of structure estimation sequences Y that exceeds a predetermined threshold NTH as an estimation boundary. Then, a structure estimation sequence Z representing the music structure of the target music is generated with each estimated boundary as the boundary of the structure section. The estimated boundary is a candidate boundary having a high probability of being the actual boundary of the structure section of the target music piece. The threshold value NTH is set to a predetermined positive number (for example, N / 2) lower than the total number N of the structure estimation series Y.

構造推定系列Ｚは、構造推定系列Ｙと同様に、対象楽曲の各単位区間に対応する複数の区間符号（ラベル）ｓZの時系列である。図８に例示される通り、推定境界で区画される各構造区間内の単位区間毎に区間符号ｓZが設定される。具体的には、構造解析部２６は、推定境界で区画される１個の構造区間内の複数の単位区間について、Ｎ個の構造推定系列Ｙにわたる区間符号ｓの代表値（例えば中央値）を算定し、構造推定系列Ｚのうち当該代表値が共通する各構造区間内の各単位区間には共通の区間符号ｓZを設定する。区間符号ｓの代表値が共通する各構造区間は、音色や和音の遷移が相互に類似または共通する区間（すなわち、「Ａメロ」「サビ」等の音楽的な意味合いが共通する区間）であると推定される。 Similar to the structure estimation sequence Y, the structure estimation sequence Z is a time series of a plurality of section codes (labels) sZ corresponding to each unit section of the target music piece. As illustrated in FIG. 8, a section code sZ is set for each unit section in each structure section partitioned by the estimated boundary. Specifically, the structure analysis unit 26 represents a representative value (for example, a median value) of the section code s over the N structure estimation sequences Y for a plurality of unit sections in one structure section partitioned by the estimation boundary. A common section code sZ is set for each unit section in each structure section having the same representative value in the structure estimation series Z. Each structural section having a common representative value of the section code s is a section in which timbre and chord transitions are similar or common to each other (that is, a section in which musical meanings such as “A melody” and “rust” are common). It is estimated to be.

第１実施形態の構造解析部２６は、以上の処理で生成した構造推定系列Ｚを表示装置１６に表示させる。具体的には、構造解析部２６は、図９に例示される通り、構造推定系列Ｚを表現する推定結果画像４２を音響信号Ｘの信号波形４４に重ねて表示装置１６に表示させる。推定結果画像４２は、時間軸の方向に延在する矩形状の図形であり、構造推定系列Ｚが指定する推定境界の位置で複数の領域４６に区画される。すなわち、推定結果画像４２の各領域４６は対象楽曲の構造区間を表象する。推定結果画像４２の複数の領域４６のうち構造推定系列Ｚにて別個の区間符号ｓZに対応する各領域４６は相異なる表示態様（色彩や階調）で表示され、区間符号ｓZが共通する区間は共通の表示態様で表示される。したがって、利用者は、推定結果画像４２を視認することで、対象楽曲の各構造区間と音響信号Ｘとの時間的な関係を把握することが可能である。 The structure analysis unit 26 of the first embodiment causes the display device 16 to display the structure estimation sequence Z generated by the above processing. Specifically, as illustrated in FIG. 9, the structure analysis unit 26 causes the display device 16 to display an estimation result image 42 representing the structure estimation sequence Z superimposed on the signal waveform 44 of the acoustic signal X. The estimation result image 42 is a rectangular figure extending in the direction of the time axis, and is divided into a plurality of regions 46 at the position of the estimation boundary designated by the structure estimation sequence Z. That is, each area 46 of the estimation result image 42 represents a structural section of the target music. Of the plurality of regions 46 of the estimation result image 42, each region 46 corresponding to a separate section code sZ in the structure estimation series Z is displayed in a different display mode (color or gradation), and the section code sZ is common. Are displayed in a common display mode. Therefore, the user can grasp the temporal relationship between each structural section of the target music and the acoustic signal X by visually recognizing the estimation result image 42.

以上に説明した通り、第１実施形態では、相異なる初期値を利用した推定処理で生成されたＮ個の構造推定系列Ｙから対象楽曲の各構造区間（構造推定系列Ｚ）が特定される。したがって、各推定処理に利用される初期値の変動に対して頑健に対象楽曲の楽曲構造を高精度に推定できるという利点がある。 As described above, in the first embodiment, each structural section (structure estimation sequence Z) of the target music is specified from N structure estimation sequences Y generated by estimation processing using different initial values. Therefore, there is an advantage that the music structure of the target music can be estimated with high accuracy robustly against fluctuations in the initial value used for each estimation process.

図１０は、第１実施形態における楽曲解析装置１００の全体的な動作のフローチャートである。例えば利用者からの指示を契機として図１０の処理を開始すると、特徴抽出部２２は、前掲の図２および図４を参照して説明した特徴抽出処理ＳA1を音響信号Ｘに対して実行することで音色特徴量ＦTおよび和音特徴量ＦCを単位区間毎に抽出する。推定処理部２４は、特徴抽出部２２が単位区間毎に抽出した音色特徴量ＦTおよび和音特徴量ＦCを利用した推定処理で、対象楽曲の楽曲構造を示す構造推定系列Ｙを推定する（ＳA2）。推定処理部２４による推定処理はＮ回にわたり反復される（ＳA3：NO）。推定処理の反復によりＮ個の構造推定系列Ｙの生成が完了すると（ＳA3：YES）、構造解析部２６は、Ｎ個の構造推定系列Ｙを相互に比較することで時間軸上の複数の推定境界を特定し、推定境界を各構造区間の境界として対象楽曲の楽曲構造を表す構造推定系列Ｚを生成する（ＳA4）。また、構造解析部２６は、構造推定系列Ｚを表象する推定結果画像４２を表示装置１６に表示させる（ＳA5）。 FIG. 10 is a flowchart of the overall operation of the music analysis device 100 according to the first embodiment. For example, when the process of FIG. 10 is started in response to an instruction from the user, the feature extraction unit 22 performs the feature extraction process SA1 described with reference to FIGS. 2 and 4 on the acoustic signal X. The timbre feature value FT and the chord feature value FC are extracted for each unit section. The estimation processing unit 24 estimates the structure estimation sequence Y indicating the music structure of the target music in the estimation process using the timbre feature value FT and the chord feature value FC extracted for each unit section by the feature extraction unit 22 (SA2). . The estimation processing by the estimation processing unit 24 is repeated N times (SA3: NO). When the generation of N structure estimation sequences Y is completed by repeating the estimation process (SA3: YES), the structure analysis unit 26 compares the N structure estimation sequences Y with each other to perform a plurality of estimations on the time axis. A boundary is specified, and a structure estimation sequence Z representing the music structure of the target music is generated using the estimated boundary as the boundary of each structural section (SA4). Further, the structure analysis unit 26 causes the display device 16 to display the estimation result image 42 representing the structure estimation series Z (SA5).

＜第２実施形態＞
本発明の第２実施形態を説明する。以下に例示する各態様において作用や機能が第１実施形態と同様である要素については、第１実施形態の説明で使用した符号を流用して各々の詳細な説明を適宜に省略する。 Second Embodiment
A second embodiment of the present invention will be described. Regarding the elements whose functions and functions are the same as those of the first embodiment in each aspect exemplified below, the detailed description of each is appropriately omitted by using the reference numerals used in the description of the first embodiment.

図１１は、第２実施形態における構造解析部２６の動作の説明図である。図１１には、構造解析部２６が特定した構造推定系列Ｚが例示されている。図１１では、区間符号ｓZが共通する複数の単位区間が１個の区画で表現されている。すなわち、図１１に図示された１個の区画が対象楽曲の１個の構造区間に相当する。 FIG. 11 is an explanatory diagram of the operation of the structure analysis unit 26 in the second embodiment. FIG. 11 illustrates the structure estimation sequence Z identified by the structure analysis unit 26. In FIG. 11, a plurality of unit sections having a common section code sZ are represented by one section. That is, one section shown in FIG. 11 corresponds to one structure section of the target music piece.

図１１から理解される通り、構造推定系列Ｚには、複数の区間符号ｓZの特定の配列が複数回にわたり観測される場合がある。例えば図１１の第１段目の例示では、「Ａ-Ｂ-Ａ-Ｂ」という配列と「Ｃ-Ｄ-Ｃ-Ｄ」という配列とが対象楽曲内で複数回にわたり反復される。第２実施形態の構造解析部２６は、対象楽曲内で複数回にわたり反復される区間符号ｓZの配列に対応した複数の構造区間を１個の構造区間に統合する。具体的には、図１１の構造推定系列Ｚにて「Ａ-Ｂ-Ａ-Ｂ」という区間符号ｓZの配列に対応する４個の構造区間は、区間符号ｓZが「Ｇ」に設定された１個の構造区間に統合され、「Ｃ-Ｄ-Ｃ-Ｄ」という区間符号ｓZの配列に対応する４個の構造区間は、区間符号ｓZが「Ｈ」に設定された１個の構造区間に統合される。以上に説明した複数の構造区間の統合を累積的に実行することで、第２実施形態の構造解析部２６は、構造区間の区間長や総数が相違する複数の階層的な構造推定系列Ｚを生成する。複数の構造区間の統合には、例えば２-gram等の統計処理が好適に利用される。構造解析部２６は、以上の手順で生成した複数の構造推定系列Ｚの何れかを確定的な解析結果として選択し、当該構造推定系列Ｚを表す推定結果画像４２を表示装置１６に表示させる。なお、複数の構造推定系列Ｚの各々に対応する推定結果画像４２を表示装置１６に並列に表示させ、所望の推定結果画像４２を利用者に選択させることも可能である。第２実施形態においても第１実施形態と同様の効果が実現される。 As understood from FIG. 11, in the structure estimation sequence Z, a specific arrangement of a plurality of section codes sZ may be observed a plurality of times. For example, in the example of the first row in FIG. 11, the sequence “A-B-A-B” and the sequence “C-D-C-D” are repeated a plurality of times in the target music. The structure analysis unit 26 according to the second embodiment integrates a plurality of structure sections corresponding to the arrangement of section codes sZ repeated a plurality of times in the target music piece into one structure section. Specifically, in the structure estimation sequence Z of FIG. 11, the section code sZ is set to “G” in the four structure sections corresponding to the array of section codes sZ “A-B-A-B”. The four structural sections integrated into one structural section and corresponding to the array of section codes sZ “C-D-C-D” are one structural section in which the section code sZ is set to “H”. Integrated into. By cumulatively executing the integration of the plurality of structural sections described above, the structure analysis unit 26 of the second embodiment generates a plurality of hierarchical structure estimation sequences Z having different section lengths and total numbers of the structural sections. Generate. For integration of a plurality of structural sections, statistical processing such as 2-gram is preferably used. The structure analysis unit 26 selects any one of the plurality of structure estimation sequences Z generated by the above procedure as a definitive analysis result, and causes the display device 16 to display an estimation result image 42 representing the structure estimation sequence Z. In addition, it is also possible to display the estimation result image 42 corresponding to each of the plurality of structure estimation sequences Z on the display device 16 in parallel, and allow the user to select a desired estimation result image 42. In the second embodiment, the same effect as in the first embodiment is realized.

＜変形例＞
以上に例示した各態様は多様に変形され得る。具体的な変形の態様を以下に例示する。以下の例示から任意に選択された２個以上の態様は、相互に矛盾しない範囲で適宜に併合され得る。 <Modification>
Each aspect illustrated above can be variously modified. Specific modifications are exemplified below. Two or more modes arbitrarily selected from the following examples can be appropriately combined within a range that does not contradict each other.

（１）基礎音色特徴量ＦT1および基礎和音特徴量ＦC0に対する次元圧縮（ＳA14）は省略され得る。すなわち、第１実施形態における基礎音色特徴量ＦT1を音色特徴量ＦTとして推定処理に利用する構成や、基礎和音特徴量ＦC0を和音特徴量ＦCとして推定処理に利用する構成も採用され得る。また、雑音抑圧処理ＳA13を省略し、基礎音色特徴量ＦT0を音色特徴量ＦTとして推定処理に利用することも可能である。 (1) The dimension compression (SA14) for the basic tone color feature value FT1 and the basic chord feature value FC0 can be omitted. That is, a configuration in which the basic timbre feature amount FT1 in the first embodiment is used in the estimation process as the timbre feature amount FT, and a configuration in which the basic chord feature amount FC0 is used as the chord feature amount FC in the estimation process may be employed. It is also possible to omit the noise suppression process SA13 and use the basic timbre feature quantity FT0 as the timbre feature quantity FT for the estimation process.

（２）前述の各形態では、滞留時間ｕが閾値Ｄに到達した場合に遷移確率τで他の構造区間に遷移する楽曲構造モデルＭSを例示したが、構造区間の継続長の確率πを規定した隠れセミマルコフモデル（ＨＳＭＭ:hidden semi-Markov model）を楽曲構造モデルＭSとして利用することも可能である。例えば構造区間の遷移が発生し易い継続長（例えば４小節や８小節）で確率πが大きい数値となる。確率πの事前分布としてはディリクレ分布が好適である。 (2) In each of the above-described forms, the music structure model MS that transitions to another structural section with the transition probability τ when the residence time u reaches the threshold D is exemplified, but the probability π of the continuation length of the structural section is specified. It is also possible to use the hidden semi-Markov model (HSMM: hidden semi-Markov model) as the music structure model MS. For example, it is a numerical value with a large probability of π with a continuation length (for example, 4 bars or 8 bars) in which the transition of the structural section is likely to occur. A Dirichlet distribution is suitable as the prior distribution of the probability π.

（３）前述の各形態では、
構成１：構造区間毎の音色観測モデルＭTkと各構造区間内の単位区間毎の和音観測モデルＭCkとを推定処理に利用する構成と、
構成２：対象楽曲の小節を単位区間として単位区間毎に特徴量を抽出する構成と、
構成３：第１処理ＳA131と第２処理ＳA132とを含む雑音抑圧処理ＳA13で音色特徴量ＦTを生成する構成と、
構成４：無限混合分布を音色観測モデルＭTkとして利用する構成と、
構成５：構造区間内の各単位区間に対応する複数の状態(ｓ,ｕ)の系列を複数の構造区間について包含し、各構造区間の最後の状態から他の構造区間の最初の状態への遷移が可能な状態遷移モデル（楽曲構造モデルＭS）を推定処理に利用する構成と、
構成６：初期値を相違させた推定処理で生成されたＮ個の構造推定系列Ｙから構造推定系列Ｚを生成する構成と
を具備する楽曲解析装置１００を例示したが、構成１から構成６の各々は相互に独立に成立し得る。すなわち、構成１から構成６の各々にとって他の構成は必須ではない。例えば構成６を省略した構成では、推定処理部２４が１回だけ推定処理を実行し、当該推定処理で生成された構造推定系列Ｙが構造推定系列Ｚとして確定される。 (3) In each of the above forms,
Configuration 1: A configuration in which a timbre observation model MTk for each structural section and a chord observation model MCk for each unit section in each structural section are used for estimation processing;
Configuration 2: A configuration in which a feature amount is extracted for each unit section using a measure of the target music as a unit section;
Configuration 3: Configuration for generating timbre feature value FT by noise suppression processing SA13 including first processing SA131 and second processing SA132;
Configuration 4: Configuration using infinite mixture distribution as timbre observation model MTk;
Configuration 5: A series of a plurality of states (s, u) corresponding to each unit section in a structure section is included for a plurality of structure sections, and from the last state of each structure section to the first state of another structure section A configuration using a state transition model (music structure model MS) capable of transition for estimation processing;
Configuration 6: The configuration of generating the structure estimation sequence Z from the N structure estimation sequences Y generated by the estimation process with different initial values is exemplified. Each can be established independently of each other. That is, the other configurations are not essential for each of the configurations 1 to 6. For example, in the configuration in which the configuration 6 is omitted, the estimation processing unit 24 executes the estimation processing only once, and the structure estimation sequence Y generated by the estimation processing is determined as the structure estimation sequence Z.

（４）移動体通信網やインターネット等の通信網を介して端末装置（例えば携帯電話機やスマートフォン）と通信するサーバ装置で楽曲解析装置１００を実現することも可能である。具体的には、楽曲解析装置１００は、端末装置から通信網を介して受信した音響信号Ｘの解析で構造推定系列Ｚを生成して端末装置に送信する。 (4) The music analysis device 100 can be realized by a server device that communicates with a terminal device (for example, a mobile phone or a smartphone) via a communication network such as a mobile communication network or the Internet. Specifically, the music analysis device 100 generates a structure estimation sequence Z by analyzing the acoustic signal X received from the terminal device via the communication network, and transmits the structure estimation sequence Z to the terminal device.

（５）前述の各形態で例示した楽曲解析装置１００は、前述の各形態の例示の通り、演算処理装置１２とプログラムとの協働で実現される。本発明の好適な態様に係るプログラムは、対象楽曲の音響を表す音響信号Ｘの単位区間毎に、音響の音色の特徴を表す音色特徴量ＦTと当該音響の和音の特徴を表す和音特徴量ＦCとを抽出する特徴抽出部２２、対象楽曲内の音楽的な構造の区分であり少なくともひとつの単位区間を含む構造区間毎に音色特徴量ＦTの生成過程を確率的に表現する音色観測モデルＭTkと、和音特徴量ＦCの生成過程を各構造区間内の単位区間毎に確率的に表現する和音観測モデルＭCkとの各々について、特徴抽出部２２が抽出した音色特徴量ＦTおよび和音特徴量ＦCを適用した推定処理により事後分布を推定する推定処理部２４、および、推定処理の結果から対象楽曲の複数の構造区間を特定する構造解析部２６としてコンピュータを機能させる。以上に例示したプログラムは、コンピュータが読取可能な記録媒体に格納された形態で提供されてコンピュータにインストールされ得る。記録媒体は、例えば非一過性（non-transitory）の記録媒体であり、ＣＤ-ＲＯＭ等の光学式記録媒体（光ディスク）が好例であるが、半導体記録媒体や磁気記録媒体等の公知の任意の形式の記録媒体を包含し得る。また、通信網を介した配信の形態でプログラムをコンピュータに配信することも可能である。 (5) The music analysis device 100 exemplified in each of the above-described embodiments is realized by the cooperation of the arithmetic processing device 12 and a program, as illustrated in each of the above-described embodiments. The program according to a preferred aspect of the present invention includes, for each unit section of the acoustic signal X representing the sound of the target musical piece, a timbre feature amount FT representing the sound timbre feature and a chord feature amount FC representing the chord feature of the sound. A timbre observation model MTk that probabilistically represents the generation process of the timbre feature quantity FT for each structural section including at least one unit section, which is a musical structure division in the target music The timbre feature quantity FT and the chord feature quantity FC extracted by the feature extraction unit 22 are applied to each of the chord observation models MCk that probabilistically express the generation process of the chord feature quantity FC for each unit section in each structural section. The computer is caused to function as an estimation processing unit 24 that estimates the posterior distribution by the estimation process and a structure analysis unit 26 that specifies a plurality of structural sections of the target music from the result of the estimation process. The programs exemplified above can be provided in a form stored in a computer-readable recording medium and installed in the computer. The recording medium is, for example, a non-transitory recording medium, and an optical recording medium (optical disk) such as a CD-ROM is a good example, but a known arbitrary one such as a semiconductor recording medium or a magnetic recording medium This type of recording medium can be included. It is also possible to distribute the program to a computer in the form of distribution via a communication network.

（６）本発明は、前述の各形態に係る楽曲解析装置１００の動作方法（楽曲解析方法）としても特定される。例えば、本発明の一態様に係る楽曲解析方法は、楽曲解析装置１００を実現するコンピュータシステム（単体のコンピュータまたは複数のコンピュータで構成されるシステム）が、対象楽曲の音響を表す音響信号Ｘの単位区間毎に、音響の音色の特徴を表す音色特徴量ＦTと当該音響の和音の特徴を表す和音特徴量ＦCとを抽出し（特徴抽出処理ＳA1）、対象楽曲内の音楽的な構造の区分であり少なくともひとつの単位区間を含む構造区間毎に音色特徴量ＦTの生成過程を確率的に表現する音色観測モデルＭTkと、和音特徴量ＦCの生成過程を各構造区間内の単位区間毎に確率的に表現する和音観測モデルＭCkとの各々について、音色特徴量ＦTおよび和音特徴量ＦCを適用した推定処理により事後分布を推定し（推定処理ＳA2）、推定処理の結果から対象楽曲の複数の構造区間を特定する（楽曲構造解析ＳA4）。 (6) The present invention is also specified as an operation method (music analysis method) of the music analysis device 100 according to each of the above-described embodiments. For example, in the music analysis method according to one aspect of the present invention, the computer system (a single computer or a system constituted by a plurality of computers) that realizes the music analysis apparatus 100 is a unit of the acoustic signal X that represents the sound of the target music. For each section, a timbre feature quantity FT representing the characteristics of the acoustic timbre and a chord feature quantity FC representing the characteristics of the acoustic chord are extracted (feature extraction process SA1), and the musical structure in the target music is classified. There is a timbre observation model MTk that stochastically represents the generation process of the timbre feature quantity FT for each structural section including at least one unit section, and the generation process of the chord feature quantity FC is probabilistic for each unit section in each structural section. For each of the chord observation models MCk expressed in (2), the posterior distribution is estimated by the estimation process using the timbre feature quantity FT and the chord feature quantity FC (estimation process SA2). Identifying a plurality of structural sections elephant song (music structure analysis SA4).

１００……楽曲解析装置、１２……演算処理装置、１４……記憶装置、１６……表示装置、２２……特徴抽出部、２４……推定処理部、２６……構造解析部。

DESCRIPTION OF SYMBOLS 100 ... Music analysis apparatus, 12 ... Operation processing apparatus, 14 ... Memory | storage device, 16 ... Display apparatus, 22 ... Feature extraction part, 24 ... Estimation processing part, 26 ... Structural analysis part.

Claims

A feature extraction unit that extracts a timbre feature amount representing a timbre feature of the sound and a chord feature amount representing a chord feature of the sound for each unit section of the sound signal representing the sound of the music;
A timbre observation model that stochastically represents the generation process of the timbre feature amount for each structural section that is a musical structural division in the music and includes at least one unit section, and for each unit section in each structural section An estimation processing unit that estimates a posteriori distribution by an estimation process that applies the timbre feature amount and the chord feature amount extracted by the feature extraction unit for each of the chord observation models that stochastically express the generation process of the chord feature amount;
A music analysis device comprising: a structure analysis unit that identifies a plurality of structural sections of the music from the result of the estimation process.

The feature extraction unit extracts a basic timbre feature amount including a plurality of elements according to the acoustic timbre from the acoustic signal for each unit section, and a plurality of unit sections before and after the unit section for each of the plurality of unit sections. A first process for selecting a minimum value for each element of the basic timbre feature quantity between each of the peripheral unit sections, and selecting a maximum value of the minimum values over the plurality of peripheral unit sections for each element. The music analysis device according to claim 1, wherein the timbre feature amount is generated by executing a second process.

The music estimation device according to claim 1, wherein the estimation processing unit executes the estimation processing using an infinite mixture distribution in which the number of mixing probability distributions is infinite as the timbre observation model.

The estimation processing unit includes a plurality of state series corresponding to each unit section in the structure section for a plurality of structure sections, and transition from the last state of each structure section to the first state of another structure section The music analysis device according to any one of claims 1 to 3, wherein a state transition model capable of being used is used for the estimation process.

The estimation processing unit repeats the estimation process multiple times with different initial values, thereby identifying a structure estimation sequence in which identification codes indicating structure sections are arranged for each unit section of the music for each estimation process. And
The music analysis device according to any one of claims 1 to 4, wherein the structure analysis unit specifies a structure section of the music from a plurality of structure estimation sequences specified by the estimation process performed a plurality of times.

Computer system
For each unit section of the acoustic signal representing the sound of the music, extract a timbre feature amount representing the timbre feature of the sound and a chord feature amount representing a chord feature of the sound,
A timbre observation model that stochastically represents the generation process of the timbre feature amount for each structural section that is a musical structural division in the music and includes at least one unit section, and for each unit section in each structural section For each of the chord observation models that probabilistically represent the chord feature generation process, the posterior distribution is estimated by the estimation process using the extracted timbre feature amount and the chord feature amount,
A music analysis method for identifying a plurality of structural sections of the music from the result of the estimation process.