JP2018141899A

JP2018141899A - Musical instrument sound recognition apparatus and musical instrument sound recognition program

Info

Publication number: JP2018141899A
Application number: JP2017036746A
Authority: JP
Inventors: 松岡　保静; Hosei Matsuoka; 保静松岡
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2017-02-28
Filing date: 2017-02-28
Publication date: 2018-09-13
Anticipated expiration: 2037-02-28
Also published as: JP6812273B2

Abstract

PROBLEM TO BE SOLVED: To improve the recognition accuracy of multiple intervals.SOLUTION: The musical instrument sound recognition device 10 according to the present invention is a musical instrument sound recognition device for recognizing musical instrument sound which is a sound of a musical instrument, which comprises a frequency analyzing unit 13 analyzing frequency for record data of recorded musical instrument sound for each frequency of each musical interval included in twelve scales, a F0 estimator 14 which processes the record data by a low pass filter which reduces frequency components higher than a predetermined frequency and estimates the interval which is the fundamental frequency of the reduced signal which is the processed signal, a multi-tone analysis unit 15 for identifying a plurality of musical intervals relating to record data on the basis of frequency analysis results for each frequency of each interval and estimation results of interval as a fundamental frequency, and an output unit 16 for outputting an identification result indicating a plurality of intervals identified by the multi-tone analysis unit 15.SELECTED DRAWING: Figure 1

Description

本発明は、楽器音認識装置及び楽器音認識プログラムに関する。 The present invention relates to a musical instrument sound recognition apparatus and a musical instrument sound recognition program.

近年、機械学習の発展により、音楽の分野等における楽音認識技術の発展が目覚ましい。歌声又は演奏音等の音程認識において、単音については、自己相関法等の、基本周波数（Ｆ０）抽出を行う技術により容易に認識することができる。一方で、例えば、ピアノ又はギター等の多重音を演奏できる楽器により奏でられる、複数の音程が混ざった演奏音等については、上述したＦ０推定を行った場合、複数の音程のうち一部の音程しか認識できないこと、又は、誤った音程を認識することが起こり得る。 In recent years, with the development of machine learning, the development of music recognition technology in the field of music and the like has been remarkable. In pitch recognition such as singing voice or performance sound, a single tone can be easily recognized by a technique for extracting a fundamental frequency (F0) such as an autocorrelation method. On the other hand, for example, a performance sound mixed with a plurality of pitches played by an instrument capable of playing multiple sounds such as a piano or a guitar, for example, a part of the plurality of pitches when the F0 estimation described above is performed. It can happen that it can only be recognized, or it can recognize an incorrect pitch.

従来、複数の音程が混ざった演奏音等を認識する手法として、フーリエ変換等の周波数解析が用いられている。例えば、特許文献１には、ピアノからの楽音をピックアップして音響信号（録音データ）に変換し、音響信号に高速フーリエ変換を施すことによって和音を認識する技術が記載されている。 Conventionally, frequency analysis such as Fourier transform has been used as a method for recognizing a performance sound in which a plurality of pitches are mixed. For example, Patent Document 1 describes a technique for recognizing chords by picking up a musical sound from a piano, converting it into an acoustic signal (recorded data), and applying a fast Fourier transform to the acoustic signal.

特開２００４−１６３７６７号公報JP 2004-163767 A

フーリエ変換による周波数解析では、直交周波数による周波数解析を行うため、ある程度の時間フレーム長を確保し、周波数分解能を高める必要がある。特に低音域においては、微小の周波数の変化で音程が変わるため、隣接する音程の周波数の差よりも細かい周波数分解能でのフーリエ変換が必要になり、時間フレーム長を十分に確保する必要がある。このため、フーリエ変換による周波数解析では、速いメロディを分析して複数の音程を認識することが困難な場合がある。また、例えば和音を構成する各音程については、転回形を考慮する必要があるところ、上述した従来の周波数解析を行うのみによっては、転回形を考慮して各音程を認識することが困難である。すなわち、従来の周波数解析を行うのみによっては、音程の組み合わせが同一で且つ最低音であるベースが異なる、いわゆる転回形を区別して認識することが困難である。このように、従来の手法では、速いメロディの音程を認識することができないこと、及び、転回形を考慮して各音程を認識することができないこと等を理由として、複数の音程を十分に認識することができていない。 In frequency analysis based on Fourier transform, frequency analysis is performed using orthogonal frequencies, so that it is necessary to secure a certain time frame length and increase frequency resolution. In particular, in the low sound range, the pitch changes due to a minute change in frequency, so that a Fourier transform with a frequency resolution finer than the difference in frequency between adjacent pitches is necessary, and it is necessary to ensure a sufficient time frame length. For this reason, in frequency analysis by Fourier transform, it may be difficult to recognize a plurality of pitches by analyzing a fast melody. Further, for example, for each pitch constituting a chord, it is necessary to consider the inversion form. However, it is difficult to recognize each pitch in consideration of the inversion form only by performing the conventional frequency analysis described above. . That is, depending on only the conventional frequency analysis, it is difficult to distinguish and recognize so-called turning forms in which the combination of intervals is the same and the bass that is the lowest tone is different. In this way, the conventional method can fully recognize multiple pitches because it cannot recognize the pitch of a fast melody and cannot recognize each pitch in consideration of the turning form. Not be able to.

本発明は上記実情に鑑みてなされたものであり、複数の音程の認識精度を向上させることを目的とする。 The present invention has been made in view of the above circumstances, and an object thereof is to improve the recognition accuracy of a plurality of pitches.

本発明の一態様に係る楽器音認識装置は、楽器の音である楽器音を認識する楽器音認識装置であって、楽器音を録音した録音データについて、所定の音階に含まれる各音程の周波数毎に周波数解析を行う解析部と、録音データを、所定の周波数よりも高い周波数成分を低減させる低域通過フィルタによって処理し、該処理後の信号である低減信号の基本周波数である音程を推定する推定部と、解析部による、各音程の周波数毎の周波数解析結果と、推定部による、基本周波数である音程の推定結果とに基づき、録音データに係る複数の音程を識別する識別部と、識別部によって識別された複数の音程を示す識別結果を出力する出力部と、を備える。 A musical instrument sound recognition apparatus according to an aspect of the present invention is a musical instrument sound recognition apparatus that recognizes a musical instrument sound that is a sound of a musical instrument, and for recording data obtained by recording the musical instrument sound, the frequency of each pitch included in a predetermined scale. Analyzing unit that performs frequency analysis every time, and processing the recorded data with a low-pass filter that reduces frequency components higher than a predetermined frequency, and estimating the pitch that is the fundamental frequency of the reduced signal that is the processed signal An identification unit for identifying a plurality of intervals related to the recording data based on a frequency analysis result for each frequency of each pitch by the estimation unit and an estimation result of a pitch that is a fundamental frequency by the estimation unit; An output unit that outputs an identification result indicating a plurality of pitches identified by the identification unit.

本発明の一態様に係る楽器音認識プログラムは、コンピュータを、楽器の音である楽器音を録音した録音データについて、所定の音階に含まれる各音程の周波数毎に周波数解析を行う解析部と、録音データを、所定の周波数よりも高い周波数成分を低減させる低域通過フィルタによって処理し、該処理後の信号である低減信号の基本周波数である音程を推定する推定部と、解析部による、各音程の周波数毎の周波数解析結果と、推定部による、基本周波数である音程の推定結果とに基づき、録音データに係る複数の音程を識別する識別部と、識別部によって識別された複数の音程を示す識別結果を出力する出力部と、として機能させる。 The musical instrument sound recognition program according to one aspect of the present invention includes a computer, an analysis unit that performs frequency analysis for each frequency of each pitch included in a predetermined scale, with respect to recording data obtained by recording a musical instrument sound that is a sound of a musical instrument, The recording data is processed by a low-pass filter that reduces frequency components higher than a predetermined frequency, and the estimation unit that estimates the pitch that is the fundamental frequency of the reduced signal that is the processed signal, and the analysis unit, Based on the frequency analysis result for each frequency of the pitch and the estimation result of the pitch that is the fundamental frequency by the estimation unit, an identification unit that identifies a plurality of intervals related to the recording data, and a plurality of pitches identified by the identification unit And functioning as an output unit that outputs the identification result shown.

本発明に係る楽器音認識装置及び楽器音認識プログラムでは、楽器音を録音した録音データについて、所定の音階に含まれる各音程に応じた周波数毎に周波数解析が行われ、複数の音程が識別される。従来のようにフーリエ変換により周波数解析を行う場合には、周波数分解能を高めるべくある程度の時間フレーム長を確保する必要がある。これに対して、本発明に係る楽器音認識装置は、各音程に応じた周波数毎に周波数解析を行うため、時間フレーム長の制限が無く、任意の時間フレーム長で音程を解析することができる。これにより、例えば速いメロディについては短い時間で解析することが可能になり、速いメロディについても適切に識別することができる。また、本発明に係る楽器音認識装置では、録音データが低域通過フィルタによって処理され、処理後の低減信号の基本周波数となる音程が推定され、当該推定の結果が考慮されて複数の音程が識別される。基本周波数となる音程の推定結果を考慮することにより、録音データに係る複数の音程のうち最も低い音程が定まるため、音程の組み合わせが同一で且つ最低音であるベースが異なる、いわゆる転回形についても区別して、複数の音程を認識することができる。なお、例えば複数の音程が混ざった録音データについて単に基本周波数の推定を行おうとした場合には、複数の音程のうち一部の音程しか認識できない等の不具合が発生し得る。この点、本発明に係る楽器音認識装置では、低域通過フィルタによって所定の周波数よりも高い周波数成分を低減させた後の低減信号について基本周波数となる音程を推定しているので、録音データに含まれる複数の音程のうちベース以外の音程を低減した状態で、すなわち、複数の音程の混在を抑制した状態で、基本周波数となる音程を推定することができ、基本周波数の推定精度を向上させることができる。これにより、転回形についても区別して、複数の音程を適切に認識することができる。以上より、本発明に係る楽器音認識装置によれば、速いメロディの音程を認識することができると共に、転回形を考慮して各音程を認識することができ、複数の音程の認識精度を向上させることができる。 In the musical instrument sound recognition apparatus and the musical instrument sound recognition program according to the present invention, a frequency analysis is performed for each frequency corresponding to each pitch included in a predetermined scale with respect to recording data obtained by recording a musical instrument sound, and a plurality of pitches are identified. The When performing frequency analysis by Fourier transform as in the prior art, it is necessary to ensure a certain time frame length in order to increase the frequency resolution. On the other hand, since the musical instrument sound recognition apparatus according to the present invention performs frequency analysis for each frequency corresponding to each pitch, there is no time frame length limitation and the pitch can be analyzed with an arbitrary time frame length. . Thereby, for example, a fast melody can be analyzed in a short time, and a fast melody can be appropriately identified. Further, in the musical instrument sound recognition apparatus according to the present invention, the recorded data is processed by the low-pass filter, the pitch that becomes the fundamental frequency of the reduced signal after processing is estimated, and the result of the estimation is taken into consideration to generate a plurality of pitches. Identified. Considering the estimation result of the pitch that becomes the fundamental frequency, the lowest pitch among the plurality of pitches related to the recording data is determined, so the so-called inversion type in which the combination of pitches is the same and the bass that is the lowest tone is different A plurality of intervals can be recognized by distinguishing them. For example, when the basic frequency is simply estimated for recording data in which a plurality of pitches are mixed, there may be a problem that only a part of the plurality of pitches can be recognized. In this regard, in the musical instrument sound recognition apparatus according to the present invention, the pitch that becomes the fundamental frequency is estimated for the reduced signal after the frequency component higher than the predetermined frequency is reduced by the low-pass filter. It is possible to estimate the pitch that becomes the fundamental frequency in a state in which the pitch other than the bass among the plurality of pitches included is reduced, that is, in a state in which mixing of a plurality of pitches is suppressed, and to improve the estimation accuracy of the fundamental frequency. be able to. Thereby, a plurality of pitches can be appropriately recognized by distinguishing the inversion form. As described above, according to the musical instrument sound recognition apparatus according to the present invention, it is possible to recognize the pitch of a fast melody and to recognize each pitch in consideration of the turning form, thereby improving the recognition accuracy of a plurality of pitches. Can be made.

本発明によれば、複数の音程の認識精度を向上させることができる。 According to the present invention, it is possible to improve the recognition accuracy of a plurality of intervals.

楽器音認識装置の機能構成を示すブロック図である。It is a block diagram which shows the function structure of a musical instrument sound recognition apparatus. 図１に示される楽器音認識装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of the musical instrument sound recognition apparatus shown by FIG. 図１に示される周波数解析部による周波数解析を説明するための図であり、１２音階に含まれる各音程の振幅強度を示す図である。It is a figure for demonstrating the frequency analysis by the frequency analysis part shown by FIG. 1, and is a figure which shows the amplitude intensity | strength of each pitch contained in 12 scales. 図１に示される周波数解析部による周波数解析を説明するための図であり、１２音階に含まれる各音程の音階ベクトル導出を説明するための図である。It is a figure for demonstrating the frequency analysis by the frequency analysis part shown by FIG. 1, and is a figure for demonstrating scale vector derivation | leading-out of each pitch contained in 12 scales. 図１に示される多重音解析部が備えるニューラルネットワークの構成例を示す図である。It is a figure which shows the structural example of the neural network with which the multiple sound analysis part shown by FIG. 1 is provided. 図１に示される楽器音認識装置が行う楽器音認識方法の一連の処理を示すフローチャートである。It is a flowchart which shows a series of processes of the musical instrument sound recognition method which the musical instrument sound recognition apparatus shown by FIG. 1 performs. 従来の周波数解析を説明するための図である。It is a figure for demonstrating the conventional frequency analysis. 楽器音認識プログラムのモジュール構成を示すブロック図である。It is a block diagram which shows the module structure of a musical instrument sound recognition program.

以下、添付図面を参照しながら本発明の実施形態を詳細に説明する。図面の説明において、同一又は同等の要素には同一符号を用い、重複する説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the description of the drawings, the same reference numerals are used for the same or equivalent elements, and redundant descriptions are omitted.

以下、添付図面を参照しながら本発明の実施形態を詳細に説明する。なお、図面の説明においては同一要素には同一符号を付し、重複する説明を省略する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the description of the drawings, the same elements are denoted by the same reference numerals, and redundant description is omitted.

図１は、楽器音認識装置の機能構成を示すブロック図である。図１に示される楽器音認識装置１０は、楽器によって演奏された音である楽器音を認識する装置である。楽器音認識装置１０は、楽器音に含まれる複数の音程を認識する。本実施形態では、楽器音認識装置１０が、楽器音に含まれる複数の音程の一例として和音構成を認識するとして説明するが、楽器音に含まれる複数の音程は、和音を構成しない複数の音程の組み合わせであってもよい。楽器音は、例えばギター又はピアノ等の、多重音を演奏できる楽器により奏でられる演奏音である。和音は、音程（周波数）が異なる複数の音が合成された音であり、特定の音程の組み合わせである。和音は、コードとも呼ばれる。 FIG. 1 is a block diagram showing a functional configuration of a musical instrument sound recognition apparatus. The instrument sound recognition apparatus 10 shown in FIG. 1 is an apparatus that recognizes an instrument sound that is a sound played by an instrument. The instrument sound recognition device 10 recognizes a plurality of pitches included in the instrument sound. In the present embodiment, the musical instrument sound recognition apparatus 10 is described as recognizing a chord configuration as an example of a plurality of pitches included in a musical instrument sound. However, a plurality of pitches included in a musical instrument sound are a plurality of pitches that do not constitute a chord. A combination of these may be used. The instrument sound is a performance sound played by an instrument that can play multiple sounds, such as a guitar or a piano. A chord is a sound obtained by synthesizing a plurality of sounds having different pitches (frequency), and is a combination of specific pitches. Chords are also called chords.

楽器音認識装置１０は、機能的には、録音部１１と、録音データ格納部１２と、周波数解析部１３（解析部）と、Ｆ０推定部１４（推定部）と、多重音解析部１５（識別部）と、出力部１６と、を備えている。楽器音認識装置１０は、例えば、図２に示されるハードウェアによって構成されている。 The instrument sound recognition apparatus 10 functionally includes a recording unit 11, a recording data storage unit 12, a frequency analysis unit 13 (analysis unit), an F0 estimation unit 14 (estimation unit), and a multiple sound analysis unit 15 ( An identification unit) and an output unit 16. The musical instrument sound recognition apparatus 10 is configured by, for example, hardware shown in FIG.

図２は、楽器音認識装置１０のハードウェア構成を示す図である。図２に示されるように、楽器音認識装置１０は、物理的には、１又は複数のプロセッサ１００１、主記憶装置であるメモリ１００２、ハードディスク又は半導体メモリ等のストレージ１００３、ネットワークカード等のデータ送受信デバイスである通信装置１００４、入力装置１００５、及びディスプレイ等の出力装置１００６等を含むコンピュータシステムとして構成されている。図１に示される各機能は、図２に示されるメモリ１００２等のハードウェア上に所定のコンピュータソフトウェアを読み込ませることにより、プロセッサ１００１の制御のもとで入力装置１００５、出力装置１００６、及び通信装置１００４を動作させるとともに、メモリ１００２及びストレージ１００３におけるデータの読み出し及び書き込みを行うことで実現される。 FIG. 2 is a diagram illustrating a hardware configuration of the musical instrument sound recognition apparatus 10. As shown in FIG. 2, the instrument sound recognition apparatus 10 physically includes one or a plurality of processors 1001, a memory 1002 that is a main storage device, a storage 1003 such as a hard disk or a semiconductor memory, and data transmission / reception such as a network card. The computer system includes a communication device 1004, an input device 1005, and an output device 1006 such as a display, which are devices. Each function shown in FIG. 1 has an input device 1005, an output device 1006, and communication under the control of the processor 1001 by loading predetermined computer software on hardware such as the memory 1002 shown in FIG. This is realized by operating the apparatus 1004 and reading and writing data in the memory 1002 and the storage 1003.

再び図１を参照して、楽器音認識装置１０の各機能の詳細を説明する。 With reference to FIG. 1 again, details of each function of the musical instrument sound recognition apparatus 10 will be described.

録音部１１は、楽器音を所定の時間単位で録音し、録音データとして取得する。録音部１１は、楽器音を所定の時間単位でサンプリングし、サンプリングした楽器音を録音データとして順次録音する。サンプリング周波数は、例えば１６０００Ｈｚ又は４４１００Ｈｚ等である。サンプリングされた楽器音は、サンプル（音響信号）といい、時系列に配列された所定数（ｎ個；ｎは１以上の整数）のサンプルをまとめて録音データ（フレーム）という。各サンプルは、当該サンプルが取得された時点での楽器音の振幅値（音量）であって、例えば、１６ビットで表される。 The recording unit 11 records instrument sounds in predetermined time units and obtains them as recorded data. The recording unit 11 samples instrument sounds in predetermined time units, and sequentially records the sampled instrument sounds as recording data. The sampling frequency is, for example, 16000 Hz or 44100 Hz. The sampled instrument sounds are referred to as samples (acoustic signals), and a predetermined number (n; n is an integer equal to or greater than 1) of samples arranged in time series are collectively referred to as recording data (frame). Each sample is the amplitude value (volume) of the instrument sound at the time when the sample is acquired, and is represented by 16 bits, for example.

録音部１１は、各サンプルを時系列に（サンプリングされた順に）配列し、所定数のサンプルごとに録音データとする。１つの録音データに含まれるサンプルの数ｎは、例えば、２５６である。サンプリング周波数が１６０００Ｈｚである場合、録音データは、０．０１６秒程度の楽器音に相当する。録音部１１は、楽器音を繰り返しサンプリングし続け、録音データを取得し続ける。録音部１１は、各録音データを録音データ格納部１２に順次出力する。 The recording unit 11 arranges each sample in time series (in the order of sampling), and sets the recording data for each predetermined number of samples. The number n of samples included in one recording data is, for example, 256. When the sampling frequency is 16000 Hz, the recording data corresponds to an instrument sound of about 0.016 seconds. The recording unit 11 continuously samples the instrument sound and continues to acquire recording data. The recording unit 11 sequentially outputs each recording data to the recording data storage unit 12.

録音データ格納部１２は、録音部１１から録音データを順次受け取り、録音部１１によって取得された録音データを格納する。録音データ格納部１２は、例えば、ＦＩＦＯ（First In First Out）バッファで構成される。この場合、録音データ格納部１２に格納可能な数の録音データが格納されると、録音データ格納部１２は、録音データ格納部１２に格納されている録音データのうち最も古い（最初に格納された）録音データを破棄し、新しい録音データを格納する。つまり、録音データ格納部１２は、複数の録音データを一時的に格納（バッファリング）する。録音データ格納部１２は、格納されている録音データを、周波数解析部１３及びＦ０推定部１４に出力する。 The recording data storage unit 12 sequentially receives the recording data from the recording unit 11 and stores the recording data acquired by the recording unit 11. The recorded data storage unit 12 is configured by, for example, a FIFO (First In First Out) buffer. In this case, when the number of recordable data items stored in the record data storage unit 12 is stored, the record data storage unit 12 is the oldest of the record data stored in the record data storage unit 12 (stored first. A) Discard the recorded data and store new recorded data. That is, the recording data storage unit 12 temporarily stores (buffers) a plurality of recording data. The recording data storage unit 12 outputs the stored recording data to the frequency analysis unit 13 and the F0 estimation unit 14.

周波数解析部１３は、録音データについて、所定の音階に含まれる各音程の周波数毎に周波数解析を行う。所定の音階は、例えば１２音階である。本実施形態では、所定の音階が１２音階であるとして説明する。所定の音階が１２音階である場合、１オクターブに含まれる音程は、例えば低い音から順に、「Ａ」「Ｂ♭」「Ｂ」「Ｃ」「Ｃ♯」「Ｄ」「Ｅ♭」「Ｅ」「Ｆ」「Ｆ♯」「Ｇ」「Ａ♭」の１２個であり、それぞれ個別の周波数とされている。ある音程の周波数を約１．０５９２倍した値が、１つ上の音程の周波数となる。例えば、最も低い「Ａ」の周波数が５５．００Ｈｚであるとすると、１つ上の音程（「Ｂ♭」）の周波数は、５５．００×１．０５９２≒５８．２５Ｈｚとなる。また、ある音程の周波数を２倍した値が、１オクターブ高い、同じ音程の周波数となる。例えば、あるオクターブの「Ａ」の周波数が５５．００Ｈｚであるとすると、１オクターブ高い「Ａ」の周波数は５５．００×２＝１１０．００Ｈｚとなる。どの周波数帯（オクターブ）で周波数解析を行うかについては、予め把握された、楽器が奏でる周波数に応じて決定される。 The frequency analysis unit 13 performs frequency analysis on the recorded data for each frequency of each pitch included in a predetermined scale. The predetermined scale is, for example, 12 scales. In the present embodiment, description will be made assuming that the predetermined scale is the 12th scale. When the predetermined scale is 12 scales, the pitches included in one octave are, for example, “A”, “B ♭”, “B”, “C”, “C #”, “D”, “E ♭”, “E” in order from the lowest tone. “F”, “F #”, “G”, and “A ♭”, each of which is an individual frequency. A value obtained by multiplying the frequency of a certain pitch by about 1.0592 is the frequency of the pitch one level higher. For example, assuming that the lowest “A” frequency is 55.00 Hz, the frequency of the pitch one higher (“B ♭”) is 55.00 × 1.0592≈58.25 Hz. A value obtained by doubling the frequency of a certain pitch is a frequency of the same pitch that is one octave higher. For example, if the frequency of “A” in an octave is 55.00 Hz, the frequency of “A” that is one octave higher is 55.00 × 2 = 110.00 Hz. Which frequency band (octave) is to be used for frequency analysis is determined in accordance with the frequency that the instrument plays, which is known in advance.

周波数解析部１３は、１２音階に含まれる各音程の周波数に応じた正弦波及び余弦波によって録音データの信号を積分することにより、各音程の周波数毎に振幅強度を導出し、該振幅強度を各音程の特徴量とする。より詳細には、周波数解析部１３は、各音程の周波数に応じた正弦波及び余弦波によって録音データの波形を積分し、該積分結果の二乗を足し合わせたものを、各音程の振幅強度とする。具体的には、周波数解析部１３は、下記（１）式に基づいて、各音程の周波数毎に振幅強度を導出する。

上記（１）式において、ｆは周波数（１２音階に含まれるいずれかの音程の周波数）、nは録音データに含まれるサンプリング数、x(k)はk番目のサンプルの振幅値、Sampling Rateはサンプリング周波数、Power(f)は周波数がfである音程（１２音階に含まれるいずれかの音程）の振幅強度を示している。 The frequency analysis unit 13 derives an amplitude intensity for each frequency of each pitch by integrating the signal of the recording data with a sine wave and a cosine wave corresponding to the frequency of each pitch included in the 12th scale, and the amplitude strength is calculated. The feature value of each pitch. More specifically, the frequency analysis unit 13 integrates the waveform of the recording data with a sine wave and a cosine wave corresponding to the frequency of each pitch, and adds the square of the integration result to the amplitude intensity of each pitch. To do. Specifically, the frequency analysis unit 13 derives the amplitude intensity for each frequency of each pitch based on the following equation (1).

In the above equation (1), f is the frequency (frequency of any pitch included in the 12th scale), n is the number of samplings included in the recording data, x (k) is the amplitude value of the kth sample, and Sampling Rate is The sampling frequency, Power (f), indicates the amplitude intensity of a pitch (any pitch included in the 12th scale) whose frequency is f.

周波数解析部１３は、上記（１）式に基づいて、例えば最も低い「Ａ」の周波数である５５Ｈｚから順に、１．０５９２倍の周波数毎に振幅強度を導出する。すなわち、図３に示されるように、周波数解析部１３は、最も低い「Ａ」の周波数である５５Ｈｚから５５×１．０５９２^mＨｚまで、ｍ＋１個の周波数分だけ、振幅強度を導出する。なお、ｍ＋１は少なくとも１２音階を構成する音程の個数（１２）よりも大きい。すなわち、ｍは１１以上の整数である。周波数解析部１３は、例えば、３オクターブ分（１２×３＝３６個）の音程の周波数について、振幅強度を導出する。 Based on the above equation (1), the frequency analysis unit 13 derives the amplitude intensity for each frequency of 1.0592 times, for example, in order from 55 Hz which is the lowest “A” frequency. That is, as shown in FIG. 3, the frequency analysis unit 13 derives the amplitude intensity for m + 1 frequencies from 55 Hz, which is the lowest “A” frequency, to 55 × 1.0592 ^m Hz. Note that m + 1 is larger than the number of pitches (12) constituting at least 12 scales. That is, m is an integer of 11 or more. For example, the frequency analysis unit 13 derives the amplitude intensity for the frequency of the pitch of 3 octaves (12 × 3 = 36).

周波数解析部１３は、オクターブが異なる同じ音程の振幅強度については、足し合わせる処理を行い、最終的に互いに異なる１２個の音程の振幅強度を、各音程の特徴量である音階ベクトル（クロマベクトル）として導出する。図４に示される例では、５５Ｈｚ〜１０４Ｈｚのオクターブ、１１０Ｈｚ〜２０８Ｈｚのオクターブ、及び２２０Ｈｚ〜４１６Ｈｚのオクターブ、の３つのオクターブの同じ音程の振幅強度が合算されて、各音程の特徴量である音階ベクトル（クロマベクトル）が導出されている。周波数解析部１３は、各音程の特徴量である音階ベクトルの値を０〜１の値に正規化した値（各音程の特徴量に基づく第１の特徴量）を、多重音解析部１５に出力する。各音程の正規化は、例えば、各音程の音階ベクトルを、最も音階ベクトルが大きい音程の音階ベクトルで除算することにより導出される。 The frequency analysis unit 13 performs a process of adding together the amplitude intensities of the same pitch with different octaves, and finally converts the amplitude intensities of twelve different pitches into a scale vector (chroma vector) that is a characteristic amount of each pitch. Derived as In the example shown in FIG. 4, the amplitude intensity of the same pitch of three octaves of an octave of 55 Hz to 104 Hz, an octave of 110 Hz to 208 Hz, and an octave of 220 Hz to 416 Hz is added up, and a scale that is a characteristic amount of each pitch is obtained. A vector (chroma vector) is derived. The frequency analysis unit 13 supplies a value (first feature value based on the feature value of each pitch) obtained by normalizing the scale vector value, which is the feature value of each pitch, to a value of 0 to 1 to the multiple sound analysis unit 15. Output. Normalization of each pitch is derived, for example, by dividing the scale vector of each pitch by the scale vector of the pitch having the largest scale vector.

Ｆ０推定部１４は、録音データに係る基本周波数（Ｆ０）である音程を推定する。基本周波数とは、信号に含まれる最も低い周波数成分の周波数である。Ｆ０推定部１４は、録音データを、所定の周波数よりも高い周波数成分を低減させる低域通過フィルタによって処理し、該処理後の信号である低減信号の基本周波数である音程を推定する。例えば、もっとも低いオクターブの周波数帯（５５Ｈｚ〜１１０Ｈｚ）までの音はＦ０推定部で音程を解析する等が考えられる。この場合、１１０Ｈｚ以下の音だけを通過させる低域通過フィルタを用いる。 The F0 estimation unit 14 estimates a pitch that is a fundamental frequency (F0) related to the recording data. The fundamental frequency is the frequency of the lowest frequency component included in the signal. The F0 estimation unit 14 processes the recorded data with a low-pass filter that reduces frequency components higher than a predetermined frequency, and estimates the pitch that is the fundamental frequency of the reduced signal that is the signal after the processing. For example, for the sound up to the lowest octave frequency band (55 Hz to 110 Hz), the pitch may be analyzed by the F0 estimation unit. In this case, a low-pass filter that passes only sound of 110 Hz or less is used.

Ｆ０推定部１４による処理を行う目的について説明する。上述したように、周波数解析部１３は、１２種類の音程毎に振幅強度を導出している。後述する多重音解析部１５は、振幅強度の大きい音程の組み合わせに基づいて、和音（コード）を推定する（詳細は後述）。ここで、和音は、音程の組み合わせから一意に特定することができない場合がある。すなわち、音程の組み合わせが同じであったとしても、例えばＣ（ルートを最低音であるベースとする）と、その第１転回形であるＣ／Ｅ（３度を最低音であるベースとする）と、その第２転回形であるＣ／Ｇ（５度を最低音であるベースとする）とが存在するため、単に音程毎の振幅強度を導出しただけでは、和音を正確に推定できない場合がある。これに対して、最も低いオクターブの周波数帯だけＦ０抽出部で最も強い音階を抽出することで、ベース音を判定しやすくなる。ベース音は最も低いオクターブの周波数帯では単一の音になっている場合が多く、自己相関法でも判定しやすいという特徴がある。低域通過フィルタによって所定の周波数よりも高い周波数成分を低減させて、和音に含まれるベース以外の音程の周波数成分を低減させた低減信号を用いることにより、最低音であるベースの周波数（基本周波数）を推定し易くなり、転回形を考慮した場合であっても、後述する多重音解析部１５による和音の推定が可能となる。 The purpose of performing the processing by the F0 estimation unit 14 will be described. As described above, the frequency analysis unit 13 derives the amplitude intensity for every 12 types of pitches. The multiple sound analysis unit 15 described later estimates a chord (a chord) based on a combination of pitches having a large amplitude intensity (details will be described later). Here, a chord may not be uniquely identified from a combination of pitches. That is, even if the combination of pitches is the same, for example, C (the root is the base that is the lowest note) and C / E that is the first inversion (3 degrees is the base that is the lowest note) And C / G that is the second rotation type (5 degrees is the base that is the lowest note), and there is a case where the chord cannot be accurately estimated simply by deriving the amplitude intensity for each pitch. is there. On the other hand, by extracting the strongest scale in the F0 extraction unit only in the lowest octave frequency band, it becomes easy to determine the bass sound. The bass sound is often a single sound in the lowest octave frequency band, and is characterized by being easily determined by the autocorrelation method. By using a reduced signal in which the frequency components higher than the predetermined frequency are reduced by the low-pass filter and the frequency components of the pitch other than the bass included in the chord are reduced, the base frequency (basic frequency) ) Can be easily estimated, and the chord can be estimated by the multiple sound analysis unit 15 to be described later even when the turning form is taken into consideration.

Ｆ０推定部１４は、上述した低域通過フィルタによって処理された後の信号である低減信号について、例えば自己相関法を用いて、１２音階に含まれる各音程が基本周波数である尤もらしさを示す第１尤度を、各音程毎に導出する。自己相関法は、基本周波数を導出するに際して一般的に行われる手法であり、複数のサンプルが含まれた２つの同じデータを準備し、ｋサンプルずらした状態でのデータ間の相関値（自己相関値）を導出し、該自己相関値から基本周波数を導出する手法である。自己相関値は、サンプルをずらした状態において、２つのデータ間の対応するサンプル同士を掛け合わせた値を、全サンプル分足し合わせることにより導出される。当該足し合わせた値に応じてデータ間の相関（自己相関値）が判断される。例えば、ｋサンプルずらした場合の自己相関値は、「サンプリング周波数」／「ｋ」（Ｈｚ）が基本周波数である尤度とされる。本実施形態では、１２音階を構成する音程毎に周波数が決まっているため、「サンプリング周波数」／「１２音階を構成する各音程の周波数」（サンプル）ずらした自己相関値が導出されることにより、１２音階に含まれる各音程が基本周波数である尤もらしさを示す第１尤度が導出される。Ｆ０推定部１４は、自己相関値が高い音程ほど、第１尤度を高くする。このように、Ｆ０推定部１４は、各音程の第１尤度を導出することによって基本周波数である音程を推定する。Ｆ０推定部１４は、各音程の第１尤度を０〜１の値に正規化した値（各音程の第１尤度に基づく第２の特徴量）を、多重音解析部１５に出力する。各音程の正規化は、例えば、各音程の第一尤度を、最も第１尤度が大きい音程の第１尤度で除算することにより導出される。 The F0 estimator 14 uses the autocorrelation method, for example, for the reduced signal that has been processed by the above-described low-pass filter to indicate the likelihood that each pitch included in the 12th scale is a fundamental frequency. One likelihood is derived for each pitch. The autocorrelation method is a method generally used for deriving a fundamental frequency, and prepares two identical data including a plurality of samples, and the correlation value (autocorrelation) between data in a state shifted by k samples. Value) and a fundamental frequency is derived from the autocorrelation value. The autocorrelation value is derived by adding the values obtained by multiplying the corresponding samples between the two pieces of data in the state where the samples are shifted, and adding them for all the samples. Correlation between data (autocorrelation value) is determined according to the added value. For example, the autocorrelation value when k samples are shifted is the likelihood that “sampling frequency” / “k” (Hz) is the fundamental frequency. In this embodiment, since the frequency is determined for each pitch constituting the 12th scale, an autocorrelation value shifted by “sampling frequency” / “frequency of each pitch constituting the 12th scale” (sample) is derived. The first likelihood indicating the likelihood that each pitch included in the 12th scale is the fundamental frequency is derived. The F0 estimation unit 14 increases the first likelihood as the pitch of the autocorrelation value increases. In this way, the F0 estimation unit 14 estimates the pitch that is the fundamental frequency by deriving the first likelihood of each pitch. The F0 estimation unit 14 outputs a value (second feature value based on the first likelihood of each interval) obtained by normalizing the first likelihood of each interval to a value of 0 to 1 to the multiple sound analysis unit 15. . Normalization of each pitch is derived, for example, by dividing the first likelihood of each pitch by the first likelihood of the pitch having the largest first likelihood.

多重音解析部１５は、周波数解析部１３による、各音程の周波数毎の周波数解析結果と、Ｆ０推定部１４による、基本周波数である音程の推定結果とに基づき、録音データに係る和音を識別する。多重音解析部１５には、上述したとおり、各音程の周波数毎の周波数解析結果として、各音程の特徴量である音階ベクトル（クロマベクトル）を正規化した値である第１の特徴量が入力されると共に、基本周波数である音程の推定結果として、各音程が基本周波数である尤もらしさを示す第１尤度を正規化した値である第２の特徴量が入力される。 The multiple sound analysis unit 15 identifies a chord related to the recording data based on the frequency analysis result for each frequency of each pitch by the frequency analysis unit 13 and the estimation result of the pitch that is the fundamental frequency by the F0 estimation unit 14. . As described above, the multiple feature analysis unit 15 receives a first feature value that is a normalized value of a scale vector (chroma vector) that is a feature value of each pitch, as a frequency analysis result for each frequency of each pitch. In addition, as a result of estimating the pitch that is the fundamental frequency, a second feature value that is a value obtained by normalizing the first likelihood indicating the likelihood that each pitch is the fundamental frequency is input.

多重音解析部１５は、図５に示されるニューラルネットワークＮ１を有している。多重音解析部１５は、ニューラルネットワークＮ１を用いて、録音データに係る和音の識別及び学習を行う。ニューラルネットワークＮ１は、各音程の特徴量である音階ベクトル（クロマベクトル）に基づく第１の特徴量、及び、各音程が基本周波数である尤もらしさを示す第１尤度に基づく第２の特徴量を入力として、録音データに係る和音の尤もらしさを示す第２尤度を出力する。第２尤度は、例えばシグモイド関数値であり、０〜１の値をとりうる。第２尤度が大きいほど、録音データに係る（含まれる）和音である可能性が高いことを意味する。多重音解析部１５は、例えば、第２尤度が最も大きい和音を、録音データに係る和音であると識別し、該和音を特定する情報を出力部１６に出力する。 The multiple sound analyzer 15 has a neural network N1 shown in FIG. The multiple sound analyzer 15 uses the neural network N1 to identify and learn chords related to the recorded data. The neural network N1 has a first feature quantity based on a scale vector (chroma vector) that is a feature quantity of each pitch, and a second feature quantity based on a first likelihood that indicates the likelihood that each pitch is a fundamental frequency. Is input, and the second likelihood indicating the likelihood of the chord related to the recording data is output. The second likelihood is, for example, a sigmoid function value and can take a value of 0 to 1. The higher the second likelihood, the higher the possibility that the chord is related to (included in) the recorded data. For example, the multiple sound analysis unit 15 identifies a chord having the largest second likelihood as a chord related to the recording data, and outputs information specifying the chord to the output unit 16.

ニューラルネットワークＮ１は、各音程の第１の特徴量に対応した複数の入力ノード３１１を含む入力層３１と、各音程の第２の特徴量に対応した複数の入力ノード３２１を含む入力層３２と、複数の中間ノード３３１を含む中間層３３と、出力ノード３４１を含む出力層３４と、を備えている。入力ノード３１１は、周波数解析部１３から受け取った１２個の各音程の第１の特徴量それぞれに対応して設けられており、１２個の入力ノード３１１それぞれには、いずれかの音程の第１の特徴量が入力される。入力ノード３２１は、Ｆ０推定部１４から受け取った１２個の各音程の第２の特徴量それぞれに対応して設けられており、１２個の入力ノード３２１それぞれには、いずれかの音程の第２の特徴量が入力される。中間ノード３３１は、１以上の入力ノード３１１に入力された第１の特徴量と、１以上の入力ノード３２１に入力された第２の特徴量とを用いて所定の計算を行い、計算結果を出力ノード３４１に出力する。出力ノード３４１は、推定対象の和音の数だけ設けられており、各出力ノード３４１は、いずれかの和音の第２尤度を出力する。出力ノード３４１は、中間ノード３３１から受け取った計算結果を用いて、和音の第２尤度を計算し、該第２尤度を出力する。 The neural network N1 includes an input layer 31 including a plurality of input nodes 311 corresponding to the first feature value of each pitch, and an input layer 32 including a plurality of input nodes 321 corresponding to the second feature value of each pitch. The intermediate layer 33 including a plurality of intermediate nodes 331 and the output layer 34 including an output node 341 are provided. The input node 311 is provided corresponding to each of the first feature values of each of the twelve pitches received from the frequency analysis unit 13, and each of the twelve input nodes 311 has a first of any pitch. Is input. The input nodes 321 are provided corresponding to the second feature values of the twelve pitches received from the F0 estimating unit 14, and each of the twelve input nodes 321 includes the second feature value of any pitch. Is input. The intermediate node 331 performs a predetermined calculation using the first feature amount input to the one or more input nodes 311 and the second feature amount input to the one or more input nodes 321, and calculates the calculation result. Output to the output node 341. There are as many output nodes 341 as the number of chords to be estimated, and each output node 341 outputs the second likelihood of one of the chords. The output node 341 calculates the second likelihood of the chord using the calculation result received from the intermediate node 331, and outputs the second likelihood.

出力部１６は、多重音解析部１５によって識別された和音を示す識別結果を出力する。出力部１６は、多重音解析部１５から入力された和音を特定する情報を識別結果として、楽器音認識装置１０の外部に出力する。 The output unit 16 outputs an identification result indicating the chord identified by the multiple sound analysis unit 15. The output unit 16 outputs information specifying the chord input from the multiple sound analysis unit 15 to the outside of the instrument sound recognition apparatus 10 as an identification result.

次に、図６を参照して、楽器音認識装置１０における楽器音認識方法の一連の処理を説明する。図６は、楽器音認識装置１０が行う楽器音認識方法の一連の処理を示すフローチャートである。 Next, a series of processes of the instrument sound recognition method in the instrument sound recognition apparatus 10 will be described with reference to FIG. FIG. 6 is a flowchart showing a series of processing of the instrument sound recognition method performed by the instrument sound recognition apparatus 10.

まず、録音部１１が、楽器音を所定の時間単位でサンプリングし、サンプリングした楽器音を録音データとして順次取得する（ステップＳ１）。そして、録音部１１は、各サンプルを時系列に配列して、所定数のサンプル毎に録音データとして録音データ格納部１２に順次出力する。 First, the recording unit 11 samples instrument sounds in predetermined time units, and sequentially acquires the sampled instrument sounds as recording data (step S1). Then, the recording unit 11 arranges the samples in time series and sequentially outputs the recorded data to the recording data storage unit 12 as recording data for each predetermined number of samples.

続いて、録音データ格納部１２は、録音部１１から録音データを順次受け取り、録音部１１によって取得された録音データを格納する（ステップＳ２）。録音データ格納部１２は、格納されている録音データを、周波数解析部１３及びＦ０推定部１４に出力する。 Subsequently, the recording data storage unit 12 sequentially receives the recording data from the recording unit 11 and stores the recording data acquired by the recording unit 11 (step S2). The recording data storage unit 12 outputs the stored recording data to the frequency analysis unit 13 and the F0 estimation unit 14.

続いて、周波数解析部１３は、録音データについて、１２音階に含まれる各音程の周波数毎に周波数解析を行う（ステップＳ３）。周波数解析部１３は、１２音階に含まれる各音程の周波数に応じた正弦波及び余弦波によって録音データの信号を積分することにより、各音程の周波数毎に振幅強度を導出し、該振幅強度を各音程の特徴量とする。周波数解析部１３は、各音程の特徴量である音階ベクトルの値を０〜１の値に正規化した値（各音程の特徴量に基づく第１の特徴量）を、多重音解析部１５に出力する。 Subsequently, the frequency analysis unit 13 performs frequency analysis on the recorded data for each frequency of pitches included in the 12th scale (step S3). The frequency analysis unit 13 derives an amplitude intensity for each frequency of each pitch by integrating the signal of the recording data with a sine wave and a cosine wave corresponding to the frequency of each pitch included in the 12th scale, and the amplitude strength is calculated. The feature value of each pitch. The frequency analysis unit 13 supplies a value (first feature value based on the feature value of each pitch) obtained by normalizing the scale vector value, which is the feature value of each pitch, to a value of 0 to 1 to the multiple sound analysis unit 15. Output.

続いて、Ｆ０推定部１４は、録音データに係る基本周波数（Ｆ０）である音程を推定する（ステップＳ４）。Ｆ０推定部１４は、録音データを、所定の周波数よりも高い周波数成分を低減させる低域通過フィルタによって処理し、該処理後の信号である低減信号の基本周波数である音程を推定する。Ｆ０推定部１４は、上述した低域通過フィルタによって処理された後の信号である低減信号について、例えば自己相関法を用いて、１２音階に含まれる各音程が基本周波数である尤もらしさを示す第１尤度を、各音程毎に導出する。Ｆ０推定部１４は、各音程の第１尤度を０〜１の値に正規化した値（各音程の第１尤度に基づく第２の特徴量）を、多重音解析部１５に出力する。 Subsequently, the F0 estimation unit 14 estimates a pitch that is a fundamental frequency (F0) related to the recording data (step S4). The F0 estimation unit 14 processes the recorded data with a low-pass filter that reduces frequency components higher than a predetermined frequency, and estimates the pitch that is the fundamental frequency of the reduced signal that is the signal after the processing. The F0 estimator 14 uses the autocorrelation method, for example, for the reduced signal that has been processed by the above-described low-pass filter to indicate the likelihood that each pitch included in the 12th scale is a fundamental frequency. One likelihood is derived for each pitch. The F0 estimation unit 14 outputs a value (second feature value based on the first likelihood of each interval) obtained by normalizing the first likelihood of each interval to a value of 0 to 1 to the multiple sound analysis unit 15. .

続いて、多重音解析部１５は、周波数解析部１３による、各音程の周波数毎の周波数解析結果と、Ｆ０推定部１４による、基本周波数である音程の推定結果とに基づき、録音データに係る和音を識別する（ステップＳ５）。多重音解析部１５は、ニューラルネットワークＮ１を用いて、録音データに係る和音を識別する。ニューラルネットワークＮ１は、各音程の特徴量である音階ベクトル（クロマベクトル）に基づく第１の特徴量、及び、各音程が基本周波数である尤もらしさを示す第１尤度に基づく第２の特徴量を入力として、録音データに係る和音の尤もらしさを示す第２尤度を出力する。多重音解析部１５は、例えば、第２尤度が最も大きい和音を、録音データに係る和音であると識別し、該和音を特定する情報を出力部１６に出力する。 Subsequently, the multiple tone analysis unit 15 performs chords related to the recording data based on the frequency analysis result for each frequency of each pitch by the frequency analysis unit 13 and the pitch estimation result that is the fundamental frequency by the F0 estimation unit 14. Is identified (step S5). The multiple sound analysis unit 15 identifies a chord related to the recording data using the neural network N1. The neural network N1 has a first feature quantity based on a scale vector (chroma vector) that is a feature quantity of each pitch, and a second feature quantity based on a first likelihood that indicates the likelihood that each pitch is a fundamental frequency. Is input, and the second likelihood indicating the likelihood of the chord related to the recording data is output. For example, the multiple sound analysis unit 15 identifies a chord having the largest second likelihood as a chord related to the recording data, and outputs information specifying the chord to the output unit 16.

そして、出力部１６は、多重音解析部１５によって識別された和音を示す識別結果を、楽器音認識装置１０の外部に出力する（ステップＳ６）。以上が、楽器音認識装置１０が行う楽器音認識方法の一連の処理の一例である。 And the output part 16 outputs the identification result which shows the chord identified by the multiple sound analysis part 15 to the exterior of the musical instrument sound recognition apparatus 10 (step S6). The above is an example of a series of processes of the instrument sound recognition method performed by the instrument sound recognition apparatus 10.

続いて、図８を参照して、コンピュータを楽器音認識装置１０として機能させるための楽器音認識プログラムＰについて説明する。 Next, an instrument sound recognition program P for causing a computer to function as the instrument sound recognition apparatus 10 will be described with reference to FIG.

楽器音認識プログラムＰは、メインモジュールＰ１０、録音モジュールＰ１１、録音データ格納モジュールＰ１２、周波数解析モジュールＰ１３、Ｆ０推定モジュールＰ１４、多重音解析モジュールＰ１５、及び出力モジュールＰ１６を備える。メインモジュールＰ１０は、楽器音認識装置１０としての処理を統括的に制御する部分である。録音モジュールＰ１１、録音データ格納モジュールＰ１２、周波数解析モジュールＰ１３、Ｆ０推定モジュールＰ１４、多重音解析モジュールＰ１５、及び出力モジュールＰ１６を実行することにより表現される機能は、それぞれ、録音部１１、録音データ格納部１２、周波数解析部１３、Ｆ０推定部１４、多重音解析部１５、及び出力部１６の機能と同様である。 The instrument sound recognition program P includes a main module P10, a recording module P11, a recording data storage module P12, a frequency analysis module P13, an F0 estimation module P14, a multiple sound analysis module P15, and an output module P16. The main module P10 is a part that comprehensively controls processing as the musical instrument sound recognition apparatus 10. The functions expressed by executing the recording module P11, the recording data storage module P12, the frequency analysis module P13, the F0 estimation module P14, the multiple sound analysis module P15, and the output module P16 are the recording unit 11 and the recording data storage, respectively. The functions of the unit 12, the frequency analysis unit 13, the F0 estimation unit 14, the multiple sound analysis unit 15, and the output unit 16 are the same.

楽器音認識プログラムＰは、例えば、ＣＤ−ＲＯＭ、ＤＶＤ若しくはＲＯＭ等の記録媒体又は半導体メモリによって提供される。また、楽器音認識プログラムＰは、搬送波に重畳されたコンピュータデータ信号としてネットワークを介して提供されてもよい。 The musical instrument sound recognition program P is provided by, for example, a recording medium such as a CD-ROM, a DVD, or a ROM, or a semiconductor memory. The musical instrument sound recognition program P may be provided via a network as a computer data signal superimposed on a carrier wave.

次に、本実施形態に係る楽器音認識装置１０の作用効果について、従来の楽器音認識技術と対比しながら説明する。 Next, the effect of the musical instrument sound recognition apparatus 10 according to the present embodiment will be described in comparison with the conventional musical instrument sound recognition technology.

従来の楽器音認識技術においては、和音を認識するに際し、フーリエ変換による周波数解析を行うことが一般的である。フーリエ変換による周波数解析では、直交周波数による周波数解析を行っており、図７に示されるように、時間フレーム長をＴとすると、基底周波数は１／Ｔ、２／Ｔ、…ｎ／Ｔとなり（サンプル数がｎの場合）、周波数間隔は１／Ｔとなる。このため、例えば、低音域のラの音（５５Ｈｚ）まで解析できるようにするためには、１つ上の音程のシ♭の音（５８Ｈｚ）との差が３Ｈｚなので、１／Ｔが３Ｈｚより小さくなるような周波数分解能が必要となる。したがって、時間フレーム長Ｔは３３３ミリ秒以上必要となり、３３３ミリ秒より速い時間で音程が変わるメロディを解析することが困難である（以下、「第１の課題」と記載する場合がある）。 In conventional instrument sound recognition technology, it is common to perform frequency analysis by Fourier transform when recognizing chords. In the frequency analysis by Fourier transform, frequency analysis by orthogonal frequency is performed. As shown in FIG. 7, when the time frame length is T, the base frequency is 1 / T, 2 / T,... N / T ( When the number of samples is n), the frequency interval is 1 / T. For this reason, for example, in order to be able to analyze up to a low-range sound (55 Hz), the difference between the upper pitch and the sound (58 Hz) is 3 Hz, so 1 / T is 3 Hz or higher. A small frequency resolution is required. Therefore, the time frame length T is required to be 333 milliseconds or more, and it is difficult to analyze a melody whose pitch changes in a time faster than 333 milliseconds (hereinafter, sometimes referred to as “first problem”).

また、和音には、音程の組み合わせが同じであったとしても、例えばＣ（ルートを最低音であるベースとする）と、その第１転回形であるＣ／Ｅ（３度を最低音であるベースとする）と、その第２転回形であるＣ／Ｇ（５度を最低音であるベースとする）とが存在するため、単に音程の組み合わせを認識しただけでは、和音を正確に推定できない場合がある（以下、「第２の課題」と記載する場合がある）。このように、従来の楽器音認識技術には、上述した第１の課題及び第２の課題が存在し、和音の認識精度を十分に担保できているとは言い難かった。 In addition, even if the combination of pitches is the same for a chord, for example, C (the root is the base that is the lowest note) and C / E that is the first turn form (3 degrees is the lowest note) 2) and C / G (the 5th degree is the base that is the lowest note), so the chord cannot be accurately estimated by simply recognizing the combination of pitches. In some cases (hereinafter, referred to as “second problem”). As described above, the conventional musical instrument sound recognition technology has the first and second problems described above, and it is difficult to say that the recognition accuracy of chords is sufficiently secured.

これに対して、本実施形態に係る楽器音認識装置１０では、上記第１の課題を解消すべく、楽器音を録音した録音データについて、１２音階に含まれる各音程に応じた周波数毎に周波数解析を行い、複数の音程を識別する。すなわち、楽器音認識装置１０では、従来の楽器音認識技術における離散フーリエ変換の基底周波数を１２音階に含まれる各音程に応じた周波数に変更し、直交周波数を用いずに周波数解析を行っている。このような構成においては、周波数間隔が時間フレーム長と無関係に決まるため、周波数解析において時間フレーム長の制限が無く、任意の時間フレーム長で音程を解析することができる。これにより、例えば速いメロディについては短い時間で解析することが可能になり、速いメロディについても適切に識別することができる。また、遅いメロディについては長い時間で解析することも可能となる。図３に示されるように、１２音階の各音程の周波数は、高音域になるほど、周波数の間隔が離れていくため、フーリエ変換のように細かく周波数解析しなくても、音程の認識が可能になる。したがって、安価な装置でも計算量的に実装が可能になるというメリットもある。 On the other hand, in the musical instrument sound recognition apparatus 10 according to the present embodiment, in order to solve the first problem, the recorded data obtained by recording the musical instrument sound has a frequency for each frequency corresponding to each pitch included in the 12th scale. Analyze and identify multiple pitches. That is, the musical instrument sound recognition apparatus 10 changes the base frequency of the discrete Fourier transform in the conventional musical instrument sound recognition technology to a frequency corresponding to each pitch included in the 12 scales, and performs frequency analysis without using the orthogonal frequency. . In such a configuration, since the frequency interval is determined independently of the time frame length, the time frame length is not limited in the frequency analysis, and the pitch can be analyzed with an arbitrary time frame length. Thereby, for example, a fast melody can be analyzed in a short time, and a fast melody can be appropriately identified. Moreover, it becomes possible to analyze a slow melody in a long time. As shown in FIG. 3, since the frequency of each pitch of the 12th scale becomes higher as the pitch becomes higher, the pitch can be recognized without fine frequency analysis such as Fourier transform. Become. Therefore, there is an advantage that even an inexpensive device can be implemented in terms of calculation amount.

本実施形態に係る楽器音認識装置１０では、上記第２の課題を解決すべく、録音データが低域通過フィルタによって処理され、処理後の低減信号の基本周波数となる音程が推定され、当該推定の結果が考慮されて複数の音程が識別される。基本周波数となる音程の推定結果を考慮することにより、録音データに係る複数の音程のうち最も低い音程が定まるため、音程の組み合わせが同一で且つ最低音であるベースが異なる、いわゆる転回形についても区別して、複雑な分数和音も認識することができる。なお、例えば複数の音程が混ざった録音データについて単に基本周波数の推定を行おうとした場合には、複数の音程のうち一部の音程しか認識できない等の不具合が発生し得る。この点、本実施形態に係る楽器音認識装置１０では、低域通過フィルタによって所定の周波数よりも高い周波数成分を低減させた後の低減信号について基本周波数となる音程を推定しているので、録音データに含まれる複数の音程のうちベース以外の音程を低減した状態で、すなわち、複数の音程の混在を抑制した状態で、基本周波数となる音程を推定することができ、基本周波数の推定精度を向上させることができる。これにより、転回形についても区別して、複数の音程を適切に認識することができる。 In the musical instrument sound recognition apparatus 10 according to the present embodiment, in order to solve the second problem, the recording data is processed by the low-pass filter, and the pitch that becomes the fundamental frequency of the reduced signal after processing is estimated, and the estimation is performed. A plurality of pitches are identified in consideration of the above result. Considering the estimation result of the pitch that becomes the fundamental frequency, the lowest pitch among the plurality of pitches related to the recording data is determined, so the so-called inversion type in which the combination of pitches is the same and the bass that is the lowest tone is different Differentiating complex chords can be recognized. For example, when the basic frequency is simply estimated for recording data in which a plurality of pitches are mixed, there may be a problem that only a part of the plurality of pitches can be recognized. In this regard, the musical instrument sound recognition apparatus 10 according to the present embodiment estimates the pitch that becomes the fundamental frequency for the reduced signal after the frequency component higher than the predetermined frequency is reduced by the low-pass filter. It is possible to estimate the pitch that becomes the fundamental frequency in a state where the pitch other than the bass is reduced among the multiple pitches included in the data, that is, in a state where mixing of multiple pitches is suppressed, and the estimation accuracy of the fundamental frequency is improved. Can be improved. Thereby, a plurality of pitches can be appropriately recognized by distinguishing the inversion form.

以上より、本実施形態に係る楽器音認識装置１０によれば、速いメロディの音程を認識することができると共に、転回形を考慮して各音程を認識することができ、複数の音程の認識精度を向上させることができる。 As described above, according to the musical instrument sound recognition apparatus 10 according to the present embodiment, it is possible to recognize a pitch of a fast melody and to recognize each pitch in consideration of the turning form, and to recognize a plurality of pitches. Can be improved.

周波数解析部１３は、１２音階に含まれる各音程の周波数に応じた正弦波及び余弦波によって、録音データの信号を積分することにより、各音程の周波数毎に振幅強度を導出し、該振幅強度を各音程の特徴量とし、多重音解析部１５は、各音程の特徴量を、各音程の周波数毎の周波数解析結果として、録音データに係る和音を識別する。これにより、１２音階に含まれる各音程の周波数毎に周波数解析を適切に行うことができ、１２音階に含まれる各音程の認識精度を向上させることができる。 The frequency analysis unit 13 derives an amplitude intensity for each frequency of each pitch by integrating a signal of the recording data with a sine wave and a cosine wave corresponding to the frequency of each pitch included in the 12 scales. And the multiple sound analysis unit 15 identifies chords related to the recording data by using the feature value of each pitch as a frequency analysis result for each frequency of each pitch. Thereby, frequency analysis can be appropriately performed for each frequency of the pitches included in the 12th scale, and the recognition accuracy of each pitch included in the 12th scale can be improved.

Ｆ０推定部は、上述した低域信号について、自己相関法を用いて、１２音階に含まれる各音程が基本周波数である尤もらしさを示す第１尤度を各音程毎に導出し、各音程の第１尤度に基づき、基本周波数である音程を推定する。これにより、基本周波数である音程を高精度に推定することができる。 The F0 estimation unit derives, for each pitch, a first likelihood indicating the likelihood that each pitch included in the 12th scale is a fundamental frequency using the autocorrelation method for the low frequency signal described above. A pitch that is a fundamental frequency is estimated based on the first likelihood. Thereby, the pitch which is a fundamental frequency can be estimated with high precision.

多重音解析部１５は、各音程の特徴量に基づく第１の特徴量、及び、各音程の第１尤度に基づく第２の特徴量を入力として、録音データに係る複数の音程の尤もらしさを示す第２尤度を出力とするニューラルネットワークＮ１を用いて、録音データに係る和音を識別する。これにより、各音程の周波数毎の周波数解析結果に応じた第１の特徴量、及び、各音程が基本周波数である尤もらしさを示す第１尤度に基づく第２の特徴量の双方を考慮して、和音を適切に推定することができる。 The multiple sound analysis unit 15 receives the first feature value based on the feature value of each pitch and the second feature value based on the first likelihood of each pitch, and the likelihood of a plurality of pitches related to the recording data. The chords related to the recording data are identified using the neural network N1 that outputs the second likelihood indicative of. Thus, both the first feature value corresponding to the frequency analysis result for each frequency of each pitch and the second feature value based on the first likelihood indicating the likelihood that each pitch is a fundamental frequency are considered. Thus, the chord can be estimated appropriately.

なお、上記実施形態の説明に用いたブロック図は、機能単位のブロックを示している。これらの機能ブロック（構成部）は、ハードウェア及び／又はソフトウェアの任意の組み合わせによって実現される。また、各機能ブロックの実現手段は特に限定されない。すなわち、各機能ブロックは、物理的及び／又は論理的に結合した１つの装置により実現されてもよいし、物理的及び／又は論理的に分離した２つ以上の装置を直接的及び／又は間接的に（例えば、有線及び／又は無線で）接続し、これら複数の装置により実現されてもよい。 In addition, the block diagram used for description of the said embodiment has shown the block of the functional unit. These functional blocks (components) are realized by any combination of hardware and / or software. Further, the means for realizing each functional block is not particularly limited. That is, each functional block may be realized by one device physically and / or logically coupled, and two or more devices physically and / or logically separated may be directly and / or indirectly. (For example, wired and / or wirelessly) and may be realized by these plural devices.

例えば、上記実施形態における楽器音認識装置１０などは、上記実施形態の楽器音認識装置１０の処理を行うコンピュータとして機能してもよい。図２は、本実施形態に係る楽器音認識装置１０のハードウェア構成の一例を示す図である。上述の楽器音認識装置１０は、物理的には、プロセッサ１００１、メモリ１００２、ストレージ１００３、通信装置１００４、入力装置１００５、出力装置１００６、及びバス１００７などを含むコンピュータ装置として構成されてもよい。 For example, the instrument sound recognition apparatus 10 in the above embodiment may function as a computer that performs processing of the instrument sound recognition apparatus 10 in the above embodiment. FIG. 2 is a diagram illustrating an example of a hardware configuration of the musical instrument sound recognition apparatus 10 according to the present embodiment. The musical instrument sound recognition apparatus 10 described above may be physically configured as a computer device including a processor 1001, a memory 1002, a storage 1003, a communication device 1004, an input device 1005, an output device 1006, a bus 1007, and the like.

なお、以下の説明では、「装置」という文言は、回路、デバイス、ユニットなどに読み替えることができる。楽器音認識装置１０のハードウェア構成は、図２に示された各装置を１つ又は複数含むように構成されてもよいし、一部の装置を含まずに構成されてもよい。 In the following description, the term “apparatus” can be read as a circuit, a device, a unit, or the like. The hardware configuration of the musical instrument sound recognition device 10 may be configured to include one or a plurality of the devices illustrated in FIG. 2 or may be configured not to include some devices.

楽器音認識装置１０における各機能は、プロセッサ１００１、メモリ１００２などのハードウェア上に所定のソフトウェア（プログラム）を読み込ませることで、プロセッサ１００１が演算を行い、通信装置１００４による通信、メモリ１００２及びストレージ１００３におけるデータの読み出し及び／又は書き込みを制御することで実現される。 Each function in the musical instrument sound recognition apparatus 10 reads predetermined software (program) on hardware such as the processor 1001 and the memory 1002, so that the processor 1001 performs calculation, and communication by the communication apparatus 1004, memory 1002, and storage This is realized by controlling reading and / or writing of data in 1003.

プロセッサ１００１は、例えば、オペレーティングシステムを動作させてコンピュータ全体を制御する。プロセッサ１００１は、周辺装置とのインターフェース、制御装置、演算装置、レジスタなどを含む中央処理装置（ＣＰＵ：Central Processing Unit）で構成されてもよい。 For example, the processor 1001 controls the entire computer by operating an operating system. The processor 1001 may be configured by a central processing unit (CPU) including an interface with peripheral devices, a control device, an arithmetic device, a register, and the like.

また、プロセッサ１００１は、プログラム（プログラムコード）、ソフトウェアモジュール、及び／又はデータを、ストレージ１００３及び／又は通信装置１００４からメモリ１００２に読み出し、これらに従って各種の処理を実行する。プログラムとしては、上述の実施の形態で説明した動作の少なくとも一部をコンピュータに実行させるプログラムが用いられる。例えば、楽器音認識装置１０の周波数解析部１３は、メモリ１００２に格納され、プロセッサ１００１で動作する制御プログラムによって実現されてもよく、他の機能ブロックについても同様に実現されてもよい。上述の各種処理は、１つのプロセッサ１００１で実行される旨を説明してきたが、２以上のプロセッサ１００１により同時又は逐次に実行されてもよい。プロセッサ１００１は、１以上のチップで実装されてもよい。なお、プログラムは、電気通信回線を介してネットワークから送信されてもよい。 Further, the processor 1001 reads a program (program code), a software module, and / or data from the storage 1003 and / or the communication device 1004 to the memory 1002, and executes various processes according to these. As the program, a program that causes a computer to execute at least a part of the operations described in the above embodiments is used. For example, the frequency analysis unit 13 of the musical instrument sound recognition apparatus 10 may be realized by a control program stored in the memory 1002 and operated by the processor 1001, and may be realized similarly for other functional blocks. Although the above-described various processes have been described as being executed by one processor 1001, they may be executed simultaneously or sequentially by two or more processors 1001. The processor 1001 may be implemented by one or more chips. Note that the program may be transmitted from a network via a telecommunication line.

メモリ１００２は、コンピュータ読み取り可能な記録媒体であり、例えば、ＲＯＭ（Read Only Memory）、ＥＰＲＯＭ（Erasable Programmable ＲＯＭ）、ＥＥＰＲＯＭ（Electrically Erasable Programmable ＲＯＭ）、ＲＡＭ（Random Access Memory）などの少なくとも１つで構成されてもよい。メモリ１００２は、レジスタ、キャッシュ、メインメモリ（主記憶装置）などと呼ばれてもよい。メモリ１００２は、上記実施形態に係る楽器音認識方法を実施するために実行可能なプログラム（プログラムコード）、ソフトウェアモジュールなどを保存することができる。 The memory 1002 is a computer-readable recording medium and includes, for example, at least one of ROM (Read Only Memory), EPROM (Erasable Programmable ROM), EEPROM (Electrically Erasable Programmable ROM), RAM (Random Access Memory), and the like. May be. The memory 1002 may be called a register, a cache, a main memory (main storage device), or the like. The memory 1002 can store programs (program codes), software modules, and the like that can be executed to implement the instrument sound recognition method according to the above-described embodiment.

ストレージ１００３は、コンピュータ読み取り可能な記録媒体であり、例えば、ＣＤ−ＲＯＭ（Compact Disc ＲＯＭ）などの光ディスク、ハードディスクドライブ、フレキシブルディスク、光磁気ディスク（例えば、コンパクトディスク、デジタル多用途ディスク、Ｂｌｕ−ｒａｙ（登録商標）ディスク）、スマートカード、フラッシュメモリ（例えば、カード、スティック、キードライブ）、フロッピー（登録商標）ディスク、磁気ストリップなどの少なくとも１つで構成されてもよい。ストレージ１００３は、補助記憶装置と呼ばれてもよい。上述の記憶媒体は、例えば、メモリ１００２及び／又はストレージ１００３を含むデータベース、サーバ、その他の適切な媒体であってもよい。 The storage 1003 is a computer-readable recording medium such as an optical disc such as a CD-ROM (Compact Disc ROM), a hard disc drive, a flexible disc, a magneto-optical disc (eg, a compact disc, a digital versatile disc, a Blu-ray). (Registered trademark) disk, smart card, flash memory (for example, card, stick, key drive), floppy (registered trademark) disk, magnetic strip, and the like. The storage 1003 may be referred to as an auxiliary storage device. The storage medium described above may be, for example, a database, a server, or other suitable medium including the memory 1002 and / or the storage 1003.

通信装置１００４は、有線及び／又は無線ネットワークを介してコンピュータ間の通信を行うためのハードウェア（送受信デバイス）であり、例えばネットワークデバイス、ネットワークコントローラ、ネットワークカード、通信モジュールなどともいう。 The communication device 1004 is hardware (transmission / reception device) for performing communication between computers via a wired and / or wireless network, and is also referred to as a network device, a network controller, a network card, a communication module, or the like.

入力装置１００５は、外部からの入力を受け付ける入力デバイス（例えば、キーボード、マウス、マイクロフォン、スイッチ、ボタン、センサなど）である。出力装置１００６は、外部への出力を実施する出力デバイス（例えば、ディスプレイ、スピーカー、ＬＥＤランプなど）である。なお、入力装置１００５及び出力装置１００６は、一体となった構成（例えば、タッチパネル）であってもよい。 The input device 1005 is an input device (for example, a keyboard, a mouse, a microphone, a switch, a button, a sensor, or the like) that accepts an external input. The output device 1006 is an output device (for example, a display, a speaker, an LED lamp, etc.) that performs output to the outside. The input device 1005 and the output device 1006 may have an integrated configuration (for example, a touch panel).

また、プロセッサ１００１及びメモリ１００２などの各装置は、情報を通信するためのバス１００７で接続される。バス１００７は、単一のバスで構成されてもよいし、装置間で異なるバスで構成されてもよい。 Each device such as the processor 1001 and the memory 1002 is connected by a bus 1007 for communicating information. The bus 1007 may be configured with a single bus or may be configured with different buses between apparatuses.

また、楽器音認識装置１０は、マイクロプロセッサ、デジタル信号プロセッサ（ＤＳＰ：Digital Signal Processor）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＰＬＤ（Programmable Logic Device）、ＦＰＧＡ（Field Programmable Gate Array）などのハードウェアを含んで構成されてもよく、当該ハードウェアにより、各機能ブロックの一部又は全てが実現されてもよい。例えば、プロセッサ１００１は、これらのハードウェアの少なくとも１つで実装されてもよい。 The musical instrument sound recognition apparatus 10 includes hardware such as a microprocessor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a programmable logic device (PLD), and a field programmable gate array (FPGA). A part or all of each functional block may be realized by the hardware. For example, the processor 1001 may be implemented by at least one of these hardware.

以上、本発明について詳細に説明したが、当業者にとっては、本発明が本明細書中に説明した実施形態に限定されるものではないということは明らかである。本発明は、特許請求の範囲の記載により定まる本発明の趣旨及び範囲を逸脱することなく修正及び変更された態様として実施することができる。したがって、本明細書の記載は、例示説明を目的とするものであり、本発明に対して何ら制限的な意味を有するものではない。 Although the present invention has been described in detail above, it will be apparent to those skilled in the art that the present invention is not limited to the embodiments described herein. The present invention can be implemented as a modified and changed mode without departing from the spirit and scope of the present invention defined by the description of the scope of claims. Therefore, the description of the present specification is for illustrative purposes and does not have any limiting meaning to the present invention.

本明細書で説明した各態様／実施形態の処理手順、シーケンス、フローチャートなどは、矛盾の無い限り、順序を入れ替えてもよい。例えば、本明細書で説明した方法については、例示的な順序で様々なステップの要素を提示しており、提示した特定の順序に限定されない。 As long as there is no contradiction, the order of the processing procedures, sequences, flowcharts, and the like of each aspect / embodiment described in this specification may be changed. For example, the methods described herein present the elements of the various steps in an exemplary order and are not limited to the specific order presented.

入出力された情報等は特定の場所（例えば、メモリ）に保存されてもよいし、管理テーブルで管理されてもよい。入出力される情報等は、上書き、更新、又は追記され得る。出力された情報等は削除されてもよい。入力された情報等は他の装置へ送信されてもよい。 Input / output information or the like may be stored in a specific location (for example, a memory) or may be managed by a management table. Input / output information and the like can be overwritten, updated, or additionally written. The output information or the like may be deleted. The input information or the like may be transmitted to another device.

判定は、１ビットで表される値（０か１か）によって行われてもよいし、真偽値（Boolean：true又はfalse）によって行われてもよいし、数値の比較（例えば、所定の値との比較）によって行われてもよい。 The determination may be performed by a value represented by 1 bit (0 or 1), may be performed by a true / false value (Boolean: true or false), or may be compared with a numerical value (for example, a predetermined value) Comparison with the value).

本明細書で説明した各態様／実施形態は単独で用いられてもよいし、組み合わせて用いられてもよいし、実行に伴って切り替えて用いられてもよい。また、所定の情報の通知（例えば、「Ｘであること」の通知）は、明示的に行うものに限られず、暗黙的（例えば、当該所定の情報の通知を行わない）によって行われてもよい。 Each aspect / embodiment described in the present specification may be used alone, may be used in combination, or may be used by switching according to execution. In addition, notification of predetermined information (for example, notification of “being X”) is not limited to explicitly performed, and may be performed implicitly (for example, notification of the predetermined information is not performed). Good.

ソフトウェアは、ソフトウェア、ファームウェア、ミドルウェア、マイクロコード、ハードウェア記述言語と呼ばれるか、他の名称で呼ばれるかを問わず、命令、命令セット、コード、コードセグメント、プログラムコード、プログラム、サブプログラム、ソフトウェアモジュール、アプリケーション、ソフトウェアアプリケーション、ソフトウェアパッケージ、ルーチン、サブルーチン、オブジェクト、実行可能ファイル、実行スレッド、手順、機能などを意味するよう広く解釈されるべきである。 Software, whether it is called software, firmware, middleware, microcode, hardware description language, or other names, instructions, instruction sets, codes, code segments, program codes, programs, subprograms, software modules , Applications, software applications, software packages, routines, subroutines, objects, executable files, execution threads, procedures, functions, etc. should be interpreted broadly.

また、ソフトウェア、命令などは、伝送媒体を介して送受信されてもよい。例えば、ソフトウェアが、同軸ケーブル、光ファイバケーブル、ツイストペア及びデジタル加入者回線（ＤＳＬ）などの有線技術及び／又は赤外線、無線及びマイクロ波などの無線技術を使用してウェブサイト、サーバ、又は他のリモートソースから送信される場合、これらの有線技術及び／又は無線技術は、伝送媒体の定義内に含まれる。 Also, software, instructions, etc. may be transmitted / received via a transmission medium. For example, software may use websites, servers, or other devices using wired technology such as coaxial cable, fiber optic cable, twisted pair and digital subscriber line (DSL) and / or wireless technology such as infrared, wireless and microwave. When transmitted from a remote source, these wired and / or wireless technologies are included within the definition of transmission media.

本明細書で説明した情報及び信号などは、様々な異なる技術のいずれかを使用して表されてもよい。例えば、上記の説明全体に渡って言及され得るデータ、命令、コマンド、情報、信号、ビット、シンボル、チップなどは、電圧、電流、電磁波、磁界若しくは磁性粒子、光場若しくは光子、又はこれらの任意の組み合わせによって表されてもよい。 Information, signals, etc. described herein may be represented using any of a variety of different technologies. For example, data, commands, commands, information, signals, bits, symbols, chips, etc. that may be referred to throughout the above description are voltages, currents, electromagnetic waves, magnetic fields or magnetic particles, light fields or photons, or any of these May be represented by a combination of

なお、本明細書で説明した用語及び／又は本明細書の理解に必要な用語については、同一の又は類似する意味を有する用語と置き換えてもよい。 Note that the terms described in this specification and / or terms necessary for understanding this specification may be replaced with terms having the same or similar meaning.

本明細書で使用する「システム」及び「ネットワーク」という用語は、互換的に使用される。 As used herein, the terms “system” and “network” are used interchangeably.

また、本明細書で説明した情報、パラメータなどは、絶対値で表されてもよいし、所定の値からの相対値で表されてもよいし、対応する別の情報で表されてもよい。 In addition, information, parameters, and the like described in this specification may be represented by absolute values, may be represented by relative values from a predetermined value, or may be represented by other corresponding information. .

上述したパラメータに使用される名称はいかなる点においても限定的なものではない。さらに、これらのパラメータを使用する数式等は、本明細書で明示的に開示したものと異なる場合もある。 The names used for the above parameters are not limiting in any way. Further, mathematical formulas and the like that use these parameters may differ from those explicitly disclosed herein.

「接続された（connected）」、「結合された（coupled）」という用語、又はこれらのあらゆる変形は、２又はそれ以上の要素間の直接的又は間接的なあらゆる接続又は結合を意味し、互いに「接続」又は「結合」された２つの要素間に１又はそれ以上の中間要素が存在することを含むことができる。要素間の結合又は接続は、物理的なものであっても、論理的なものであっても、或いはこれらの組み合わせであってもよい。本明細書で使用する場合、２つの要素は、１又はそれ以上の電線、ケーブル及び／又はプリント電気接続を使用することにより、並びにいくつかの非限定的かつ非包括的な例として、無線周波数領域、マイクロ波領域及び光（可視及び不可視の両方）領域の波長を有する電磁エネルギーなどの電磁エネルギーを使用することにより、互いに「接続」又は「結合」されると考えることができる。 The terms “connected”, “coupled”, or any variation thereof, means any direct or indirect connection or coupling between two or more elements and It can include the presence of one or more intermediate elements between two “connected” or “coupled” elements. The coupling or connection between the elements may be physical, logical, or a combination thereof. As used herein, the two elements are radio frequency by using one or more wires, cables and / or printed electrical connections, and as some non-limiting and non-inclusive examples By using electromagnetic energy, such as electromagnetic energy having a wavelength in the region, microwave region, and light (both visible and invisible) region, it can be considered to be “connected” or “coupled” to each other.

本明細書で使用する「に基づいて」という記載は、別段に明記されていない限り、「のみに基づいて」を意味しない。言い換えれば、「に基づいて」という記載は、「のみに基づいて」と「に少なくとも基づいて」との両方を意味する。 As used herein, the phrase “based on” does not mean “based only on,” unless expressly specified otherwise. In other words, the phrase “based on” means both “based only on” and “based at least on.”

本明細書で使用する「第１」、「第２」などの呼称を使用した要素へのいかなる参照も、それらの要素の量又は順序を全般的に限定するものではない。これらの呼称は、２つ以上の要素間を区別する便利な方法として本明細書で使用され得る。したがって、第１及び第２の要素への参照は、２つの要素のみがそこで採用され得ること、又は何らかの形で第１の要素が第２の要素に先行しなければならないことを意味しない。 Any reference to elements using the designations “first”, “second”, etc. as used herein does not generally limit the amount or order of those elements. These designations can be used herein as a convenient way to distinguish between two or more elements. Thus, a reference to the first and second elements does not mean that only two elements can be employed there, or that in some way the first element must precede the second element.

「含む（including）」、「含んでいる（comprising）」、及びそれらの変形が、本明細書あるいは特許請求の範囲で使用されている限り、これら用語は、用語「備える」と同様に、包括的であることが意図される。さらに、本明細書あるいは特許請求の範囲において使用されている用語「又は（or）」は、排他的論理和ではないことが意図される。 As long as “including”, “comprising” and variations thereof are used in the specification or claims, these terms are inclusive of the term “comprising”. Intended to be Further, the term “or” as used herein or in the claims is not intended to be an exclusive OR.

本明細書において、文脈又は技術的に明らかに１つのみしか存在しない装置であることが示されていなければ、複数の装置をも含むものとする。 In this specification, a device is intended to include a plurality of devices unless the context or technology clearly indicates that there is only one device.

１０…楽器音認識装置、１３…周波数解析部（解析部）、１４…Ｆ０推定部（推定部）、１５…多重音解析部（識別部）、１６…出力部、Ｎ１…ニューラルネットワーク、Ｐ…楽器音認識プログラム。 DESCRIPTION OF SYMBOLS 10 ... Musical instrument sound recognition apparatus, 13 ... Frequency analysis part (analysis part), 14 ... F0 estimation part (estimation part), 15 ... Multiple sound analysis part (identification part), 16 ... Output part, N1 ... Neural network, P ... Musical instrument sound recognition program.

Claims

An instrument sound recognition device for recognizing an instrument sound that is a sound of an instrument,
For the recording data recording the instrument sound, an analysis unit for performing frequency analysis for each frequency of each pitch included in a predetermined scale,
The recording data is processed by a low-pass filter that reduces a frequency component higher than a predetermined frequency, and an estimation unit that estimates a pitch that is a fundamental frequency of a reduced signal that is a signal after the processing;
An identification unit for identifying a plurality of pitches related to the recording data based on a frequency analysis result for each frequency of each pitch by the analysis unit and a pitch estimation result that is the fundamental frequency by the estimation unit;
An instrument sound recognition apparatus comprising: an output unit that outputs an identification result indicating the plurality of pitches identified by the identification unit.

The analysis unit derives an amplitude intensity for each frequency of each pitch by integrating the signal of the recording data with a sine wave and a cosine wave corresponding to the frequency of each pitch included in the 12th scale, The intensity is the characteristic amount of each pitch,
The musical instrument sound recognition apparatus according to claim 1, wherein the identification unit identifies a plurality of pitches related to the recording data by using the characteristic amount of each pitch as a frequency analysis result for each frequency of the pitch.

The estimation unit derives, for each pitch, a first likelihood indicating the likelihood that each pitch included in the 12 scales is the fundamental frequency using the autocorrelation method for the reduced signal, The musical instrument sound recognition apparatus according to claim 2, wherein a pitch that is the fundamental frequency is estimated based on the first likelihood of the pitch.

The identification unit receives a first feature value based on the feature value of each pitch and a second feature value based on the first likelihood of each pitch, and inputs a plurality of pitches related to the recorded data. The musical instrument sound recognition apparatus according to claim 3, wherein a plurality of pitches related to the recorded data are identified using a neural network that outputs a second likelihood indicating the likelihood of the recorded sound.

Computer
An analysis unit that performs frequency analysis for each frequency of each pitch included in a predetermined scale with respect to recording data obtained by recording an instrument sound that is a sound of an instrument;
The recording data is processed by a low-pass filter that reduces a frequency component higher than a predetermined frequency, and an estimation unit that estimates a pitch that is a fundamental frequency of a reduced signal that is a signal after the processing;
An identification unit for identifying a plurality of pitches related to the recording data based on a frequency analysis result for each frequency of each pitch by the analysis unit and a pitch estimation result that is the fundamental frequency by the estimation unit;
An output unit for outputting an identification result indicating the plurality of pitches identified by the identification unit;
Musical instrument sound recognition program to function as.