JP2008145716A

JP2008145716A - Voice signal processor

Info

Publication number: JP2008145716A
Application number: JP2006332616A
Authority: JP
Inventors: Sadahiro Yasura; 定浩安良
Original assignee: Victor Company of Japan Ltd
Current assignee: Victor Company of Japan Ltd
Priority date: 2006-12-08
Filing date: 2006-12-08
Publication date: 2008-06-26
Anticipated expiration: 2026-12-08
Also published as: JP4862136B2

Abstract

<P>PROBLEM TO BE SOLVED: To materialize high speed calculation processing of feature quantity of a voice signal which is included in a voice coded stream, by minimizing a decode processing amount of a variable length coded data of the voice coded stream (or, without decoding the variable length coded data). <P>SOLUTION: The voice coded stream which is obtained by subjecting the voice signal to frequency conversion and coding is stored in a buffer memory 11. A synchronization signal detection section 12 detects a synchronization signal in the voice coded stream. An auxiliary information extraction section 13 extracts frame length information expressing length from a detected synchronization signal to a following synchronization signal, and reads auxiliary information to desired auxiliary information (for example, global_gain for L-channel), and a feature quantity calculation section 15 calculates the feature quantity of the voice signal from the desired auxiliary information. A data skip section 14 skips data from the desired auxiliary information to the following synchronization signal. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、音声符号化ストリームに含まれるパラメータ（補助情報）から、音声符号化ストリームに係る音声信号の特徴量を算出することが可能な音声信号処理装置に関する。 The present invention relates to an audio signal processing apparatus capable of calculating a feature amount of an audio signal related to an audio encoded stream from parameters (auxiliary information) included in the audio encoded stream.

テレビジョン放送は２０１１年にアナログ放送が終了し、デジタル放送への移行が完了するが、オーディオビデオ機器に関しては、それに先んじてアナログ放送からデジタル放送への対応が行われている。現在、ＤＶＤ（Digital Versatile Disc）レコーダにおいては、ＢＳ（Broadcasting Satellite）デジタル／１１０度ＣＳ（Communication Satellite）デジタル／地上デジタル／地上アナログ放送などの様々な放送（アナログ放送及びデジタル放送の両方）を受信できるようなチューナを備える機種が増加しており、特にデジタル放送については、ＨＤＤ（Hard Disk Drive：ハードディスクドライブ）に符号化ストリームをそのまま記録できる形になっている。また、ＢＳデジタル／１１０度ＣＳデジタル／地上デジタルにおける音声信号は、ＭＰＥＧ（Moving Picture Experts Group）−２ＡＡＣ（Advanced Audio Coding）符号化方式により符号化され、ビデオ符号化ストリームに多重化されて伝送される。 As for television broadcasting, analog broadcasting ended in 2011 and the transition to digital broadcasting is completed, but audio video equipment has been supported from analog broadcasting to digital broadcasting prior to that. Currently, DVD (Digital Versatile Disc) recorders receive various broadcasts (both analog broadcasts and digital broadcasts) such as BS (Broadcasting Satellite) digital / 110 degree CS (Communication Satellite) digital / terrestrial digital / terrestrial analog broadcasts. The number of models equipped with a tuner that can be used is increasing. Especially for digital broadcasting, an encoded stream can be recorded as it is on an HDD (Hard Disk Drive). In addition, audio signals in BS digital / 110 degree CS digital / terrestrial digital are encoded by MPEG (Moving Picture Experts Group) -2 AAC (Advanced Audio Coding) encoding method, multiplexed into a video encoded stream, and transmitted. Is done.

また、デジタル放送への対応とは別に、チャプターを打つ機能（例えば、場面の切り替わり位置を設定する機能）や、ＣＭ（Commercial：コマーシャル）カット機能などで用いられる音声や映像の特徴点を捉える技術が、横並びのＤＶＤレコーダ市場において他社との差別化を図るうえで重要となってきている。その際、上述のデジタル放送への対応に伴い、音声符号化ストリームを音声信号まで復号する処理（高負荷及び遅延が発生する処理）を行わずに、音声符号化ストリームに含まれる補助情報を用いて音声信号の特徴点を捉える機能を実現することが求められている。 In addition to supporting digital broadcasting, this technology captures the feature points of audio and video that are used for chapter-type functions (for example, functions for setting scene switching positions) and CM (Commercial) cut functions. However, in the side-by-side DVD recorder market, it has become important for differentiating from other companies. At that time, accompanying the above-described digital broadcasting, the auxiliary information included in the audio encoded stream is used without performing the process of decoding the audio encoded stream to an audio signal (a process in which high load and delay occur). Therefore, it is required to realize a function for capturing feature points of an audio signal.

一方、下記の特許文献１には、複数の内容が時分割的に存在する入力音声信号が周波数分解されてスケールファクタと共に符号化されている場合に、このスケールファクタを振幅として抽出し、振幅の時間的変化に基づいて、入力音声信号の内容の変化点を検出することが可能な技術が開示されている。
特開２００３−２９７７２号公報（段落００１５、００１６、００６４） On the other hand, in Patent Document 1 below, when an input audio signal having a plurality of contents in a time-division manner is frequency-decomposed and encoded together with a scale factor, the scale factor is extracted as an amplitude, A technique capable of detecting a change point of the content of an input audio signal based on a temporal change is disclosed.
JP2003-29772A (paragraphs 0015, 0016, 0064)

しかしながら、ＭＰＥＧ−２ＡＡＣ符号化方式においては、スケールファクタと呼ばれる情報は、特許文献１に記載されているような周波数信号を一定値の範囲に収める正規化値という意味合いで使用されておらず、周波数信号を拡大縮小させるためのパラメータにすぎない。したがって、スケールファクタは音声信号の振幅値との関連性は薄く、スケールファクタを抽出したとしても、この情報を音声信号の特徴量として使用することはできないという問題がある。また、ＭＰＥＧ−２ＡＡＣ符号化方式では、スケールファクタに関しても可変長符号を用いて音声符号化ストリームに符号化されることから、スケールファクタを抽出するためには、可変長符号の復号に多くの演算量を費やす必要があるという問題がある。 However, in the MPEG-2 AAC encoding method, information called a scale factor is not used in the sense of a normalized value that fits a frequency signal within a certain range as described in Patent Document 1, It is just a parameter for scaling the frequency signal. Therefore, the scale factor is not closely related to the amplitude value of the audio signal, and there is a problem that even if the scale factor is extracted, this information cannot be used as the feature amount of the audio signal. Also, in the MPEG-2 AAC encoding method, a scale factor is encoded into an audio encoded stream using a variable length code. Therefore, in order to extract a scale factor, many variable length codes are decoded. There is a problem that it is necessary to spend a calculation amount.

上記の問題を解決するため、本発明は、音声符号化ストリームの可変長符号化されたデータの復号処理量を最低限に抑え（あるいは、可変長符号化されたデータを復号しないようにし）、音声符号化ストリームに含まれる音声信号の特徴量の算出処理の高速化を実現することが可能な音声信号処理装置を提供することを目的とする。 In order to solve the above problem, the present invention minimizes the decoding processing amount of the variable-length encoded data of the audio encoded stream (or avoids decoding the variable-length encoded data) It is an object of the present invention to provide an audio signal processing apparatus capable of realizing high-speed calculation processing of feature amounts of audio signals included in an audio encoded stream.

上記の目的を達成するため、本発明によれば、フレームの先頭であることを示す同期信号に続いて前記フレームの長さを表すフレーム長情報及びその他の複数の補助情報が順次配列されていることによって構成されているヘッダ部と、音声信号を含むオーディオストリーム部とを有する前記フレームを単位として、複数の前記フレームの配列によって構成された音声符号化ストリームを蓄積する音声符号化ストリーム蓄積手段と、
前記音声符号化ストリーム蓄積手段に蓄積された前記音声符号化ストリーム中の前記フレームの前記同期信号を検出することによって、前記フレームの先頭を特定する同期信号検出手段と、
前記音声符号化ストリーム蓄積手段に蓄積された前記音声符号化ストリームにおいて、前記同期信号検出手段によって特定された前記フレームの先頭から当該フレームの前記ヘッダ部に含まれている前記補助情報を順番に読み取る補助情報読み取り手段と、
所望の補助情報として設定されている補助情報が前記補助情報読み取り手段によって読み取られた直後に、前記フレームの前記同期信号から前記所望の補助情報までのデータ量と、前記フレーム長情報から得られる前記フレームに含まれるデータ量とに基づいて、次のフレームの先頭までのデータを読み飛ばして前記同期信号検出手段の同期信号検出動作の開始タイミングを制御するデータスキップ手段と、
前記所望の補助情報を用いて前記音声信号の特徴量を算出する特徴量算出手段とを、
有する音声信号処理装置が提供される。 To achieve the above object, according to the present invention, the frame length information indicating the length of the frame and a plurality of other auxiliary information are sequentially arranged following the synchronization signal indicating the head of the frame. A coded audio stream storing means for storing a coded audio stream constituted by an arrangement of a plurality of the frames in units of the frames each having a header portion constituted by the above and an audio stream portion containing an audio signal; ,
Synchronization signal detection means for identifying the head of the frame by detecting the synchronization signal of the frame in the audio encoded stream stored in the audio encoded stream storage means;
In the audio encoded stream stored in the audio encoded stream storage means, the auxiliary information included in the header portion of the frame is sequentially read from the head of the frame specified by the synchronization signal detection means. An auxiliary information reading means;
Immediately after the auxiliary information set as desired auxiliary information is read by the auxiliary information reading means, the data amount from the synchronization signal of the frame to the desired auxiliary information and the frame length information are obtained. Based on the amount of data included in the frame, data skip means for controlling the start timing of the synchronization signal detection operation of the synchronization signal detection means by skipping the data up to the beginning of the next frame;
Feature quantity calculating means for calculating the feature quantity of the audio signal using the desired auxiliary information;
An audio signal processing apparatus is provided.

さらに、本発明によれば、上記の構成に加えて、前記音声符号化ストリームから読み取られる前記所望の補助情報として、前記音声符号化ストリームの符号化時の量子化ステップ情報が設定されている音声信号処理装置が提供される。 Furthermore, according to the present invention, in addition to the above-described configuration, the audio in which quantization step information at the time of encoding the audio encoded stream is set as the desired auxiliary information read from the audio encoded stream A signal processing apparatus is provided.

さらに、本発明によれば、上記の構成に加えて、前記量子化ステップ情報は前記音声符号化ストリームの符号化時の拡大縮小量を表すスケールファクタ情報を含むものであり、前記所望の補助情報として前記スケールファクタ情報が設定され、データスキップ手段は、前記補助情報読み取り手段によって最初に読み出された前記スケールファクタ情報から、次のフレームの先頭までのデータを読み飛ばすように制御する音声信号処理装置が提供される。 Furthermore, according to the present invention, in addition to the above-described configuration, the quantization step information includes scale factor information indicating an amount of enlargement / reduction at the time of encoding the audio encoded stream, and the desired auxiliary information The scale factor information is set as follows, and the data skip means controls to skip data from the scale factor information first read by the auxiliary information reading means to the beginning of the next frame. An apparatus is provided.

さらに、本発明によれば、上記の構成に加えて、前記特徴量算出手段は、前記音声信号の音量レベルの変化を検出するための特徴量として、前記音声信号を周波数変換することによって求められる周波数スペクトルのそれぞれに対して対数を取って合計した概算ビット数を、前記補助情報読み取り手段によって読み取られた前記補助情報を用いて算出するように構成されている音声信号処理装置が提供される。 Furthermore, according to the present invention, in addition to the above-described configuration, the feature amount calculating means is obtained by frequency-converting the sound signal as a feature amount for detecting a change in volume level of the sound signal. There is provided an audio signal processing apparatus configured to calculate an approximate number of bits obtained by taking a logarithm for each frequency spectrum using the auxiliary information read by the auxiliary information reading means.

本発明は、音声符号化ストリームの可変長符号化されたデータの復号処理量を最低限に抑えながら（あるいは、可変長符号化されたデータを復号しないようにしながら）、音声符号化ストリーム中の補助情報を抽出し、抽出された補助情報に基づいて音声符号化ストリームに含まれる音声信号の特徴量の算出処理を行うため、高速な処理が可能となり、例えば、ＤＶＤレコーダによるＤＶＤやＨＤＤへの音声符号化ストリームの記録中に、並行して音声信号の特徴量を算出することが可能となる。 The present invention reduces the amount of decoding processing of variable length encoded data of an audio encoded stream (or avoids decoding of variable length encoded data), while Since the auxiliary information is extracted and the feature amount calculation process of the audio signal included in the audio encoded stream is performed based on the extracted auxiliary information, high-speed processing is possible. It is possible to calculate the feature amount of the audio signal in parallel while recording the audio encoded stream.

以下、図面を参照しながら、本発明の実施の形態について説明する。なお、以下では、音声符号化ストリームの一例として、ＭＰＥＧ−２ＡＡＣ符号化方式を用いて符号化された音声符号化ストリームを想定しながら本発明の実施の形態の説明を行うが、本発明に係る音声符号化方式はＭＰＥＧ−２ＡＡＣ符号化方式に限定されるものではない。なお、ＭＰＥＧ−２ＡＡＣ符号化方式は、ＩＳＯ（国際標準化機構）のワーキンググループであるＭＰＥＧ（Moving Picture Experts Group）によって制定された音声情報圧縮の国際規格“MPEG-2/Advanced Audio Coding(ISO/IEC標準13818-7)”を表している。以降、ＭＰＥＧ−２ＡＡＣを単にＡＡＣと呼ぶこともある。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the following, an embodiment of the present invention will be described assuming an audio encoded stream encoded using the MPEG-2 AAC encoding method as an example of an audio encoded stream. Such an audio coding system is not limited to the MPEG-2 AAC coding system. The MPEG-2 AAC coding system is an international standard “MPEG-2 / Advanced Audio Coding (ISO / ISO) for audio information compression established by MPEG (Moving Picture Experts Group), which is a working group of ISO (International Organization for Standardization). IEC standard 13818-7) ”. Hereinafter, MPEG-2 AAC may be simply referred to as AAC.

まず、図１を参照しながら、本発明の実施の形態における音声信号処理装置の構成の一例について説明する。図１は、本発明の実施の形態における音声信号処理装置の構成の一例を示すブロック図である。図１に図示されている音声信号処理装置は、バッファメモリ１１、同期信号検出部１２、補助情報抽出部１３、データスキップ部１４、特徴量算出部１５を有している。 First, an example of the configuration of an audio signal processing device according to an embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram showing an example of the configuration of an audio signal processing device according to an embodiment of the present invention. The audio signal processing device illustrated in FIG. 1 includes a buffer memory 11, a synchronization signal detection unit 12, an auxiliary information extraction unit 13, a data skip unit 14, and a feature amount calculation unit 15.

バッファメモリ１１は、音声信号が周波数変換されて符号化された音声符号化ストリームを一時的に蓄積する機能を有している。バッファメモリ１１に蓄積された音声符号化ストリームは、同期信号検出部１２、補助情報抽出部１３、データスキップ部１４によって適宜読み出される。 The buffer memory 11 has a function of temporarily storing an audio encoded stream in which an audio signal is frequency-converted and encoded. The audio encoded stream stored in the buffer memory 11 is appropriately read out by the synchronization signal detection unit 12, the auxiliary information extraction unit 13, and the data skip unit 14.

また、同期信号検出部１２は、音声信号が周波数変換されて符号化された音声符号化ストリームに対して、音声符号化ストリーム中の同期信号を検出して、１ＡＡＵ（Audio Access unit：オーディオアクセスユニット）の開始点を特定する機能を有している。なお、同期信号検出部１２は、データスキップ部１４からのスキップ完了フラグを受けて、同期信号の検出処理を開始する。また、同期信号検出部１２によって音声符号化ストリーム中の開始点が発見された場合には、同期完了フラグが有効となって補助情報抽出部１３に供給される。 In addition, the synchronization signal detection unit 12 detects a synchronization signal in the audio encoded stream with respect to the audio encoded stream obtained by encoding the audio signal after frequency conversion, and 1AAU (Audio Access unit: Audio Access Unit) ) To specify the starting point. The synchronization signal detection unit 12 receives the skip completion flag from the data skip unit 14 and starts the synchronization signal detection process. When the start point in the audio encoded stream is found by the synchronization signal detection unit 12, the synchronization completion flag becomes valid and is supplied to the auxiliary information extraction unit 13.

また、補助情報抽出部１３は、同期信号検出部１２で特定された開始点からＡＡＵを特定し、１ＡＡＵに含まれる特徴量算出に必要な補助情報を抽出する機能を有している。なお、補助情報抽出部１３は、同期信号検出部１２からの同期完了フラグを受けて、補助情報の抽出処理を開始する。また、補助情報抽出部１３によって補助情報が抽出された場合には、抽出完了フラグが有効となってデータスキップ部１４に供給される。 Further, the auxiliary information extraction unit 13 has a function of specifying AAU from the start point specified by the synchronization signal detection unit 12 and extracting auxiliary information necessary for calculating the feature amount included in 1 AAU. The auxiliary information extraction unit 13 receives the synchronization completion flag from the synchronization signal detection unit 12 and starts the auxiliary information extraction process. When auxiliary information is extracted by the auxiliary information extraction unit 13, the extraction completion flag is validated and supplied to the data skip unit 14.

また、データスキップ部１４は、まだ読み出していない残りのデータ（１ＡＡの残りのデータ）を読み飛ばして、次の同期信号の位置までデータ読み出し位置を移動させる機能を有している。なお、データスキップ部１４は、補助情報抽出部１３からの抽出完了フラグを受けて、データ読み出し位置の移動処理を開始する。また、データスキップ部１４によってデータ読み出し位置の移動が行われた場合には、スキップ完了フラグが有効となって同期信号検出部１２に供給される。 The data skip section 14 has a function of skipping the remaining data (1AA remaining data) that has not yet been read and moving the data reading position to the position of the next synchronization signal. The data skip unit 14 receives the extraction completion flag from the auxiliary information extraction unit 13 and starts the data reading position movement process. Further, when the data read position is moved by the data skip unit 14, the skip completion flag is validated and supplied to the synchronization signal detection unit 12.

また、特徴量算出部１５は、補助情報抽出部１３によって抽出された補助情報を用いて音声信号の特徴量を算出し、算出された特徴量を出力する機能を有している。 Further, the feature amount calculation unit 15 has a function of calculating the feature amount of the audio signal using the auxiliary information extracted by the auxiliary information extraction unit 13 and outputting the calculated feature amount.

次に、図２を参照しながら、音声符号化ストリームの１つであるＡＡＣ符号化ストリームの構成の一例について説明する。図２は、本発明の実施の形態における音声信号処理装置で用いられるＡＡＣ符号化ストリームの構成の一例を示す図である。なお、図２には、ステレオ音声信号を符号化した場合に生成されるＡＡＣ符号化ストリームの構成の一例が図示されている。 Next, an example of the configuration of an AAC encoded stream that is one of the audio encoded streams will be described with reference to FIG. FIG. 2 is a diagram showing an example of a configuration of an AAC encoded stream used in the audio signal processing device according to the embodiment of the present invention. FIG. 2 shows an example of the configuration of an AAC encoded stream generated when a stereo audio signal is encoded.

図２に図示されているＡＡＣ符号化ストリームの構成は、デジタル放送で使用されているＡＤＴＳ（Audio Data Transport Stream：オーディオデータトランスポートストリーム）フォーマットを想定している。図２において、１ＡＡＵに相当するadts_frameは、adts_fixed_header、adts_variable_header、adts_error_check、raw_data_blockにより構成される。 The configuration of the AAC encoded stream illustrated in FIG. 2 assumes an ADTS (Audio Data Transport Stream) format used in digital broadcasting. In FIG. 2, an adts_frame corresponding to 1 AAU is composed of adts_fixed_header, adts_variable_header, adts_error_check, and raw_data_block.

adts_fixed_headerは、フレーム間で値が変化しない情報に使用され、先頭には同期信号（syncword）が存在する。また、adts_variable_headerは、フレーム間で値が変化する情報に使用され、１ＡＡＵのフレーム長を示す情報（frame_length）や、デコーダバッファの遷移を示す情報（adts_buffer_fullness）などが含まれている。また、adts_error_checkは、ＣＲＣ（Cyclic Redundancy Check：巡回冗長検査）エラーチェックコードのために使用される。 The adts_fixed_header is used for information whose value does not change between frames, and a synchronization signal (syncword) exists at the head. Further, adts_variable_header is used for information whose value changes between frames, and includes information (frame_length) indicating the frame length of 1 AAU, information (adts_buffer_fullness) indicating transition of the decoder buffer, and the like. Further, adts_error_check is used for a CRC (Cyclic Redundancy Check) error check code.

また、raw_data_blockは、エレメントと呼ばれる単位が集まって構成されている。raw_data_blockを構成しているエレメントには、Ｌ／Ｒチャンネル用のＣＰＥ（Channel Pair Element）、スタッフィングバイト挿入用のＦＩＬＬ（Fill Element）、１ＡＡＵの終わりを示すＥＮＤ（Term Element）が存在する。なお、ＦＩＬＬは存在しない場合がある。 The raw_data_block is composed of units called elements. Elements constituting the raw_data_block include an L / R channel CPE (Channel Pair Element), a stuffing byte insertion FILL (Fill Element), and an END (Term Element) indicating the end of AAU. Note that FILL may not exist.

ＣＰＥには、Ｌ／Ｒチャンネルで共通の窓関数を表す情報（common_window）や、チャンネルごとの情報（individual_channel_stream）が存在する。また、individual_channel_streamには、窓関数のシーケンス処理を表す情報（window_sequence）、帯域制限を表す情報（max_sfb）、量子化ステップを表す情報(global_gain）、拡大縮小用パラメータを表す情報（scale_factor_data）、量子化データを示す情報（spectral_data）が存在する。 In the CPE, there is information (common_window) representing a window function common to the L / R channels and information (individual_channel_stream) for each channel. In addition, individual_channel_stream includes information representing window function sequence processing (window_sequence), information representing band limitation (max_sfb), information representing a quantization step (global_gain), information representing a scaling parameter (scale_factor_data), quantization Information indicating data (spectral_data) exists.

なお、拡大縮小用パラメータを表す情報（scale_factor_data）及び量子化データを示す情報（spectral_data）は、ハフマン符号により可変長符号化が行われており、固定長の場合とは異なり、データの先頭から順番に情報を抽出する必要がある。すなわち、例えば、Ｌｃｈ用scale_factor_data、Ｌｃｈ用spectral_data、Ｒｃｈ用scale_factor_data、Ｒｃｈ用spectral_dataの順番で可変長符号化されている場合には、各チャンネルのscale_factor_dataを取り出すためには、Ｌｃｈ用scale_factor_dataを復号して取り出した後、Ｒｃｈ用scale_factor_dataの読み出し位置を特定するためにＬｃｈ用spectral_dataを復号し、そして、Ｒｃｈ用scale_factor_dataの復号を行う必要がある。 Note that the information (scale_factor_data) indicating the parameters for scaling and the information (spectral_data) indicating the quantized data are variable-length encoded by Huffman codes, and in the order from the beginning of the data, unlike the case of the fixed length. It is necessary to extract information. That is, for example, when variable length encoding is performed in the order of Lch scale_factor_data, Lch spectral_data, Rch scale_factor_data, and Rch spectral_data, in order to extract scale_factor_data of each channel, the Lch scale_factor_data is decoded. After the extraction, it is necessary to decode the Lch spectral_data in order to specify the read position of the Rch scale_factor_data and to decode the Rch scale_factor_data.

なお、補助情報抽出部１３において、frame_length、Ｌｃｈ用global_gainを含む補助情報を抽出し、Ｒｃｈ用global_gainは抽出しないようにすることも可能である。また、補助情報抽出部１３において、さらにＬｃｈ用scale_factor_dataを含めた補助情報を抽出し、Ｒｃｈ用scale_factor_dataは抽出しないようにすることも可能である。そして、データスキップ部１４では、frame_lengthをビット数に換算した値から、読み出したデータ量（例えば、syncwordからＬｃｈ用global_gainまでのビット数）を減算したデータ量だけデータの読み飛ばしを行うことにより、次のadts_frameの先頭に読み出し位置を進めることが可能となる。このようにして、可変長符号化されたデータの復号を省略したり、あるいは、最低限のデータ復号の処理量に抑えたりすることが可能となる。 The auxiliary information extraction unit 13 may extract auxiliary information including frame_length and Lch global_gain, and may not extract Rch global_gain. Further, the auxiliary information extracting unit 13 may further extract auxiliary information including Lch scale_factor_data and not extract Rch scale_factor_data. Then, the data skip unit 14 skips the data by a data amount obtained by subtracting the read data amount (for example, the bit number from the syncword to the Lch global_gain) from the value obtained by converting the frame_length into the bit number, It becomes possible to advance the reading position to the head of the next adts_frame. In this way, decoding of variable-length encoded data can be omitted, or the processing amount of data decoding can be minimized.

また、上述のフォーマット中のraw_data_blockは、音声信号の形式によって異なるエレメント構成を有することになる。例えばエレメントを＜＞を用いて記述する場合、本発明の実施の形態における音声信号処理装置によって処理されるステレオ音声に係る音声符号化ストリームに含まれるエレメントは、以下のように記述可能である。 The raw_data_block in the above format has an element configuration that differs depending on the format of the audio signal. For example, when elements are described using <>, elements included in an audio encoded stream related to stereo audio processed by the audio signal processing apparatus according to the embodiment of the present invention can be described as follows.

ステレオ音声： <CPE1><FILL><TERM> （ただし、CPE1 = L/R：Ｌ／Ｒチャンネル） Stereo audio: <CPE1> <FILL> <TERM> (CPE1 = L / R: L / R channel)

また、例えばモノラル音声やマルチチャンネル音声（例えば５．１ｃｈ）は、ＦＩＬＬが存在しないとすると、シングルチャンネル用のＳＣＥ（Single Channel Element）、低域強調チャンネル用のＬＦＥ（LFE（Low Frequency Enhancement）Channel Element）を用いて、以下のように記述することができる。 For example, if there is no FILL for monaural audio or multi-channel audio (for example, 5.1ch), a single channel SCE (Single Channel Element) and a low frequency enhancement channel LFE (LFE (Low Frequency Enhancement) Channel Element) can be described as follows:

モノラル音声： <SCE1><TERM>
（ただし、SCE1 = C：センタチャンネル）
マルチチャンネル音声： <SCE1><CPE1><CPE2><LFE><TERM>
（ただし、SCE1 = C：センタチャンネル、CPE1 = L/R：Ｌ／Ｒチャンネル、CPE2 = Ls/Rs：サラウンドＬ／Ｒチャンネル、LFE = 低域強調チャンネル） Monaural audio: <SCE1><TERM>
(However, SCE1 = C: Center channel)
Multi-channel audio: <SCE1><CPE1><CPE2><LFE><TERM>
(However, SCE1 = C: Center channel, CPE1 = L / R: L / R channel, CPE2 = Ls / Rs: Surround L / R channel, LFE = Low frequency emphasis channel)

なお、本発明は、最初に出現するエレメント<SCE1>に含まれる補助情報を取得する場合には、ステレオ音声だけではなく、モノラル音声やマルチチャンネル音声にも対応可能となる。 In the present invention, when the auxiliary information included in the element <SCE1> that appears first is acquired, not only stereo sound but also monaural sound and multi-channel sound can be supported.

また、ＭＰＥＧ−２ＡＡＣ符号化方式で用いられる量子化式及び逆量子化式は、それぞれ下記の式（１）及び式（２）のように表される。 Also, the quantization formula and the inverse quantization formula used in the MPEG-2 AAC encoding scheme are expressed as the following formula (1) and formula (2), respectively.

なお、量子化式（１）及び逆量子化式（２）において、mdct_line(k)は周波数スペクトルを示している。また、global_gainは上述の量子化ステップを表す情報（量子化ステップ情報）であり、所定ビット数に使用ビット数を収めるために、周波数スペクトル全体の量子化ステップを変更するパラメータとして用いられる。また、scalefactor(sfb)は、scale_factor_dataを復号したものであり、周波数信号を拡大縮小させて、量子化や逆量子化で発生する量子化誤差が、聴覚心理的観点から許容される範囲に収まるようにするパラメータとして用いられる。それゆえ、scalefactor(sfb)情報だけを抽出したとしても、周波数信号の振幅値に略比例関係にあるとは言えない。 In the quantization formula (1) and the inverse quantization formula (2), mdct_line (k) indicates a frequency spectrum. Moreover, global_gain is information (quantization step information) indicating the above-described quantization step, and is used as a parameter for changing the quantization step of the entire frequency spectrum in order to keep the number of used bits within a predetermined number of bits. Scalefactor (sfb) is obtained by decoding scale_factor_data so that the frequency signal is scaled so that the quantization error caused by quantization or inverse quantization falls within the allowable range from the psychoacoustic viewpoint. It is used as a parameter. Therefore, even if only scalefactor (sfb) information is extracted, it cannot be said that there is a substantially proportional relationship with the amplitude value of the frequency signal.

なお、特徴量算出部１５は、例えば、周波数変換により求められる周波数スペクトルのそれぞれに対して対数を取って合計した概算ビット数を補助情報を用いて算出し、算出された概算ビット数を、復号した音声信号の音量変化を検出するための特徴量として出力することも可能である。以下に、この場合の算出式の一例を示す。 Note that, for example, the feature amount calculation unit 15 calculates the approximate number of bits obtained by taking the logarithm of each frequency spectrum obtained by frequency conversion using the auxiliary information, and decodes the calculated approximate number of bits. It is also possible to output as a feature amount for detecting a change in volume of the sound signal. An example of the calculation formula in this case is shown below.

まず、上述の量子化式（１）を変形すると、次の式（３）となる。 First, when the above-described quantization equation (1) is modified, the following equation (3) is obtained.

ハフマン符号化を用いずに、量子化値が何ビットで表現されるかを求めると、下記の式（４）が得られる。 The following equation (4) is obtained by calculating how many bits the quantized value is expressed without using Huffman coding.

一方、上記の式（４）のnum_bit(k)は、例えば入力される音声信号が１６ビットのＰＣＭ（Pulse Code Modulation：パルスコード変調）データの場合には、絶対値を取っているため、符号（１ビット）を取り除いた１５ビットの値まで取り得ることになる。また、量子化値を音声符号化ストリームと同様にビット単位で羅列することを考えた場合、量子化値が何ビットで表現されているかを示す補助情報が与えられていないと、元通りの値を取り出すことが困難になる。そのため、元通りの量子化値を取り出すための補助情報として、０〜１５の値を表現できるように４ビット分使用される。その結果、上記のnum_bit(k)は修正されて、下記の式（５）のようになる。 On the other hand, num_bit (k) in the above equation (4) takes an absolute value when the input audio signal is 16-bit PCM (Pulse Code Modulation) data, for example. A value of 15 bits from which (1 bit) is removed can be obtained. In addition, when considering that the quantized values are arranged in units of bits in the same way as the audio encoded stream, if the auxiliary information indicating how many bits the quantized values are expressed is not given, the original value It becomes difficult to take out. Therefore, 4 bits are used as auxiliary information for extracting the original quantized value so that values of 0 to 15 can be expressed. As a result, the above num_bit (k) is corrected to be as shown in the following equation (5).

そして、上記の式（５）で得られるnum_bit(k)から、周波数スペクトル分（すなわち１０２４スペクトル分）のnum_bit(k)の総和であるtotal_num_bitを求めると、下記の式（６）のようになる。 Then, when total_num_bit which is the sum of num_bit (k) of the frequency spectrum (that is, 1024 spectrum) is obtained from num_bit (k) obtained by the above expression (5), the following expression (6) is obtained. .

さらに、このtotal_num_bitが使用ビット数(used_bit)である場合には、使用ビット数であるused_bitとtotal_num_bitとが等しい下記の式（７）が得られる。 Furthermore, when this total_num_bit is the number of used bits (used_bit), the following formula (7) in which used_bit, which is the number of used bits, is equal to total_num_bit is obtained.

そして、上記の式（７）を変形して、下記の式（８）に示すように、Σ(log2(x))を求めることが可能となる。 Then, the above equation (7) can be modified to obtain Σ (log2 (x)) as shown in the following equation (8).

なお、上記の式（８）において、lenは周波数バンド番号(sfb)に含まれる周波数スペクトル数を表しており、最後の項は、周波数バンド全体におけるscalefactor量が求められている。また、max_sfb × len = 1024の関係を有している。また、上記の式（８）は符号化されたビットストリームをscale_factor_dataまで復号して抽出した補助情報を用いる場合の算出式を示しているが、scale_factor_dataの復号を省略した場合には、scalefactor(sfb)の値をすべて０（all zero）として、下記の式（９）によって算出することが可能である。 In the above formula (8), len represents the number of frequency spectra included in the frequency band number (sfb), and the last term is the amount of scalefactor in the entire frequency band. Further, there is a relationship of max_sfb × len = 1024. Further, the above equation (8) shows a calculation formula in the case of using auxiliary information extracted by decoding an encoded bitstream up to scale_factor_data, but when decoding of scale_factor_data is omitted, scalefactor (sfb ) Are all zero, and can be calculated by the following equation (9).

上述の式（８）や式（９）で求められるΣ(log2(x))は概算ビット数であるが、音声信号の音量レベルが大きい場合には、周波数スペクトルの振幅レベルも大きいため、概算ビット数も大きい値を示し、音量レベルが小さい場合には、概算ビット数も小さい値を示すことになる。すなわち、概算ビット数は、音声信号の振幅レベル（音量）に対応した変化を示し、音声信号の音量変化を検出するための特徴量として使用することが可能である。 Σ (log2 (x)) obtained by the above formulas (8) and (9) is an approximate number of bits. However, when the volume level of the audio signal is large, the amplitude level of the frequency spectrum is also large. The number of bits also indicates a large value, and when the volume level is small, the approximate number of bits also indicates a small value. That is, the approximate number of bits indicates a change corresponding to the amplitude level (volume) of the audio signal, and can be used as a feature amount for detecting a change in the volume of the audio signal.

また、ＭＰＥＧ−２ＡＡＣはエンコーダバッファ（デコーダバッファ）を持っているため、固定レート符号化の場合であっても１ＡＡＵの長さは一定ではなく、エンコーダバッファが破綻しない範囲での可変長が許されている。音声信号が無音の場合（あるいは、無音に近い場合）にはフレームに含まれる情報量は少なくなり、frame_lengthの値が小さく設定される。すなわち、frame_lengthの値によって短いフレーム長が示された場合には、そのフレームは無音あるいは無音に近い音声を含むフレームとみなすことが可能であり、音声信号の無音部分を検出するための特徴量としてframe_lengthを使用することが可能である。 Also, since MPEG-2 AAC has an encoder buffer (decoder buffer), the length of 1 AAU is not constant even in the case of fixed rate encoding, and a variable length within the range where the encoder buffer does not fail is allowed. Has been. When the audio signal is silent (or close to silence), the amount of information included in the frame is small, and the value of frame_length is set small. That is, when a short frame length is indicated by the value of frame_length, the frame can be regarded as a frame including silence or a sound close to silence, and is used as a feature amount for detecting a silence portion of an audio signal. It is possible to use frame_length.

また、固定レート符号化では、音声信号の無音部分においてビットレート調整用に多くのスタッフィングバイトが費やされる。したがって、演算処理量に余裕があり、符号化ストリームの可変長符号をすべて復号することが可能であれば、音声信号の無音部分を検出するために有用な特徴量として、ＦＩＬＬでスタッフィングバイトとして挿入されているバイト数を使用することが可能となる。 Also, with fixed rate coding, many stuffing bytes are spent for bit rate adjustment in the silent portion of the audio signal. Therefore, if there is a sufficient amount of calculation processing and all variable-length codes of the encoded stream can be decoded, it is inserted as a stuffing byte in FILL as a feature quantity useful for detecting a silent portion of an audio signal. It is possible to use the number of bytes that are being used.

また、その他の補助情報において、max_sfbは、音声信号の周波数帯域幅を表す特徴量として使用可能である。以下、図３に図示されているmax_sfbと周波数帯域幅との関係を参照しながら、max_sfbが、音声信号の周波数帯域幅を表す特徴量として使用可能であることについて説明する。 Further, in other auxiliary information, max_sfb can be used as a feature amount representing the frequency bandwidth of the audio signal. Hereinafter, it will be described that max_sfb can be used as a feature amount representing the frequency bandwidth of an audio signal with reference to the relationship between max_sfb and the frequency bandwidth illustrated in FIG. 3.

図３は、本発明の実施の形態における音声信号処理装置において、max_sfbが音声信号の周波数帯域幅を表す特徴量として使用可能であることを説明するためのグラフの一例を示す図である。なお、図３のグラフの縦軸は振幅レベル［ｄＢ］を表し、図３のグラフの横軸は周波数［Ｈｚ］を表している。 FIG. 3 is a diagram illustrating an example of a graph for explaining that max_sfb can be used as a feature amount representing the frequency bandwidth of an audio signal in the audio signal processing device according to the embodiment of the present invention. The vertical axis of the graph in FIG. 3 represents the amplitude level [dB], and the horizontal axis of the graph in FIG. 3 represents the frequency [Hz].

図３に図示されているように、符号化で用いられるエンコーダによっては、設定されたビットレートに応じて周波数帯域幅を可変にする場合があり、低ビットレート（例えば128kbps）では、使用可能なビット数が不足傾向にあるので帯域制限を強くして（max_sfb = 42に相当）、耳障りな量子化ノイズの発生を防ぐようにし、高ビットレート（例えば256kbps）では、使用可能なビット数が必要十分であるので帯域制限を弱くして（max_sfb = 48に相当）、可能な限り忠実に符号化を行うようにしている。max_sfbは帯域制限を示す値であり、したがって、max_sfbから周波数帯域幅を逆に求めることが可能である。また、周波数帯域幅とビットレートとの間に相関があるものと仮定し、max_sfb又は周波数帯域幅から設定ビットレートを推定することも可能であり、adts_buffer_fullness及びframe_lengthから、デコーダバッファの遷移を計算して求めることも可能である。例えば瞬間的なビットレートであれば、frame_lengthの値から以下の式（１０）により簡単に算出することが可能である。 As shown in FIG. 3, depending on the encoder used for encoding, the frequency bandwidth may be variable according to the set bit rate, and can be used at a low bit rate (for example, 128 kbps). Since the number of bits tends to be insufficient, the bandwidth limit is strengthened (equivalent to max_sfb = 42) to prevent the generation of annoying quantization noise, and at high bit rates (eg 256kbps), the number of usable bits is required. Since this is sufficient, the bandwidth limit is weakened (corresponding to max_sfb = 48) and the encoding is performed as faithfully as possible. max_sfb is a value indicating the band limitation, and therefore, the frequency bandwidth can be obtained in reverse from max_sfb. It is also possible to estimate the set bit rate from max_sfb or the frequency bandwidth, assuming that there is a correlation between the frequency bandwidth and the bit rate, and calculate the decoder buffer transition from adts_buffer_fullness and frame_length. It is also possible to ask. For example, if it is an instantaneous bit rate, it can be easily calculated from the value of frame_length by the following equation (10).

また、窓関数のシーケンス処理を表す情報であるwindow_sequenceを、定常音から非定常音に切り替わった箇所を特定するための特徴量として使用することが可能である。図４は、本発明の実施の形態における音声信号処理装置において、定常音から非定常音に切り替わった箇所を特定するための特徴量として使用可能なwindow_sequenceの値と窓関数との関係を説明するための図である。また、図５は、本発明の実施の形態における音声信号処理装置において用いられる窓関数の状態遷移の一例を示す状態遷移図である。 Further, window_sequence, which is information representing the window function sequence processing, can be used as a feature amount for specifying a location where the stationary sound is switched to the non-stationary sound. FIG. 4 illustrates the relationship between the window_sequence value and the window function that can be used as a feature quantity for specifying the location where the stationary sound is switched to the non-stationary sound in the audio signal processing apparatus according to the embodiment of the present invention. FIG. FIG. 5 is a state transition diagram showing an example of state transition of the window function used in the audio signal processing apparatus according to the embodiment of the present invention.

図４中のlong窓（long(value=0)）は定常状態の場合に使用される窓関数、図４中のstart窓（start(value=1)）は音の立ち上がりの際に使用される窓関数、図４中のshort窓（short(value=2)）は音の立ち上がりから収束までの遷移状態時に使用される窓関数、図４中のstop窓（stop(value=3)）は音の収束の際に使用される窓関数である。例えば、カスタネットを叩いた時のように無音から急激な音の立ち上がりを経て再び無音に戻る場合には、図５の状態遷移図において、long(0)→start(1)→short(2)→stop(3)→long(0)の順に窓関数の状態は遷移する。したがって、window_sequenceの値を監視することによって、音声信号の傾向を把握することが可能となる。 The long window (long (value = 0)) in FIG. 4 is a window function used in the steady state, and the start window (start (value = 1)) in FIG. 4 is used at the rise of sound. The window function, the short window in Fig. 4 (short (value = 2)) is the window function used in the transition state from the start of sound to the convergence, and the stop window in Fig. 4 (stop (value = 3)) is the sound. Is a window function used in the convergence of. For example, in the case of returning to silence after a sudden rise of sound from silence, such as when the castanets are struck, long (0) → start (1) → short (2) in the state transition diagram of FIG. The window function state changes in the order of stop (3) → long (0). Therefore, by monitoring the value of window_sequence, it is possible to grasp the tendency of the audio signal.

また、common_windowは、Ｌ／Ｒチャンネルで共通の窓関数を表す情報であり、ニュース番組などのモノラル音声に近いコンテンツを符号化した場合には、このcommon_windowに含まれる情報を用いてビット数の削減が行われる。したがって、このcommon_windowの情報の変化を検出することによって、そのコンテンツがステレオ音声に近い音声信号を含んでいるか、あるいはモノラル音声に近い音声信号を含んでいるかを把握することが可能であり、ステレオ音声又はモノラル音声を識別するための特徴量としてcommon_windowを使用することが可能である。 The common_window is information representing a window function common to the L / R channels. When content close to monaural audio such as a news program is encoded, the number of bits is reduced using the information included in the common_window. Is done. Therefore, by detecting the change in information of this common_window, it is possible to grasp whether the content includes an audio signal close to stereo audio or an audio signal close to monaural audio. Alternatively, common_window can be used as a feature amount for identifying monaural sound.

なお、上述の実施の形態では、ＭＰＥＧ−２ＡＡＣ符号化方式を前提として説明を行ったが、ＭＰＥＧ−２ＡＡＣＳＢＲ（Spectral Band Replication）や、ＭＰＥＧ−４ＡＡＣ、ＭＰＥＧ−４ＡＡＣＳＢＲ、ＭＰＥＧ−１レイヤ３に本発明を適用することも可能である。また、図１に図示されている音声信号処理装置の各機能ブロックは、ハードウェア及び／又はソフトウェアによって実現可能である。また、上述の実施の形態における音声信号処理装置の機能をプログラムによりコンピュータに実現させるようにしてもよい。このプログラムは、記録媒体から読み取られてコンピュータに取り込まれてもよく、あるいは、通信ネットワークを介して伝送されてコンピュータに取り込まれてもよい。 In the above-described embodiment, the description has been made on the assumption that the MPEG-2 AAC encoding method is used. However, MPEG-2 AAC SBR (Spectral Band Replication), MPEG-4 AAC, MPEG-4 AAC SBR, MPEG- 1. It is also possible to apply the present invention to layer 3. Further, each functional block of the audio signal processing apparatus shown in FIG. 1 can be realized by hardware and / or software. Further, the functions of the audio signal processing apparatus in the above-described embodiment may be realized by a computer by a program. This program may be read from a recording medium and loaded into a computer, or may be transmitted via a communication network and loaded into a computer.

本発明は、音声符号化ストリームの可変長符号化されたデータの復号処理量を最低限に抑えながら（あるいは、可変長符号化されたデータを復号しないようにしながら）、音声符号化ストリーム中の補助情報を抽出し、抽出された補助情報に基づいて音声符号化ストリームに含まれる音声信号の特徴量の算出処理を行うため、高速な処理が可能となるという効果を有しており、音声符号化ストリームに含まれるパラメータ（補助情報）から、音声符号化ストリームに係る音声信号の特徴量を算出する技術に適用可能である。 The present invention reduces the amount of decoding processing of variable length encoded data of an audio encoded stream (or avoids decoding of variable length encoded data), while Since the auxiliary information is extracted, and the feature amount calculation process of the audio signal included in the audio encoded stream is performed based on the extracted auxiliary information, it has the effect of enabling high-speed processing. The present invention is applicable to a technique for calculating a feature amount of an audio signal related to an audio encoded stream from parameters (auxiliary information) included in the encoded stream.

本発明の実施の形態における音声信号処理装置の構成の一例を示すブロック図である。It is a block diagram which shows an example of a structure of the audio | voice signal processing apparatus in embodiment of this invention. 本発明の実施の形態における音声信号処理装置で用いられるＡＡＣ符号化ストリームの構成の一例を示す図である。It is a figure which shows an example of a structure of the AAC encoding stream used with the audio | voice signal processing apparatus in embodiment of this invention. 本発明の実施の形態における音声信号処理装置において、max_sfbが音声信号の周波数帯域幅を表す特徴量として使用可能であることを説明するためのグラフの一例を示す図である。It is a figure which shows an example of the graph for demonstrating that max_sfb can be used as a feature-value showing the frequency bandwidth of an audio | voice signal in the audio | voice signal processing apparatus in embodiment of this invention. 本発明の実施の形態における音声信号処理装置において、定常音から非定常音に切り替わった箇所を特定するための特徴量として使用可能なwindow_sequenceの値と窓関数との関係を説明するための図である。FIG. 4 is a diagram for explaining the relationship between a window_sequence value and a window function that can be used as a feature amount for specifying a location where a stationary sound is switched to a non-stationary sound in the audio signal processing device according to the embodiment of the present invention. is there. 本発明の実施の形態における音声信号処理装置において用いられる窓関数の状態遷移の一例を示す状態遷移図である。It is a state transition diagram which shows an example of the state transition of the window function used in the audio | voice signal processing apparatus in embodiment of this invention.

Explanation of symbols

１１バッファメモリ
１２同期信号検出部
１３補助情報抽出部
１４データスキップ部
１５特徴量算出部 DESCRIPTION OF SYMBOLS 11 Buffer memory 12 Synchronization signal detection part 13 Auxiliary information extraction part 14 Data skip part 15 Feature-value calculation part

Claims

It includes a header portion configured by sequentially arranging frame length information indicating the length of the frame and a plurality of other auxiliary information following a synchronization signal indicating the beginning of the frame, and an audio signal An audio encoded stream storage means for storing an audio encoded stream composed of an array of a plurality of the frames, with the frame having an audio stream unit as a unit;
Synchronization signal detection means for identifying the head of the frame by detecting the synchronization signal of the frame in the audio encoded stream stored in the audio encoded stream storage means;
In the audio encoded stream stored in the audio encoded stream storage means, the auxiliary information included in the header portion of the frame is sequentially read from the head of the frame specified by the synchronization signal detection means. An auxiliary information reading means;
Immediately after the auxiliary information set as desired auxiliary information is read by the auxiliary information reading means, the data amount from the synchronization signal of the frame to the desired auxiliary information and the frame length information are obtained. Based on the amount of data included in the frame, data skip means for controlling the start timing of the synchronization signal detection operation of the synchronization signal detection means by skipping the data up to the beginning of the next frame;
Feature quantity calculating means for calculating the feature quantity of the audio signal using the desired auxiliary information;
An audio signal processing apparatus.

The audio signal processing apparatus according to claim 1, wherein quantization step information at the time of encoding the audio encoded stream is set as the desired auxiliary information read from the audio encoded stream.

The quantization step information includes scale factor information indicating an amount of enlargement / reduction at the time of encoding of the audio encoded stream, the scale factor information is set as the desired auxiliary information, and a data skip unit includes: 3. The audio signal processing apparatus according to claim 2, wherein control is performed so as to skip data from the scale factor information first read by the auxiliary information reading means to the head of the next frame.

The feature amount calculating means is a rough bit summed logarithmically for each frequency spectrum obtained by frequency-converting the sound signal as a feature amount for detecting a change in volume level of the sound signal. The audio signal processing device according to claim 1, wherein the number is calculated using the auxiliary information read by the auxiliary information reading unit.