[go: up one dir, main page]

HK1199344B - Audio fingerprint for content identification - Google Patents

Audio fingerprint for content identification Download PDF

Info

Publication number
HK1199344B
HK1199344B HK14112822.3A HK14112822A HK1199344B HK 1199344 B HK1199344 B HK 1199344B HK 14112822 A HK14112822 A HK 14112822A HK 1199344 B HK1199344 B HK 1199344B
Authority
HK
Hong Kong
Prior art keywords
content
audio signal
audio
particular segment
television
Prior art date
Application number
HK14112822.3A
Other languages
Chinese (zh)
Other versions
HK1199344A1 (en
Inventor
马尔科姆‧斯莱尼
安德瑞斯‧赫尔南德斯‧沙夫霍瑟
Original Assignee
Verizon Patent And Licensing Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US13/332,331 external-priority patent/US8949872B2/en
Application filed by Verizon Patent And Licensing Inc. filed Critical Verizon Patent And Licensing Inc.
Publication of HK1199344A1 publication Critical patent/HK1199344A1/en
Publication of HK1199344B publication Critical patent/HK1199344B/en

Links

Abstract

Methods and system for identifying multimedia content streaming through a television includes retrieving an audio signal from a multimedia content selected for rendering at the television. The retrieved audio signal is partitioned into a plurality of segments of small intervals. A particular segment is analyzed to identify acoustic modulation and to generate a distinct vector for the particular segment based on the acoustic modulation, wherein the vector defines a unique fingerprint of the particular segment of the audio signal. A content database on a server is queried using the vector of the particular segment to obtain content information for multimedia content that matches the fingerprint of the particular segment. The content information is used to identify the multimedia content and the source of the multimedia content that matches the audio signal received for rendering.

Description

Audio fingerprinting for content identification
Technical Field
The present invention relates to audio fingerprinting, and more particularly to audio fingerprinting for connected televisions.
Background
Television viewing has changed over the years. Advances in technology have allowed television manufacturers to integrate internet and web features into television sets to provide the ability to connect and access online interactive media, internet TV, OTT content, and on-demand streaming media through these television sets. In addition to televisions, some external devices, such as set-top boxes, blu-ray players, game controllers, and other cooperating devices, are equipped with these internet and web features to enable traditional televisions without these integrated features to access the internet and web features through these external devices. With these internet-enabled televisions, viewers can search for and find videos, movies, photos, and other content available on the web, available locally, or provided directly by content providers, such as cable content providers, satellite content providers, other users, and so forth. The internet features incorporated into the TV and external devices also provide integration with social networking sites, allowing viewers to socially interact while performing traditional TV viewing.
Internet-enabled televisions have numerous applications to allow users to search for and select content for viewing. However, the identity (identity) of the content to be viewed and/or the source of the content may not be available at the television. It would be advantageous if content selected for viewing could be identified by fingerprints so that additional information and promotional content (including content-related events) related to the content could be presented to the viewer. In the current information age, showing any additional information about the content can increase the user's engagement and user satisfaction.
This is the background in generating embodiments of the present invention.
Disclosure of Invention
Embodiments of the present invention describe methods and systems on a television that allow identification of multimedia content selected for viewing. An algorithm executed by a processor of an internet-enabled television or external device retrieves an audio signal from multimedia content selected for presentation at the television device, performs fingerprinting of a portion of the audio signal by examining modulation characteristics of the audio signal, and uses the fingerprint to identify information related to the content from a content provider. The content information may be used to identify additional information or promotional media related to the content or to generate events to be presented alongside the content.
Embodiments provide a way of determining the source of multimedia content, such as video content, using an audio signal. Since most protected content is recognizable given audio, analyzing images of multimedia content is not as important as analyzing broadcasted utterances and music. The current embodiment provides a way to identify the entire content by focusing on a small piece of audio signal by performing the following actions: the method includes extracting an audio portion of multimedia content selected for presentation, fingerprinting the audio portion, and matching the fingerprint to a corresponding audio portion of multimedia content available in a database to determine the multimedia content. The present embodiments provide an efficient algorithm that focuses on the modulation characteristics of a portion of an audio signal to match multimedia content obtained from multiple content providers. The algorithm also provides the ability to: the audio signal directed to the television is verified to be for the same content by storing information related to the content in a local cache and performing periodic verification of the audio signal. The algorithm performs periodic validation by: a new fingerprint of the streaming audio signal is generated and compared to the content information in the local cache to determine if the signal continues to match or deviate from the content in the local cache. If there is a deviation, the algorithm initiates a search on the database server for a match with the content stored therein and the matching cycle continues. If there is no deviation, then there is no need to query the database server for a match, resulting in resource optimization and matching speed while providing efficient and accurate matching of content.
It should be appreciated that the present invention can be implemented in numerous ways, such as a method and a system. Several inventive embodiments of the present invention are described below.
In one embodiment, a method for identifying multimedia content streaming through a television is disclosed. The method includes retrieving an audio signal from multimedia content selected for presentation at a television. The retrieved audio signal is divided into a plurality of segments at smaller intervals. A particular segment is analyzed to identify an acoustic modulation (acoustic modulation) and a discriminative vector for the particular segment is generated based on the acoustic modulation. The vector defines a unique fingerprint of a particular section of the audio signal. The content database on the server is queried using the vector for the particular segment to obtain content information for the multimedia content that matches the fingerprint for the particular segment. The content information is used to identify the source of multimedia content and the multimedia content that matches the received audio signal for presentation.
In another embodiment, a method for identifying content flowing through a television is disclosed. The method includes retrieving an audio signal from content selected for presentation at a television. The audio signal is divided into a plurality of sections at smaller intervals. A particular segment of the audio signal is analyzed to identify an acoustic modulation to generate a vector for the particular segment based on the acoustic modulation. The vector identifies a plurality of floating point numbers related to the data points of the particular segment and defines a unique audio fingerprint of the particular segment of the audio signal. The content database is searched to identify one or more contents of the audio segment with data points having a plurality of floating point numbers closest to the particular segment. The content database is a repository of pre-computed data points for a plurality of audio segments representing different portions of a plurality of audio signals of a plurality of content obtained from a plurality of content providers. Content with an audio segment having a data point closest to the floating point number of the particular segment is identified. The content provider database is queried using the content identifier with the content of the audio segment matching the particular segment. In response to the query, a portion of the content is received from a content provider database. The portion of the content includes a content record matching the particular segment and an additional record for a predetermined amount of time. A portion of the content received from the content provider database is used for subsequent matching of the audio signal flowing through the television.
In another embodiment, a method for matching promotional media of content streaming through a television is disclosed. The method includes retrieving an audio signal from content selected for presentation at a television. The audio signal is divided into a plurality of sections at smaller intervals. A particular segment of the audio signal is analyzed to identify a modulation characteristic and generate a vector of a plurality of floating point numbers related to data points associated with the audio segment. The vector defines a unique fingerprint of the audio segment. The content database is searched to identify the content of the audio segment with data points having a plurality of floating point numbers that are closest to a particular segment of the audio signal. The content database is a repository of pre-computed data points for a plurality of audio segments representing different portions of a plurality of audio signals associated with a plurality of content obtained from a plurality of content providers. The promotional media associated with the content is identified from the service database using the fingerprint of the particular segment. A portion of the content is received from a content provider database, and metadata and assets (assets) related to the identified promotional media are received from an ad campaign database. The multimedia content of the promotional media is assembled using the retrieved metadata and assets for presentation on the television alongside the content related to the audio signal stream.
Accordingly, embodiments of the invention provide an efficient search and match algorithm for identifying the source of content flowing through a television set by fingerprinting a portion of an audio signal extracted from the content using acoustic modulation and matching the fingerprint to content stored in a content database. The matching algorithm uses optimal system resources while providing efficient matching. The algorithm continues to verify the validity of the match by periodically fingerprinting and matching. The algorithm uses the results of the periodic matching to identify and update events or additional information presented alongside the content. The additional information is related to the content currently flowing through the television and is provided alongside the content in a seamless manner, thereby enhancing the user's television viewing experience. The satisfaction of the user experience can be leveraged to increase monetization by targeting appropriate promotional media to the user.
Other aspects of the invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, illustrating by way of example the principles of the invention.
Drawings
The invention may best be understood by referring to the following description taken in conjunction with the accompanying drawings.
Fig. 1 shows a simplified overview of a system equipped with an algorithm, including various modules within the algorithm for identifying the source and content of multimedia content flowing through a television, in one embodiment of the invention.
Fig. 2a-2f show diagrams comparing the modulation characteristics of sampled audio sections of an audio signal implemented using C and Matlab of the algorithm in an embodiment of the invention.
FIG. 3 illustrates a graphical representation of a Locality Sensitive Hashing (LSH) technique used to match particular segments to corresponding segments of content in one embodiment of the invention.
FIG. 4 shows a schematic modulation flow diagram used in one embodiment to generate a difference vector by analyzing modulation characteristics of an audio segment.
FIG. 5 shows an illustrative audio fingerprint flow diagram for an algorithm to follow to generate a fingerprint of an audio segment in one embodiment of the invention.
Fig. 6 shows a flow diagram of the processing flow operations used by an algorithm to identify multimedia content flowing through a television in one embodiment of the invention.
Fig. 7 shows a flow diagram of various process flow operations used by an algorithm to identify multimedia content flowing through a television in an alternative embodiment of the invention.
Fig. 8 illustrates an alternative embodiment of identifying processing flow operations for matching promotional media to content flowing through a television.
Detailed Description
Broadly stated, embodiments of the present invention provide a method and system for identifying multimedia content flowing through a television. An algorithm executing on a processor of an internet-enabled television or an internet-enabled external device connected to the television selects an audio segment from the content selected for presentation, generates an audio fingerprint, and uses the audio fingerprint to identify a source of multimedia content and multimedia content information. The algorithm utilizes the acoustic modulation characteristics of the audio segments to perform matching and ensures correct matching by periodic verification while using network resources in an optimal and efficient manner. The algorithm employs a local cache available to the algorithm to store matching content and perform periodic validation to ensure that the identified content continues to be related to streaming content at the television. The algorithm also uses the multimedia content information to identify additional information (e.g., promotional media and/or events related to the content) to present alongside the content.
Having briefly summarized various embodiments of the present invention will now be described in detail with reference to the drawings. Fig. 1 shows a simplified overview of a system identifying high-level software/hardware modules for identifying multimedia content streaming to a television. The system includes a rendering device (e.g., a television) to request and receive content from a content provider. In one embodiment, the television includes an internet connection interface integrated into the television. In another embodiment, the television is connected to an external device such as a set-top box with an integrated internet-enabled interface. The internet connection/enabling interface may include, for example, the internet protocol suite (suite) for receiving television services over the internet instead of delivery over traditional modes such as satellite signals or cable television formats. Television services may include live television, time-shifted television, and video-on-demand (VOD) content. Typically, in internet-enabled televisions, the content remains on the content provider's web server and the requested program is streamed to the television. As a result, the internet connection interface in the television is unaware of the source of the requested content and the information related to the content. The television is also equipped with a hardware audio capture system (HAC) configured to: interact with an internet enabled/connected interface and extract a portion of an audio signal from content selected for streaming to a television from a content provider's web server, wherein the content selected for streaming is responsive to a viewer's request and can be any of live television, time-shifted television, and VOD content. The HAC interacts with algorithms (e.g., audio processing algorithms) available at the television to transmit audio signals captured from the internet connection interface for further processing.
An algorithm receives a portion of an audio signal and divides the portion of the audio signal into a plurality of segments of smaller spacing. In one embodiment, the portion of the audio signal received by the algorithm may be divided into 5 second spaced segments. The algorithm then selects a particular segment for analysis. In one embodiment, the algorithm may select the particular section for analysis based on the payload data of the content included therein. The algorithm then analyzes the particular audio segment to determine the acoustic modulation of the audio signal and generate a discriminative vector of floating point numbers. The vector defines an audio fingerprint of the audio signal based on the modulation characteristics of the particular segment. The process of generating a discriminative vector that defines an audio fingerprint is described further below with reference to FIG. 1. In one embodiment, using the generated vectors, an algorithm queries a content database available on a local server associated with the television to find a match with a fingerprint with data available on the server. The process of matching fingerprints to content in the content database will be described in detail below with reference to other figures. After a match is found, the algorithm obtains content information from the content database that includes the source of the multimedia content. The algorithm may use the content information to retrieve the content record covering the time of the particular segment and additional records for a predetermined amount of time and store it in a local cache. The information in the local cache may be used by an algorithm to further verify the content flowing through the television.
In another embodiment, a local cache may be used to pre-fill the content and corresponding fingerprints, and the algorithm may use the information in the local cache to find a match to a segment of the audio signal. In this embodiment, the back-end server dynamically collects content-related information and corresponding fingerprint information based on: users of television equipment typically view what programs, what programs are viewed more frequently, what programs are popular with users in a particular geographic area (using the user's zip code), and so forth. When the user selects content to view on the television, the algorithm at the television requests the server to download the cache. In response to a request from the algorithm, the server pushes different subsets of content and corresponding matching fingerprints onto the local cache of the television. The algorithm then uses the information in the local cache to identify the content selected by the user. The information in the local cache can be used until it expires. When the information expires, the algorithm sends an update request for the content and the fingerprint associated with the content to the back-end server, and the back-end server will forward the appropriate content and fingerprint information to load the local cache.
In one embodiment, the algorithm performs fingerprint matching by querying one or more databases available on one or more network servers. For example, the algorithm may first generate a fingerprint of a selected section of the audio signal and query a content database on a network server for a match of the fingerprints. The content database may be a repository of fingerprints for a plurality of portions of a plurality of audio signals obtained from a plurality of content providers. In one embodiment, content information from multiple content providers may be obtained in advance and stored in a content database on a server locally available to the algorithm, enabling the content to be easily identified regardless of the location and time at which it was broadcast. The audio portion of the content in the content database may be fingerprinted and these fingerprints may be stored alongside the content or in a separate database on a server equipped with search software and used for matching of content currently selected for viewing at the television. Search software on the server assists in searching the database and finding a match for the content. Using this information, an algorithm executing on the processor of the television then queries a second server (e.g., an event server or Business Information Service (BIS) server) to determine if there are any BIS service(s), ad campaigns, or events for this audio scheduled for a particular date when the selected content was streamed. If a service, event or ad campaign is found for the time period, the algorithm grabs the metadata and assets of the service/event/ad campaign from the ad campaign database to create an application or video of the service/ad campaign. The application or video is presented alongside the content streamed into the television and provides additional information or promotional media related to the content. The viewer viewing the selected content is provided with additional information most relevant to the content being viewed, thereby enriching the viewing experience of the user. Algorithms provide the ability to extract features of a small portion of an audio signal and use it to match and describe the complete video content selected for streaming.
Feature extraction and fingerprinting will now be described in detail with reference to fig. 1. In a typical audio/video recording, the top (peak) and transition (transition) of the calculated features of the media do not change much during editing, compression and transmission. Furthermore, in the speech domain (speech world), it has been determined that most of the speech information is concentrated around 4 Hz. As a result, the algorithm uses a modulated spectrogram (spectrogram) to capture the modulation characteristics of the audio signal and uses an audio modulation fingerprinting technique to identify fingerprints of the video. The algorithm generates a spectrogram over time for a particular section of the selected audio signal and looks for energy distributed around different frequencies. To accomplish this, band pass filters are used to divide the audio signal within the selected zone into different bands/channels. In one embodiment, the selected audio segment is divided using 13 linearly spaced filters to obtain 13 different channels. Additional information related to the use of band pass filters to divide audio signals is described in the Audio Toolbox available from https:// engineering. purdue. edu/. malcolm/interval/1998-010/, which is incorporated herein by reference. One or more channels may be combined to provide a wider channel for analysis.
After the audio signals of the different channels are obtained, the algorithm calculates the modulation energy in each channel by taking the absolute value of the signal of each channel and then smoothes the response using a low-pass filter with a cut-off frequency at 6 Hz. The modulation energy is a rough measure of the time information in the channel. The modulation energy provides an important measure of how the audio signal varies over time. In one embodiment, the algorithm uses a Fast Fourier Transform (FFT) algorithm to analyze the modulation in each channel. The magnitude obtained from the FFT provides a measure of how much energy is in each channel at each frequency. FIG. 5 shows an audio fingerprint flow diagram followed by an algorithm for generating an audio fingerprint of an audio segment extracted from content streamed to a television in one embodiment of the invention. As shown in fig. 5, fingerprints are generated by extracting an audio signal from streaming content and passing particular sections of the audio signal through a filter band to divide the audio sections into channels at a plurality of different frequencies. The magnitude of the modulation in each frequency at each channel is measured to determine the energy distribution at each frequency in each channel.
Focusing only on the magnitude of the spectrum and ignoring the phase of the spectrum enables the algorithm to obtain the same fingerprint of the content even if the audio data has a light weight shift in the analysis window. Using the modulation spectrogram, the algorithm calculates 18 measurements of the modulation of each channel at frequencies from 0Hz (DC) to about 6Hz for each band pass channel. The 18 measurements are selectively selected from a two-dimensional array of channel numbers and modulation frequencies. Thus, using the modulation spectra of 13 channels and 18 independent frequency measurements at each channel, the algorithm calculates a single discriminative vector for 234 elements (i.e., 13 × 18) of the selected section of the audio signal. Each element in the vector is a data point represented as a floating point number. The difference vector briefly describes the modulation of the audio signal within the shorter section and forms a fingerprint of the audio signal.
Fig. 4 shows a modulation flow diagram followed by an algorithm for generating a difference vector for an audio segment of an audio signal extracted from content selected for streaming at a television. The algorithm examines the acoustic modulation of a particular channel and uses an FFT to generate an acoustic spectrum for the particular channel. Selective data points (234 data points) from the acoustic spectrum are selected to compute a vector for the audio segment.
Fig. 2a-2f show audio signal spectrograms generated by an algorithm and used to match content from a content provider. Fig. 2a, 2b and 2c were generated using a Matlab implementation of a tri-modulation intonation test using frequency modulations of 441Hz, 881Hz and 1201Hz modulated with 2Hz, 3Hz and 4 Hz. When using a lower frequency modulator filter (e.g. 2Hz), a lower channel with a lower modulation frequency is recorded, as shown in fig. 2a (Matlab implementation). Similarly, fig. 2b shows the result from a slightly higher frequency modulator filter at 3Hz and fig. 2c shows the result from a higher frequency modulator filter at 4 Hz. It should be noted here that the spectrogram of an audio signal generated by using a Matlab implementation is exemplary and should not be considered limiting. Other types of implementations, such as the C implementation, may be used, as shown in fig. 2d, 2e, and 2 f. It can be noted from fig. 2a-2f that the results from the C implementation are similar to the results from the Matlab implementation of the modulator frequency at each of the 3 different frequencies. In addition, each frequency of sound has its own unique fingerprint and audio signals with these different frequencies will generate their own unique fingerprint combination. The larger the fingerprint, the easier it is to match. To get a better sample, a 5 second window is selected for segmentation and fingerprinting in one embodiment. The time period, number of channels and number of frequencies used for segmenting the audio signal are exemplary and should not be considered limiting.
After generating a spectrogram for a particular audio segment and generating a discriminative vector, the algorithm uses the vector to find a match for content in the content database. The content database may be located on a server and made available to the algorithm over a network (e.g., the internet). The content database is a repository of content received from a plurality of content providers, wherein audio signals of the content have been fingerprinted. The fingerprints of the audio signal are stored alongside the content or in a separate database and each fingerprint is mapped to the content. The algorithm may use various techniques to find a match for the vector. In one embodiment, the algorithm uses a randomized algorithm (e.g., a Locality Sensitive Hashing (LSH) method) to find and find a match for the content in the content database. When new content is selected for streaming to the television, the algorithm captures the audio portion of the content and divides the content into smaller spaced segments of, for example, 5 seconds. The algorithm then performs the same analysis (already described above) to obtain the fingerprints of a particular section of the captured audio signal and matches the fingerprints of the captured audio signal against those stored in the database by using the floating point numbers of the vectors. It should be noted that even if the content of the captured audio signal is the same as the audio signal in the content database, the signals may not be an exact match. This may be due to the fact that: the audio signals in the database may have undergone different compression techniques and have different time offsets compared to the audio signal associated with the particular segment being matched. Thus, direct and conventional matching will not provide the desired matching results. To accommodate this variation in compression techniques, the algorithm may use the LSH technique to find the nearest neighbor match.
Fig. 3 shows a comparison of a fingerprint of a particular audio segment with a predetermined fingerprint from a content database using the LSH matching technique. LSH matching uses each of the 234 floating point numbers from the segment of the audio signal of the new content streaming to the television and attempts to match the corresponding data point of the audio signal of the content in the content database. As described above, the 234 floating point numbers are obtained using a modulated spectrogram. It should be understood that generating a vector of 234 floating point numbers and matching the vector of 234 floating point numbers using the LSH matching technique is exemplary and should not be considered limiting. Thus, the segments of the audio signal may be matched in an alternative way. The algorithm calculates the distance between each data point of an audio segment in the content database and the corresponding floating point number of the particular segment of the audio signal. When the algorithm finds a plurality of audio signals having data points closer to corresponding data points of the particular audio signal, the algorithm determines the audio signal whose data point is closest to the content of the data point defined by the floating point number in the vector for the particular audio segment. When more than one content has an audio signal that is closest to the data point of a particular audio segment, we further sample by using a subsequent audio segment of the content selected for streaming, analyze the subsequent audio segment to define a second vector, and use the second vector to find a match. The sampling, analysis and matching may be continued until a good match is found. For more information on locality-Sensitive Hashing techniques, reference may be made to IEEE publication (IEEE Signal Processing Magazine, March2008) by Malcolm Slaney and Michael Casey, entitled "Local-Sensitive Hashing for Finding Nearest Neighbors" which is incorporated herein by reference.
The matching of content enables the algorithm to identify the source of the content and to retrieve information associated with the content selected for streaming to the television. In one embodiment, the algorithm requests and receives content from the server that includes a match of fingerprints for the content for the period of the particular segment it matches and also additional incoming fingerprints for a predetermined amount of time. The server interacts with a plurality of content providers and receives content from these sources. The additional content is used for subsequent matching to the audio signal. In one embodiment, the content and additional content are received and stored in a local cache available to the algorithm. The algorithm may ensure that the audio segment is matched to the correct content by verifying that one or more subsequent segments of the audio signal continue to match audio segments stored with content in the local cache. If a subsequent audio segment of the audio signal matches an audio segment of the content, then the server need not be queried to obtain the content. Alternatively, the content may be provided from a local cache. On the other hand, if the subsequent audio segment does not match the content stored in the local cache, new content from the content database that matches the particular audio segment is retrieved and stored in the local cache for subsequent matching.
There are multiple options to cache and distribute work using the audio fingerprint matching of the current embodiment. Some of the most important options include advance hinting (advancement), local caching, and validation. The implication in advance is a method of answering a single fingerprint request with a sequence of matched content identifiers and incoming fingerprints. The newly received fingerprint along with the content ID is stored in a local cache on the TV for subsequent reference and verification. The incoming fingerprints allow the TV or a set-top box connected to the TV to identify what content will come later and simply check the newly calculated fingerprint of the content against the incoming fingerprints stored in the local cache. If the newly computed fingerprint matches the expected incoming fingerprint, then there is no change to the content provider source and there is no need to query the content provider for the content identifier.
In one embodiment, a local cache option is invoked, wherein content and fingerprints that match fingerprints of the audio signal are downloaded and stored in a local cache to match incoming fingerprints of the audio signal. In another embodiment, content and a set of fingerprints related to a plurality of content are downloaded to a local device (i.e., a TV) and stored in a local cache. In this embodiment, the set of fingerprints may relate to content scheduled for a particular period of time. The client can request and receive the set of fingerprints periodically, e.g., once per day or once every three hours, etc. In one embodiment, the client computes a fingerprint from the audio signal and performs an action on the content only if the content matches one of the known fingerprints stored in the local cache. By performing the action only when there is a match, network resources are conserved because the algorithm avoids unnecessary server access to find a match.
In one embodiment, a verification option is invoked, wherein an algorithm sends a request to the server along with a content identifier based on a best guess for the content. In one embodiment, the best guess for the content may be based on previous queries. The server receiving such a request only verifies and confirms that the fingerprint received from the algorithm in the TV is indeed the desired fingerprint for the content related to the content identifier obtained in the request. This option also saves network resources because the server has been provided with enough information about the content to identify the content. Thus, the local cache, together with the fingerprint, provides a faster and accurate match to the content selected for presentation at the TV, while conserving network resources.
In one embodiment of the invention, the content identification information is used by an algorithm to identify an event, promotional media, or ad campaign and capture the ad campaign or event's metadata and assets. In this embodiment, the source data and assets are used to assemble a video or application presented alongside the content. Once the video or application is presented alongside the content, the algorithm continues to verify the validity of the match by continuing to perform the match for subsequent sections of the audio signal, thereby ensuring that the content has not changed over time. If the content has changed, the algorithm reinitializes the data in the local cache and begins the extraction of the audio signal, generation of the discriminative vector, and matching of the vector to the content in the content database to identify the source of the new content and information related to the new content, thereby enabling promotional media or events to be identified and assembled for presentation with the new content.
Fig. 6 shows a flow diagram of operations for identifying multimedia content flowing through a television in one embodiment of the invention. The method begins at operation 710, where an audio signal is retrieved from multimedia content selected for presentation at a television. Multimedia content may be obtained from any of a variety of content sources including satellite providers, cable providers, DVRs, blu-ray providers, live media from the internet, etc. Multimedia content may be stored on a content provider server and streamed to a television at the request of a viewer. As a result, the source of the content or the content information is not available at the internet connection interface of the television or the external device connected to the television. To identify the source of the content and the content information, the algorithm may divide the audio signal into a plurality of sections at smaller intervals, as described in operation 720.
A particular segment of the audio signal is analyzed to identify acoustic modulations in the particular segment, as described in operation 730. The particular segment is selected for analysis based on the payload data included therein. The result of the analysis of the particular section is an identification of a plurality of data points that are represented by distinct floating point numbers. A plurality of floating point numbers are used to generate the vector. The content database on the server is queried using the vector of floating point numbers, as described in operation 740. The server is equipped with a search algorithm that helps determine the location of content from a particular content provider, where the content of the particular content provider includes a data segment whose data points match or are in close proximity to the floating point number of the particular segment. The content in the content database is obtained from a plurality of sources and the audio signals of these content are pre-fingerprinted and stored with the content or in a separate database and mapped to the content in the content database. As a result, information related to the content and the source of the content is retrieved from a particular content provider when an audio segment of the content from the content provider matches a particular segment of the content streamed to the television. The retrieved information may be stored in a local cache and used for further verification of the content flowing through the television.
Fig. 7 shows an alternative embodiment of the present invention for identifying content flowing through a television. Processing begins with operation 810, where an algorithm within the television identifies a selection of particular content flowing through the television. The content can come from any content provider. An audio signal from the selected content is retrieved. The audio signal is divided into a plurality of smaller intervals, as set forth in operation 820. In one embodiment, each zone is divided for a preset duration, for example 5 seconds. A particular segment within the plurality of segments is selected and analyzed to identify acoustic modulations within the particular segment, as set forth in operation 830. The acoustic modulation is obtained by passing the audio segment through a band pass filter and examining the modulation characteristics of that particular segment using an FFT to identify the energy distribution at each channel for each frequency of the audio segment. The result of the examination of the modulation characteristic is the identification of a selective set of data points represented by floating point numbers. The set of floating point numbers is used to compute the discriminative vector. The vector defines a unique audio fingerprint for a particular segment.
The content database is searched to identify one or more contents of the audio segment with data points that match or are in close proximity to the floating point numbers of the vector of the particular segment, as set forth in operation 840. As previously mentioned, the content database includes content from multiple content providers with audio segments that have been fingerprinted by the algorithm using the same technique. When more than one audio segment from one or more content providers includes a data point that matches the data point for a particular audio segment, the algorithm identifies the content having the audio segment closest to the floating point number for the particular segment. The algorithm then obtains the content identifier with the content of the audio segment that closely matches the audio segment of the particular segment, as set forth in operation 850. The content provider database is queried using information obtained from the content database, such as the content identifier, as set forth in operation 860. In response to the query, the ID portion of the identified content is received from the content provider database, as set forth in operation 870. The portion may include an identifier that matches the content of the particular section and an additional fingerprint for a predetermined amount of time. In one embodiment, the additional recording may include a recording for an additional 15 seconds in addition to the 5 seconds associated with the particular segment. A record of audio content obtained from the content provider is stored in a local cache and used to further verify and match promotional media or events.
Fig. 8 illustrates another alternative embodiment of promotional media for matching content flowing through a television. The method begins at operation 910, where an audio signal is retrieved from content selected for presentation at a television. The audio signal is divided into a plurality of sections at smaller intervals, as described in operation 920. A particular segment of the audio signal is selected for analysis to identify modulation characteristics, as described in operation 930. The particular audio segment may be selected based on the payload contained therein. The analysis of the specific segment includes: an acoustic spectrogram of the particular section is generated and a plurality of floating point numbers in the acoustic spectrogram related to the data points defining an acoustic modulation of the particular section of the audio signal is identified. The discrimination vector is computed as a function of the floating point number. The vector defines a unique audio fingerprint of the audio segment.
At operation 940, the content database is searched to identify content that includes an audio segment with data points that match or are in close proximity to a plurality of floating point numbers of a particular audio segment. The content database is a repository of pre-computed data points for a plurality of audio segments representing different portions of a plurality of audio signals for a plurality of content obtained from a plurality of content providers. Upon identifying content with an audio signal matching a particular audio segment, content information related to the content and a source of the content may be retrieved from a content provider using a content identifier.
Using the content identifier, the promotional media or event associated with the content is identified from the service database using the fingerprint of the particular segment, as described in operation 950. The content provider database is queried to obtain content from the content provider database and the ad campaign database is queried to obtain metadata and assets related to the identified promotional media, as described in operation 960. The process ends with the assembly of multimedia content from the content obtained from the content provider database and the assembly of promotional media content/applications for presentation at the television using the metadata and assets retrieved from the ad campaign database, as described in operation 970. Promotional media content may be presented alongside the content or separately in the form of a widget in one embodiment of the invention.
By extracting features of content by means of audio fingerprinting of a smaller section of the audio signal related to the content to determine what content a particular user is watching on his/her television and to identify a particular application or promotional multimedia related to the content for presentation alongside the content, the algorithm behaves like creating a potential bridge for Broadcast Interaction Services (BIS) for the user. Using a modulation detection process that matches the two signals based on their modulation similarity, a smaller segment of audio is matched with the audio of multiple content received from a content provider/broadcaster scheduled for a particular time period. This approach uses less CPU resources and time but provides more efficient and accurate matching. In addition to modulation matching, the algorithm provides faster matches by enabling records of matching content for a segment of time and for an additional predetermined amount of time to be stored locally in a local cache of the television and by continuing to verify that the identified content continues to match the audio signal of the multimedia content selected for presentation at the television. When the user changes the multimedia content selected for viewing, the algorithm determines that the content stored in the local cache no longer matches and flushes the content. The algorithm then performs audio fingerprinting using HAC and LSH techniques as previously described, making it a more robust and efficient algorithm tool.
Embodiments of the invention may be practiced with a variety of computer system configurations, including hand-held devices, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like. The invention can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a wire or wireless network.
With the above embodiments in mind, it should be understood that the invention is capable of use in various computer-implemented operations involving data stored in computer systems. These operations can include physical transformation of data, preservation of data, and display of data. The operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or electromagnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. Data can also be stored in the network during capture and transmission over the network. The storage device may be, for example, at the network node and a memory associated with the server, as well as other computing devices, including portable devices.
Any of the operations described herein (which form part of the invention) are useful machine operations. The invention also relates to an apparatus or device for performing these operations. The apparatus may be specially constructed for the required purposes, or it may be a general purpose computer selectively activated or configured by a computer program stored on the computer. In particular, various general-purpose machines may be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations.
The invention can also be embodied as computer readable code on a computer readable medium. The computer readable medium is any data storage device that can store data which can thereafter be read by a computer system. The computer readable medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion.
Although the foregoing invention has been described in some detail for purposes of clarity of understanding, it will be apparent that certain changes and modifications may be practiced within the scope of the appended claims. Accordingly, the present embodiments are to be considered as illustrative and not restrictive, and the invention is not to be limited to the details given herein, but may be modified within the scope and equivalents of the appended claims.

Claims (20)

1. A method for identifying multimedia content flowing through a television, the method being performed by a processor, comprising:
retrieving an audio signal from multimedia content selected for presentation at the television;
dividing the audio signal into a plurality of sections;
analyzing a particular segment to identify acoustic modulations in the particular segment, the analyzing generating a discriminative vector for the particular segment based on the acoustic modulations, the vector defining an unique audio fingerprint for the particular segment of the audio signal; and
querying a content database on a server using the vector for the particular segment of audio signals to obtain content information for multimedia content that matches the fingerprint for the particular segment, the content information being used to retrieve information data from a content provider, the information data including multimedia content that matches the particular segment and additional multimedia content related to the audio signal currently being rendered at the television for a predetermined amount of time, wherein the retrieved information data is stored in a local cache of the television for subsequent verification of audio signals that continue to flow through content of the television.
2. The method of claim 1, wherein the audio signal is captured from multimedia content streamed to the television by a content provider or obtained from a digital multimedia recording device.
3. The method of claim 1, wherein each of the plurality of segments has a predefined interval of 5 seconds.
4. The method of claim 1, wherein analyzing further comprises:
generating an acoustic spectrogram to identify acoustic modulation characteristics of the particular section of audio signal at one or more frequencies, wherein the acoustic modulation characteristics are spread across a plurality of channels;
examining the acoustic modulation at each channel to measure a magnitude that identifies an amount of energy in each channel at each frequency; and
calculating the vector of the particular segment of the audio signal as a function of the measured magnitude in each channel for each frequency over the time period associated with the particular segment of the audio signal, wherein the vector identifies a plurality of floating point numbers of data points representing unique fingerprints for the particular segment of the audio signal.
5. The method of claim 4, wherein the examination of the acoustic modulation and the measurement of magnitude are accomplished using a fast Fourier transform technique.
6. The method of claim 4, wherein querying further comprises:
searching the content database to identify one or more multimedia content with an audio segment having data points closest to the plurality of floating point numbers of a particular segment of the audio signal, the content database being a repository of pre-computed data points for a plurality of audio segments representing different portions of a plurality of audio signals of multimedia content obtained from a plurality of content providers;
calculating a distance between a data point of each audio section of the identified multimedia content and a floating point number of the particular section using an iterative computation method; and
selecting multimedia content having a data point closest to the floating point number, wherein the multimedia content is referenced using the unique identifier.
7. The method of claim 6, further comprising retrieving multimedia content associated with an entry from the content provider using the unique identifier.
8. The method of claim 6, further comprising:
when more than one multimedia content has a data point closest to the floating point number of the particular segment,
performing additional matching by selecting one or more additional sections of an audio signal of the multimedia content currently selected for presentation at the television.
9. The method of claim 1, further comprising:
identifying an event or promotional media from a service database relating to multimedia content scheduled for presentation, the event or promotional media identified by using information in the fingerprint from the particular segment;
retrieving metadata and assets related to the identified event or promotional media from an ad campaign database; and
using the retrieved metadata and assets to assemble application or multimedia content associated with the event or promotional media, the assembled application or multimedia content related to the event or promotional media being presented at the television alongside the multimedia content related to the audio signal.
10. A method for identifying content flowing through a television, the method performed by a processor of the television, comprising:
retrieving an audio signal from content selected for presentation at the television;
dividing the audio signal into a plurality of sections;
analyzing a particular segment to identify acoustic modulations in the particular segment, the analyzing generating a vector for the particular segment based on the acoustic modulations, the vector identifying a plurality of floating point numbers related to data points of the particular segment, the vector defining a unique audio fingerprint for the particular segment of the audio signal;
searching a content database to identify one or more content with an audio segment having data points closest to the plurality of floating point numbers of the particular segment, the content database being a repository of pre-computed data points for a plurality of audio segments representing different portions of a plurality of audio signals for a plurality of content obtained from a plurality of content providers;
obtaining a content identifier for content having an audio segment with a data point closest to a floating point number of the particular segment;
querying a content provider database for information about content with audio segments matching a particular segment using the content identifier; and
receiving, in response to the query, a portion of the content from the content provider database, the portion of content comprising a content record matching the particular segment and an additional record for a predetermined amount of time, the additional record defining a sequence of audio fingerprints of the content, the portion of the content record and the additional record received from the content provider database being used for further matching of subsequent segments of the audio signal.
11. The method of claim 10, wherein analyzing further comprises:
generating an acoustic spectrogram to identify acoustic modulation characteristics of the particular section of audio signal at one or more frequencies, wherein the acoustic modulation characteristics are spread across a plurality of channels;
examining the acoustic modulation at each channel to measure a magnitude, the magnitude identifying an energy value in each channel at each frequency, the examining identifying data points related to the acoustic modulation of the particular segment of the audio signal; and
calculating the vector of a particular segment of the audio signal as a function of the measured magnitude in each channel for each frequency over a time period associated with the particular segment of the audio signal, wherein the vector identifies a plurality of floating point numbers related to data points of the particular segment, the vector representing a unique fingerprint of the particular segment of the audio signal.
12. The method of claim 10, wherein identifying the content identifier further comprises:
calculating a distance between a data point of each content in the content database and a corresponding floating point number of the particular segment using an iterative computation method; and
identifying content with a set of data points that are closest to a corresponding floating point number of the particular segment.
13. The method of claim 10, further comprising:
storing the portion of the content record and the additional record received from the content provider database in a local cache accessible to a processor of the television for further verification of the content of the audio signal flowing through the television.
14. The method of claim 13, further comprising:
periodically generating additional fingerprints for additional segments of the streaming audio signal; and
comparing the additional fingerprints with fingerprints and fingerprint sequences of the content and additional records stored in the local cache to determine whether the streaming audio signal continues to match content in the local cache.
15. The method of claim 14, further comprising:
when the additional fingerprint does not match the fingerprint of the content stored in the local cache,
purging the content from the local cache;
initiating a search by querying the content database to identify content matching the additional segment using the additional fingerprint; and
retrieving content from the content provider database for storage in the local cache for subsequent validation.
16. The method of claim 10, further comprising:
identifying promotional media from a service database related to the content, the promotional media identified by using information in the fingerprint from the particular segment;
retrieving metadata and assets related to the identified promotional media from an ad campaign database; and
assembling multimedia content for the promotional media using the retrieved metadata and assets, the assembled multimedia content related to the promotional media being presented at a television alongside content related to the audio signal.
17. A method for identifying content flowing through a television, the method performed by a processor of the television, comprising:
retrieving a set of audio fingerprints associated with a plurality of content arranged for presentation;
storing the set of audio fingerprints in a local cache associated with the television;
receiving a request to present content on the television;
retrieving an audio signal of the content selected for presentation at the television;
analyzing a particular segment of the audio signal to identify acoustic modulations in the particular segment, the analyzing generating a vector for the particular segment based on the acoustic modulations, the vector identifying a plurality of floating point numbers related to data points of the particular segment, the vector defining a unique audio fingerprint for the particular segment of the audio signal;
determining whether a match of the audio fingerprint for a particular segment of the audio signal is found within the local cache by comparing the audio fingerprint of the particular segment with the audio fingerprints of the plurality of content;
when a match is found in the local cache, query a content provider database using a content identifier of the particular content that matches the audio fingerprint of the particular segment to obtain a portion of the particular content; and
presenting the particular content obtained from the content provider database in response to the request.
18. The method of claim 17, further comprising:
when the audio fingerprint of a particular section of the audio signal does not match the fingerprint of any of the plurality of content stored in the local cache,
forwarding a request to a content database to verify a possible match of the audio fingerprint associated with the audio signal, wherein the request includes a content identifier of content from a previous query;
receiving confirmation of a possible match for an audio fingerprint of the audio signal from the content database.
19. The method of claim 17, further comprising:
periodically generating additional fingerprints for additional segments of the streaming audio signal; and
verifying whether the additional fingerprint continues to match the particular content in the local cache by comparing the additional fingerprint to a corresponding fingerprint of the particular content stored in the local cache.
20. The method of claim 17, wherein the set of audio fingerprints arranged for rendering are periodically retrieved and stored in the local cache, and wherein the local cache is purged prior to storing the retrieved audio fingerprints.
HK14112822.3A 2011-12-20 2012-11-30 Audio fingerprint for content identification HK1199344B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US13/332,331 2011-12-20
US13/332,331 US8949872B2 (en) 2011-12-20 2011-12-20 Audio fingerprint for content identification
PCT/US2012/067487 WO2013095893A1 (en) 2011-12-20 2012-11-30 Audio fingerprint for content identification

Publications (2)

Publication Number Publication Date
HK1199344A1 HK1199344A1 (en) 2015-06-26
HK1199344B true HK1199344B (en) 2019-02-01

Family

ID=

Similar Documents

Publication Publication Date Title
CN103999473B (en) Audio-frequency fingerprint for content recognition
JP7804711B2 (en) Audio processing for detecting crowd noise onset in sporting event television programming
US9264785B2 (en) Media fingerprinting for content determination and retrieval
US10509815B2 (en) Presenting mobile content based on programming context
JP5175908B2 (en) Information processing apparatus and program
US11223433B1 (en) Identification of concurrently broadcast time-based media
US20130148898A1 (en) Clustering objects detected in video
US20150301718A1 (en) Methods, systems, and media for presenting music items relating to media content
US20160249115A1 (en) Providing Interactivity Options for Television Broadcast Content
US20170134810A1 (en) Systems and methods for user interaction
HK1199344B (en) Audio fingerprint for content identification