HK40036392B

HK40036392B - System and method for continuous media segment identification

Info

Publication number: HK40036392B
Application number: HK42021026403.2A
Authority: HK
Inventors: W·里奥·霍尔提
Original assignee: 构造数据有限责任公司
Priority date: 2014-12-01
Filing date: 2021-02-26
Publication date: 2022-12-09

Description

System and method for continuous media segment identification

This application is a divisional application of an inventive patent application having an international filing date of 2015, 11-month and 30-month, PCT international application number PCT/US2015/062945, national application number 201580074972.8, entitled "system and method for continuous media segment identification".

Priority declaration

The present application claims the benefit of U.S. provisional patent application No.62/086,113 entitled "AUDIO MATCHING USING PATH tracking" (filed 12/1 2004) and invented by w.leo Hoarty. The above-identified application is currently co-pending or is an application entitled to the benefit of the filing date of the instant or current co-pending application.

Technical Field

The present invention relates generally to a media identification client server system with significant improvements in the efficient presentation and identification of multimedia information. More particularly, the present invention relates to a computationally efficient and accurate media identification system that requires minimal processing of media during client device processing prior to communication with a server device for continuous identification.

Background

Applications for automatic content recognition are experiencing considerable growth and are expected to continue to grow, driven by demand from many new business opportunities, including: an interactive television application that provides contextually relevant content; a targeted advertisement; and tracking media consumption. To cope with this growth, a comprehensive solution is needed, which is associated with the following problems: a media database is created and specific media segments within the database that are tolerant of media content alterations that alter the original transmission image, such as graphics generated locally within the client device, or involve viewing of standard definition broadcasts when the user uses the zoom or stretch mode of his HDTV. These changes may occur due to user actions such as using an Electronic Program Guide (EPG), requesting additional program information to appear in a pop-up window generated by the set-top box, or selecting a non-standard video mode on a remote control.

Automatic content recognition systems typically ingest large amounts of data and tend to operate on a continuous 24-hour schedule. The amount of data consumed and managed by the system enables the data to be sorted by the current popular idioms of big data systems. Therefore, the system must operate as efficiently as possible in terms of both data processing and storage resources and have data communication requirements. A basic means of increasing the efficiency of operation while still achieving the necessary accuracy is to utilize a method that generates a compressed representation of the data to be identified. The compressed representation, often referred to as a fingerprint, is typically associated with identifying data from audio or video content. Despite the use of algorithms of various different complexity, most algorithms still rely on a common set of basic principles with several important characteristics, such as: the fingerprint should be much smaller than the original data; a set of fingerprints representing a media sequence or media segment should be unique such that the set of fingerprints can be identified in a large database of fingerprints; original media content cannot be reconstructed from a set of fingerprints, even in a degraded form; and the system should be able to identify a copy of the original media even if the copy is reduced or distorted either intentionally or by the way the media is copied or otherwise reproduced. Examples of common media distortions include: changing the level of image data or cropping image data, such as changing from a high definition video format to a standard definition format or from a standard definition format to a high definition video format, re-encoding the image or audio data to a lower quality level or changing the frame rate of the video. Other examples may include decoding digital media into analog form and then re-encoding the media digitally.

One useful example of a typical media fingerprinting method may be illustrated by examining a popular mobile phone application (app) known as 'Shazam'. The Shazam application and many similar applications are commonly used to identify songs that are unknown to a user, particularly songs that are heard in a public place such as a bar or restaurant. These applications sample audio from the microphone of a mobile device (such as a smartphone or tablet) and then generate a so-called "fingerprint" of the unknown audio to be identified. The "fingerprint" is typically constructed by detecting frequency events, such as a center frequency of a particular sound event being higher than the average of the surrounding sounds. This type of acoustic event is referred to as a "landmark" in Shazam patent u.s.6,990,453. The system then continues to analyze the audio of another such event. When the first and second "landmarks" are found and the time interval separating them is sent to the remote processing device as a unit of data called a "fingerprint", the remote processing device will accumulate additional "fingerprints" over a period of time (typically twenty to thirty seconds). A reference database of known musical compositions is then searched using a series of "fingerprints", wherein the database is built by the fingerprinting device. The match result is then sent back to the mobile device and when the match result is positive, unknown music playing at the user location is identified.

Another service, known as Viggle, which identifies television audio through a software application downloaded to a user's mobile device that relays audio samples from the user's listening location to a central server device to identify the audio through an audio matching system provides a means for users enjoying the service to accumulate loyalty points after identifying a television program while the user is watching the program. Users who enjoy the service may then redeem the loyalty points for goods or services similar to other consumer loyalty programs.

Identifying unknown television segments typically requires a distinct method between video identification and audio identification. This is because video is presented in discrete frames and audio is played as a continuous signal. However, despite the different presentation formats, the video system compresses the video segments into representative fingerprints and then searches a database of known video fingerprints to identify the unknown segments similar to the identification process of audio. The video fingerprint may be generated by many means, but typically the main function of fingerprint generation requires identifying various video attributes, such as finding image boundaries (e.g. light-dark edges in a video frame) or other patterns in the video that can be segmented and marked and then grouped together with similar events in adjacent video frames to form a video fingerprint.

In principle, the same process should be used to build a system of identifying video segments to register known video segments into a reference database as processing unknown videos from client devices of a media matching service. However, if the example of a smart tv is used as the client device, several problems arise because the processing device using the smart tv samples the video that arrives at the tv set. One such problem arises from the fact that: most television equipment is connected to some form of set-top box device. In the united states, 62% of households subscribe to cable television services, 27% subscribe to satellite television, and more television is fed by internet-connected set-top boxes. Less than 10% of television receivers in the united states receive television signals from an off-air source. In the case of television signals provided to a television set by a set-top box, as opposed to watching television via over-the-air transmission via an antenna, the set-top box will often overlay the received video image with a locally generated graphical display, such as program information displayed when the user presses an "info" button on a remote control. Similarly, when a user requests a program guide, the television image will typically shrink by a factor of four or less and be located at a corner of the displayed content, surrounded by the program guide grid. Also, alerts and other messages generated by the set-top box may appear in a window overlaid on the video program. Other forms of disruptive video distortion may occur when a user selects a video zoom mode to magnify an image or when a user selects a stretch mode when watching a standard definition broadcast but wishes to fill up a 16:9 screen of a high definition television with 4:3 aspect ratio images. In each of these cases, the video identification process will not be able to match the unknown video sampled from the set-top box configuration.

Thus, existing automatic content recognition systems that rely solely on video recognition will be interrupted when several common situations arise in which video programming information is changed by an attached set-top box device as described above. Further problems arise with identifying the video even when the video is not changed by the set-top box device. For example, prior art video recognition systems may lose the ability to recognize unknown video segments when the video image is fading to black or even when the video image is depicting a very dark scene.

It is worth noting that the audio signal of a television program is hardly changed, but transmitted to the television system as received by the set-top box device attached to said television. In all of the above examples of graphical overlays, the program audio will continue to play generally unchanged for a fade to black or dark video scene, and thus may be used for reliable program segment identification by a suitable automatic content identification system for the audio signal. There is therefore a clear need for an automatic content recognition system that utilizes audio recognition alone, or in addition to video, for the purpose of identifying unknown television program segments. However, the techniques employed by the above-described music recognition systems (e.g., Shazam) are generally not suitable for recognizing continuous content, such as television programs. These mobile phone music recognition applications are typically designed to process audio from microphones in the open air, which also introduces significant indoor noise interference, such as that present in noisy restaurants or bars. Furthermore, the mode of operation of these aforementioned audio recognition applications is generally based on the particular use envisaged and is not designed for continuous automatic content recognition. Therefore, the technical architecture of special music ID programs is not suitable for continuous identification of audio due to many technical difficulties in identifying audio from high interference sources. The disadvantages of the system are further: not only operate continuously but also have a very large number of synchronization devices, such as television set-top boxes or smart tv groups in certain countries or even regions.

When a television program is displayed on a television set, there are many uses for identifying the television program. Examples include interactive television applications, in which case supplemental information for the currently displayed television program is typically provided to the viewer in the form of a pop-up window on the same television display that identifies the media, or on an auxiliary display of a device such as a smart phone or tablet. Such context-dependent information typically needs to be synchronized with the main program currently being viewed. Another application for detecting television programs is commercial replacement, also known as targeted advertising. There is another use for media census, such as rating measurements of one or more television programs. All of these uses and others not mentioned benefit from timely detection of unknown program segments. Thus, continuous audio recognition alone or in conjunction with video recognition may provide or enhance the reliability and consistency of an automatic content recognition system.

Disclosure of Invention

The present invention is used to identify video and/or audio segments to enable an interactive television application to provide various interactive television services in a client set-top box or smart television. In addition, the present invention provides a reliable means for identifying program viewing statistics with respect to audience ratings measurements.

The present invention provides an audio and video clip identification apparatus in which, at enrollment time, as shown in fig. 1, video frames and seconds of audio are converted into a universal format continuous coefficient stream 101, which universal format continuous coefficient stream 101 can be marked and stored in a reference database 102 to provide candidate data for identifying unknown audio or video clips when presented to the system of the present invention from an inventive client device enabled. The present invention can operate in a variety of modes, such as using only video, or only audio, or a combination of video and audio, and the system will provide accurate results in three to ten seconds. Audio and video segment information is prepared for the recognition process in a manner 103, which manner 103 is similar to the registration process 101 for the recognition process 104 of fig. 1. The result of a successful match is a unique identification code or metadata for the audio/video clip 110.

In one embodiment of the invention, video segments may be used as the primary means of identifying unknown media segments. Video recognition by the present invention may be interrupted if a consumer device, such as a set-top box, displays a locally generated graphic overlaid on a main video image. If the interruption occurs, the system of the present invention can seamlessly switch to audio clip information to continue identifying the unknown media content sent from the consumer device to the central matching server means.

The ability to dynamically switch between audio and video segment identification is further enhanced by an embodiment of the present invention in which the audio segment information is converted by the Linear Predictive Coding (LPC) means of the present invention from a stream of digital audio samples to a stream of coefficients or symbols having characteristics similar to the video segment transform process. The features include a broad set of symbols (called coefficients) that exhibit broad variability rather than being directly related to frequency, unlike other time-frequency transforms such as the well-known popular fourier series. Furthermore, the coefficient processing will reliably repeat the values of the same or largely similar audio segments, thus exhibiting highly desirable distinct high entropy characteristics while maintaining repeatability. Another important feature of the LPC process of the present invention is that the coefficient values remain substantially fixed for a time interval of at least 20 milliseconds (ms) to as long as 100 ms. The fixed time frame allows the coefficients to be processed with similar processing means to the video pixel sampling process of Neumeier US8,595,781, which is incorporated herein by reference in its entirety, which provides the further advantage of allowing the use of a continuous data matching scheme employing high-dimensional algebraic suspect selection in combination with time-discounted scoring means, such as Path Pursuit as taught by Neumeier. This is in sharp contrast to the prior art, where feature vectors and other means are used to find landmarks and combine the landmarks to form a fingerprint, as exemplified by the popular Shazam music recognition service and many other audio recognition systems.

Audio data differs greatly from video data in most respects, but the audio signal is transformed by the present invention into coefficient groups or coefficient frames, also referred to in the art as "cues," in a manner that makes it similar to the sampled pixel values of the video information. This data similarity aspect between video and audio cues allows the advantageous center matching means of the present invention to be used interchangeably to match unknown audio against reference audio or unknown video against reference video data, or both (if the application so requires).

The present invention provides a means to continuously identify media information from multiple client devices, such as smart televisions, cable or satellite set-top boxes, or internet media terminals. The present invention provides a means for transforming media samples received by the device into successive frames of compressed media information for recognition by a central server device. The central server means will identify the unknown media segment within three to ten seconds and provide the identity of the previously unknown segment back to the respective client device which provided the segment for use in the interactive television application, for example displaying contextually relevant content in the form of an overlay window, or for advertising replacement purposes. In addition, the identification of the media segments may be provided to other processes of the server or external systems over a network for media census, such as a ratings measurement application.

The invention is based on transforming audio into time frozen coefficient frames in a continuous process similar to the continuous video frame process of the prior art (Neumeier patent) and by understanding that video information is processed in Neumeier by: the average pixel value is found from a plurality of video frame positions within the video frame. The video frame information is typically continuously registered in the matching system at a rate of at least a number of frames per second, but not necessarily the full video frame rate of a common television signal. Similarly, the identification phase of the Neumeier patent allows the video frame information to be collected and transmitted to the central matching device of the present invention at a video frame rate that is lower than the full frame rate of the unknown video segment, as long as the frame rate is not greater than the registered frame rate. The audio information is processed as overlapping frames of a typical short duration audio piece, typically 20 to 100 milliseconds. It is known that certain audio channel characteristics, such as the power spectral density of the signal, are effectively fixed for short time intervals of between 20 and 100 milliseconds and can be converted into coefficients that do not change significantly during the frame time. Thus, a means of transforming continuous audio data into a substantially time-frozen frame of coefficients may be used, which provides an efficient means of storing known audio information in a database and then searching by algorithmic means to identify unknown audio pieces.

In addition, it has been determined during the development of the present invention that the coefficients have similar entropy characteristics to the video coefficients (cues) of patent US8,595,781, providing the ability to store the coefficients by means of a location-sensitive hash index device to form a searchable reference database. As with video, during the recognition stage, the database can be searched by linear algebra (matrix math) means to find candidates in the multidimensional space. The candidates, also referred to as suspect, may be represented by tokens arranged in bins (bins) having characteristics similar to leaky buckets, providing an effective scoring means known in the industry as time discount binning to find a match from a harvested suspect. Another effective means of scoring candidate matches is to exploit the relevance of the unknown cue to one or more candidate (known) cues. The correlation means, not to be confused with autocorrelation as used herein, is well known to those skilled in the art for finding a closest match of a reference data item to one of a set of test data items. Thus, the scoring means using a mathematical correlation process produces the best match achieved by the recognition system, which can be binned instead of time discounts.

It should be appreciated that the coefficient frame generation rate during the recognition process may be less than the coefficient frame generation rate used during the enrollment process, yet still provide sufficient information for the matching system to accurately determine the identity of the unknown audio piece over a three second to ten second time interval. For example, the present invention allows the registration rate to operate at 20 millisecond intervals (e.g., with 50% overlap) equal to multiples of 100 frames per second. The client device may send frames to the matching server apparatus for identification at any reasonable multiple of perhaps 50, 25 or 10 frames per second or 100 frames per second for efficient matching by the identification mechanism of the present invention.

Once the audio is transformed from a time-based representation to a frequency-based representation, additional transformations may be applied to produce some further refinements to the set of coefficient frames (cues). In this step, a number of suitable algorithms can be found. The goal is to reduce the data dimensionality while increasing the invariance of the enrollment with respect to the identified sample arrangement. Thus, there is a large number of coefficient generation capabilities, where any one of the coefficients can be selected for data registration and identification, provided that only one specific selection applies for both registration and identification at any given time.

The present invention provides a means of identifying audio or video information from any media source, such as cable, satellite or internet delivered programming. Once identified, the present invention can send a signal from the centralized identification appliance over the data network to the client application of the present invention, causing the application to display contextually-targeted or other content on the television display associated with the client device providing the unknown media information. Likewise, the context-coordinated content may be supplied by the recognition means to a second screen device, such as a smartphone or tablet. Similarly, in identifying unknown media segments, the present invention may maintain a general view of the audience measurement for a particular television program for use by a third party, such as a television advertising agency or television network.

In one or more aspects, the related systems include, but are not limited to, circuitry and/or programming for implementing the herein-referenced method aspects; the circuitry and/or programming can be virtually any combination of hardware, software, and/or firmware configured to effect the herein-referenced method aspects depending upon the design choices of the system designer.

In addition to the foregoing, various other method, system, and/or program product embodiments are set forth and described in the teachings of the present disclosure, such as in the text (e.g., claims, drawings, and/or specification) and/or drawings.

The foregoing is a summary and thus contains, by necessity, simplifications, generalizations, and omissions of detail; consequently, those skilled in the art will appreciate that the summary is illustrative only and is not intended to be in any way limiting. Other aspects, embodiments, features, and advantages of the devices and/or processes and/or other subject matter described herein will become apparent in the teachings set forth herein.

Drawings

FIG. 1 is a high-level block diagram of the basic functionality of an automatic content recognition system. Known audio/video information 101, consisting of audio and/or video segments 102 and metadata (program information) 103, is processed and transformed into coefficient frames 104 that are stored in a reference database 105. Unknown audio and/or video information 106 is processed into coefficient frames 107 by a process similar to 104 and supplied to an Automatic Content Recognition (ACR) system 108, which ACR system 108 compares the data with the reference database 105. When the unknown audio/video segment is identified, audio and/or video metadata (program information or segment ID)109 is output.

Fig. 2 is a block diagram of a server 202 and a client device 203 of the present invention. One or more content sources 201a are supplied to the media ingest device 201, the media ingest device 201 generating audio and/or video cue data 201c and providing associated metadata in the form of program identification and time code 201b information for each media segment. The media information is input to a reference matching database 204 queried by an Automatic Content Recognition (ACR) processor 205 to process and identify segments of unknown audio 203b and/or video 203a supplied by one or more client devices 203. The client device consists of an ACR client 208 that converts the contents of the tv frame buffer 209 and/or tv audio buffer 211 into a corresponding set of threads to be sent to the server 202. After a successful match of an audio or video segment, the ACR processor 205 sends a message to the match processing device 207, which match processing device 207 thereby checks the Interactive Television (ITV) content database for the presence of instructions and possibly data that will be transmitted over the network to the client device application 210 for local processing by the client device 203. The processing may include displaying supplemental information in a window of a television display, the supplemental information having information associated with the program segments detected by the method of the present invention. In addition, the matching process 207 may provide results to a measurement database, such as an audience share measurement system 207 b.

Fig. 3 is a block diagram of an advantageous system showing an apparatus for use in the present invention that receives media information (such as a radio or television program broadcast) from a content delivery network 302 through, for example, an optical transport 303 so that the matching server system 306 will receive the program before a client device (such as a smart television) so that the content can be processed and stored in a reference media database 307 with sufficient time so that the system is ready before the arrival of unknown media from the client devices 309-312. Network distribution of radio or television programs is often provided to service providers (such as satellite and cable providers) over fiber optic networks, which typically exhibit network delays of fractions of a second, while client devices may receive content via satellite or subject the content to additional processing in the head end of the cable system, such that the delay is about two to five seconds or possibly longer. This distribution time difference between backbone and home delivery is sufficient to allow the server device of the present invention to provide real-time processing of unknown audio or video segments, since known data from the same source, received by the client device, will have been processed and stored for good use by the matching server device before any query to its matching service. Accordingly, interactive television services, such as background-related information display or advertisement replacement, may be performed in close proximity to the beginning of the play of the identified segment.

FIG. 4 is a process flow diagram of raw audio input 401 from a receiver, showing the following steps: pre-processing 402; pre-emphasis (if applicable) 403; framing, shaping, and overlapping 404 of audio segments; autocorrelation 405 to prepare the signal for linear predictive coding 406 processing; the LPC coefficients are then converted to line spectral pairs or immittance spectral frequencies 407; then the coefficients are post-processed by normalization and quantization 408; and forming the quantized coefficients into 'cue' groups 409 to be sent to the audio matching system 410, which audio matching system 410 provides audio metadata (identification) 411 when an audio piece is successfully identified by the matching system.

FIG. 5 is a frequency response graph of an audio pre-emphasis filter used to enhance the informational content of high frequency audio components;

fig. 6 is a plot (a) of a typical television audio spectrum before the pre-emphasis filter of fig. 5 is applied to the signal. The measure 601 of the difference in amplitude of the audio signal from the low frequency average peak (about 500Hz) to the high frequency average peak appears to be in the range of about 45 dB. Curve (b) shows the increased signal strength of the high frequency audio component after processing graph (a) via the filter of fig. 5, where the high frequency information increases to a beneficially narrower range 602 of 30dB between the frequencies.

Fig. 7 shows audio segment overlaps 701 to 704 employed by the present invention. In one embodiment, the present invention uses 20 millisecond audio clips with 10 millisecond overlap. In certain embodiments, segment lengths may advantageously utilize up to 100 milliseconds of segment length, and overlap of 10% to 90% of the segment length may be advantageously achieved.

Fig. 8 is a signal framing graph illustrating the spectral effects of various shaping functions applied to audio frames. Graph 801 shows a simple rectangular frame with an abrupt start and stop that results in a fourier transform 802 exhibiting significant sideband noise added to the signal of interest as a result of the abrupt break. Graph 803 shows a Hamming Window (Hamming Window) that is widely used in voice communication systems. The resulting fourier transform 804 shows an optimized signal with harmonic information suppressed by more than 50 dB. Graph 805 shows a relatively simple triangular window function with a fourier graph 806, which fourier graph 806 has a quality close to that of hamming window graph 804, but requires much less computation to be applied to the audio frame and is therefore most advantageous for applications using consumer electronics devices with limited computational means, such as smart televisions or set-top boxes.

Fig. 9 is a graph of the coefficient output of an autocorrelation function employed by the present invention and applied to typical television audio.

Fig. 10 is a graph 1002 of the Linear Prediction (LP) spectrum and also shows a graph 1001 of a weighting filter suitable for normalizing coefficients for optimal quantization.

FIG. 11 is a graph of the coefficient output of the LPC process for the autocorrelation output of FIG. 10 showing typical values for a 20 millisecond audio sample of a speech signal.

Fig. 12 is a result of the conversion of the LPC coefficient output of fig. 11 to Immittance Spectral Frequency (ISF) coefficients. It is well known in the art that there are suitable alternatives, namely the use of Line Spectral Pair (LSP) transforms that produce similar coefficients, where both ISF and LSP coefficients may be more suitable for quantization of unprocessed coefficients of the LPC process.

Fig. 13 is a polar diagram of an ISF coefficient map of coefficient output of the ISF process to a complex plane (Z plane) unit circle. The ISF coefficients exist as symmetrical conjugate pairs and only the first half of the unit circle contributes to the output value. The pole (x) of the LPC forming the input to the ISF process is shown inside the circle.

Fig. 14 is a graph of 15 ISF coefficients plotted over time, showing the relative sensitivity of the unmodified transform output with respect to position 1203 on the unit circle of the Z-axis curve.

Fig. 15 is an ingestion process of an audio source 1501, where the audio source 1501 is decoded in a receiver/decoder audio buffer 1502 and then split into fixed-length audio frames 1503. In this embodiment, the audio frame is transformed 1504 by autocorrelation, then further processed 1505 into coefficients 1505 by linear prediction coding 1505, and further processed 1506 into coefficients using an ISF transform in this embodiment. Program information metadata 1509 is added to the program timecode 1508 and processed coefficients 1507 to form an audio data cue record 1510.

Fig. 16 is a schematic diagram of a reference audio cue 1601 hashed by an audio hashing function 1602 and stored in a reference database 1604, the reference database 1604 indexed by parsing an output of the hashing function 1602 with the most significant bits addressing a memory sector and the remaining bits addressing a "bucket" (location) 1606 within the memory sector.

Fig. 17 is a schematic diagram of audio cue formation 1706 from an unknown audio source received by the television monitor 1701 and decoded in the television audio buffer 1703 and then processed by the client software of the present invention to form audio frames 1702 of predetermined length and converted to coefficients 1705. The client thread formation includes adding a current processing time 1707, referred to in the art as "real time".

Fig. 18 is a schematic diagram of an unknown audio cue 1801, which unknown audio cue 1801 uses a hash function 1804 to generate a hash index, which is then used to address a reference database bucket 1805. The candidate audio cues 1802 are retrieved from the database and supplied to a matching process 1803, which matching process 1803 outputs results 1807 after successful matching of the unknown media segment with the known segment from the reference database 1806.

Fig. 19 is a representative diagram of a time discount binning process 1901 that supplies tokens to bucket 1902 until the bucket contains enough tokens to cross a threshold 1904, which threshold 1904 indicates a high probability of a media fragment match result in the present invention. The buckets are "leaky" and will drain tokens over time such that consistent matching results are required within a predetermined time domain to cause tokens to fill the respective bucket faster than the leak rate so that tokens in the bucket successfully cross the threshold.

Fig. 20 is a matrix diagram of possible combinations of transforms from audio input to coefficient or hash string output. In all paths through the matrix, the coefficients are quantized either by linear processing 2014 or by vector quantization 2015, except for output 2013, and then output from the system at 2016. In all of these processes, the audio is converted into high-entropy coefficient sets that represent audio frames with a nearly fixed power spectrum over the duration of the audio frame, thus yielding coefficients that can be properly hashed and applied to the search and scoring approach of Path Pursuit for continuous identification of audio segments.

This flowchart of fig. 21 includes steps that may perform content audio matching.

This flowchart of fig. 22 defines the steps of matching a series of coefficient frames representing an unknown audio segment. Candidate harvest (determination) and time discount binning are the same as taught by the neumieer patent.

This flowchart of fig. 22a defines the steps of matching a series of coefficient frames representing an unknown audio piece. The candidate harvest (determination) is provided to a process that correlates the unknown set of cues with one or more suspect (candidate) cues. The closest match is further evaluated and then output as a result if above a threshold.

FIG. 23 illustrates an operational flow representing example operations related to continuous audio matching.

Fig. 24-28 illustrate an alternative embodiment of the operational procedure of fig. 23.

Detailed Description

In one embodiment, as shown in FIG. 2, the system utilizes the client application 203 of the present invention running within the processor device of a cable, satellite or Internet connected set-top box or within the processor device of a smart television to identify audio 203b and video 203a information from a television program. In one exemplary embodiment, the client application process typically operates on audio 211 and/or video information just prior to playing the information to the speakers and/or display of the television device. The audio and/or video information is processed by the present invention to generate a highly compressed continuous stream of frame representations of the corresponding audio and/or video signals using the ACR client 208. The frame representation is sent 203a and/or 203b via a network (typically the internet) to the server device 202 of the present invention for identification. The frame representation is in the form of selected average pixel values for video frames and transformed power spectral coefficients for audio information.

In order to identify unknown media segments of audio and/or video information, said information must first be registered by the inventive identification server devices 104 and 105 of fig. 1. The registration process is typically the same or similar to the process implemented by the client device 107 to send the coefficient representation to the server 108. The registration data is received by the server 102, processed by the server and then stored by the server at 105 for later utilization by the recognition process 108.

Referring again to fig. 2, upon successful identification of an unknown media fragment at the ACR processor 205, the system of the present invention may utilize the process of the matching process 207 to search the server for client services in the ITV content database 206 that may be notified or triggered by the presence of the media fragment. The client event may include sending a trigger signal 202a to the client application 210 of the present invention, the trigger signal 202a displaying context-related information (such as information about the plot of the program or actors in the program) or any of a variety of interactive television services available from smart televisions or set-top boxes. Similarly, the trigger signal may cause the currently displayed television advertisement to be replaced with a different advertisement that is more relevant to the viewer. The advertisement replacement process is also referred to as targeted advertising to those skilled in the art. Another use of the trigger is to update the ratings database via 207b to maintain viewing census for rating measurement purposes. The time sensitivity of the screening is typically less than for the other interactive television applications described above.

The audio and video matching data streams are created by separate and distinct processes, however each process results in a data structure with similar characteristics that can then be applied to a separate database still served by the equivalent server means of the present invention to both register the data to the reference database and have the data used by the media matching means of the present invention to identify unknown media segments from the client device. The video and audio coefficients, while somewhat similar in dimensionality and entropy characteristics, remain in separate databases and it will be apparent to those skilled in the art that audio data cannot be used to search the video database and vice versa. However, the processing means and database structure are similar and largely identical for both types of media, thus providing advantageous economies of scale for systems employing video and audio matching.

The video coefficients are generated from video information as taught by the invention of patent US8,595,781. The searchable audio representation of the present invention must be formed from media that is distinct from the type of video information. However, the end result of this process is a continuous stream of coefficient frames with similar characteristics to the video frame information created by the referenced patent.

To create searchable frames of audio coefficients from audio information, a basic aspect of the present invention is that the power spectral density of a typical audio signal, such as television audio, remains substantially fixed for a period of 20 milliseconds up to 100 milliseconds (ms), which is in the range of a single television frame of about 33 milliseconds for U.S. based standards and 40 milliseconds for non-U.S. televisions. Thus, an audio signal may be divided into a plurality of frames, then converted to a power spectral representation, and stored in a searchable multidimensional reference database from which a subset of pixels is sampled and stored in a matching database using a process similar to a video frame as taught by Neumeier. One embodiment of the invention, which provides the necessary audio data transformation, uses Linear Predictive Coding (LPC) as the main step to convert the audio signal into said audio coefficient representation, which is then sent to the inventive server. The use of LPC or equivalent transforms allows for flexible and efficient conversion of the audio signal into a highly compressed form that can be further manipulated to enhance the search and selection efficiency of the entire automatic content recognition system.

In contrast, prior art techniques for audio matching may use, for example, Modified Discrete Cosine Transform (MDCT), mel-frequency cepstral coefficient (MFCC) processing, or discrete fourier transforms, etc., to convert, for example, an audio signal from a time representation to a frequency representation. Once the signal is transformed, the prior art can find frequency events (sometimes called landmarks) above a certain amplitude and then measure the time interval between events or landmarks to form a so-called fingerprint for storing reference media segments. The client device then uses the same process to generate a fingerprint to be submitted to identify the unknown media segment.

To match audio information, the present invention does not use prior art fingerprinting means, but creates a continuous stream of coefficients from fixed audio frames for building a reference database and then for matching unknown media segments, applies a similar process with the client device to unknown audio segments, and provides the coefficients to a matching server means that utilizes the reference database. It should be understood that the coefficient processing of the present invention can be accomplished by a variety of different but related mathematical transforms, as shown in fig. 20, which are similar to those used in the prior art. However, many of the additional steps performed by the prior art in forming a fingerprint constructed by identified landmarks or other unique constructs are not in any way utilized by the present invention. Thus, the present invention is capable of handling continuous media streams, which the prior art fails to. Furthermore, the invention is scalable to support millions of client devices with high accuracy and has the further advantage of low processing overhead in the client devices.

Returning to fig. 2 of the present invention, which shows the basic functionality and communication path from the client to the server, the client device 203 contains processor means capable of executing computer programs and the client device provides access to the client's video 209 and audio 211 buffers to the processor means. The ACR client 208 application periodically samples the data from the video and audio buffers and processes the video 203a and audio 203b cues, where cues consist of element 1706 of fig. 17. In this embodiment, the elements of the cue consist of 16 coefficients and a time code consisting of local time (also referred to as real time). The thread is sent over a network to the server device 202 of the present invention. An Automatic Content Recognition (ACR) processor 205 receives the hints and performs a matching process in which the received hints are identified by searching a reference media matching database 204. The processor 205 may provide useful matching results by various means, such as by using Neumeier's Path Pursuit or by correlation of unknown thread groups with a set of suspect threads. The relevant process is illustrated in fig. 22 a. The positive identification from 205 is communicated to the matching processing means 207, and the matching processing means 207 may perform various functions, such as providing context-dependent content to the client device, as taught by the Zeev neumieer patent US8,769,584B 2, the entire contents of which are incorporated herein by reference. The matching process 207 may also provide statistical information to the matching results service 207b for rating purposes or other rating measurement services.

Figure 3 shows how the invention has the capability of providing continuous identification of e.g. television programs. Many interactive television applications may be implemented by having a system that is timely-aware of the current program being displayed on the television receiver. Such applications include targeted advertising as well as contextual trigger information display. While not necessarily time sensitive, the system of the present invention also enables accurate audience share measurements. Fig. 1 shows media information processed by a registration system to populate a reference database against which unknown media information is tested for identification. The obvious problem is how to transfer data, such as television programs, into a central database quickly enough to match the same television programs entering the system from the client device without delay. The answer is the following fact: the central registration system receives media content from the television distribution backbone network that typically arrives at the central facility of the present invention within four to ten seconds before the same program arrives at the television receiver of the client device. Thus, the system has sufficient time to process the incoming reference media before any query of the data is needed.

In a preferred embodiment of the present invention, FIG. 4 depicts the steps of converting client television receiver audio 401 to data suitable for transmission to audio matching system 410. The transformation process begins with an audio pre-processing function 402 in which digital audio received from an audio buffer of a television receiving device, which may be provided at a higher sampling rate (e.g., 48kHz) but which will be processed by the present invention at, for example, 16kHz, is converted from stereo to mono by summing the stereo information, and may be further processed by a down-sampling step. Other pre-processing steps may include volume normalization and band filtering. Process 403 applies pre-emphasis processing in which the audio signal is passed through a high-pass filter having the filter characteristics shown in fig. 5. The original audio of fig. 6a is depicted in a representative spectrogram of a representative television audio segment, and post-equalized audio is depicted in fig. 6b, where the audio is enhanced according to the filter parameters of fig. 5. The pre-emphasis process of 403 enhances the dynamic range of some coefficients and thus improves the quantization process 408 of the coefficients. The data is then divided into frames of 20ms and 50% overlap with the previous frame as shown in fig. 7. The frame audio is then shaped with a triangular window function 805 as shown in fig. 8 to obtain a spectral distribution as shown at 806. The next step of the process is to auto-correlate the framed audio 405 and then apply an LPC process 406 whose coefficients are further transformed by an ISF function of 407 and then normalized in step 408 by a weighting function similar to 1001 of fig. 10, step 408 also including a quantization step. The data is then framed into a cue group 409 and sent to the audio matching system 410 for registering reference audio information or for identification processes of unknown media segments.

In a preferred embodiment of the invention, Linear Predictive Coding (LPC) is used as the main step of coefficient generation, but alternative embodiments include: mel-frequency cepstral coefficients (MFCCs), Modified Discrete Cosine Transforms (MDCTs), and/or wavelets, among others. FIG. 20 represents a block diagram matrix of various alternatives that may be used by this disclosure to transform audio into coefficients usable by this disclosure. The matrix maps the four families 2002, 2003, 2004, 2005 of possible algorithm combinations applicable to audio transforms into coefficient frame outputs for beneficial development by the present invention. The process chain 2002 includes four variants from a common basis for autocorrelation 2002a applied to the audio signal 2001. The autocorrelation may directly provide one 2017 of four coefficient outputs. The second process of the 2002 family applies Linear Predictive Coding (LPC)2006 to the output of 2002a to output LPC coefficients at 2009. Alternatively, the LPC 2006 values may be further transformed by LSP 2007 or ISF 2008 to further transform the coefficients. In all four cases, the coefficient output is further processed with one of two possible quantization steps of 2014 or 2015. The second processing family is the mel-frequency cepstral (MFC) coefficient process, which starts with the acquisition of the Log values 2003 of the audio and then further processing with the MFC process 2010 before the final quantization step of 2014 or 2015. The wavelet 2004 transform may be used for a suitable coefficient generation step 2011 and finally, the modified discrete cosine transform 2005 process may produce a candidate cue set (coefficient frame) by direct coefficient generation 2012 or by bit derivation (2013) that produces a hash string output. In all outputs except output 2013, the coefficients are quantized either by a linear process 2014 or by vector quantization 2015 and then output from the system 2016. In all of these processes, the audio is converted into sets of high-entropy coefficients that represent audio frames with a nearly fixed power spectrum over the duration of the audio frame, thus yielding coefficients that can be appropriately hashed and applied to the Path Pursuit's search and scoring approach, thereby providing the possibility of accurately and continuously identifying audio segments.

Figure 13 is a graph of the coefficients of LPC processing as the poles of the Z-plane processing represented by X1302. The transformation of the LPC coefficients to ISF coefficients results in a zero point with respect to the unit circle 1301. Fig. 14 is a graph of ISF coefficients over time, illustrating their high entropy and thus applicability to the path-tracing matching process. It should be noted that in another embodiment of the present invention, the audio conversion process of the present invention may function with only LPC output coefficients without employing a step of conversion to LSP or equivalent ISF coefficients, as this LSP/ISF step was developed primarily in the prior art for improving audio quality in vocoder applications. It has been found that some improvements in audio quality may not significantly improve the accuracy of the audio matching system.

Fig. 15 shows the formation of an audio cue data set from coefficient data 1507 by adding program time codes 1508 and some program identifying information, also referred to as metadata 1509. In fig. 16, once formed, audio cues 1601 are supplied to a media search database where audio cues 1601 are processed by an audio hashing function 1602, creating hash keys 1603 for storage in search database 1604 where hash keys 1603 result in similar audio data cues grouped nearby to minimize search distance and thus improve overall system efficiency.

A client of the present invention is shown in fig. 17, in which a process similar to the registration function is generated in the client device 1701. The audio from the client device is processed into audio cues 1705 with local time 1707 (also referred to as "real time") added to the cues, the local time 1707 being added to the cues to provide relative time differences between the cues. Figure 18 shows an unknown data hint addressing a reference media database by the same hash function as used to address the database during the registration process of the reference media. One or more candidates 1802 are retrieved from the database for supply to the matching process 1803 as described above. The candidates are evaluated using a linear algebraic function to select candidate data by evaluating euclidean distances in a high dimensional space, for example using the possible point locations in the isodiametric sphere (PPLEB), a process also known as suspect selection. Another step in the possible candidate (suspect) selection process is performed by Time Discounted Binning (TDB) over a known period of time. Fig. 19 shows candidates (suspect) where each candidate is represented by a bucket 1902 that is allocated after the process of collecting the suspect. The bucket is leaky, meaning that tokens have a preset time value and timeout, which equates to a leaky bucket that leaks out over time. When an unknown data thread arrives and more suspects are collected from the reference database, the number of tokens in the bucket identifying the unknown thread will be above the threshold 1904 after a period of three to ten seconds, identifying the unknown data. This entire process can be understood by reference to the appendix of invention US8,595,781. An alternative means of scoring candidate matches may be achieved by applying the correlation of the unknown cues 1801 with one or more candidate cues 1802. The correlation means is not to be confused with the autocorrelation as used herein and is well known to those skilled in the art for finding the closest match of a reference data item to one of a set of test data items. Thus, binning with the scoring means of the mathematical correlation process instead of time discounts yields the best match achieved by the recognition system. This process is also shown in FIG. 22a, where the various steps from start 2202a to within range 2206a are similar to the process leading to the time discount binning described above for FIG. 22. In step 2207a, a correlation process is applied in lieu of creating the token bin. Step 2209a selects the closest fit from the correlation process 2207 a. The winning value is further evaluated by 2211a and if affirmative, the candidate token identification is output as result 2212 a.

The above process is one of many embodiments of the present invention. The following description is of the inventive means for generating coefficients from an audio signal and is common to most embodiments.

The present disclosure discloses that Linear Predictive Coding (LPC) coefficients and variants thereof may be used in place of feature vectors or fingerprints for reliably detecting audio segments, typically within a few seconds of analyzing an unknown audio signal. The theory of the LPC lower layer is well understood and practiced in signal communication systems as a fundamental procedure for transcoding audio signals for packet based digital communication systems. A subset of these general processes are used in the present invention. A root cause behind the selected process is provided, along with a detailed description of many steps used to generate coefficients useful for Automatic Content Recognition (ACR).

Referring again to FIG. 4, there is shown a simplified block diagram of a process for processing audio from a television audio 401 source; it should be appreciated that the application of the audio signal processing steps 402 to 409 to the audio matching system 410 is the same as the enrollment process of adding known audio clip cues to the reference database 307 of fig. 3, as it is used to process audio from, for example, a client smart tv, and submit the audio clip cues to the audio matching system 410 via a network such as the internet to determine the identity of unknown segments of the cue values.

In more detail of the many steps of applying the audio representation to the audio matching system 410, some necessary pre-processing 402 steps are applied to the audio, which pre-processing 402 steps may include stereo to mono conversion, down-sampling or up-sampling of the audio, followed by pre-emphasis (whitening) 403, followed by framing, shaping and overlap 404, the audio is segmented into 20 to 100 millisecond frames at 404, and then the triangular window function 805 of FIG. 8 is applied to the signal of each frame, such as 701 of FIG. 7, to mitigate abrupt start and stop of the signal within the frame boundary. The final step of 404 is to overlap the frames by 50% in this embodiment. The overlap is typically implemented as 50% in the present example, as shown at 701 to 704 of fig. 7, by: the next audio frame is started at a halfway point of the audio of the previous frame so that the first half of the next frame is the same audio as the second half of the previous frame, and so on. This process accommodates alignment differences between the reference database of known audio segments and the unknown audio segments received by the matching system server device 306 of fig. 3. The pre-processed digital audio is passed through an autocorrelation process 405 in preparation for conversion to a Linear Predictive Coding (LPC) process 406. As the audio passes through block 406, it is evaluated by the Z-plane transform 1/A (Z). The key to the usefulness of this process of matching unknown audio clips to a database of reference audio clips lies in the fact that: LPC converts time domain audio to frequency domainThe power spectrum in (d) represents, very much like a fourier transform, but is a laplacian mode. Thus, the resulting transformed audio information is quasi-stationary with respect to its power spectral density, remaining relatively unchanged for at least tens of milliseconds. The transfer function 1/a (z) is an all-polar representation of the full-bandwidth audio transfer function. A (z) is a set of coefficients of a polynomial in the z domain, where z represents e^∧(-iωt). In a preferred embodiment, for wideband audio coding, a 16 th LPC polynomial (LPC 16) is used. Higher order polynomials up to at least LPC48 may be used. Higher order polynomials are further advantageous by applying audio band pre-emphasis 403 when applied to audio before LPC processing. However, a further improvement of the relatively high entropy distribution between the coefficients is to apply an LP weighting function, such as 1001, applied to the representative LP spectrum 1002 of fig. 10. In an embodiment of the encoder, an audio segment of 20 milliseconds duration is analyzed and converted into groups of 16 coefficients, which in turn represent channel information of an audio signal having, for example, a bandwidth of 8 kHz. In another embodiment of the encoder, a 100 millisecond audio clip and a 16kHz frequency bandwidth are converted into groups of 48 coefficients. Fig. 5 illustrates an example pre-emphasis filter that provides pre-emphasis to audio prior to processing by LPC transforms. Fig. 6(a) shows the spectral characteristics of the audio before pre-emphasis, and fig. 6(B) shows the audio spectrum after the pre-emphasis step. The particular filter of fig. 5 provides a frequency rise of +15dB from 1kHz to the top of the audio band (which is 16kHz in this embodiment).

The continuous coefficient frames generated by the LPC process of the present invention can be used in an audio matching device in place of the fingerprints used in the prior art where the processing of the Path Pursuit provides the matching mechanism. When the LPC process is used in an audio vocoder (such as for audio communications), the LPC excitation encoding sub-process provides two values every 20ms frame, which are codebook representations of the waveform and amplitude values of the signal. Iterative algorithms are used to convert the excitation into a codebook and are computationally expensive. A relatively small change in the codebook value results in a large improvement in the perceived speech quality, and therefore this process is valuable for audio communication systems. However, for audio matching systems, small differences in codebook values do not result in large euclidean distances between coefficients required for audio matching applications. Due to the large processing requirements of the codebook and the sub-optimal distance characteristics, the excitation parameters do not benefit the invention and are therefore not used.

In one embodiment, the LPC coefficients are not used directly from the output of the 1/A (z) model. Audio codecs for typical audio communications have led to computationally efficient processing means. In one widely used embodiment, the LPC coefficients are calculated using forward and backward predictions called Levinson-Durbin and using an iterative algorithm. An attractive property of this method is that the reflection coefficient is easily deduced as a by-product. These coefficients are used to generate the lattice filter and the prediction filter for synthesis. This filter topology also provides robust performance with low sensitivity to coefficient accuracy, which is also a useful attribute for audio matching systems.

Thus, the present invention does not require all of the steps for voice communication applications of LPC, and therefore useful coefficients can be generated by a subset of the steps. In one embodiment, an example of the reduction step is as follows:

capturing 320 audio samples in 20 milliseconds at a 16kHz Sampling Rate (SR)

Alternatively, 320 audio samples are captured at 10 milliseconds with a 32kHz SR

Alternatively, 2400 audio samples were captured at 50 milliseconds with a 48kHz SR

High pass filtering is not required and is typically set to 50Hz because this is done on the television audio already prior to transmission

Pre-emphasis of 4kHz HPF is performed, boosting +25dB at 16kHz

Performing 50% overlap of audio frames

Autocorrelation of 16, 32 or 48 coefficients of audio output

Levenson-Durbin calculates 16 or 32 or 48 LPC coefficients

The audio input from a typical source such as found in smart televisions is stereo and is transmitted at a sampling rate of 48 kHz. For a processing sampling rate of 48kHz less than the receive rate, audio down-conversion is performed by low-pass filtering to remove frequency components above the nyquist frequency, which is twice the frequency of interest, followed by a decimation process to down-convert the audio to the required sampling rate. For example, a conversion from 48kHz to 16kHz requires a low pass filter to eliminate frequency components above 8 kHz. The filter output is then decimated by a factor of three, converting to a lower sampling rate of 16 kHz. It is also apparent that for automatic content recognition, a stereo input is not necessary for good audio detection. Thus, the stereo input is converted to mono by combining the left and right channels, or alternatively, the left or right channel may be used as the only representative mono channel.

To improve the power spectral distribution, a whitening filter is then added to the data path of the present invention. The filter raises frequencies above 4kHz up to 20dB at the highest frequency. Every 20 milliseconds (320 samples at 16kHz) in the audio is packed into one frame.

A simple triangular window function is applied to each audio frame to prepare the audio frame for LPC processing. Frame shaping is required to reduce spurious signal generation at the edges due to the sudden start and stop of signals in each frame. Typically, a hamming function is employed to maximize audio fidelity. However, since the fidelity of the encoding is not important to the media identification process, only a simple trigonometric function is required by the present invention.

Levenson-Durbin uses the autocorrelation of the audio samples to compute LPC coefficients for input to the LPC function. For a total of 17 values 0-16 per frame from the 17 autocorrelation strings, 16 coefficients were calculated using Levenson-Durbin, except for the leading "1". The details of such encoding are well known to those skilled in the art. Since the DC component is not present in the audio as described above, the autocorrelation function is equivalent to the covariance of the signal. The inversion of the covariance matrix results in an all-polar representation of the signal channel. Any matrix inversion method may be used, such as gaussian elimination or cholesky decomposition. The matrix is by definition real-valued and symmetric about a diagonal, also known as the Toeplitz matrix. Levenson-Durbin recursively computes the root using iterative forward/backward estimation. This method is almost universally used for LPC analysis. Not only is the method numerically stable and computationally efficient, it also provides the reflection coefficient as a byproduct with little additional computation. The lattice filter representation of the channel using reflection coefficients is particularly suitable for fixed-point implementations, used in a general-purpose vocoder as a whole and may be advantageously used by the present invention. In one embodiment of the invention, the autocorrelation coefficients obtained from a 20 millisecond segment of audio are shown in FIG. 9. The LPC coefficients calculated from the autocorrelation values are shown in fig. 11.

In another embodiment, it may be found beneficial to follow the LPC process to further process the LPC coefficients in the form of Line Spectral Pairs (LSPs) or equivalent Immittance Spectral Frequencies (ISFs), as shown in fig. 12. The IFS is derived from the LPC coefficients by first generating symmetric and anti-symmetric functions f1 'and f2', the symmetric and anti-symmetric functions f1 'and f2' having the same order from the LPC filter of the LPC coefficients:

f₁'(z)＝A(z)+z^-16A(Z^-1) And f is₂'(z)＝A(z)-z^-16A(Z^-1)

The roots of these two equations are on the unit circle and are ISF. Like the LPC coefficients, the roots of f1 and f2 are conjugate symmetric and only those roots on the upper half of the unit circle need to be evaluated. With this symmetry, two new functions f1 and f2 are created. F1 consists of only the first 8 coefficients of F1'. F2 consists of the first 7 coefficients of F2' filtered using a difference equation to remove roots at 1 and-1. The root of f1(z) 0 and f2(z) 0 is ISF. The roots of these functions may be obtained using classical methods such as Newton-Raphson or LaGuerre polynomials. However, due to the special properties of these polynomials, computationally efficient methods using the Chebyshev polynomial can be used.

Using the above method, f1 and f2 for the LPC coefficients for this example are shown in fig. 14. The zero crossings of f1 and f2 are ISF. The x-axis corresponds to θ, and the angle on the unit circle is 0 to 180 degrees. F1 and F2 were evaluated using only real number components. For example, at x-10, the angle is 18 degrees, and the input of f1 and f2 is cos (18 × 100/(2 × pi)) -0.95106. A zero crossing is an ISF position, where ISF is cos (θ). The first and last zero crossings are the roots of f1, and the roots alternate between f1 and f 2. An efficient zero crossing detection algorithm is written that takes advantage of these characteristics to minimize the required processing. Fig. 13 shows LPC coefficients generated by the Levinson-Durbin algorithm as X and the resulting ISF as O.

A plot of the ISF coefficients over time is plotted in fig. 14, and the desired entropy characteristics of the coefficients, which are largely independent of the underlying audio signal from which they are derived indirectly, are shown in fig. 14. It should be understood that the LPC coefficients will appear in the graph in a similar shape.

It is interesting to note that the reflection coefficients and ISF are derived from the autocorrelation coefficients by a series of linear transformations. Although division exists in the Levinson-Durbin algorithm and is not a linear process, the division is only used for scaling and thus can be interpreted as a linear multiplication. As evidence, if an omission is made from a double precision floating point implementation, the result will be the same. This observation is important because it indicates that the statistical properties of the autocorrelation, LPC coefficients, reflection coefficients and ISF should be very similar. Thus, in another embodiment of the present invention, the system of the present invention can perform automatic content recognition on video content creation coefficients from only autocorrelation data (rather than LPC, and also not ISF processes), thereby further improving the efficiency of the overall ACR system.

It should be appreciated from the foregoing detailed description that the present invention provides a means for converting audio information into semi-stationary frames of audio coefficients useful for data registration and recognition by an automatic content recognition system. The process provides the ability to continuously match audio information from a large number of audio sources, such as smart televisions. With a suitable central server extension, the population may include tens of millions of devices. Furthermore, the audio ACR system can be effectively combined with a video matching system such as that taught by Neumeier and Liberty in U.S.8,595,781, where both audio and video matching processes can share a common central processing architecture such as the path tracking device of Neumeier. The present invention differs from the prior art in that no fingerprinting means is employed to identify the audio and is more accurate with fewer false positive results and at the same time has greater scalability so that it can be used for continuous identification of media and at the same time requires minimizing the processing overhead at each client device.

Fig. 23 illustrates a system and/or operational flow 2300 representing example operations related to continuous audio matching. In fig. 23 and the following figures that include various examples of operational flows, this may be discussed and illustrated with respect to the above examples of fig. 1-22 and/or with respect to other examples and contexts. It should be understood, however, that the circuits, devices, and/or operational flows may be performed in many other environments and contexts of fig. 1-22 and/or modified versions. Further, while the various operational flows are presented in the order shown, it should be understood that the various procedures performed by the operational flows may be performed in an order different from those shown, or may be performed concurrently. An "operational flow" as used herein may include circuitry for performing the flow. A processing device, such as a microprocessor, may become "circuitry configured for certain operations" by executing one or more instructions or other code-like appendages. The flow of operations performed by the processing device will cause the processing device to become "circuitry configured to perform the various operations by executing one or more instructions or other attachments.

After the start operation, operational flow 2300 moves to operation 2310. Operation 2310 depicts maintaining a reference match database that includes at least one coefficient corresponding to at least one audio frame of at least one ingested content and at least one content identification corresponding to the at least one ingested content. For example, as shown in fig. 1-22 and/or described in connection with fig. 1-22, content is supplied to a media ingest operation that generates audio and/or video cue data and provides associated metadata (e.g., an identification of the received content, such as a title, episode, or other identifier). The audio and/or video cue data is stored in the database in real time (i.e., as the content is received) along with the corresponding identification. Audio and/or video data is transformed into values using specific algorithms, functions and/or sets of functions. The client device also uses the particular algorithm, function, and/or set of functions when the client device processes audio and/or video data. When the same point in the program content is processed at the ingest operation and the client device, the resulting audio and/or coefficients will be the same or nearly the same since both the ingest operation and the client device use the same set of algorithms, functions, and/or functions. Rather than storing the entire program content or only the audio portion of the program content, frames of the audio content are converted to much smaller coefficients and stored with the identifier. The coefficients will not be able to produce audio but should contain enough data to match the corresponding coefficients sent by the client device in order to retrieve the associated content identification from the reference match database.

Operation 2320 then shows receiving at least one transmission from at least one client device, the at least one transmission including at least one client coefficient corresponding to at least one audio frame that may be rendered by the at least one client device. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, when audio and/or video is capable of being rendered by a client device (i.e., played through a speaker or other audio output of the client device), the audio and/or video data is thus transformed into coefficients in the client device using the same algorithms, functions, and/or sets of functions (not necessarily at the same rate as described elsewhere herein) as are used for the ingest operation. The resulting coefficients are typically transmitted over the internet to a matching server system that has access to a reference matching database.

Then, operation 2330 depicts identifying at least one content associated with the at least one client device based, at least in part, on searching a reference match database using the at least one client coefficient as a search term. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, the matching server system may use coefficients received from the client system to retrieve the suspect from the reference matching database. A plurality of successively received coefficients are used to retrieve a plurality of suspect persons, which are placed in bins associated with possible program matches. The time discounts are used to bin through successive database searches to determine and/or identify the most likely program being presented by the client device. The operational flow then proceeds to an end operation.

Fig. 24 illustrates an alternative embodiment of the exemplary operational flow 2300 of fig. 23. FIG. 24 illustrates an example embodiment wherein the operational flow 2310 may include at least one additional operation. Additional operations may include operations 2410, 2420, 2430, 2440, 2450, and/or 2460.

Operation 2410 shows obtaining at least one real-time feed of at least one broadcast of at least one content. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, the matching server system may retrieve programs via the satellite downlink of the national broadcast facility of the network. The matching server system may receive multiple channels of content at once. By forming the downlink directly from the network national broadcast facility, the matching server system receives the content before the client device due to client delays introduced by additional downlink and retransmission operations by local affiliates, cable operators, network headends, and the like.

Operation 2420 then shows encoding at least one audio sample of at least one real-time feed. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, audio data for one or more channels is converted into a coefficient stream for storage in a reference media database. A continuous audio waveform is sampled into a plurality of frames, or 20ms frames, which may occur, for example, 50 times a second. The sampling rate is selected to maintain an effectively fixed power spectral density of the audio information within the samples. In some embodiments, the overlapping of adjacent audio frames is performed to compensate for any mismatch between the start times of the audio matches by the matching server system and the client devices. The frame data is then transformed using a function that can repeatedly produce the same coefficient values as would be present when the audio data was transformed at the client device.

Operation 2430 then shows storing the at least one encoded audio sample in association with the at least one content identification. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, the coefficients may be stored with an identification of the program name obtained by the ingestion arrangement (e.g., satellite feed). The data is stored in a manner that facilitates retrieval of the data with a path tracking device to introduce leaky buckets and binning for time discounts on the results of successive data retrieval operations.

Operation 2420 may include at least one additional operation. Additional operations may include operation 2440. Operation 2440 shows transforming at least one audio sample into at least one coefficient, the transforming based at least in part on at least one normalization, the at least one normalization capable of repeatedly providing coefficients associated with ingested audio content, the ingested audio content not being correlated with a particular frequency. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, the transformation process may include an algorithm and/or function that: it is designed to "spread out" the coefficient values along the range of values in order to maximize the use of the entire range, rendering the data high entropy. Without such expansion, the coefficients would tend to cluster around a single point along the range of possible values for the coefficients. For example, consider a conversation that includes a speaker whose speech features include tones corresponding to particular frequencies. Without the above-described transformation designed to make the data appear high entropy, the coefficients corresponding to the speaker would tend to cluster around a value corresponding to that frequency. By applying the functionality disclosed herein, the coefficients are then expanded around their range of possible values, such that they appear to be high entropy and any association of the resulting coefficients with particular audio frequencies is eliminated. However, the functionality is repeatable because two different systems (e.g., the matching server system and the client device) operating on the same audio content will output the same or nearly the same coefficient values (note that they need not be exactly the same, as subsequent time-discounted binning to determine the likelihood of a match between multiple suspect allows for slight variations in the coefficients corresponding to the same portion of the content).

Operation 2450 shows maintaining a reference match database, including storing at least one coefficient corresponding to at least one audio frame using a location-sensitive hash index. In some embodiments, as shown in fig. 1-22 and/or described with respect to fig. 1-22, to quickly retrieve data, some of the most significant bits may indicate the particular database server on which the coefficients and program identification should be stored.

Operation 2460 shows maintaining at least two reference matching databases, including at least one audio reference matching database and at least one video reference matching database, with which the system can independently identify the at least one content associated with the at least one client device in response to receiving at least one client coefficient corresponding to at least one audio frame presented by the at least one client device or at least one client coefficient corresponding to at least one video frame presented by the at least one client device. In some embodiments, as shown in fig. 1-22 and/or described with respect to fig. 1-22, the system may receive video ingest in addition to audio ingest, thereby facilitating identification of programs using one or both of the stream of audio coefficients and/or the stream of video coefficients, which may be used to provide a more robust match by confirming identification using audio coefficients and using video coefficients or by providing the ability to switch between audio and video matches as needed if the signal is interrupted.

Fig. 25 illustrates an alternative embodiment of the exemplary operational flow 2300 of fig. 23. FIG. 25 illustrates an example embodiment in which the operational flow 2320 may include at least one additional operation. Additional operations may include operations 2510, 2520, 2530, and/or 2540.

Operation 2510 shows receiving at least one transmission from at least one client device comprising one or more of at least one television, at least one smart television, at least one media player, at least one set-top box, at least one gaming console, at least one a/V receiver, at least one internet-connected device, at least one computing device, or at least one streaming media device. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, a desktop applet may operate on a client device to convert an audio stream presentable on the client device into a coefficient stream for transmission to a matching server system. Many client devices present content and have the ability to perform data processing tasks simultaneously. In some cases, client actions may occur on the smart tv; in various embodiments, the client action occurs on a set-top box (e.g., a cable or satellite receiver) that receives the content and provides it to the television for playback.

Operation 2520 shows receiving at least one transport stream from at least one client device, the at least one transport stream including at least one sequence of client coefficients associated with one or more of at least one audio frame or at least one video frame renderable by the at least one client device to identify at least one content renderable by the at least one client device, the at least one sequence including at least some audio client coefficients. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, the client device of the present invention sends coefficients corresponding to samples of audio content to the matching server system, the generation and sending of the coefficients occurring at certain intervals (which may be periodic or aperiodic and may change mid-stream). The client device may additionally transmit coefficients generated using pixel data from what the client device receives, but the invention disclosed herein at least sometimes transmits audio coefficients regardless of whether video coefficients are transmitted.

Operation 2530 shows receiving at least one transmission from at least one client device, the at least one transmission comprising at least one client coefficient corresponding to at least one audio frame renderable by the at least one client device, the at least one client coefficient corresponding to the at least one audio frame renderable by the at least one client device, the client coefficient determined at least in part by at least one transform that is the same as at least one transform used in maintaining the reference matching database. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, the client device uses the same transform function (although not necessarily at the same rate as disclosed elsewhere herein) as utilized by the matching server system to obtain coefficients corresponding to audio content to be played on the speakers or audio external to the client device. The use of the same transformation by both systems means that at the same point in the program content, the resulting coefficient values produced by the client device and the matching server system will be substantially the same (subject to an overlap function that aligns the audio frames if framing starts with different time offsets on both systems).

Operation 2540 shows receiving at least one transmission from at least one client device, the at least one transmission including at least one client coefficient corresponding to at least one audio frame renderable by the at least one client device, the at least one client coefficient corresponding to the at least one audio frame renderable by the at least one client device, the client coefficient determined at least in part by sampling at least one audio stream into one or more frames and overlapping the one or more frames and then normalizing the overlapped one or more frames. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, where framing begins on a client device at a different time offset from that on the matching server system, frame overlap aligns the audio frames, which may occur when a client device is tuned to a new channel, for example, midway through a program being broadcast.

Fig. 26 illustrates an alternative embodiment of the exemplary operational flow 2300 of fig. 23. FIG. 26 illustrates an example embodiment where the operational flow 2330 may include at least one additional operation. Additional operations may include operations 2610, 2620, 2630, 2640, 2650, and/or 2660.

Operation 2610 illustrates obtaining one or more suspects from a reference match database associated with the video coefficients using the one or more video coefficients received from the at least one client device. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, the path tracking algorithm obtains a plurality of suspects corresponding to consecutive video coefficients received by the matching server system. Video matching may work if the client device is generating an unchanged display of content; activation of an on-screen menu or television zoom mode or on-screen graphics (e.g., watermarks) added by the local broadcast device may cause the video matching to fail.

Operation 2620 then shows detecting a change in one or more media content from at least one client device. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, the matching server system may detect a probability that a particular bin identifying the correct program is below a particular threshold, thereby declaring the particular bin as a possible content identification bin. This may occur when the received video coefficients (which were sent during on-screen channel guide activity) do not adequately match the coefficients in the database. Alternatively, the desktop applet of the client device may detect activation of the on-screen channel guide and begin transmission of audio coefficients or signal the matching server system to notify of the activation.

Operation 2630 then shows toggling content identification to utilize the one or more audio coefficients received from the at least one client device to obtain further suspect(s) from the reference match database associated with the audio coefficients. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, when an interference occurs with the video match (e.g., detection and/or signaling related to an on-screen channel guide), the matching server system may switch to using the matching of the audio coefficients because the audio signal is typically not interrupted by the on-screen channel guide, or added watermarks, or other interference with the on-screen video (i.e., media content changes).

Operation 2620 may include at least one additional operation. Additional operations may include operation 2640 and/or operation 2650.

Operation 2640 shows receiving at least one indication of at least one of an on-screen graphic, a fade to black, or a video zoom mode associated with the at least one client device. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, as described above, the matching server system may detect specific media content changes, such as on-screen graphics, a fade to black, or a video zoom pattern, that would interfere with matching using video coefficients. Such detection may be performed when a content match does not match a program with sufficient confidence, likelihood, and/or probability. Alternatively, the client device may signal to the matching server system that a media content change, such as a zoom mode, is occurring. Such signals may cause the matching server system to begin using the audio coefficients.

Then, operation 2650 shows sending a signaling to switch to audio content recognition based at least in part on the at least one indication. In some embodiments, as shown in fig. 1-22 and/or described with respect to fig. 1-22, in the event that video matching does not work, the system may switch to using audio coefficients for identification. In some cases, the leaky bucket created in association with the video match is recreated and the time discount binning resumes after switching to the audio match. In other cases, the content matching operation retains the suspect from the video match in an existing bin and begins adding the suspect from the audio match to that bin, so that in the time interval that occurs immediately after switching to audio, the bin may have both a video suspect and an audio suspect, where the video suspect may leak out of the bucket first, but both the video suspect and the audio suspect will be used to declare identification.

Operation 2660 shows determining at least one identification of at least one content associated with at least one client device based at least in part on time discounting binning of one or more suspects retrieved from the reference matching database using at least one client coefficient corresponding to at least one audio frame presentable by the at least one client device. In some embodiments, as shown in fig. 1-22 and/or described with respect to fig. 1-22, upon receiving audio coefficients from a client device, the audio coefficients are used as a search query referencing a media database. One or more suspects corresponding to the audio coefficients are retrieved, each suspects being linked to a particular program identifier. The suspect is placed in the bin assigned to the particular program. The process is repeated using each successively received audio coefficient and the bin that receives the largest suspect most likely corresponds to the program being viewed. Over time, the oldest suspect is removed (i.e., a "leaky bucket"), and when the channel on the client changes, the suspect begins to enter a different bin in response to different audio coefficients resulting from the channel change.

Fig. 27 illustrates an alternative embodiment of the exemplary operational flow 2300 of fig. 23. Fig. 26 illustrates an exemplary embodiment where the operational flow 2310 may include at least one additional operation 2710 and where the operational flow 2330 may include at least one additional operation 2720.

Operation 2710 illustrates storing one or more transformed power spectral coefficients associated with at least one audio portion of at least one ingested content, the at least one audio portion associated with at least one content identification. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, the audio coefficients of the media ingest operation begin as frames of audio content ingested during a sample having a frame size small enough that the power spectral density corresponding to the ingested audio signal remains effectively constant throughout the sample. The frames are converted into data using the operations disclosed herein, which is then stored in a reference media database and associated with the identity of the program being ingested.

Operation 2720 then shows temporally discounting binning one or more suspect samples obtained from the reference match database based, at least in part, on one or more received transformed power spectral coefficients associated with at least one audio content renderable by at least one client device. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, the audio coefficients of the client device that made the transmission operation also begin as frames of audio content that correspond to a certain audio portion of the program being played on the client device, the frames obtained during sampling having a frame size that is small enough that the power spectral density of the audio signal with the program being played on the client device remains effectively constant throughout the sampling. Matching the coefficients of the ingested known content with the coefficients of the client device playing the unknown content will result in the identification of the content being played by the client device.

Fig. 28 illustrates an alternative embodiment of the exemplary operational flow 2300 of fig. 23. FIG. 26 illustrates an exemplary embodiment in which operational flow 2300 may include at least one additional operation. Additional operations may include operations 2810, 2820, 2830, 2840, and/or 2850.

Operation 2810 shows continuously identifying at least one content associated with at least one client device based at least in part on continuously maintaining a reference matching database, continuously receiving transmissions from the at least one client device, and continuously searching the reference matching database using client coefficients associated with subsequent transmissions as search terms. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, coefficients received from a client device are used as a search query against a reference media database, and the query results are used for a time discount binning operation. Subsequent coefficients are received from the client device and used as a subsequent database search, the results of which are used for the time discount binning operation. Program identification occurs if sufficient audio coefficients are received from the client device. If the channel on the client device changes, the coefficient flow continues and a different program identification may follow. Thus, the audio match is a continuous audio match that will continue even when the channel changes. The operational flow then proceeds to an end operation.

Operation 2820 shows maintaining a second reference match database that includes at least one coefficient corresponding to at least one video frame of at least one ingested content and at least one content identification corresponding to the at least one ingested content. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, in addition to generating the audio coefficient stream for storage in the reference match database during the ingest operation, a video coefficient stream may be generated for storage in a reference match database corresponding to the video. The databases may be placed on different servers or server farms for optimal performance.

Then, operation 2830 shows changing a content identification method associated with the at least one client device, the changing the content identification method including at least one of: switching from content identification based on video coefficients to audio coefficients or vice versa. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, the content identification operation may switch between matching using audio coefficients and matching using video coefficients as desired; for example, if an interruption of one of the audio or video occurs, the matching may switch to another method. The operational flow then proceeds to an end operation.

Operation 2840 shows controlling at least one client device, including at least signaling the at least one client device to switch from transmitting client coefficients corresponding to video frames to transmitting client coefficients corresponding to audio frames. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, if the content recognition operation cannot reliably select a program identification based on a stream of video coefficients from the client device, the matching server system may send a command to the client device over the internet to begin sending audio coefficients instead of, or in addition to, the video coefficients so that content recognition may be attempted using the audio coefficients. The reverse process is also possible (i.e., the matching server system may instruct the client to begin sending video coefficients instead of, or in addition to, audio coefficients). The operational flow then proceeds to an end operation.

Operation 2850 shows controlling at least one client device, including at least signaling the at least one client device to transmit client coefficients corresponding to audio frames at a particular rate. For example, as shown in fig. 1-22 and/or described with respect to fig. 1-22, the rate at which audio coefficients are sent by the client device need not be the same as the rate at which audio coefficients are generated during ingestion. Once initially identified, the matching server system may instruct the client device to send coefficients less frequently. Alternatively, in more important cases of accurate and/or faster identification, the matching server system may instruct the client device to send coefficients more frequently. The operational flow then proceeds to an end operation.

Certain aspects of the present invention include the process steps and instructions described herein in the form of an algorithm. It should be noted that the process steps and instructions of the present invention could be embodied in software, firmware or hardware, and when embodied in software, could be downloaded to reside on and be operated from different platforms used by real time network operating systems.

The present invention also relates to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may comprise a general-purpose computer selectively activated or reconfigured by a computer program stored in the computer. Such a computer program may be stored in a computer readable storage medium, such as, but is not limited to, any type of disk including floppy disks, optical disks, CD-ROMs, magnetic-optical disks, read-only memories (ROMs), Random Access Memories (RAMs), EPROMs, EEPROMs, magnetic or optical cards, Application Specific Integrated Circuits (ASICs), or any type of media suitable for storing electronic instructions, and each coupled to a computer system bus.

Furthermore, the computers or computing devices referred to in this specification may include a single processor or may employ a multi-processor design to increase computing power.

The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Various general-purpose systems may also be used with programs in accordance with the teachings herein, or it may prove convenient to construct more specialized apparatus to perform the required method steps. The required structure for a variety of these systems will appear from the description above. In addition, the present invention is not described with reference to any particular programming language or operating system. It will be appreciated that a variety of programming languages and operating systems may be used to implement the teachings of the invention as described herein.

The systems and methods, flow diagrams, and structure block diagrams described in this specification can be implemented in computer processing systems that include program code comprising program instructions that are executable by the computer processing systems. Other implementations may also be used. Furthermore, the flowchart and block diagrams described herein describe specific methods and/or corresponding acts in support of steps and corresponding functions in support of disclosed structural means, which may also be utilized to implement corresponding software structures and algorithms and their equivalents.

Embodiments of the subject matter described in this specification can be implemented as one or more computer program products, i.e., one or more modules of computer program instructions encoded on a tangible program carrier for execution by, or to control the operation of, data processing apparatus. The computer readable medium can be a machine-readable storage device, a machine-readable storage substrate, a storage device, or a combination of one or more of them.

A computer program (also known as a program, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program does not necessarily correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub programs, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers at one site or distributed across multiple sites and interconnected by a suitable communication network.

The processes and logic flows described in this specification can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto optical disks, or optical disks. However, a computer does not necessarily have such a device. Processors suitable for the execution of a computer program include, by way of example only, and not by way of limitation, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both.

To provide for interaction with a user or administrator of the system described herein, embodiments of the subject matter described in this specification can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other types of devices may also be used to provide for interaction with a user. For example, feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input.

Embodiments of the subject matter described in this specification can be implemented in a computing system that includes one or more back-end components, which includes one or more data servers, or that includes one or more middleware components, such as application servers, or that includes a front-end component, such as a client computer having a graphical user interface or a Web browser through which a user or administrator can interact with some implementations of the subject matter described is this specification, or any combination of one or more such back-end components, middleware components, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication, such as a communication network. The computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client server relationship to each other.

The foregoing detailed description has set forth various embodiments of the devices and/or processes via the use of block diagrams, flowcharts, and/or examples. Insofar as such block diagrams, flowcharts, and/or examples contain one or more functions and/or operations, it will be understood by those within the art that each function and/or operation within such block diagrams, flowcharts, or examples can be implemented, individually and/or collectively, by a wide range of hardware, software, firmware, or virtually any combination thereof. In one embodiment, portions of the subject matter described herein may be implemented by an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), or other integrated format. In any event, those skilled in the art will recognize that some aspects of the embodiments disclosed herein, in whole or in part, can be equivalently implemented in standard integrated circuits, as one or more computer programs running on one or more computers (e.g., as one or more programs running on one or more computer systems), as one or more programs running on one or more processors (e.g., as one or more programs running on one or more microprocessors), as firmware, or as virtually any combination thereof, and that designing the circuitry and/or writing the code for the software and or firmware would be well within the skill of one of skill in the art in light of this disclosure. Moreover, those skilled in the art will appreciate that the mechanisms of the subject matter described herein are capable of being distributed as a program product in a variety of forms, and that an illustrative embodiment of the subject matter described herein applies equally regardless of the particular type of signal bearing media used to actually carry out the distribution. Examples of signal bearing media include, but are not limited to, the following: recordable type media such as floppy disks, hard disk drives, CD ROMs, digital magnetic tape, and computer memory; and transmission type media such as digital and analog communication links using TDM or IP based communication links (e.g., packet links).

Those skilled in the art will recognize that the prior art has evolved to leave little difference between hardware and software implementations of aspects of the system; the use of hardware or software is often (but not always, in that in some cases the choice between hardware and software may become significant) a design choice representing a cost versus efficiency tradeoff. Those skilled in the art will appreciate that there are a variety of vehicles by which processes and/or systems and/or other technologies described herein can be effected (e.g., hardware, software, and/or firmware), and that the preferred vehicle will vary with the context in which the processes and/or systems and/or other technologies are deployed. For example, if the implementer determines that speed and accuracy are paramount, the implementer may opt for a mainly hardware and/or firmware vehicle; alternatively, if flexibility is paramount, the implementer may opt for a software implementation-specific feature; or, yet alternatively, the implementer may opt for some combination of hardware, software, and/or firmware. Thus, there are several possible vehicles by which the processes and/or apparatus and/or other techniques described herein may be implemented, none of which is inherently superior to the other, as any vehicle to be used is a choice dependent upon the context in which the vehicle is to be deployed and the particular considerations of the implementer (e.g., speed, flexibility, or predictability), any of which may vary. Those skilled in the art will recognize that the optical aspects of the embodiments will typically employ hardware, software, and/or firmware oriented optical applications.

The foregoing detailed description has set forth various embodiments of the devices and/or processes via the use of block diagrams, flowcharts, and/or examples. Insofar as such block diagrams, flowcharts, and/or examples contain one or more functions and/or operations, it will be understood by those within the art that each function and/or operation within such block diagrams, flowcharts, or examples can be implemented, individually and/or collectively, by a wide range of hardware, software, firmware, or virtually any combination thereof. In one embodiment, portions of the subject matter described herein may be implemented by an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), or other integrated format. However, those skilled in the art will recognize that some aspects of the embodiments disclosed herein, in whole or in part, can be equivalently implemented in standard integrated circuits, in whole or in part, as one or more computer programs running on one or more computers (e.g., as one or more programs running on one or more computer systems), as one or more programs running on one or more processors (e.g., as one or more programs running on one or more microprocessors), as firmware, or as virtually any combination thereof, and that designing the circuitry and/or writing the code for the software and or firmware would be well within the skill of one of skill in the art in light of this disclosure. Moreover, those skilled in the art will appreciate that the mechanisms of the subject matter described herein are capable of being distributed as a program product in a variety of forms, and that an illustrative embodiment of the subject matter described herein applies equally regardless of the particular type of signal bearing media used to actually carry out the distribution. Examples of signal bearing media include, but are not limited to, the following: recordable type media such as floppy disks, hard disk drives, CD ROMs, digital magnetic tape, and computer memory; and transmission type media such as digital and analog communication links using TDM or IP based communication links (e.g., packet links).

Aspects described herein depict different components contained within, or associated with, different other components. It is to be understood that such depicted architectures are merely exemplary, and that in fact many other architectures can be implemented which achieve the same functionality. In a conceptual sense, any arrangement of components to achieve the same functionality is effectively "associated" such that the desired functionality is achieved. Hence, any two components herein combined to achieve a particular functionality can be seen as "associated with" each other such that the desired functionality is achieved, irrespective of architectures or intermedial components. Likewise, any two components so associated can also be viewed as being "operably connected," or "operably coupled," to each other to achieve the desired functionality, and any two components capable of being so associated can also be viewed as being "operably couplable," to each other to achieve the desired functionality. Specific examples of components that may be operably coupled include, but are not limited to, physically mateable and/or physically interacting components and/or wirelessly interactable and/or wirelessly interacting components and/or logically interacting and/or logically interactable components.

While particular aspects of the present subject matter described herein have been shown and described, it will be obvious to those skilled in the art that, based upon the teachings herein, changes and modifications may be made without departing from the subject matter described herein, and its broader aspects and, therefore, the appended claims are to encompass within their scope all such changes and modifications as are within the true spirit and scope of the present subject matter described herein. Furthermore, it is to be understood that the invention is defined by the appended claims. It will be understood by those within the art that, in general, terms used herein, and especially in the appended claims (e.g., bodies of the appended claims) are generally intended as "open" terms (e.g., the term "including" should be interpreted as "including but not limited to," the term "having" should be interpreted as "having at least," the term "includes" should be interpreted as "includes but is not limited to," etc.). It will be further understood by those within the art that if a specific number is recited in a claim, if such an intent is to be recited, such intent will be explicitly recited in the claim, and in the absence of such recitation no such intent is present. For example, as an aid to understanding, the following appended claims may contain usage of the introductory phrases "at least one" and "one or more" to introduce claim recitations. However, the use of such phrases should not be construed to introduce the indefinite articles "a" or "an" in a claim recitation, such that any particular claim containing such introduced claim recitation is limited to inventions containing only one such recitation, even when the same claim includes the introductory phrases "one or more" or "at least one" and indefinite articles such as "a" or "an" (e.g., "a" and/or "an" should typically be interpreted to mean "at least one" or "one or more"); the same holds true for the use of definite articles used to introduce claim recitations. Furthermore, even if a specific number of an introduced claim recitation is explicitly recited, those skilled in the art will recognize that such recitation should typically be interpreted to mean at least the recited number (e.g., the bare recitation of "two recitations," without other modifiers, typically means at least two recitations, or two or more recitations). Further, in those instances where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B and C together, etc.). In those instances where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include, but not be limited to, systems having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B and C together, etc.).

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any inventions or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment.

Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination. Furthermore, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Materials of the related art:

KABAL (P.), RAMACHANDRAN (R.P.) The calculation of line spectral frequencies using Chebyshev polynomials, Acoustics, Speech and Signal processing journal, Vol.34, No.6, p.1419-1426, 1986.

ITAKURA (F.) Line spectral representation of linear predictive coefficients of speech signals, J.Acoust.Soc.Amer. (J.Acoustics Association), Vol.57, supplement No. 1, S35,1975.

BISTRITZ (Y.), PELLERM (S.): immunity Spectral Pairs for speech coding, Proc. ICASSP' (1993 International conference on Acoustic, Speech and Signal processing), pages II-9 to II-12.

Neumeier U.S.8,595,781, METHODS FOR IDENTIFYING information VIDEO SEGMENTS AND DISPLAYING CONTENT marked CONTENT A CONNECTED TELEVISION (method of IDENTIFYING VIDEO segments and displaying contextually TARGETED CONTENT ON a networked TELEVISION)

Neumeier U.S.8,769,584B2, METHOD FOR DISPLAYING CONTEXTUATED CONTENT ON A CONNECTED TELEVISION (method of DISPLAYING CONTEXTUALLY TARGETED CONTENT ON a networked TELEVISION)

Neumeier U.S.9,055,335-SYSTEMS AND METHODS FOR ADDRESSING A MEDIA data base USING distinguishing Association HASHING (system and method FOR accessing media DATABASEs USING DISTANCE-dependent HASHING)

Audio Magic U.S.5,918,223, METHOD AND ARTICLE OF practical FOR CONTENT, basic ANALYSIS, STORAGE, RETRIEVAL, AND SEGMENTATION OF Audio INFORMATION (CONTENT-BASED ANALYSIS, STORAGE AND RETRIEVAL METHODs AND products AND AUDIO INFORMATION SEGMENTATION)

-Civolution US 8,959,202B2

-Shazam US 6,990,453

-Zeitera Audio Matching-14589366 application

Claims

1. A computer-implemented method for identifying one or more unknown media segments, the method comprising:

receiving an audio cue related to an unknown media segment, wherein the audio cue comprises an autocorrelation representation of audio frames identified in the unknown media segment;

identifying a plurality of reference audio cues in a reference audio cue database, wherein the plurality of reference audio cues are to be determined to match the received audio cue, and wherein a reference audio cue of the plurality of reference audio cues comprises an autocorrelation representation of audio frames identified in a known media segment;

adding a first token to a first bin associated with a first known media segment, wherein the first token is added to the first bin based on determining a match between the received audio cue associated with the unknown media segment and a first reference audio cue associated with the first known media segment;

adding a second token to a second bin related to a second known media segment, wherein the second token is added to the second bin based on determining a match between the received audio cue related to the unknown media segment and a second reference audio cue related to the second known media segment;

determining that a plurality of tokens in the first bin exceed a value; and

identifying the unknown media segment as matching the first known media segment when it is determined that the plurality of tokens in the first bin exceeds the value.

2. The computer-implemented method of claim 1, wherein the autocorrelation representation of the audio frame comprises one or more coefficients.

3. The computer-implemented method of claim 2, the method further comprising generating the one or more coefficients, wherein generating the one or more coefficients comprises: applying an autocorrelation function to the audio frame.

4. The computer-implemented method of claim 3, the method further comprising applying one or more transform functions to the one or more coefficients to generate one or more transformed coefficients.

5. The computer-implemented method of claim 4, wherein the one or more transform functions comprise at least a linear predictive coding function.

6. The computer-implemented method of claim 4, further comprising applying one or more normalization functions to coefficients of the one or more transforms.

7. The computer-implemented method of claim 6, wherein applying the one or more normalization functions to the one or more transformed coefficients comprises: quantizing the one or more transformed coefficients.

8. The computer-implemented method of claim 4, the method further comprising applying one or more additional transform functions to the one or more transformed coefficients to generate one or more further transformed coefficients.

9. The computer-implemented method of claim 8, wherein the one or more additional transform functions comprise at least one of a Line Spectral Pair (LSP) transform function or an Immittance Spectral Frequency (ISF) transform function.

10. The computer-implemented method of claim 8, the method further comprising applying one or more normalization functions to the one or more further transformed coefficients.

11. The computer-implemented method of claim 10, wherein applying the one or more normalization functions to the one or more further transformed coefficients comprises: quantizing the one or more further transformed coefficients.

12. The computer-implemented method of claim 1, wherein an audio frame comprises a period of time in an audio portion of a media clip, and wherein the audio portion has fixed audio signal characteristics during the period of time.

13. The computer-implemented method of claim 1, wherein the audio frames have a fixed size.

14. The computer-implemented method of claim 1, wherein a reference audio cue is determined to match the received audio cue when the autocorrelation representation of an audio frame identified in the known media segment is within a range of the autocorrelation representation of an audio frame identified in the unknown media segment.

15. The computer-implemented method of claim 1, the method further comprising:

determining content related to the known media segments;

retrieving the content from a database;

sending the retrieved content, wherein the retrieved content is addressed to a media system, and wherein the received audio cue is received from the media system.

16. The computer-implemented method of claim 1, the method further comprising:

removing one or more tokens from the first bin when a period of time has elapsed.

17. A computing device for identifying one or more unknown media segments, the computing device comprising:

a storage device;

one or more processors; and

a non-transitory machine-readable storage medium comprising instructions that, when executed on the one or more processors, cause the one or more processors to perform operations comprising:

receiving audio cues related to an unknown media segment, wherein the audio cues comprise autocorrelation representations of audio frames identified in the unknown media segment;

identifying a plurality of reference audio cues in a reference audio cue database, wherein the plurality of reference audio cues are to be determined to match the received audio cue, and wherein a reference audio cue of the plurality of reference audio cues comprises an autocorrelation representation of an audio frame identified in a known media segment;

determining that a plurality of tokens in the first bin exceed a value; and

18. The computing device of claim 17, wherein the autocorrelation representation of the audio frame comprises one or more coefficients, the computing device further comprising instructions that, when executed on the one or more processors, cause the one or more processors to perform operations comprising generating the one or more coefficients by applying an autocorrelation function to the audio frame.

19. The computing device of claim 18, further comprising instructions that when executed on the one or more processors cause the one or more processors to perform operations comprising applying one or more transform functions to the one or more coefficients to generate one or more transformed coefficients.

20. A computer program product, the computer program product being tangibly embodied in a non-transitory machine-readable storage medium of a computing device, the non-transitory machine-readable storage medium including instructions configured to cause one or more processors to:

adding a second token to a second bin associated with a second known media segment, wherein the second token is added to the second bin based on determining a match between the received audio cue associated with the unknown media segment and a second reference audio cue associated with the second known media segment;

determining that a plurality of tokens in the first bin exceed a value; and