US20140181668A1 - Visual summarization of video for quick understanding - Google Patents
Visual summarization of video for quick understanding Download PDFInfo
- Publication number
- US20140181668A1 US20140181668A1 US13/722,754 US201213722754A US2014181668A1 US 20140181668 A1 US20140181668 A1 US 20140181668A1 US 201213722754 A US201213722754 A US 201213722754A US 2014181668 A1 US2014181668 A1 US 2014181668A1
- Authority
- US
- United States
- Prior art keywords
- video
- window
- emotion
- audio
- indicia
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/47—End-user applications
- H04N21/472—End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content
- H04N21/47217—End-user interface for requesting content, additional data or services; End-user interface for interacting with content, e.g. for content reservation or setting reminders, for requesting event notification, for manipulating displayed content for controlling playback functions for recorded or on-demand content, e.g. using progress bars, mode or play-point indicators or bookmarks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/01—Input arrangements or combined input and output arrangements for interaction between user and computer
- G06F3/048—Interaction techniques based on graphical user interfaces [GUI]
- G06F3/0484—Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/2343—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
- H04N21/234318—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by decomposing into objects, e.g. MPEG-4 objects
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/25—Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
- H04N21/266—Channel or content management, e.g. generation and management of keys and entitlement messages in a conditional access system, merging a VOD unicast channel into a multicast channel
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/83—Generation or processing of protective or descriptive data associated with content; Content structuring
- H04N21/845—Structuring of content, e.g. decomposing content into time segments
- H04N21/8456—Structuring of content, e.g. decomposing content into time segments by decomposing the content in the time domain, e.g. in time segments
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/85—Assembly of content; Generation of multimedia applications
- H04N21/854—Content authoring
- H04N21/8549—Creating video summaries, e.g. movie trailer
Definitions
- This disclosure relates generally to graphical user interfaces, and more particularly, to visually summarizing a video in a way that facilitates quick understanding by a viewer of the types and locations of particular types of content.
- a television show, movie, internet video, or other similar content may be stored on a disc or in other memory using a container or wrapper file format.
- the container format may be used to specify how multiple different data files are to be used.
- the container format for a video may identify different data types and describe how they are to be interleaved when the video is played.
- a container may contain video files, audio files, subtitle files, chapter-information files, metadata, and other files.
- a container also typically includes a file that specifies synchronization information needed for simultaneous playback of the various files.
- One format for digital video files is the DVD-Video format.
- Another format for digital video files is Audio Video Interleaved (“AVI”). Audio may be stored in various formats, such as the PCM, DTS, MPEG-1 Audio Layer II (MP2), or Dolby Digital (AC-3) formats.
- a multimedia video generally includes a large amount of perceptual information, i.e., information such as images and sounds that are perceived by viewers.
- the frames of a video file may show humans, who may or may not be actors, and a wide variety of nonhuman objects.
- a nonhuman object may be a background, such as a natural indoor or outdoor location, or a professional stage or set.
- a nonhuman object may also be a prop or other visual element in front of the background object.
- Yet another type of nonhuman object that may be shown in a video frame is text. For instance, words spoken by humans may be displayed as text in a particular area of the frames.
- Segments of an audio file may be synchronously played with the display of video frames. These segments may include spoken words, music, and a wide variety of sound effects.
- an audio-video file may be as short as a few minutes
- the typical video such as a television show or a full length movie, ranges in length from 20 minutes to over two hours.
- the typical video may include many scenes, each corresponding with a particular segment of the video. For example, a movie may have between 50 and 200 scenes. A minor scene may be one minute or less. A major scene may be three or more minutes. Each scene may include many frames and may include one or more camera shots.
- a scene may be accompanied by spoken dialog, a particular musical score or set of sound effects, or a combination of sound types. Particular human and nonhuman objects may appear in a scene.
- a scene may be intended by the creator to invoke particular emotions or moods, or to convey a theme of the story.
- One embodiment is directed to a method that visually summarizes the types and locations of particular types of content in a video in a way that facilitates understanding by a viewer.
- the method may include determining one or more semantic segments of the video.
- the method may include determining one or more emotion objects for at least one of the semantic segments.
- the method may include generating a user interface on a display screen.
- the user interface may include one window, and in another embodiment, the user interface may include two windows.
- the method may include displaying first indicia of the emotion object in a first window. The horizontal extent of the first window corresponds with the temporal length of the video and the first indicia are displayed at a location corresponding with the temporal appearance of the emotion object in the video.
- Additional embodiments are directed to a non-transitory computer-readable storage medium having executable code stored thereon to cause a machine to perform a method for rendering a summary of a video, and to a system that visually summarizes the types and locations of particular types of content in a video in a way that facilitates understanding by a viewer.
- FIG. 1 depicts a high-level block diagram of an exemplary computer system for implementing various embodiments.
- FIG. 2 is a block diagram of an exemplary audio-visual file container according to one embodiment.
- FIG. 3 is a block diagram of a process for visually summarizing a video in a way that facilitates quick understanding by a viewer of the types and locations of particular types of content according to an embodiment.
- FIG. 4 depicts a display screen displaying a user interface according to one embodiment.
- FIG. 5 illustrates one embodiment of a process for generating visual tags according to one embodiment.
- FIG. 6 illustrates a process for generating audio and key word tags according to one embodiment.
- FIG. 7 depicts a display screen displaying a user interface according to an embodiment.
- FIG. 8 depicts a display screen displaying a user interface according to an embodiment.
- a multimedia video generally includes a large amount of perceptual information, i.e., information such as images and sounds that may be perceived by viewers.
- a video may show human and nonhuman objects.
- a video may include spoken words, music, and other sounds, which may be referred to herein as audio objects.
- a video may evoke various emotions, moods, or themes, which may be referred to herein as emotion objects.
- the spoken words may include “key words.”
- a key word may be a word that provides significant information content about a scene in a video. These objects and key words may be used to describe a scene to a viewer.
- visual representations of key words, and human, nonhuman, audio, and emotion objects may be used to describe the scenes of a video to a viewer.
- visual representations of the relationships between these objects and key words may be used to describe the scenes of a video to a viewer. By visually presenting this information to the viewer, he or she may be enabled to generally understand the scene. The information may enable the viewer to determine whether a particular scene is of interest or is objectionable. In various embodiments, visual information summarizing all of the scenes of a video may be presented to the viewer in a single display screen.
- a viewer selects a video, and human, nonhuman, and audio objects of the video are identified.
- key words that are spoken by human objects in the video are identified.
- Human, nonhuman, and audio objects may be used to classify a particular segment of a video as a scene.
- the objects and key words are then associated with the scenes of the video.
- the objects, key words, and other data may be used to determine an emotion, mood, or theme for one or more of the scenes, and to generate corresponding emotion objects.
- the objects and key words may be compared with profile information to determine an attitude or preference of a viewer regarding the scenes of the video.
- a viewer's attitude may be, for example, that he or she likes, dislikes, or finds a particular type of content objectionable.
- visual representations of key words, and human, nonhuman, and audio objects summarizing all of the scenes of a video are presented to the viewer in a single display screen.
- visual representations of a viewer's attitudes or preferences toward a particular object or key word may be displayed.
- a display screen may include a first window for playing the video and a second window for rendering text, symbols, and icons corresponding with human, nonhuman, audio, and emotion objects, and key words.
- the second window may also include a visual indication of a viewer's attitude regarding particular human, nonhuman, audio, and emotion objects, and key words.
- a viewer may select one or more scenes for playing in the first window.
- One or more other scenes of the video may be identified as scenes to be recommended to the viewer.
- the recommended scenes may be other scenes that have human, nonhuman, audio, and emotion objects, and key words that are similar to the scene selected by the viewer.
- FIG. 1 depicts a high-level block diagram of an exemplary computer system 100 for implementing various embodiments.
- the mechanisms and apparatus of the various embodiments disclosed herein apply equally to any appropriate computing system.
- the major components of the computer system 100 include one or more processors 102 , a memory 104 , a terminal interface 112 , a storage interface 114 , an I/O (Input/Output) device interface 116 , and a network interface 118 , all of which are communicatively coupled, directly or indirectly, for inter-component communication via a memory bus 106 , an I/O bus 108 , bus interface unit 109 , and an I/O bus interface unit 110 .
- processors 102 includes one or more processors 102 , a memory 104 , a terminal interface 112 , a storage interface 114 , an I/O (Input/Output) device interface 116 , and a network interface 118 , all of which are communicatively coupled, directly or indirectly, for inter
- the computer system 100 may contain one or more general-purpose programmable central processing units (CPUs) 102 A and 102 B, herein generically referred to as the processor 102 .
- the computer system 100 may contain multiple processors typical of a relatively large system; however, in another embodiment, the computer system 100 may alternatively be a single CPU system.
- Each processor 102 executes instructions stored in the memory 104 and may include one or more levels of on-board cache.
- the memory 104 may include a random-access semiconductor memory, storage device, or storage medium (either volatile or non-volatile) for storing or encoding data and programs.
- the memory 104 represents the entire virtual memory of the computer system 100 , and may also include the virtual memory of other computer systems coupled to the computer system 100 or connected via a network.
- the memory 104 is conceptually a single monolithic entity, but in other embodiments the memory 104 is a more complex arrangement, such as a hierarchy of caches and other memory devices.
- memory may exist in multiple levels of caches, and these caches may be further divided by function, so that one cache holds instructions while another holds non-instruction data, which is used by the processor or processors.
- Memory may be further distributed and associated with different CPUs or sets of CPUs, as is known in any of various so-called non-uniform memory access (NUMA) computer architectures.
- NUMA non-uniform memory access
- the memory 104 may store all or a portion of the following: an audio visual file container 150 (shown in FIG. 2 as container 202 ), a video processing module 152 , an audio processing module 154 , and a control module 156 . These modules are illustrated as being included within the memory 104 in the computer system 100 , however, in other embodiments, some or all of them may be on different computer systems and may be accessed remotely, e.g., via a network.
- the computer system 100 may use virtual addressing mechanisms that allow the programs of the computer system 100 to behave as if they only have access to a large, single storage entity instead of access to multiple, smaller storage entities.
- audio visual file container 150 video processing module 152 , audio processing module 154 , and control module 156 are illustrated as being included within the memory 104 , these components are not necessarily all completely contained in the same storage device at the same time. Further, although the audio visual file container 150 , video processing module 152 , audio processing module 154 , and control module 156 are illustrated as being separate entities, in other embodiments some of them, portions of some of them, or all of them may be packaged together.
- the video processing module 152 , audio processing module 154 , and control module 156 may include instructions or statements that execute on the processor 102 or instructions or statements that are interpreted by instructions or statements that execute on the processor 102 to carry out the functions as further described below.
- the video processing module 152 , audio processing module 154 , and control module 156 are implemented in hardware via semiconductor devices, chips, logical gates, circuits, circuit cards, and/or other physical hardware devices in lieu of, or in addition to, a processor-based system.
- the video processing module 152 , audio processing module 154 , and control module 156 may include data in addition to instructions or statements.
- the video processing module 152 may include various processes that generate visual tags according to one embodiment.
- the audio processing module 154 may include various processes for generating audio and key word tags according to one embodiment.
- the control module 156 may include various processes for visually summarizing a video in a way that facilitates quick understanding by a viewer of the types and locations of particular types of content according to an embodiment.
- the control module 156 may include various processes for rendering all or selected portions of a video, and rendering a user interface, such as the one shown in FIG. 4 .
- the control module 156 may include various processes for identifying scenes to be recommended to a viewer, as well as other processes described herein.
- the computer system 100 may include a bus interface unit 109 to handle communications among the processor 102 , the memory 104 , a display system 124 , and the I/O bus interface unit 110 .
- the I/O bus interface unit 110 may be coupled with the I/O bus 108 for transferring data to and from the various I/O units.
- the I/O bus interface unit 110 communicates with multiple I/O interface units 112 , 114 , 116 , and 118 , which are also known as I/O processors (IOPs) or I/O adapters (IOAs), through the I/O bus 108 .
- the display system 124 may include a display controller, a display memory, or both.
- the display controller may provide video, audio, or both types of data to a display device 126 .
- the display memory may be a dedicated memory for buffering frames of video data.
- the display system 124 may be coupled with a display device 126 , such as a standalone display screen, computer monitor, television, or a tablet or handheld device display.
- the display device 126 may include one or more speakers for rendering audio.
- one or more speakers for rendering audio may be coupled with an I/O interface unit.
- one or more of the functions provided by the display system 124 may be on board a processor 102 integrated circuit.
- one or more of the functions provided by the bus interface unit 109 may be on board a processor 102 integrated circuit.
- the I/O interface units support communication with a variety of storage and I/O devices.
- the terminal interface unit 112 supports the attachment of one or more viewer I/O devices 120 , which may include viewer output devices (such as a video display device, speaker, and/or television set) and viewer input devices (such as a keyboard, mouse, keypad, touchpad, trackball, buttons, light pen, or other pointing device).
- viewer output devices such as a video display device, speaker, and/or television set
- viewer input devices such as a keyboard, mouse, keypad, touchpad, trackball, buttons, light pen, or other pointing device.
- a viewer may manipulate the user input devices using a user interface, in order to provide input data and commands to the user I/O device 120 and the computer system 100 , and may receive output data via the user output devices.
- a user interface may be presented via the user I/O device 120 , such as displayed on a display device, played via a speaker, or printed via a printer.
- the storage interface 114 supports the attachment of one or more disk drives or direct access storage devices 122 (which are typically rotating magnetic disk drive storage devices, although they could alternatively be other storage devices, including arrays of disk drives configured to appear as a single large storage device to a host computer, or solid-state drives, such as flash memory).
- the storage device 122 may be implemented via any type of secondary storage device.
- the contents of the memory 104 , or any portion thereof, may be stored to and retrieved from the storage device 122 as needed.
- the I/O device interface 116 provides an interface to any of various other I/O devices or devices of other types, such as printers or fax machines.
- the network interface 118 provides one or more communication paths from the computer system 100 to other digital devices and computer systems; these communication paths may include, e.g., one or more networks.
- the computer system 100 shown in FIG. 1 illustrates a particular bus structure providing a direct communication path among the processors 102 , the memory 104 , the bus interface 109 , the display system 124 , and the I/O bus interface unit 110
- the computer system 100 may include different buses or communication paths, which may be arranged in any of various forms, such as point-to-point links in hierarchical, star or web configurations, multiple hierarchical buses, parallel and redundant paths, or any other appropriate type of configuration.
- the I/O bus interface unit 110 and the I/O bus 108 are shown as single respective units, the computer system 100 may, in fact, contain multiple I/O bus interface units 110 and/or multiple I/O buses 108 . While multiple I/O interface units are shown, which separate the I/O bus 108 from various communications paths running to the various I/O devices, in other embodiments, some or all of the I/O devices are connected directly to one or more system I/O buses.
- the computer system 100 is a multi-user mainframe computer system, a single-user system, or a server computer or similar device that has little or no direct user interface, but receives requests from other computer systems (clients).
- the computer system 100 may be implemented as a desktop computer, portable computer, laptop or notebook computer, tablet computer, pocket computer, telephone, smart phone, or any other suitable type of electronic device.
- FIG. 1 is intended to depict the representative major components of the computer system 100 . Individual components, however, may have greater complexity than represented in FIG. 1 , components other than or in addition to those shown in FIG. 1 may be present, and the number, type, and configuration of such components may vary. Several particular examples of additional complexity or additional variations are disclosed herein; these are by way of example only and are not necessarily the only such variations.
- the various program components illustrated in FIG. 1 may be implemented, in various embodiments, in a number of different manners, including using various computer applications, routines, components, programs, objects, modules, data structures, etc., which may be referred to herein as “software,” “computer programs,” or simply “programs.”
- FIG. 2 is a block diagram of an exemplary audio-visual file container 202 that may contain a video file 204 , an audio file 206 , a subtitle file 208 , and a metadata file 210 according to one embodiment.
- the container may also include other files, such as a file that specifies synchronization information.
- FIG. 3 is a block diagram of a process 300 for visually summarizing a video in a way that facilitates quick understanding by a viewer of the locations of particular types of content according to an embodiment.
- the process 300 may receive as input a visual tag file 302 , an audio tag file 304 , a key word tag file 306 , an attribute tag file 308 , and a metadata file 210 .
- the visual tag file 302 includes tags that correspond with visually perceivable objects, such as human and nonhuman objects.
- the audio tag file 304 includes tags that correspond with aurally perceivable objects.
- the key word tag file 306 includes tags that correspond with key word objects.
- the attribute tag file 308 includes tags that correspond with attribute objects.
- Each tag may be associated with a time stamp that indicates the start and stop time in which the object or attribute is rendered or otherwise associated. Exemplary embodiments for automatically determining tags are described below with reference to FIGS. 5-6 . In addition, in some embodiments, tags of one or more types may be wholly or partially determined using manual methods.
- the operation 310 may include comparing a tag with one or more other tags associated with the same shot or scene for consistency.
- a shot may be a continuous sequence of frames captured without interruption by a camera oriented in a single direction or camera angle.
- a visual tag may indicate that a particular human object appears in a shot and a key word tag identifying the name of the human object is associated with the shot.
- a visual tag may indicate that a particular human object appears in a shot and an audio tag identifies an audio signature of the human object is associated with the shot.
- the tags that are compared indicate the same object, the positive or consistent result of the comparison may be used in operation 310 to validate that the human object was correctly identified.
- the operation 310 may include modifying a tag determined to be inconsistent with other tags associated with the same shot.
- the modification may include adding an indication to the tag that it should not be used in other processes.
- a probability or confidence parameter associated with the particular tag is above a threshold, it may be determined that the object was correctly identified and that the shot or scene includes multiple objects.
- the modification may include adding an indication to the tag that it may be relied on to a particular extent.
- an emotion tag file 314 may be created from the attribute tag file 308 and the consistency-corrected visual tag 302 , audio tag 304 , key word tag 306 , and metadata 210 files.
- the emotion tag file 314 includes tags that are associated with emotion objects.
- an emotion object may be associated with an emotion, mood, or theme that a typical viewer might be expected to perceive or that the creators of a video intended the audience to perceive.
- Each emotion object may be of a predefined type and associated with a time stamp.
- An emotion object may include parameters corresponding with intensity of the perceived emotion or a confidence level that the perceived emotion accurately represents a ground truth emotion.
- An emotion object may be generated directly from the attribute file 308 , such as where the attribute file identifies an association or correlation of an attribute with a perceived emotion.
- an emotion object may be generated directly from the visual tag 302 , such as where the tag identifies a human object displaying a particular emotion.
- an emotion object may be generated directly from the audio tag 304 or key word tag 306 files, such as where an audio tag identifies a segment of sound associated or correlated with an emotion, mood, or theme, or a key word is associated with an emotion, mood, or theme.
- an emotion object may be generated in operation 312 by identifying patterns of visual, audio, key word, and attribute tags that correspond or correlate with an emotion object.
- an emotion object may be generated in operation 312 using contextual data provided in the metadata file 210 , such as metadata designating that the video is of a particular genre, e.g., comedy, horror, drama, or action.
- contextual data e.g., comedy, horror, drama, or action.
- visual, audio, and attribute tags for a shot or scene may all be associated with a particular mood, e.g., amusement, fear, sadness, suspense, or interest.
- an emotion object may be determined using manual methods.
- a tag may be generated for an emotion object.
- An emotion tag may include an intensity level of the emotion, mood, or theme.
- a single emotion tag may be associated with two or more emotion objects.
- one or more tags of the tag files 302 , 304 , 306 , 308 , and 314 may be rendered as one or more indicia on a display device according to known techniques.
- FIG. 4 depicts a display screen 402 of a display device, e.g., display 126 ( FIG. 1 ), for displaying a user interface.
- the user interface includes windows 404 and 406 , which may be rendered on the display screen 402 along with a variety of textual information, and control icons or buttons, e.g., buttons 403 , 405 , outside of the windows.
- the video may be played in the window 404 .
- a variety of text, symbols, lines, and icons (“indicia”) for summarizing the video may be rendered in the window 406 .
- the horizontal extent of the window 406 may correspond with the duration or total time of the video.
- FIG. 4 depicts a user interface that includes a window 404 for playing a video
- a user interface may omit the window 404 , i.e., in other embodiments, a user interface may include only the window 406 for rendering text, symbols, and icons for summarizing the video (along with control icons or buttons 403 , 405 outside of the window 406 ).
- one or more object identifiers 408 may be rendered on the display screen 402 , such as to one side or the other of the window 406 , e.g., OBJECT 1 to OBJECT 8.
- one or more horizontal lines (time lines) having a length (or horizontal extent) and temporal position may be rendered horizontally adjacent to each object identifier. The length or horizontal extent may indicate the duration of the rendering of the associated object.
- OBJECT 1 is associated with lines 410 a
- OBJECT 5 is associated with lines 410 e .
- an icon rather than a line may be rendered to indicate the temporal location of an object.
- an icon 418 may be displayed to show where an audio object associated with music is located.
- the icon may be rendered at a point corresponding with the start of the time period in which the object is rendered.
- the icon may be rendered at a point corresponding with the midpoint or end of the time period in which the object is rendered.
- Exemplary embodiments for automatically determining object identifiers are described below with reference to FIGS. 5 and 6 .
- the horizontal lines 410 a - 410 h facilitate a quick understanding by a viewer of the types and locations of various objects in the video.
- a viewer may quickly understand where different objects simultaneously appear in the video.
- OBJECTS 4 and 5 which may be two particular actors, only appear together in the final quarter of the video.
- a viewer may quickly understand where preferred or objectionable objects appear in the video.
- horizontal lines for objectionable objects may be rendered in a different color than the color used for horizontal lines for objects generally.
- key words 414 (“KW#”) may be rendered in the second window 406 at horizontal locations corresponding with the temporal rendering of the particular key word in the video.
- key word 1 (KW1) 414 appears in the video at the start, at approximately the one-third time point, at approximately the two-thirds time point, and at a time point about eighty percent of the way through the video.
- a key word 414 may be rendered at any desired vertical coordinate or position within the second window 406 , i.e., it may but need not be associated with one of the object identifiers 408 . Exemplary embodiments for automatically determining key words are described below with reference to FIG. 6 .
- key words 414 facilitates a quick understanding by a viewer of the types and locations of various key words in the video.
- a viewer may quickly understand where key words simultaneously occur with the appearance of various objects in the video. For example, key word KW4 occurs simultaneously with an appearance of object 8.
- emotion, mood, or theme denoting icons 416 may be rendered in the second window 406 .
- An emotion denoting icon 416 may be associated with and representative of an emotion tag.
- An emotion denoting icon 416 may be rendered at horizontal locations corresponding with the temporal location of the particular emotion tag in the video.
- an emotion or mood denoting icon 416 may be an “emoticon.”
- an emotion or mood denoting icon 416 may be a colored or gray-scale icon. While depicted as circular, an icon 416 may be any shape.
- the size, color, or shade of an icon 416 may correspond with an intensity of the associated emotion tag.
- an icon 416 associated with amusement or a funny mood may be relatively large if the mood or emotion would be expected to be perceived intensely, but the same icon may be relatively small if the mood or emotion would be expected to be perceived mildly.
- the display of emotion denoting icons 416 facilitates a quick understanding by a viewer of the types and locations of various emotions, moods, or themes in the video.
- a viewer can determine in a single view the proportion of the video that is associated with a particular emotion, mood, or theme, such as action or comedy.
- a viewer can determine in a single view where emotion objects of a particular type are located, e.g., funny portions of the video.
- FIG. 5 illustrates of a process for generating visual tags according to one embodiment.
- a video file 204 may be parsed into shot files.
- a shot may be a continuous sequence of frames captured without interruption by a camera oriented in a single direction or camera angle.
- the camera may have a single field of view and field size, or may have a variable field of view, such as a zoom-in or -out shot.
- the camera may remain fixed, or be moved in a panning, tilting, or tracking motion.
- a fixed field of view shot may be a long shot, a full shot, a medium shot, or a close up shot.
- the video file 204 may be parsed into shot files according to any known method. For example, in one embodiment, a histogram may be computed for each frame of the video file and the histograms for consecutive frames compared. If the histogram intersection of first and second consecutive frames is greater than a threshold, it may be inferred that the frames are similar, and consequently that the two frames are part of the same shot. On the other hand, if the histogram intersection of first and second consecutive frames is less than the threshold, it may be inferred that the two frames form a shot boundary. In addition, it may be inferred that the first consecutive frame is the last frame of a preceding shot and the second consecutive frame is the first frame of a succeeding shot.
- the histograms of two or more consecutive first frames may be compared with the histograms of two or more consecutive second frames (the group of first and second frames being consecutive), and a shot boundary may be defined by more consecutive frames than merely two frames.
- the shot transition between shots may be a “fade” rather than a “cut.”
- a time code and type of shot transition (fade or cut) may be recorded as metadata for use in content analysis described below.
- Other known methods for parsing a video file into shot files may be employed in operation 504 .
- operation 504 may include parsing the video file so that sequential frames between determined shot boundaries are grouped togetherI or otherwise identified or tagged as being associated with a particular shot. Sequential frames associated with a particular shot may be referred to herein as a shot file.
- a key frame may be determined for a shot file.
- the key frame may be deemed to be representative of all frames in the shot, permitting descriptive data for the shot to be determined only for the key frame and not for every frame of the shot.
- a key frame may be determined for each shot file.
- the operation 506 of determining a key frame may be omitted. Any known method for determining a key frame may be employed.
- a key frame may be determined by selecting a middle frame of the shot file.
- descriptive data for the shot may be determined for each of two or more key frames for a shot. Other known methods for determining a key frame may be employed in operation 506 .
- shot attributes may be determined and recorded as metadata.
- shot attributes may include shot length, color variance, type of illumination or lighting, amount of motion, and shot type (zooming, panning, tilting, tracking motion, long, full, medium, or close up).
- Shot length may be determined by counting the number of frames of a shot.
- Color variance and illumination or lighting properties may be determined by analyzing pixel values of key frames using known techniques.
- the amount of motion may be determined by evaluating the number of times individual pixels change value from frame-to-frame in a shot using known techniques.
- Shot type may be determined using known techniques.
- a shot attribute may correspond with known cinematic techniques for evoking a particular mood. For example, particular lighting may be used to evoke a suspense theme.
- Metadata for a shot may include mood, emotion, or theme where another shot attribute is associated with a known cinematic technique for evoking the mood, emotion, or theme.
- visual objects in a shot may be identified and tagged.
- visual objects in a shot may be identified by application of one or more known image recognition processes to the shot.
- the operation 510 may operate on one or more key frames of the shot.
- a shot may include the human and nonhuman visual objects. Both human and nonhuman visual objects may be identified in operation 510 .
- a human visual object may be identified by identifying a face (“human facial object”) in a frame.
- the operation 510 may include determining whether or not a particular visual object is present in a shot and, if present, to identify its location in the frame.
- the operation 510 may include extracting an identified object for further processing. For example, an extracted human facial object may be further processed to determine the identity of the viewer or to determine a facial expression of the viewer.
- the position or location within a frame of an object may be determined using any known method.
- a method may be of a type that employs rules that code typical attributes of the object.
- Attributes of a facial object may include, for example, eyes, eye brows, nose, hair line, hair texture, lips, and mouth.
- a rule may identify a face only if a particular facial feature, e.g., a first eye, is in a prescribed relationship to another feature, e.g., a second eye.
- a method may be of a type that employs rules that identify so-called “invariant features” that are present in a frame regardless of the position or pose of the object, the lighting, or camera viewpoint.
- Methods of this type may employ an image recognition processes that identifies: (i) facial features using edge detectors (e.g., a Sobel filter) and templates; (ii) skin or hair texture using a neural network; and (iii) skin color using a pixel chrominance classifier.
- methods may employ multiple techniques in stages, such as identifying global features such as skin color and face shape first, then verifying that the region is in fact a face by locating and detecting particular facial features within the region.
- the object may be identified as an object of a particular type or instance using any known method in operation 510 .
- known template matching methods may be employed.
- a first type of template matching method several standard patterns of a face are used. The standard patterns may describe the face as a whole or the facial features separately. Correlations between an image extracted from a frame and the standard patterns may be computed. If the correlations are statistically significant, it may be determined that a human facial object is found.
- the patterns are “learned” from training images using known statistical analysis and machine learning techniques.
- patterns may be learned from training images using: (i) Eigenfaces; (ii) Distribution-based Methods (including Principle Component Analysis, Factor Analysis, and Fisher's Linear Discriminant); (iii) Neural Networks; (iv) Support Vector Machines; (v) Sparse Network of Winnows (SNoW); (vi) Naive Bayes Classifiers; (vii) Hidden Markov Models; (viii) Information-Theoretical Approaches (including Kullback relative information); and (ix) Inductive Learning Algorithms.
- a nonhuman object such as a prop may be identified though color values and object-specific features. Patterns and templates for nonhuman objects will be different than those for facial objects.
- a musical instrument such as an acoustic guitar, may be identified by determining regions of pixels having wood color values. Appropriately colored pixel regions may then be compared with patterns or templates for neck and body parts of the acoustic guitar, as viewed in different orientations.
- a human facial object may be processed to determine the emotion expressed on the facial object.
- a process may, in one embodiment, employ a Gabor filter to determine facial features and their orientation, and a support vector machine to determine an emotion corresponding with detected facial features.
- a sequence of frames in which a facial expression morphs from one emotion to another may be analyzed to determine an emotional category of a human facial object.
- the sequence of frames need not include every consecutive frame, e.g., two or more key frames may be analyzed.
- the sequence of frames may be analyzed using a Tree-Augmented-Naive Bayes classifier.
- a category of emotion may be determined by comparing motion vectors with a template.
- the motion vectors may be based on deformation of facial features as reflected in an optical flow that occurs in a sequence of frames.
- Optical flow may be determined using differential, matching, energy-, or phase-based techniques.
- motions that may be determined may include amusement, joy, anger, disgust, embarrassment, fear, sadness, surprise, and a neutral state.
- Other emotions or moods may be determined in alternative embodiments.
- the operation 510 may include associating a determined emotion with a human object.
- the operation 510 may include generating an emotion tag that is associated with the scene of the video in which the facial emotion was detected.
- the emotion of a facial object may be determined in operation 510 using any known method.
- the amount of motion in a shot may be determined in operation 508
- the amount of motion in a shot may be determined in operation 510 after identifying an object. For example, the position of the identified object in various key frames between the beginning and ending frames of the shot may be compared.
- nonhuman object may be a background, such as such indoor or outdoor location set.
- a background nonhuman object may be determined using known techniques, including techniques that consider the size (number of pixels), color, and distribution of pixels in a frame.
- a background object may be identified using a pattern matching technique that employs patterns or templates of various background objects. Training images for developing a template may be learned from training images in the video or in a metadata file.
- a background object may be determined in operation 510 using any known method.
- a segment of two or more video frames that includes common objects, that is intended to convey common emotional content, that is intended to convey an element of a story, that is accompanied by a common audio segment, or some combination of the foregoing may be classified as a scene.
- a scene may also be referred to in this description and the claims as a “semantic segment.”
- One or more of the various tags described herein may be associated with a particular scene or semantic segment if the particular tag is determined from content in the scene.
- a visual scene may be determined according to any known method.
- a visual scene may include one or more camera shots and one or more human and nonhuman objects.
- scenes may be determined by grouping together consecutive shots having visual or audio objects corresponding with the same ground truth. For example, two consecutive shots having the same background object or other non-human object may be grouped together as a scene.
- a scene may include a first shot that is a long shot of a particular person and a second shot that is a medium shot of the same person.
- a sequence of four consecutive shots in which the first and third shots have a first human object and the second and fourth shots have a second human object may be grouped together as a scene.
- visual scenes may be determined if a preceding and following shot include related visual objects.
- the first shot may include a particular person
- the second shot may include another person
- two may be interacting.
- visual scenes may be determined by comparing histogram data. For example, histogram data for a first of three consecutive shots is compared with the third shot in the series. If the intersection of first and third consecutive shots is outside a threshold, it may be inferred that the shots are similar and part of the same scene, such as where the video shows an interaction between person A and person B, the camera first capturing person A, second capturing person B, and third capturing person A.
- the determination of a visual scene in operation 512 may include associating the scene with a probability or confidence parameter that is a measure of how likely the identified and grouped shots define a scene according to a ground truth specifying the temporal boundaries of a scene.
- the validity of a scene determined in operation 512 may be tested by comparing the temporal span of the scene with other scene determiners, such as a temporal span associated with an audio object.
- the determination of a visual scene in operation 512 may include associating an attribute tag with the scene.
- the attribute tag may correspond with known cinematic techniques for evoking a particular mood, e.g., amusement, fear, sadness, suspense, or interest.
- an attribute tag designating an action theme may be associated with a scene with a relatively large number of shots of short duration.
- visual tags may be associated or set for each scene.
- a visual tag corresponds with visual objects, such as human and nonhuman objects.
- a tag When a tag is generated, it may be associated with a time or time span. However, the segments of the video that correspond with the various scenes may not be known at the time a tag is generated. Operation 514 may be performed at a time when the various scenes of the video are known so that a previously generated visual tag may be associated with a particular scene.
- FIG. 6 illustrates a process for generating audio and key word tags according to one embodiment.
- one or more audio features or audio signal descriptors may be extracted from an audio file 206 .
- An audio feature may be a time domain feature, such as zero crossing rate, energy contour, volume contour, or fundamental frequency, or a frequency domain feature, such as short term energy, bandwidth, entropy, spectral centroid, Mel-Frequency Cepstral Coefficients, or a Discreet Wavelet Transform.
- Many audio features are known in the art and any known audio feature or features that are suitable may be extracted in operation 602 .
- audio features or audio signal descriptors extracted from an audio file 206 may be classified. Each classification may be defined by a set of characteristic audio feature values. In one embodiment, audio features may be classified as silence, speech (spoken words), music, and a fourth category of other sounds that will be referred to herein as “sound effect.”
- Segments of the video for which sound is not detectable may be classified as silent.
- an audio tag with a silent type attribute may be associated with a silent audio feature, the tag having a time stamp that indicates the start and stop time of the silent period.
- Segments of the video for which the audio feature values are similar to those that are characteristic of speech may be classified as speech.
- An audio tag with a speech type attribute may be associated with the audio feature, the tag having a time stamp of the period of speech.
- Segments of the video for which the audio feature values are similar to those that are characteristic of music may be classified as music.
- An audio tag with music type attribute may be associated with the audio feature, the tag having a time stamp of the period of music.
- Segments of the video for which the audio feature values are not similar to those that are characteristic of speech or music (and are not silent) may be classified as a sound effect.
- An audio tag with sound effect type attribute may be associated with a time stamp of the period of music.
- the sound effect category may include sounds conventionally understood to be movie or television sound effects, such as an explosion, a door being slammed, a motor vehicle engine, a scream, laughter, applause, wind, and rain.
- the sound effect category may include any sound that may not be classified as speech, music, or silence, even if the sound may not be conventionally understood to be a theatrical sound effect.
- audio features classified as sound effects may be further classified by sound effect type.
- Each sound effect sub-classification may be defined by a set of characteristic audio feature values.
- a gun shot may be defined by particular audio feature values.
- a library of audio feature values that are characteristic of a variety of sound effects may be provided.
- Each audio feature classified as a sound effect may be compared with the library of characteristic features. Where matches are found, the sound effect audio tag may have additional data added to it, specifying the particular sound, e.g., a crying baby sound effect.
- An optional operation 607 may include associating an attribute tag with a sound effect audio feature.
- the attribute tag may correspond with known cinematic techniques for evoking a particular mood.
- an attribute tag designating an action theme may be associated with gun shot or explosion sound effects.
- an attribute tag designating a suspense theme or amusement theme may be associated with a sound effect.
- an audio or acoustic fingerprint may be determined for audio features classified as music.
- An audio fingerprint is a content-based compact signature that may summarize a music recording.
- an audio fingerprint does correspond with an exact copy of a particular music recording.
- An audio fingerprint may be found to match an extracted music recording where small variations from the particular music recording are present in the extracted audio features.
- An audio fingerprint is derived from the extracted audio features and may include a vector, a trace of vectors, a codebook, a sequence of Hidden Markov model sound classes, a sequence of error correcting words, or musically meaningful high-level attributes.
- a library of audio fingerprints for various music recordings may be provided.
- audio features classified as music may be compared with the library. Where matches are found, the music audio tag may have additional data added to it, specifying an identification of the particular song.
- an attribute tag designating an emotion, mood, or theme may be associated with a music audio tag. Particular cinematic techniques are known to employ certain types of music to evoke particular moods.
- a music audio tag may include attribute data designating that the music is associated with action, suspense, or sad themes if the music is of a particular type.
- an audio transcript may be determined.
- An audio transcript may include all of the words spoken in the video.
- an audio transcript may be provided with the video in the form of a closed caption file included in the AV file container.
- spoken words may be determined from audio features classified as speech using any known technique.
- spoken words may be manually determined.
- key words may be determined from the audio transcript.
- a key word may be a word that provides significant information content about a scene in a video.
- a key word may be a name of an actor that appears in a scene.
- a key word may be a name of a concept or idea that is central to a plot or story.
- the word “run” may be a key word for the movie Forrest Gump.
- a key word may be a name of a song.
- a key word may be a word that is predefined to be objectionable or liked by a viewer. For example, a vulgar word may be predefined as a key word.
- a key word may be determined from the audio transcript by counting the frequency of occurrences of words, the most frequently occurring verbs and nouns being determined to be key words.
- the operation 614 may include generating key word objects for each determined key word.
- key word tags may be created and stored in the key word tag file 306 (shown in FIG. 3 ).
- a viewing pattern of a viewer may be gathered during the viewing of various videos. Using the viewing pattern, a viewing profile for a viewer may be generated. The viewing profile may identify categories of objects the viewer prefers. In addition, a viewer may manually input content types that he or she prefers or finds objectionable.
- FIGS. 7 and 8 depict the display screen 402 for displaying a user interface according to various embodiments.
- a viewer may select one or more time segments to create a playlist.
- a viewer has selected time segments 702 and 704 .
- the viewer desires to view a playlist that includes time segments in which both OBJECT 2 and OBJECT 7 appear.
- a viewer may select a time segment using a pointing device, such as a mouse or a touch screen.
- the Play Selected button 403 may be activated to play the selected time segment.
- additional time segments may be recommended to a viewer.
- One or more OBJECTS in the selected segments may be automatically determined or manually designated by a viewer. An automated search for any other segments that include these OBJECTS may be performed. Segments that are found to include these OBJECTS may then be recommended to a viewer. In the example of FIG. 8 , the time segments 802 and 804 are recommended to a viewer. The time segments are segments in which both OBJECT 2 and OBJECT 7 appear.
- aspects of the present disclosure may be embodied as a system, method, or computer program product. Accordingly, aspects of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.), or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module,” or “system.” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
- the computer readable medium may be a computer readable signal medium or a computer readable storage medium.
- a computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination thereof.
- a computer readable storage medium may be any tangible medium that can contain, or store, a program for use by or in connection with an instruction execution system, apparatus, or device.
- a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof.
- a computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
- Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wire line, optical fiber cable, RF, etc., or any suitable combination thereof.
- Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including: an object oriented programming language such as Java, Smalltalk, C++, or the like; and conventional procedural programming languages, such as the “C” programming language or similar programming languages.
- the program code may execute as specifically described herein.
- the program code may execute entirely on the viewer's computer, partly on the viewer's computer, as a stand-alone software package, partly on the viewer's computer and partly on a remote computer, or entirely on the remote computer or server.
- the remote computer may be connected to the viewer's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- LAN local area network
- WAN wide area network
- Internet Service Provider an Internet Service Provider
- These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function or act specified in the flowchart or block diagram block or blocks.
- the computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus, or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions or acts specified in the flowchart or block diagram block or blocks.
- Embodiments according to this disclosure may be provided to end-users through a cloud-computing infrastructure.
- Cloud computing generally refers to the provision of scalable computing resources as a service over a network.
- Cloud computing may be defined as a computing capability that provides an abstraction between the computing resource and its underlying technical architecture (e.g., servers, storage, networks), enabling convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction.
- cloud computing allows a user to access virtual computing resources (e.g., storage, data, applications, and even complete virtualized computing systems) in “the cloud,” without regard for the underlying physical systems (or locations of those systems) used to provide the computing resources.
- cloud-computing resources are provided to a user on a pay-per-use basis, where users are charged only for the computing resources actually used (e.g., an amount of storage space used by a user or a number of virtualized systems instantiated by the user).
- a user can access any of the resources that reside in the cloud at any time, and from anywhere across the Internet.
- a user may access applications or related data available in the cloud.
- the nodes used to create a stream computing application may be virtual machines hosted by a cloud service provider. Doing so allows a user to access this information from any computing system attached to a network connected to the cloud (e.g., the Internet).
- each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which may include one or more executable instructions for implementing the specified logical function(s).
- the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
- each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- User Interface Of Digital Computer (AREA)
- Television Signal Processing For Recording (AREA)
Abstract
Description
- This disclosure relates generally to graphical user interfaces, and more particularly, to visually summarizing a video in a way that facilitates quick understanding by a viewer of the types and locations of particular types of content.
- A television show, movie, internet video, or other similar content may be stored on a disc or in other memory using a container or wrapper file format. The container format may be used to specify how multiple different data files are to be used. The container format for a video may identify different data types and describe how they are to be interleaved when the video is played. A container may contain video files, audio files, subtitle files, chapter-information files, metadata, and other files. A container also typically includes a file that specifies synchronization information needed for simultaneous playback of the various files.
- One format for digital video files is the DVD-Video format. Another format for digital video files is Audio Video Interleaved (“AVI”). Audio may be stored in various formats, such as the PCM, DTS, MPEG-1 Audio Layer II (MP2), or Dolby Digital (AC-3) formats.
- A multimedia video generally includes a large amount of perceptual information, i.e., information such as images and sounds that are perceived by viewers. The frames of a video file may show humans, who may or may not be actors, and a wide variety of nonhuman objects. A nonhuman object may be a background, such as a natural indoor or outdoor location, or a professional stage or set. A nonhuman object may also be a prop or other visual element in front of the background object. Yet another type of nonhuman object that may be shown in a video frame is text. For instance, words spoken by humans may be displayed as text in a particular area of the frames. Segments of an audio file may be synchronously played with the display of video frames. These segments may include spoken words, music, and a wide variety of sound effects.
- While an audio-video file may be as short as a few minutes, the typical video, such as a television show or a full length movie, ranges in length from 20 minutes to over two hours. The typical video may include many scenes, each corresponding with a particular segment of the video. For example, a movie may have between 50 and 200 scenes. A minor scene may be one minute or less. A major scene may be three or more minutes. Each scene may include many frames and may include one or more camera shots. A scene may be accompanied by spoken dialog, a particular musical score or set of sound effects, or a combination of sound types. Particular human and nonhuman objects may appear in a scene. A scene may be intended by the creator to invoke particular emotions or moods, or to convey a theme of the story.
- One embodiment is directed to a method that visually summarizes the types and locations of particular types of content in a video in a way that facilitates understanding by a viewer. The method may include determining one or more semantic segments of the video. In addition, the method may include determining one or more emotion objects for at least one of the semantic segments. Further, the method may include generating a user interface on a display screen. The user interface may include one window, and in another embodiment, the user interface may include two windows. Moreover, the method may include displaying first indicia of the emotion object in a first window. The horizontal extent of the first window corresponds with the temporal length of the video and the first indicia are displayed at a location corresponding with the temporal appearance of the emotion object in the video.
- Additional embodiments are directed to a non-transitory computer-readable storage medium having executable code stored thereon to cause a machine to perform a method for rendering a summary of a video, and to a system that visually summarizes the types and locations of particular types of content in a video in a way that facilitates understanding by a viewer.
-
FIG. 1 depicts a high-level block diagram of an exemplary computer system for implementing various embodiments. -
FIG. 2 is a block diagram of an exemplary audio-visual file container according to one embodiment. -
FIG. 3 is a block diagram of a process for visually summarizing a video in a way that facilitates quick understanding by a viewer of the types and locations of particular types of content according to an embodiment. -
FIG. 4 depicts a display screen displaying a user interface according to one embodiment. -
FIG. 5 illustrates one embodiment of a process for generating visual tags according to one embodiment. -
FIG. 6 illustrates a process for generating audio and key word tags according to one embodiment. -
FIG. 7 depicts a display screen displaying a user interface according to an embodiment. -
FIG. 8 depicts a display screen displaying a user interface according to an embodiment. - A multimedia video generally includes a large amount of perceptual information, i.e., information such as images and sounds that may be perceived by viewers. For example, a video may show human and nonhuman objects. A video may include spoken words, music, and other sounds, which may be referred to herein as audio objects. A video may evoke various emotions, moods, or themes, which may be referred to herein as emotion objects. The spoken words may include “key words.” A key word may be a word that provides significant information content about a scene in a video. These objects and key words may be used to describe a scene to a viewer. In particular, according to various embodiments, visual representations of key words, and human, nonhuman, audio, and emotion objects may be used to describe the scenes of a video to a viewer. In addition, visual representations of the relationships between these objects and key words may be used to describe the scenes of a video to a viewer. By visually presenting this information to the viewer, he or she may be enabled to generally understand the scene. The information may enable the viewer to determine whether a particular scene is of interest or is objectionable. In various embodiments, visual information summarizing all of the scenes of a video may be presented to the viewer in a single display screen.
- According to various embodiments, a viewer selects a video, and human, nonhuman, and audio objects of the video are identified. In addition, key words that are spoken by human objects in the video are identified. Human, nonhuman, and audio objects may be used to classify a particular segment of a video as a scene. The objects and key words are then associated with the scenes of the video. Further, the objects, key words, and other data may be used to determine an emotion, mood, or theme for one or more of the scenes, and to generate corresponding emotion objects. The objects and key words may be compared with profile information to determine an attitude or preference of a viewer regarding the scenes of the video. A viewer's attitude may be, for example, that he or she likes, dislikes, or finds a particular type of content objectionable. In various embodiments, visual representations of key words, and human, nonhuman, and audio objects summarizing all of the scenes of a video are presented to the viewer in a single display screen. In addition, visual representations of a viewer's attitudes or preferences toward a particular object or key word may be displayed.
- In one embodiment, a display screen may include a first window for playing the video and a second window for rendering text, symbols, and icons corresponding with human, nonhuman, audio, and emotion objects, and key words. The second window may also include a visual indication of a viewer's attitude regarding particular human, nonhuman, audio, and emotion objects, and key words. In one embodiment, a viewer may select one or more scenes for playing in the first window. One or more other scenes of the video may be identified as scenes to be recommended to the viewer. The recommended scenes may be other scenes that have human, nonhuman, audio, and emotion objects, and key words that are similar to the scene selected by the viewer.
-
FIG. 1 depicts a high-level block diagram of anexemplary computer system 100 for implementing various embodiments. The mechanisms and apparatus of the various embodiments disclosed herein apply equally to any appropriate computing system. The major components of thecomputer system 100 include one ormore processors 102, amemory 104, aterminal interface 112, astorage interface 114, an I/O (Input/Output)device interface 116, and anetwork interface 118, all of which are communicatively coupled, directly or indirectly, for inter-component communication via amemory bus 106, an I/O bus 108,bus interface unit 109, and an I/Obus interface unit 110. - The
computer system 100 may contain one or more general-purpose programmable central processing units (CPUs) 102A and 102B, herein generically referred to as theprocessor 102. In an embodiment, thecomputer system 100 may contain multiple processors typical of a relatively large system; however, in another embodiment, thecomputer system 100 may alternatively be a single CPU system. Eachprocessor 102 executes instructions stored in thememory 104 and may include one or more levels of on-board cache. - In an embodiment, the
memory 104 may include a random-access semiconductor memory, storage device, or storage medium (either volatile or non-volatile) for storing or encoding data and programs. In another embodiment, thememory 104 represents the entire virtual memory of thecomputer system 100, and may also include the virtual memory of other computer systems coupled to thecomputer system 100 or connected via a network. Thememory 104 is conceptually a single monolithic entity, but in other embodiments thememory 104 is a more complex arrangement, such as a hierarchy of caches and other memory devices. For example, memory may exist in multiple levels of caches, and these caches may be further divided by function, so that one cache holds instructions while another holds non-instruction data, which is used by the processor or processors. Memory may be further distributed and associated with different CPUs or sets of CPUs, as is known in any of various so-called non-uniform memory access (NUMA) computer architectures. - The
memory 104 may store all or a portion of the following: an audio visual file container 150 (shown inFIG. 2 as container 202), avideo processing module 152, anaudio processing module 154, and acontrol module 156. These modules are illustrated as being included within thememory 104 in thecomputer system 100, however, in other embodiments, some or all of them may be on different computer systems and may be accessed remotely, e.g., via a network. Thecomputer system 100 may use virtual addressing mechanisms that allow the programs of thecomputer system 100 to behave as if they only have access to a large, single storage entity instead of access to multiple, smaller storage entities. Thus, while the audiovisual file container 150,video processing module 152,audio processing module 154, andcontrol module 156 are illustrated as being included within thememory 104, these components are not necessarily all completely contained in the same storage device at the same time. Further, although the audiovisual file container 150,video processing module 152,audio processing module 154, andcontrol module 156 are illustrated as being separate entities, in other embodiments some of them, portions of some of them, or all of them may be packaged together. - In an embodiment, the
video processing module 152,audio processing module 154, andcontrol module 156 may include instructions or statements that execute on theprocessor 102 or instructions or statements that are interpreted by instructions or statements that execute on theprocessor 102 to carry out the functions as further described below. In another embodiment, thevideo processing module 152,audio processing module 154, andcontrol module 156 are implemented in hardware via semiconductor devices, chips, logical gates, circuits, circuit cards, and/or other physical hardware devices in lieu of, or in addition to, a processor-based system. In an embodiment, thevideo processing module 152,audio processing module 154, andcontrol module 156 may include data in addition to instructions or statements. - The
video processing module 152 may include various processes that generate visual tags according to one embodiment. Theaudio processing module 154 may include various processes for generating audio and key word tags according to one embodiment. Thecontrol module 156 may include various processes for visually summarizing a video in a way that facilitates quick understanding by a viewer of the types and locations of particular types of content according to an embodiment. In addition, thecontrol module 156 may include various processes for rendering all or selected portions of a video, and rendering a user interface, such as the one shown inFIG. 4 . Further, thecontrol module 156 may include various processes for identifying scenes to be recommended to a viewer, as well as other processes described herein. - The
computer system 100 may include abus interface unit 109 to handle communications among theprocessor 102, thememory 104, adisplay system 124, and the I/Obus interface unit 110. The I/Obus interface unit 110 may be coupled with the I/O bus 108 for transferring data to and from the various I/O units. The I/Obus interface unit 110 communicates with multiple I/O interface units O bus 108. Thedisplay system 124 may include a display controller, a display memory, or both. The display controller may provide video, audio, or both types of data to adisplay device 126. The display memory may be a dedicated memory for buffering frames of video data. Thedisplay system 124 may be coupled with adisplay device 126, such as a standalone display screen, computer monitor, television, or a tablet or handheld device display. In one embodiment, thedisplay device 126 may include one or more speakers for rendering audio. Alternatively, one or more speakers for rendering audio may be coupled with an I/O interface unit. In alternate embodiments, one or more of the functions provided by thedisplay system 124 may be on board aprocessor 102 integrated circuit. In addition, one or more of the functions provided by thebus interface unit 109 may be on board aprocessor 102 integrated circuit. - The I/O interface units support communication with a variety of storage and I/O devices. For example, the
terminal interface unit 112 supports the attachment of one or more viewer I/O devices 120, which may include viewer output devices (such as a video display device, speaker, and/or television set) and viewer input devices (such as a keyboard, mouse, keypad, touchpad, trackball, buttons, light pen, or other pointing device). A viewer may manipulate the user input devices using a user interface, in order to provide input data and commands to the user I/O device 120 and thecomputer system 100, and may receive output data via the user output devices. For example, a user interface may be presented via the user I/O device 120, such as displayed on a display device, played via a speaker, or printed via a printer. - The
storage interface 114 supports the attachment of one or more disk drives or direct access storage devices 122 (which are typically rotating magnetic disk drive storage devices, although they could alternatively be other storage devices, including arrays of disk drives configured to appear as a single large storage device to a host computer, or solid-state drives, such as flash memory). In another embodiment, thestorage device 122 may be implemented via any type of secondary storage device. The contents of thememory 104, or any portion thereof, may be stored to and retrieved from thestorage device 122 as needed. The I/O device interface 116 provides an interface to any of various other I/O devices or devices of other types, such as printers or fax machines. Thenetwork interface 118 provides one or more communication paths from thecomputer system 100 to other digital devices and computer systems; these communication paths may include, e.g., one or more networks. - Although the
computer system 100 shown inFIG. 1 illustrates a particular bus structure providing a direct communication path among theprocessors 102, thememory 104, thebus interface 109, thedisplay system 124, and the I/Obus interface unit 110, in alternative embodiments thecomputer system 100 may include different buses or communication paths, which may be arranged in any of various forms, such as point-to-point links in hierarchical, star or web configurations, multiple hierarchical buses, parallel and redundant paths, or any other appropriate type of configuration. Furthermore, while the I/Obus interface unit 110 and the I/O bus 108 are shown as single respective units, thecomputer system 100 may, in fact, contain multiple I/Obus interface units 110 and/or multiple I/O buses 108. While multiple I/O interface units are shown, which separate the I/O bus 108 from various communications paths running to the various I/O devices, in other embodiments, some or all of the I/O devices are connected directly to one or more system I/O buses. - In various embodiments, the
computer system 100 is a multi-user mainframe computer system, a single-user system, or a server computer or similar device that has little or no direct user interface, but receives requests from other computer systems (clients). In other embodiments, thecomputer system 100 may be implemented as a desktop computer, portable computer, laptop or notebook computer, tablet computer, pocket computer, telephone, smart phone, or any other suitable type of electronic device. -
FIG. 1 is intended to depict the representative major components of thecomputer system 100. Individual components, however, may have greater complexity than represented inFIG. 1 , components other than or in addition to those shown inFIG. 1 may be present, and the number, type, and configuration of such components may vary. Several particular examples of additional complexity or additional variations are disclosed herein; these are by way of example only and are not necessarily the only such variations. The various program components illustrated inFIG. 1 may be implemented, in various embodiments, in a number of different manners, including using various computer applications, routines, components, programs, objects, modules, data structures, etc., which may be referred to herein as “software,” “computer programs,” or simply “programs.” -
FIG. 2 is a block diagram of an exemplary audio-visual file container 202 that may contain avideo file 204, anaudio file 206, asubtitle file 208, and ametadata file 210 according to one embodiment. The container may also include other files, such as a file that specifies synchronization information. -
FIG. 3 is a block diagram of aprocess 300 for visually summarizing a video in a way that facilitates quick understanding by a viewer of the locations of particular types of content according to an embodiment. Theprocess 300 may receive as input avisual tag file 302, anaudio tag file 304, a keyword tag file 306, anattribute tag file 308, and ametadata file 210. Thevisual tag file 302 includes tags that correspond with visually perceivable objects, such as human and nonhuman objects. Theaudio tag file 304 includes tags that correspond with aurally perceivable objects. The keyword tag file 306 includes tags that correspond with key word objects. Theattribute tag file 308 includes tags that correspond with attribute objects. Each tag may be associated with a time stamp that indicates the start and stop time in which the object or attribute is rendered or otherwise associated. Exemplary embodiments for automatically determining tags are described below with reference toFIGS. 5-6 . In addition, in some embodiments, tags of one or more types may be wholly or partially determined using manual methods. - The
operation 310 may include comparing a tag with one or more other tags associated with the same shot or scene for consistency. A shot may be a continuous sequence of frames captured without interruption by a camera oriented in a single direction or camera angle. As one example, a visual tag may indicate that a particular human object appears in a shot and a key word tag identifying the name of the human object is associated with the shot. As another example, a visual tag may indicate that a particular human object appears in a shot and an audio tag identifies an audio signature of the human object is associated with the shot. In these examples, if the tags that are compared indicate the same object, the positive or consistent result of the comparison may be used inoperation 310 to validate that the human object was correctly identified. If there are no tags that are consistent with a particular tag, it may be determined that the object associated with the particular tag was misidentified. Theoperation 310 may include modifying a tag determined to be inconsistent with other tags associated with the same shot. The modification may include adding an indication to the tag that it should not be used in other processes. Alternatively, if a probability or confidence parameter associated with the particular tag is above a threshold, it may be determined that the object was correctly identified and that the shot or scene includes multiple objects. In this circumstance, the modification may include adding an indication to the tag that it may be relied on to a particular extent. - In
operation 312, anemotion tag file 314 may be created from theattribute tag file 308 and the consistency-correctedvisual tag 302,audio tag 304,key word tag 306, andmetadata 210 files. Theemotion tag file 314 includes tags that are associated with emotion objects. In one embodiment, an emotion object may be associated with an emotion, mood, or theme that a typical viewer might be expected to perceive or that the creators of a video intended the audience to perceive. Each emotion object may be of a predefined type and associated with a time stamp. An emotion object may include parameters corresponding with intensity of the perceived emotion or a confidence level that the perceived emotion accurately represents a ground truth emotion. An emotion object may be generated directly from theattribute file 308, such as where the attribute file identifies an association or correlation of an attribute with a perceived emotion. In addition, an emotion object may be generated directly from thevisual tag 302, such as where the tag identifies a human object displaying a particular emotion. Further, an emotion object may be generated directly from theaudio tag 304 orkey word tag 306 files, such as where an audio tag identifies a segment of sound associated or correlated with an emotion, mood, or theme, or a key word is associated with an emotion, mood, or theme. Moreover, an emotion object may be generated inoperation 312 by identifying patterns of visual, audio, key word, and attribute tags that correspond or correlate with an emotion object. Further, an emotion object may be generated inoperation 312 using contextual data provided in themetadata file 210, such as metadata designating that the video is of a particular genre, e.g., comedy, horror, drama, or action. For example, visual, audio, and attribute tags for a shot or scene may all be associated with a particular mood, e.g., amusement, fear, sadness, suspense, or interest. In one embodiment, an emotion object may be determined using manual methods. In one embodiment, a tag may be generated for an emotion object. An emotion tag may include an intensity level of the emotion, mood, or theme. In addition, in one embodiment, a single emotion tag may be associated with two or more emotion objects. For example, a typical viewer might be expected to simultaneously perceive two emotions, such as happiness and surprise, when perceiving a particular scene. In arendering operation 316, one or more tags of the tag files 302, 304, 306, 308, and 314 may be rendered as one or more indicia on a display device according to known techniques. -
FIG. 4 depicts adisplay screen 402 of a display device, e.g., display 126 (FIG. 1 ), for displaying a user interface. In one embodiment, the user interface includeswindows display screen 402 along with a variety of textual information, and control icons or buttons, e.g.,buttons window 404. A variety of text, symbols, lines, and icons (“indicia”) for summarizing the video may be rendered in thewindow 406. The horizontal extent of thewindow 406 may correspond with the duration or total time of the video. The x axis shown in the figure represents the horizontal extent or time, while the y axis represents a vertical direction. WhileFIG. 4 depicts a user interface that includes awindow 404 for playing a video, in other embodiments, a user interface may omit thewindow 404, i.e., in other embodiments, a user interface may include only thewindow 406 for rendering text, symbols, and icons for summarizing the video (along with control icons orbuttons - In one embodiment, one or more object identifiers 408 may be rendered on the
display screen 402, such as to one side or the other of thewindow 406, e.g.,OBJECT 1 toOBJECT 8. In various embodiments, one or more horizontal lines (time lines) having a length (or horizontal extent) and temporal position may be rendered horizontally adjacent to each object identifier. The length or horizontal extent may indicate the duration of the rendering of the associated object. InFIG. 4 , for instance,OBJECT 1 is associated withlines 410 a andOBJECT 5 is associated withlines 410 e. In the example, it can be seen thatOBJECT 1 appears from time t3 to time t4, from time t5 to time t6, from time t7 to time t8. In contrast,OBJECT 3 appears from time t1 to time t2 and does not appear again. In one embodiment, an icon rather than a line may be rendered to indicate the temporal location of an object. For example, anicon 418 may be displayed to show where an audio object associated with music is located. In embodiments where an icon is rendered to indicate the temporal location of an object and the horizontal extent of the icon is smaller than the duration of the rendering of the object, the icon may be rendered at a point corresponding with the start of the time period in which the object is rendered. Alternatively, the icon may be rendered at a point corresponding with the midpoint or end of the time period in which the object is rendered. Exemplary embodiments for automatically determining object identifiers are described below with reference toFIGS. 5 and 6 . It will be appreciated that thehorizontal lines 410 a-410 h facilitate a quick understanding by a viewer of the types and locations of various objects in the video. In addition, a viewer may quickly understand where different objects simultaneously appear in the video. For example, OBJECTS 4 and 5, which may be two particular actors, only appear together in the final quarter of the video. Further, a viewer may quickly understand where preferred or objectionable objects appear in the video. For example, horizontal lines for objectionable objects may be rendered in a different color than the color used for horizontal lines for objects generally. - Still referring to
FIG. 4 , in various embodiments, key words 414 (“KW#”) may be rendered in thesecond window 406 at horizontal locations corresponding with the temporal rendering of the particular key word in the video. For example, inFIG. 4 , it may be seen that key word 1 (KW1) 414 appears in the video at the start, at approximately the one-third time point, at approximately the two-thirds time point, and at a time point about eighty percent of the way through the video. Akey word 414 may be rendered at any desired vertical coordinate or position within thesecond window 406, i.e., it may but need not be associated with one of the object identifiers 408. Exemplary embodiments for automatically determining key words are described below with reference toFIG. 6 . It will be appreciated that the display ofkey words 414 facilitates a quick understanding by a viewer of the types and locations of various key words in the video. In addition, a viewer may quickly understand where key words simultaneously occur with the appearance of various objects in the video. For example, key word KW4 occurs simultaneously with an appearance ofobject 8. - In various embodiments, as shown in
FIG. 4 , emotion, mood, ortheme denoting icons 416 may be rendered in thesecond window 406. Anemotion denoting icon 416 may be associated with and representative of an emotion tag. Anemotion denoting icon 416 may be rendered at horizontal locations corresponding with the temporal location of the particular emotion tag in the video. In one embodiment, an emotion ormood denoting icon 416 may be an “emoticon.” In other embodiments, an emotion ormood denoting icon 416 may be a colored or gray-scale icon. While depicted as circular, anicon 416 may be any shape. In various embodiments, the size, color, or shade of anicon 416 may correspond with an intensity of the associated emotion tag. For example, anicon 416 associated with amusement or a funny mood may be relatively large if the mood or emotion would be expected to be perceived intensely, but the same icon may be relatively small if the mood or emotion would be expected to be perceived mildly. It will be appreciated that the display ofemotion denoting icons 416 facilitates a quick understanding by a viewer of the types and locations of various emotions, moods, or themes in the video. A viewer can determine in a single view the proportion of the video that is associated with a particular emotion, mood, or theme, such as action or comedy. In addition, a viewer can determine in a single view where emotion objects of a particular type are located, e.g., funny portions of the video. -
FIG. 5 illustrates of a process for generating visual tags according to one embodiment. Referring toFIG. 5 , inoperation 504, avideo file 204 may be parsed into shot files. As mentioned, a shot may be a continuous sequence of frames captured without interruption by a camera oriented in a single direction or camera angle. During a shot, the camera may have a single field of view and field size, or may have a variable field of view, such as a zoom-in or -out shot. The camera may remain fixed, or be moved in a panning, tilting, or tracking motion. For example, a fixed field of view shot may be a long shot, a full shot, a medium shot, or a close up shot. - The
video file 204 may be parsed into shot files according to any known method. For example, in one embodiment, a histogram may be computed for each frame of the video file and the histograms for consecutive frames compared. If the histogram intersection of first and second consecutive frames is greater than a threshold, it may be inferred that the frames are similar, and consequently that the two frames are part of the same shot. On the other hand, if the histogram intersection of first and second consecutive frames is less than the threshold, it may be inferred that the two frames form a shot boundary. In addition, it may be inferred that the first consecutive frame is the last frame of a preceding shot and the second consecutive frame is the first frame of a succeeding shot. In one alternative, the histograms of two or more consecutive first frames may be compared with the histograms of two or more consecutive second frames (the group of first and second frames being consecutive), and a shot boundary may be defined by more consecutive frames than merely two frames. For example, the shot transition between shots may be a “fade” rather than a “cut.” A time code and type of shot transition (fade or cut) may be recorded as metadata for use in content analysis described below. Other known methods for parsing a video file into shot files may be employed inoperation 504. In addition,operation 504 may include parsing the video file so that sequential frames between determined shot boundaries are grouped togetherI or otherwise identified or tagged as being associated with a particular shot. Sequential frames associated with a particular shot may be referred to herein as a shot file. - In
operation 506, a key frame may be determined for a shot file. The key frame may be deemed to be representative of all frames in the shot, permitting descriptive data for the shot to be determined only for the key frame and not for every frame of the shot. In one embodiment, a key frame may be determined for each shot file. In another embodiment, theoperation 506 of determining a key frame may be omitted. Any known method for determining a key frame may be employed. In one embodiment, a key frame may be determined by selecting a middle frame of the shot file. In alternative embodiments, descriptive data for the shot may be determined for each of two or more key frames for a shot. Other known methods for determining a key frame may be employed inoperation 506. - In
operation 508, various shot attributes may be determined and recorded as metadata. Examples of shot attributes may include shot length, color variance, type of illumination or lighting, amount of motion, and shot type (zooming, panning, tilting, tracking motion, long, full, medium, or close up). Shot length may be determined by counting the number of frames of a shot. Color variance and illumination or lighting properties may be determined by analyzing pixel values of key frames using known techniques. The amount of motion may be determined by evaluating the number of times individual pixels change value from frame-to-frame in a shot using known techniques. Shot type may be determined using known techniques. A shot attribute may correspond with known cinematic techniques for evoking a particular mood. For example, particular lighting may be used to evoke a suspense theme. Metadata for a shot may include mood, emotion, or theme where another shot attribute is associated with a known cinematic technique for evoking the mood, emotion, or theme. - In
operation 510, visual objects in a shot may be identified and tagged. In one embodiment, visual objects in a shot may be identified by application of one or more known image recognition processes to the shot. Theoperation 510 may operate on one or more key frames of the shot. A shot may include the human and nonhuman visual objects. Both human and nonhuman visual objects may be identified inoperation 510. With respect to human visual objects, in one embodiment, a human visual object may be identified by identifying a face (“human facial object”) in a frame. Theoperation 510 may include determining whether or not a particular visual object is present in a shot and, if present, to identify its location in the frame. Theoperation 510 may include extracting an identified object for further processing. For example, an extracted human facial object may be further processed to determine the identity of the viewer or to determine a facial expression of the viewer. - In
operation 510, the position or location within a frame of an object may be determined using any known method. For example, a method may be of a type that employs rules that code typical attributes of the object. Attributes of a facial object may include, for example, eyes, eye brows, nose, hair line, hair texture, lips, and mouth. For instance, in the case of a human facial object, a rule may identify a face only if a particular facial feature, e.g., a first eye, is in a prescribed relationship to another feature, e.g., a second eye. In addition, a method may be of a type that employs rules that identify so-called “invariant features” that are present in a frame regardless of the position or pose of the object, the lighting, or camera viewpoint. Methods of this type, especially when employed to identify a human facial object, may employ an image recognition processes that identifies: (i) facial features using edge detectors (e.g., a Sobel filter) and templates; (ii) skin or hair texture using a neural network; and (iii) skin color using a pixel chrominance classifier. Further, methods may employ multiple techniques in stages, such as identifying global features such as skin color and face shape first, then verifying that the region is in fact a face by locating and detecting particular facial features within the region. - Further, once the position within a frame of an object is determined, the object may be identified as an object of a particular type or instance using any known method in
operation 510. Continuing the example of a human facial object, known template matching methods may be employed. In a first type of template matching method, several standard patterns of a face are used. The standard patterns may describe the face as a whole or the facial features separately. Correlations between an image extracted from a frame and the standard patterns may be computed. If the correlations are statistically significant, it may be determined that a human facial object is found. In a second type of template matching method, the patterns are “learned” from training images using known statistical analysis and machine learning techniques. In various embodiments, patterns may be learned from training images using: (i) Eigenfaces; (ii) Distribution-based Methods (including Principle Component Analysis, Factor Analysis, and Fisher's Linear Discriminant); (iii) Neural Networks; (iv) Support Vector Machines; (v) Sparse Network of Winnows (SNoW); (vi) Naive Bayes Classifiers; (vii) Hidden Markov Models; (viii) Information-Theoretical Approaches (including Kullback relative information); and (ix) Inductive Learning Algorithms. - While methods for object location and identification have been described with respect to a human facial object, it will be appreciated that these techniques may be generally employed with non-facial human objects and nonhuman objects. For example, a nonhuman object, such as a prop may be identified though color values and object-specific features. Patterns and templates for nonhuman objects will be different than those for facial objects. For example, a musical instrument, such as an acoustic guitar, may be identified by determining regions of pixels having wood color values. Appropriately colored pixel regions may then be compared with patterns or templates for neck and body parts of the acoustic guitar, as viewed in different orientations.
- In one embodiment, a human facial object may be processed to determine the emotion expressed on the facial object. To determine the emotion of a facial object, a process may, in one embodiment, employ a Gabor filter to determine facial features and their orientation, and a support vector machine to determine an emotion corresponding with detected facial features. In one embodiment, a sequence of frames in which a facial expression morphs from one emotion to another may be analyzed to determine an emotional category of a human facial object. The sequence of frames need not include every consecutive frame, e.g., two or more key frames may be analyzed. The sequence of frames may be analyzed using a Tree-Augmented-Naive Bayes classifier. In addition, a category of emotion may be determined by comparing motion vectors with a template. The motion vectors may be based on deformation of facial features as reflected in an optical flow that occurs in a sequence of frames. Optical flow may be determined using differential, matching, energy-, or phase-based techniques. In various embodiments, motions that may be determined may include amusement, joy, anger, disgust, embarrassment, fear, sadness, surprise, and a neutral state. Other emotions or moods may be determined in alternative embodiments. The
operation 510 may include associating a determined emotion with a human object. In addition, theoperation 510 may include generating an emotion tag that is associated with the scene of the video in which the facial emotion was detected. In other embodiments, the emotion of a facial object may be determined inoperation 510 using any known method. - While the amount of motion in a shot may be determined in
operation 508, in one embodiment, the amount of motion in a shot may be determined inoperation 510 after identifying an object. For example, the position of the identified object in various key frames between the beginning and ending frames of the shot may be compared. - Another type of nonhuman object that may be determined in
operation 510 may be a background, such as such indoor or outdoor location set. A background nonhuman object may be determined using known techniques, including techniques that consider the size (number of pixels), color, and distribution of pixels in a frame. A background object may be identified using a pattern matching technique that employs patterns or templates of various background objects. Training images for developing a template may be learned from training images in the video or in a metadata file. In other embodiments, a background object may be determined inoperation 510 using any known method. - According to an aspect, a segment of two or more video frames that includes common objects, that is intended to convey common emotional content, that is intended to convey an element of a story, that is accompanied by a common audio segment, or some combination of the foregoing may be classified as a scene. A scene may also be referred to in this description and the claims as a “semantic segment.” One or more of the various tags described herein may be associated with a particular scene or semantic segment if the particular tag is determined from content in the scene.
- In
operation 512, a visual scene may be determined according to any known method. A visual scene may include one or more camera shots and one or more human and nonhuman objects. In one embodiment, scenes may be determined by grouping together consecutive shots having visual or audio objects corresponding with the same ground truth. For example, two consecutive shots having the same background object or other non-human object may be grouped together as a scene. As another example, a scene may include a first shot that is a long shot of a particular person and a second shot that is a medium shot of the same person. As a third example, a sequence of four consecutive shots in which the first and third shots have a first human object and the second and fourth shots have a second human object may be grouped together as a scene. - In one embodiment, visual scenes may be determined if a preceding and following shot include related visual objects. For example, the first shot may include a particular person, the second shot may include another person, and two may be interacting. In one embodiment, visual scenes may be determined by comparing histogram data. For example, histogram data for a first of three consecutive shots is compared with the third shot in the series. If the intersection of first and third consecutive shots is outside a threshold, it may be inferred that the shots are similar and part of the same scene, such as where the video shows an interaction between person A and person B, the camera first capturing person A, second capturing person B, and third capturing person A.
- The determination of a visual scene in
operation 512 may include associating the scene with a probability or confidence parameter that is a measure of how likely the identified and grouped shots define a scene according to a ground truth specifying the temporal boundaries of a scene. In one embodiment, the validity of a scene determined inoperation 512 may be tested by comparing the temporal span of the scene with other scene determiners, such as a temporal span associated with an audio object. - The determination of a visual scene in
operation 512 may include associating an attribute tag with the scene. The attribute tag may correspond with known cinematic techniques for evoking a particular mood, e.g., amusement, fear, sadness, suspense, or interest. In one embodiment, an attribute tag designating an action theme may be associated with a scene with a relatively large number of shots of short duration. - In
operation 514, visual tags may be associated or set for each scene. As mentioned, a visual tag corresponds with visual objects, such as human and nonhuman objects. When a tag is generated, it may be associated with a time or time span. However, the segments of the video that correspond with the various scenes may not be known at the time a tag is generated.Operation 514 may be performed at a time when the various scenes of the video are known so that a previously generated visual tag may be associated with a particular scene. -
FIG. 6 illustrates a process for generating audio and key word tags according to one embodiment. Referring toFIG. 6 inoperation 602, one or more audio features or audio signal descriptors may be extracted from anaudio file 206. An audio feature may be a time domain feature, such as zero crossing rate, energy contour, volume contour, or fundamental frequency, or a frequency domain feature, such as short term energy, bandwidth, entropy, spectral centroid, Mel-Frequency Cepstral Coefficients, or a Discreet Wavelet Transform. Many audio features are known in the art and any known audio feature or features that are suitable may be extracted inoperation 602. - In
operation 604, audio features or audio signal descriptors extracted from anaudio file 206 may be classified. Each classification may be defined by a set of characteristic audio feature values. In one embodiment, audio features may be classified as silence, speech (spoken words), music, and a fourth category of other sounds that will be referred to herein as “sound effect.” - Segments of the video for which sound is not detectable may be classified as silent. In
operation 605, an audio tag with a silent type attribute may be associated with a silent audio feature, the tag having a time stamp that indicates the start and stop time of the silent period. - Segments of the video for which the audio feature values are similar to those that are characteristic of speech may be classified as speech. An audio tag with a speech type attribute may be associated with the audio feature, the tag having a time stamp of the period of speech. Segments of the video for which the audio feature values are similar to those that are characteristic of music may be classified as music. An audio tag with music type attribute may be associated with the audio feature, the tag having a time stamp of the period of music.
- Segments of the video for which the audio feature values are not similar to those that are characteristic of speech or music (and are not silent) may be classified as a sound effect. An audio tag with sound effect type attribute may be associated with a time stamp of the period of music. The sound effect category may include sounds conventionally understood to be movie or television sound effects, such as an explosion, a door being slammed, a motor vehicle engine, a scream, laughter, applause, wind, and rain. The sound effect category may include any sound that may not be classified as speech, music, or silence, even if the sound may not be conventionally understood to be a theatrical sound effect.
- In
operation 606, audio features classified as sound effects may be further classified by sound effect type. Each sound effect sub-classification may be defined by a set of characteristic audio feature values. For example, a gun shot may be defined by particular audio feature values. A library of audio feature values that are characteristic of a variety of sound effects may be provided. Each audio feature classified as a sound effect may be compared with the library of characteristic features. Where matches are found, the sound effect audio tag may have additional data added to it, specifying the particular sound, e.g., a crying baby sound effect. - An
optional operation 607 may include associating an attribute tag with a sound effect audio feature. The attribute tag may correspond with known cinematic techniques for evoking a particular mood. In one embodiment, an attribute tag designating an action theme may be associated with gun shot or explosion sound effects. In other embodiments, an attribute tag designating a suspense theme or amusement theme may be associated with a sound effect. - In
operation 608, an audio or acoustic fingerprint may be determined for audio features classified as music. An audio fingerprint is a content-based compact signature that may summarize a music recording. In one embodiment, an audio fingerprint does correspond with an exact copy of a particular music recording. An audio fingerprint may be found to match an extracted music recording where small variations from the particular music recording are present in the extracted audio features. An audio fingerprint is derived from the extracted audio features and may include a vector, a trace of vectors, a codebook, a sequence of Hidden Markov model sound classes, a sequence of error correcting words, or musically meaningful high-level attributes. - A library of audio fingerprints for various music recordings may be provided. In
operation 610, audio features classified as music may be compared with the library. Where matches are found, the music audio tag may have additional data added to it, specifying an identification of the particular song. In addition, an attribute tag designating an emotion, mood, or theme may be associated with a music audio tag. Particular cinematic techniques are known to employ certain types of music to evoke particular moods. In one embodiment, a music audio tag may include attribute data designating that the music is associated with action, suspense, or sad themes if the music is of a particular type. - In
operation 612, an audio transcript may be determined. An audio transcript may include all of the words spoken in the video. In one embodiment, an audio transcript may be provided with the video in the form of a closed caption file included in the AV file container. In another embodiment, spoken words may be determined from audio features classified as speech using any known technique. In yet another, embodiment, spoken words may be manually determined. - In
operation 614, key words may be determined from the audio transcript. A key word may be a word that provides significant information content about a scene in a video. For example, a key word may be a name of an actor that appears in a scene. A key word may be a name of a concept or idea that is central to a plot or story. For example, the word “run” may be a key word for the movie Forrest Gump. A key word may be a name of a song. A key word may be a word that is predefined to be objectionable or liked by a viewer. For example, a vulgar word may be predefined as a key word. In one embodiment, a key word may be determined from the audio transcript by counting the frequency of occurrences of words, the most frequently occurring verbs and nouns being determined to be key words. Theoperation 614 may include generating key word objects for each determined key word. In addition, key word tags may be created and stored in the key word tag file 306 (shown inFIG. 3 ). - In one embodiment, a viewing pattern of a viewer may be gathered during the viewing of various videos. Using the viewing pattern, a viewing profile for a viewer may be generated. The viewing profile may identify categories of objects the viewer prefers. In addition, a viewer may manually input content types that he or she prefers or finds objectionable.
-
FIGS. 7 and 8 depict thedisplay screen 402 for displaying a user interface according to various embodiments. In one embodiment, a viewer may select one or more time segments to create a playlist. In the example shown inFIG. 7 , a viewer has selectedtime segments 702 and 704. In this example, the viewer desires to view a playlist that includes time segments in which bothOBJECT 2 andOBJECT 7 appear. In one embodiment, a viewer may select a time segment using a pointing device, such as a mouse or a touch screen. Once a playlist has been created by a viewer, the Play Selectedbutton 403 may be activated to play the selected time segment. In addition, in one embodiment additional time segments may be recommended to a viewer. One or more OBJECTS in the selected segments may be automatically determined or manually designated by a viewer. An automated search for any other segments that include these OBJECTS may be performed. Segments that are found to include these OBJECTS may then be recommended to a viewer. In the example ofFIG. 8 , thetime segments OBJECT 2 andOBJECT 7 appear. - In the foregoing, reference is made to various embodiments. It should be understood, however, that this disclosure is not limited to the specifically described embodiments. Instead, any combination of the described features and elements, whether related to different embodiments or not, is contemplated to implement and practice this disclosure. Many modifications and variations may be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. Furthermore, although embodiments of this disclosure may achieve advantages over other possible solutions or over the prior art, whether or not a particular advantage is achieved by a given embodiment is not limiting of this disclosure. Thus, the described aspects, features, embodiments, and advantages are merely illustrative and are not considered elements or limitations of the appended claims except where explicitly recited in a claim(s).
- As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method, or computer program product. Accordingly, aspects of the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.), or an embodiment combining software and hardware aspects that may all generally be referred to herein as a “circuit,” “module,” or “system.” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable medium(s) having computer readable program code embodied thereon.
- Any combination of one or more computer readable medium(s) may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination thereof. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination thereof. In the context of this disclosure, a computer readable storage medium may be any tangible medium that can contain, or store, a program for use by or in connection with an instruction execution system, apparatus, or device.
- A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
- Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wire line, optical fiber cable, RF, etc., or any suitable combination thereof.
- Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including: an object oriented programming language such as Java, Smalltalk, C++, or the like; and conventional procedural programming languages, such as the “C” programming language or similar programming languages. The program code may execute as specifically described herein. In addition, the program code may execute entirely on the viewer's computer, partly on the viewer's computer, as a stand-alone software package, partly on the viewer's computer and partly on a remote computer, or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the viewer's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider).
- Aspects of the present disclosure have been described with reference to flowchart illustrations, block diagrams, or both, of methods, apparatuses (systems), and computer program products according to embodiments of this disclosure. It will be understood that each block of the flowchart illustrations or block diagrams, and combinations of blocks in the flowchart illustrations or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions or acts specified in the flowchart or block diagram block or blocks.
- These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function or act specified in the flowchart or block diagram block or blocks.
- The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus, or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions or acts specified in the flowchart or block diagram block or blocks.
- Embodiments according to this disclosure may be provided to end-users through a cloud-computing infrastructure. Cloud computing generally refers to the provision of scalable computing resources as a service over a network. More formally, cloud computing may be defined as a computing capability that provides an abstraction between the computing resource and its underlying technical architecture (e.g., servers, storage, networks), enabling convenient, on-demand network access to a shared pool of configurable computing resources that can be rapidly provisioned and released with minimal management effort or service provider interaction. Thus, cloud computing allows a user to access virtual computing resources (e.g., storage, data, applications, and even complete virtualized computing systems) in “the cloud,” without regard for the underlying physical systems (or locations of those systems) used to provide the computing resources.
- Typically, cloud-computing resources are provided to a user on a pay-per-use basis, where users are charged only for the computing resources actually used (e.g., an amount of storage space used by a user or a number of virtualized systems instantiated by the user). A user can access any of the resources that reside in the cloud at any time, and from anywhere across the Internet. In context of the present disclosure, a user may access applications or related data available in the cloud. For example, the nodes used to create a stream computing application may be virtual machines hosted by a cloud service provider. Doing so allows a user to access this information from any computing system attached to a network connected to the cloud (e.g., the Internet).
- The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which may include one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
- While the foregoing is directed to exemplary embodiments, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow.
Claims (12)
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/722,754 US20140181668A1 (en) | 2012-12-20 | 2012-12-20 | Visual summarization of video for quick understanding |
US14/166,158 US9961403B2 (en) | 2012-12-20 | 2014-01-28 | Visual summarization of video for quick understanding by determining emotion objects for semantic segments of video |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/722,754 US20140181668A1 (en) | 2012-12-20 | 2012-12-20 | Visual summarization of video for quick understanding |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/166,158 Continuation US9961403B2 (en) | 2012-12-20 | 2014-01-28 | Visual summarization of video for quick understanding by determining emotion objects for semantic segments of video |
Publications (1)
Publication Number | Publication Date |
---|---|
US20140181668A1 true US20140181668A1 (en) | 2014-06-26 |
Family
ID=50974790
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/722,754 Abandoned US20140181668A1 (en) | 2012-12-20 | 2012-12-20 | Visual summarization of video for quick understanding |
US14/166,158 Active 2035-01-18 US9961403B2 (en) | 2012-12-20 | 2014-01-28 | Visual summarization of video for quick understanding by determining emotion objects for semantic segments of video |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US14/166,158 Active 2035-01-18 US9961403B2 (en) | 2012-12-20 | 2014-01-28 | Visual summarization of video for quick understanding by determining emotion objects for semantic segments of video |
Country Status (1)
Country | Link |
---|---|
US (2) | US20140181668A1 (en) |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140233916A1 (en) * | 2013-02-19 | 2014-08-21 | Tangome, Inc. | Integrating selected video frames into a social feed |
US20150206034A1 (en) * | 2014-01-21 | 2015-07-23 | Electronics And Telecommunications Research Institute | Apparatus and method for recognizing object using correlation between object and content-related information |
US20170047096A1 (en) * | 2015-08-10 | 2017-02-16 | Htc Corporation | Video generating system and method thereof |
US9762950B1 (en) * | 2013-09-17 | 2017-09-12 | Amazon Technologies, Inc. | Automatic generation of network pages from extracted media content |
WO2017157419A1 (en) * | 2016-03-15 | 2017-09-21 | Telefonaktiebolaget Lm Ericsson (Publ) | Associating metadata with a multimedia file |
US20180063253A1 (en) * | 2015-03-09 | 2018-03-01 | Telefonaktiebolaget Lm Ericsson (Publ) | Method, system and device for providing live data streams to content-rendering devices |
US20180091858A1 (en) * | 2015-05-22 | 2018-03-29 | Playsight Interactive Ltd. | Event based video generation |
CN108351965A (en) * | 2015-09-14 | 2018-07-31 | 罗技欧洲公司 | The user interface of video frequency abstract |
US20190139576A1 (en) * | 2017-11-06 | 2019-05-09 | International Business Machines Corporation | Corroborating video data with audio data from video content to create section tagging |
US10419790B2 (en) * | 2018-01-19 | 2019-09-17 | Infinite Designs, LLC | System and method for video curation |
US20190349650A1 (en) * | 2018-09-30 | 2019-11-14 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method, apparatus, and device for generating an essence video and storage medium |
US10511888B2 (en) * | 2017-09-19 | 2019-12-17 | Sony Corporation | Calibration system for audience response capture and analysis of media content |
US10565435B2 (en) * | 2018-03-08 | 2020-02-18 | Electronics And Telecommunications Research Institute | Apparatus and method for determining video-related emotion and method of generating data for learning video-related emotion |
US10592750B1 (en) * | 2015-12-21 | 2020-03-17 | Amazon Technlogies, Inc. | Video rule engine |
US10783679B2 (en) * | 2017-01-30 | 2020-09-22 | Disney Enterprises Inc. | Circular visual representation of media content |
US20210090449A1 (en) * | 2019-09-23 | 2021-03-25 | Revealit Corporation | Computer-implemented Interfaces for Identifying and Revealing Selected Objects from Video |
CN112740715A (en) * | 2018-09-20 | 2021-04-30 | 诺基亚技术有限公司 | An apparatus and method for artificial intelligence |
US11256741B2 (en) * | 2016-10-28 | 2022-02-22 | Vertex Capital Llc | Video tagging system and method |
WO2022051638A1 (en) * | 2020-09-03 | 2022-03-10 | Sony Interactive Entertainment Inc. | Multimodal game video summarization with metadata |
US20220279240A1 (en) * | 2021-03-01 | 2022-09-01 | Comcast Cable Communications, Llc | Systems and methods for providing contextually relevant information |
US11449720B2 (en) * | 2019-05-10 | 2022-09-20 | Electronics And Telecommunications Research Institute | Image recognition device, operating method of image recognition device, and computing device including image recognition device |
US11450353B2 (en) * | 2019-04-30 | 2022-09-20 | Sony Interactive Entertainment Inc. | Video tagging by correlating visual features to sound tags |
US20220343119A1 (en) * | 2017-03-24 | 2022-10-27 | Revealit Corporation | Contextual-based method and system for identifying and revealing selected objects from video |
US11551452B2 (en) * | 2018-04-06 | 2023-01-10 | Nokia Technologies Oy | Apparatus and method for associating images from two image streams |
US12277501B2 (en) | 2020-04-14 | 2025-04-15 | Sony Interactive Entertainment Inc. | Training a sound effect recommendation network |
Families Citing this family (36)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9508385B2 (en) * | 2013-11-21 | 2016-11-29 | Microsoft Technology Licensing, Llc | Audio-visual project generator |
US10467287B2 (en) * | 2013-12-12 | 2019-11-05 | Google Llc | Systems and methods for automatically suggesting media accompaniments based on identified media content |
US9798509B2 (en) | 2014-03-04 | 2017-10-24 | Gracenote Digital Ventures, Llc | Use of an anticipated travel duration as a basis to generate a playlist |
US9431002B2 (en) | 2014-03-04 | 2016-08-30 | Tribune Digital Ventures, Llc | Real time popularity based audible content aquisition |
US9454342B2 (en) | 2014-03-04 | 2016-09-27 | Tribune Digital Ventures, Llc | Generating a playlist based on a data generation attribute |
US10529383B2 (en) * | 2015-04-09 | 2020-01-07 | Avid Technology, Inc. | Methods and systems for processing synchronous data tracks in a media editing system |
US9607224B2 (en) * | 2015-05-14 | 2017-03-28 | Google Inc. | Entity based temporal segmentation of video streams |
US20170076156A1 (en) * | 2015-09-14 | 2017-03-16 | Logitech Europe S.A. | Automatically determining camera location and determining type of scene |
US10299017B2 (en) | 2015-09-14 | 2019-05-21 | Logitech Europe S.A. | Video searching for filtered and tagged motion |
US9805567B2 (en) | 2015-09-14 | 2017-10-31 | Logitech Europe S.A. | Temporal video streaming and summaries |
US10261963B2 (en) | 2016-01-04 | 2019-04-16 | Gracenote, Inc. | Generating and distributing playlists with related music and stories |
EP3327677B8 (en) * | 2016-11-25 | 2019-09-18 | Nokia Technologies Oy | An apparatus for spatial audio and associated method |
US10146100B2 (en) | 2016-12-12 | 2018-12-04 | Gracenote, Inc. | Systems and methods to transform events and/or mood associated with playing media into lighting effects |
US10019225B1 (en) | 2016-12-21 | 2018-07-10 | Gracenote Digital Ventures, Llc | Audio streaming based on in-automobile detection |
US10419508B1 (en) | 2016-12-21 | 2019-09-17 | Gracenote Digital Ventures, Llc | Saving media for in-automobile playout |
US10565980B1 (en) | 2016-12-21 | 2020-02-18 | Gracenote Digital Ventures, Llc | Audio streaming of text-based articles from newsfeeds |
US11528525B1 (en) | 2018-08-01 | 2022-12-13 | Amazon Technologies, Inc. | Automated detection of repeated content within a media series |
US10455297B1 (en) * | 2018-08-29 | 2019-10-22 | Amazon Technologies, Inc. | Customized video content summary generation and presentation |
US11037304B1 (en) * | 2018-09-10 | 2021-06-15 | Amazon Technologies, Inc. | Automated detection of static content within portions of media content |
US10848819B2 (en) | 2018-09-25 | 2020-11-24 | Rovi Guides, Inc. | Systems and methods for adjusting buffer size |
US11265597B2 (en) * | 2018-10-23 | 2022-03-01 | Rovi Guides, Inc. | Methods and systems for predictive buffering of related content segments |
US11636673B2 (en) * | 2018-10-31 | 2023-04-25 | Sony Interactive Entertainment Inc. | Scene annotation using machine learning |
US10977872B2 (en) | 2018-10-31 | 2021-04-13 | Sony Interactive Entertainment Inc. | Graphical style modification for video games using machine learning |
US11375293B2 (en) | 2018-10-31 | 2022-06-28 | Sony Interactive Entertainment Inc. | Textual annotation of acoustic effects |
CN109615682A (en) * | 2018-12-07 | 2019-04-12 | 北京微播视界科技有限公司 | Animation producing method, device, electronic equipment and computer readable storage medium |
US11030479B2 (en) * | 2019-04-30 | 2021-06-08 | Sony Interactive Entertainment Inc. | Mapping visual tags to sound tags using text similarity |
US11071182B2 (en) | 2019-11-27 | 2021-07-20 | Gracenote, Inc. | Methods and apparatus to control lighting effects |
US10904446B1 (en) | 2020-03-30 | 2021-01-26 | Logitech Europe S.A. | Advanced video conferencing systems and methods |
US10951858B1 (en) | 2020-03-30 | 2021-03-16 | Logitech Europe S.A. | Advanced video conferencing systems and methods |
US10972655B1 (en) | 2020-03-30 | 2021-04-06 | Logitech Europe S.A. | Advanced video conferencing systems and methods |
US10965908B1 (en) | 2020-03-30 | 2021-03-30 | Logitech Europe S.A. | Advanced video conferencing systems and methods |
US11615312B2 (en) | 2020-04-14 | 2023-03-28 | Sony Interactive Entertainment Inc. | Self-supervised AI-assisted sound effect generation for silent video using multimodal clustering |
US11277666B2 (en) | 2020-06-10 | 2022-03-15 | Rovi Guides, Inc. | Systems and methods to improve skip forward functionality |
US11184675B1 (en) * | 2020-06-10 | 2021-11-23 | Rovi Guides, Inc. | Systems and methods to improve skip forward functionality |
US11276433B2 (en) | 2020-06-10 | 2022-03-15 | Rovi Guides, Inc. | Systems and methods to improve skip forward functionality |
US11625922B2 (en) | 2021-03-08 | 2023-04-11 | Motorola Solutions, Inc. | Event summarization facilitated by emotions/reactions of people near an event location |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5734794A (en) * | 1995-06-22 | 1998-03-31 | White; Tom H. | Method and system for voice-activated cell animation |
US6544294B1 (en) * | 1999-05-27 | 2003-04-08 | Write Brothers, Inc. | Method and apparatus for creating, editing, and displaying works containing presentation metric components utilizing temporal relationships and structural tracks |
US20030118974A1 (en) * | 2001-12-21 | 2003-06-26 | Pere Obrador | Video indexing based on viewers' behavior and emotion feedback |
US20040263529A1 (en) * | 2002-05-31 | 2004-12-30 | Yuji Okada | Authoring device and authoring method |
US20070223871A1 (en) * | 2004-04-15 | 2007-09-27 | Koninklijke Philips Electronic, N.V. | Method of Generating a Content Item Having a Specific Emotional Influence on a User |
US20130204664A1 (en) * | 2012-02-07 | 2013-08-08 | Yeast, LLC | System and method for evaluating and optimizing media content |
Family Cites Families (39)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3036287B2 (en) | 1992-12-15 | 2000-04-24 | 富士ゼロックス株式会社 | Video scene detector |
WO1996025821A1 (en) | 1995-02-14 | 1996-08-22 | Index Systems, Inc. | Apparatus and method for allowing rating level control of the viewing of a program |
US6014183A (en) | 1997-08-06 | 2000-01-11 | Imagine Products, Inc. | Method and apparatus for detecting scene changes in a digital video stream |
US7380258B2 (en) | 2000-06-21 | 2008-05-27 | At&T Delaware Intellectual Property, Inc. | Systems and methods for controlling and managing programming content and portions thereof |
JP4683253B2 (en) | 2000-07-14 | 2011-05-18 | ソニー株式会社 | AV signal processing apparatus and method, program, and recording medium |
EP1312209B1 (en) | 2000-08-25 | 2017-03-08 | OpenTV, Inc. | Personalized remote control |
US7904814B2 (en) | 2001-04-19 | 2011-03-08 | Sharp Laboratories Of America, Inc. | System for presenting audio-video content |
US20050257242A1 (en) | 2003-03-14 | 2005-11-17 | Starz Entertainment Group Llc | Multicast video edit control |
KR101150748B1 (en) | 2003-06-30 | 2012-06-08 | 아이피지 일렉트로닉스 503 리미티드 | System and method for generating a multimedia summary of multimedia streams |
EP1666967B1 (en) * | 2004-12-03 | 2013-05-08 | Magix AG | System and method of creating an emotional controlled soundtrack |
US20080253617A1 (en) | 2005-09-29 | 2008-10-16 | Koninklijke Philips Electronics, N.V. | Method and Apparatus for Determining the Shot Type of an Image |
US8654848B2 (en) | 2005-10-17 | 2014-02-18 | Qualcomm Incorporated | Method and apparatus for shot detection in video streaming |
US20070129942A1 (en) | 2005-12-01 | 2007-06-07 | Ban Oliver K | Visualization and annotation of the content of a recorded business meeting via a computer display |
US8201080B2 (en) | 2006-05-24 | 2012-06-12 | International Business Machines Corporation | Systems and methods for augmenting audio/visual broadcasts with annotations to assist with perception and interpretation of broadcast content |
US8458595B1 (en) * | 2006-05-31 | 2013-06-04 | Adobe Systems Incorporated | Video editing including simultaneously displaying timelines and storyboards |
CN101449587A (en) | 2006-06-08 | 2009-06-03 | 汤姆逊许可公司 | Detection of scene switching for video |
US20080127270A1 (en) * | 2006-08-02 | 2008-05-29 | Fuji Xerox Co., Ltd. | Browsing video collections using hypervideo summaries derived from hierarchical clustering |
US8386257B2 (en) | 2006-09-13 | 2013-02-26 | Nippon Telegraph And Telephone Corporation | Emotion detecting method, emotion detecting apparatus, emotion detecting program that implements the same method, and storage medium that stores the same program |
KR100828371B1 (en) * | 2006-10-27 | 2008-05-08 | 삼성전자주식회사 | Method and apparatus for generating metadata of content |
KR100804678B1 (en) | 2007-01-04 | 2008-02-20 | 삼성전자주식회사 | How to classify scenes by video figures |
JP5175852B2 (en) | 2007-07-31 | 2013-04-03 | パナソニック株式会社 | Video analysis device, method for calculating evaluation value between persons by video analysis |
JP5181325B2 (en) | 2007-08-08 | 2013-04-10 | 国立大学法人電気通信大学 | Cut part detection system, shot detection system, scene detection system, and cut part detection method |
JP2009044423A (en) | 2007-08-08 | 2009-02-26 | Univ Of Electro-Communications | Scene detection system and scene detecting method |
KR101319544B1 (en) | 2007-10-25 | 2013-10-21 | 삼성전자주식회사 | Photographing apparatus for detecting appearance of person and method thereof |
US7889073B2 (en) | 2008-01-31 | 2011-02-15 | Sony Computer Entertainment America Llc | Laugh detector and system and method for tracking an emotional response to a media presentation |
US20090226046A1 (en) * | 2008-03-07 | 2009-09-10 | Yevgeniy Eugene Shteyn | Characterizing Or Recommending A Program |
US20100131993A1 (en) | 2008-11-24 | 2010-05-27 | Sanitate Paul A | Method and apparatus for the efficient generation, storage and delivery of multiple versions of a video |
US8769589B2 (en) | 2009-03-31 | 2014-07-01 | At&T Intellectual Property I, L.P. | System and method to create a media content summary based on viewer annotations |
US8805854B2 (en) * | 2009-06-23 | 2014-08-12 | Gracenote, Inc. | Methods and apparatus for determining a mood profile associated with media data |
US8661353B2 (en) * | 2009-05-29 | 2014-02-25 | Microsoft Corporation | Avatar integrated shared media experience |
US9167189B2 (en) | 2009-10-15 | 2015-10-20 | At&T Intellectual Property I, L.P. | Automated content detection, analysis, visual synthesis and repurposing |
US8990690B2 (en) * | 2011-02-18 | 2015-03-24 | Futurewei Technologies, Inc. | Methods and apparatus for media navigation |
US20130246063A1 (en) * | 2011-04-07 | 2013-09-19 | Google Inc. | System and Methods for Providing Animated Video Content with a Spoken Language Segment |
US8937620B1 (en) * | 2011-04-07 | 2015-01-20 | Google Inc. | System and methods for generation and control of story animation |
US9053750B2 (en) * | 2011-06-17 | 2015-06-09 | At&T Intellectual Property I, L.P. | Speaker association with a visual representation of spoken content |
KR20130055429A (en) * | 2011-11-18 | 2013-05-28 | 삼성전자주식회사 | Apparatus and method for emotion recognition based on emotion segment |
US10372758B2 (en) * | 2011-12-22 | 2019-08-06 | Tivo Solutions Inc. | User interface for viewing targeted segments of multimedia content based on time-based metadata search criteria |
US9785639B2 (en) * | 2012-04-27 | 2017-10-10 | Mobitv, Inc. | Search-based navigation of media content |
US9721010B2 (en) * | 2012-12-13 | 2017-08-01 | Microsoft Technology Licensing, Llc | Content reaction annotations |
-
2012
- 2012-12-20 US US13/722,754 patent/US20140181668A1/en not_active Abandoned
-
2014
- 2014-01-28 US US14/166,158 patent/US9961403B2/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US5734794A (en) * | 1995-06-22 | 1998-03-31 | White; Tom H. | Method and system for voice-activated cell animation |
US6544294B1 (en) * | 1999-05-27 | 2003-04-08 | Write Brothers, Inc. | Method and apparatus for creating, editing, and displaying works containing presentation metric components utilizing temporal relationships and structural tracks |
US20030118974A1 (en) * | 2001-12-21 | 2003-06-26 | Pere Obrador | Video indexing based on viewers' behavior and emotion feedback |
US20040263529A1 (en) * | 2002-05-31 | 2004-12-30 | Yuji Okada | Authoring device and authoring method |
US20070223871A1 (en) * | 2004-04-15 | 2007-09-27 | Koninklijke Philips Electronic, N.V. | Method of Generating a Content Item Having a Specific Emotional Influence on a User |
US20130204664A1 (en) * | 2012-02-07 | 2013-08-08 | Yeast, LLC | System and method for evaluating and optimizing media content |
Cited By (47)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140233916A1 (en) * | 2013-02-19 | 2014-08-21 | Tangome, Inc. | Integrating selected video frames into a social feed |
US10257563B2 (en) | 2013-09-17 | 2019-04-09 | Amazon Technologies, Inc. | Automatic generation of network pages from extracted media content |
US9762950B1 (en) * | 2013-09-17 | 2017-09-12 | Amazon Technologies, Inc. | Automatic generation of network pages from extracted media content |
US10721519B2 (en) | 2013-09-17 | 2020-07-21 | Amazon Technologies, Inc. | Automatic generation of network pages from extracted media content |
US20150206034A1 (en) * | 2014-01-21 | 2015-07-23 | Electronics And Telecommunications Research Institute | Apparatus and method for recognizing object using correlation between object and content-related information |
US9412049B2 (en) * | 2014-01-21 | 2016-08-09 | Electronics And Telecommunications Research Institute | Apparatus and method for recognizing object using correlation between object and content-related information |
US20180063253A1 (en) * | 2015-03-09 | 2018-03-01 | Telefonaktiebolaget Lm Ericsson (Publ) | Method, system and device for providing live data streams to content-rendering devices |
US10616651B2 (en) * | 2015-05-22 | 2020-04-07 | Playsight Interactive Ltd. | Event based video generation |
US20180091858A1 (en) * | 2015-05-22 | 2018-03-29 | Playsight Interactive Ltd. | Event based video generation |
US20170047096A1 (en) * | 2015-08-10 | 2017-02-16 | Htc Corporation | Video generating system and method thereof |
CN108351965A (en) * | 2015-09-14 | 2018-07-31 | 罗技欧洲公司 | The user interface of video frequency abstract |
US10592750B1 (en) * | 2015-12-21 | 2020-03-17 | Amazon Technlogies, Inc. | Video rule engine |
US10915569B2 (en) | 2016-03-15 | 2021-02-09 | Telefonaktiebolaget Lm Ericsson (Publ) | Associating metadata with a multimedia file |
WO2017157419A1 (en) * | 2016-03-15 | 2017-09-21 | Telefonaktiebolaget Lm Ericsson (Publ) | Associating metadata with a multimedia file |
US11256741B2 (en) * | 2016-10-28 | 2022-02-22 | Vertex Capital Llc | Video tagging system and method |
US10783679B2 (en) * | 2017-01-30 | 2020-09-22 | Disney Enterprises Inc. | Circular visual representation of media content |
US12354028B2 (en) * | 2017-03-24 | 2025-07-08 | Revealit Corporation | Generative interactive video method and system |
US20240119321A1 (en) * | 2017-03-24 | 2024-04-11 | Revealit Corporation | Generative interactive video method and system |
US11893514B2 (en) * | 2017-03-24 | 2024-02-06 | Revealit Corporation | Contextual-based method and system for identifying and revealing selected objects from video |
US20220343119A1 (en) * | 2017-03-24 | 2022-10-27 | Revealit Corporation | Contextual-based method and system for identifying and revealing selected objects from video |
US10511888B2 (en) * | 2017-09-19 | 2019-12-17 | Sony Corporation | Calibration system for audience response capture and analysis of media content |
US11218771B2 (en) | 2017-09-19 | 2022-01-04 | Sony Corporation | Calibration system for audience response capture and analysis of media content |
US10714144B2 (en) * | 2017-11-06 | 2020-07-14 | International Business Machines Corporation | Corroborating video data with audio data from video content to create section tagging |
US20190139576A1 (en) * | 2017-11-06 | 2019-05-09 | International Business Machines Corporation | Corroborating video data with audio data from video content to create section tagging |
US10419790B2 (en) * | 2018-01-19 | 2019-09-17 | Infinite Designs, LLC | System and method for video curation |
US10565435B2 (en) * | 2018-03-08 | 2020-02-18 | Electronics And Telecommunications Research Institute | Apparatus and method for determining video-related emotion and method of generating data for learning video-related emotion |
US11551452B2 (en) * | 2018-04-06 | 2023-01-10 | Nokia Technologies Oy | Apparatus and method for associating images from two image streams |
CN112740715A (en) * | 2018-09-20 | 2021-04-30 | 诺基亚技术有限公司 | An apparatus and method for artificial intelligence |
US11140462B2 (en) * | 2018-09-30 | 2021-10-05 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method, apparatus, and device for generating an essence video and storage medium |
US20190349650A1 (en) * | 2018-09-30 | 2019-11-14 | Baidu Online Network Technology (Beijing) Co., Ltd. | Method, apparatus, and device for generating an essence video and storage medium |
US11450353B2 (en) * | 2019-04-30 | 2022-09-20 | Sony Interactive Entertainment Inc. | Video tagging by correlating visual features to sound tags |
US11449720B2 (en) * | 2019-05-10 | 2022-09-20 | Electronics And Telecommunications Research Institute | Image recognition device, operating method of image recognition device, and computing device including image recognition device |
US11893592B2 (en) * | 2019-09-23 | 2024-02-06 | Revealit Corporation | Incentivized neural network training and assurance processes |
US11580869B2 (en) * | 2019-09-23 | 2023-02-14 | Revealit Corporation | Computer-implemented interfaces for identifying and revealing selected objects from video |
US20230153836A1 (en) * | 2019-09-23 | 2023-05-18 | Revealit Corporation | Incentivized neural network training and assurance processes |
US20230196385A1 (en) * | 2019-09-23 | 2023-06-22 | Revealit Corporation | Virtual environment-based interfaces applied to selected objects from video |
US12051080B2 (en) * | 2019-09-23 | 2024-07-30 | Revealit Corporation | Virtual environment-based interfaces applied to selected objects from video |
US12361687B2 (en) * | 2019-09-23 | 2025-07-15 | Revealit Corporation | User-described video streams |
US20210090449A1 (en) * | 2019-09-23 | 2021-03-25 | Revealit Corporation | Computer-implemented Interfaces for Identifying and Revealing Selected Objects from Video |
US20240331041A1 (en) * | 2019-09-23 | 2024-10-03 | Revealit Corporation | User-described Video Streams |
US12277501B2 (en) | 2020-04-14 | 2025-04-15 | Sony Interactive Entertainment Inc. | Training a sound effect recommendation network |
WO2022051638A1 (en) * | 2020-09-03 | 2022-03-10 | Sony Interactive Entertainment Inc. | Multimodal game video summarization with metadata |
TWI797740B (en) * | 2020-09-03 | 2023-04-01 | 日商索尼互動娛樂股份有限公司 | Apparatus, method and assembly for multimodal game video summarization with metadata field |
US12003811B2 (en) | 2021-03-01 | 2024-06-04 | Comcast Cable Communications, Llc | Systems and methods for providing contextually relevant information |
US11516539B2 (en) * | 2021-03-01 | 2022-11-29 | Comcast Cable Communications, Llc | Systems and methods for providing contextually relevant information |
US20220279240A1 (en) * | 2021-03-01 | 2022-09-01 | Comcast Cable Communications, Llc | Systems and methods for providing contextually relevant information |
US12363378B2 (en) | 2021-03-01 | 2025-07-15 | Comcast Cable Communications, Llc | Systems and methods for providing contextually relevant information |
Also Published As
Publication number | Publication date |
---|---|
US20140178043A1 (en) | 2014-06-26 |
US9961403B2 (en) | 2018-05-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9961403B2 (en) | Visual summarization of video for quick understanding by determining emotion objects for semantic segments of video | |
US10733230B2 (en) | Automatic creation of metadata for video contents by in cooperating video and script data | |
US12126868B2 (en) | Content filtering in media playing devices | |
US10679063B2 (en) | Recognizing salient video events through learning-based multimodal analysis of visual features and audio-based analytics | |
US20210117685A1 (en) | System and method for generating localized contextual video annotation | |
US8804999B2 (en) | Video recommendation system and method thereof | |
CN108307229B (en) | Video and audio data processing method and device | |
Sundaram et al. | A utility framework for the automatic generation of audio-visual skims | |
US20230140369A1 (en) | Customizable framework to extract moments of interest | |
US8856636B1 (en) | Methods and systems for trimming video footage | |
KR20080114786A (en) | Method and apparatus for automatic generation of multiple image summaries | |
US11514924B2 (en) | Dynamic creation and insertion of content | |
Wang et al. | Affection arousal based highlight extraction for soccer video | |
CN113992973B (en) | Video abstract generation method, device, electronic equipment and storage medium | |
Le et al. | Learning multimodal temporal representation for dubbing detection in broadcast media | |
KR20060116335A (en) | A computer readable recording medium storing a video summary device and method using an event and a computer program controlling the device | |
CN116980718A (en) | Scenario recomposition method and device for video, electronic equipment and storage medium | |
Xu et al. | Fast summarization of user-generated videos: exploiting semantic, emotional, and quality clues | |
Zhang et al. | AI video editing: A survey | |
Nayak et al. | See me speaking? Differentiating on whether words are spoken on screen or off to optimize machine dubbing | |
CN118695044A (en) | Method, device, computer equipment, readable storage medium and program product for generating promotional video | |
Gagnon et al. | A computer-vision-assisted system for videodescription scripting | |
Valdés et al. | On-line video abstract generation of multimedia news | |
CN113392722B (en) | Method, device, electronic device and storage medium for identifying emotion of object in video | |
CN115407912A (en) | Interact with semantic video clips through interactive tiles |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: INTERNATIONAL BUSINESS MACHINES CORPORATION, NEW Y Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:KRITT, BARRY A.;RAKSHIT, SARBAJIT K.;SREMANIAK, SHAWN K.;SIGNING DATES FROM 20121219 TO 20121220;REEL/FRAME:029513/0266 |
|
AS | Assignment |
Owner name: LENOVO ENTERPRISE SOLUTIONS (SINGAPORE) PTE. LTD., SINGAPORE Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:INTERNATIONAL BUSINESS MACHINES CORPORATION;REEL/FRAME:034194/0353 Effective date: 20140926 Owner name: LENOVO ENTERPRISE SOLUTIONS (SINGAPORE) PTE. LTD., Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:INTERNATIONAL BUSINESS MACHINES CORPORATION;REEL/FRAME:034194/0353 Effective date: 20140926 |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |