MXPA01004561A

MXPA01004561A - Systems and methods for interoperable multimediacontent descriptions

Info

Publication number: MXPA01004561A
Application number: MXPA/A/2001/004561A
Authority: MX
Inventors: Seungyup Paek; Ana Benitez; Shihfu Chang
Original assignee: Ana Benitez; Shihfu Chang; Seungyup Paek; The Trustees Of Columbia University In The City Ofnew York
Priority date: 1998-11-06
Filing date: 2001-05-04
Publication date: 2002-06-05

Abstract

Systems and methods for generating standard description records from multimedia information are provided. The system includes at least one multimedia information input interface (180) receiving multimedia information, a computer processor, and a data storage system (150), operatively coupled to said processor, for storing said at least one description record. The processor performs object extraction processing to generate multimedia object descriptions (200, 201, 205) from the multimedia information, and object hierarchy processing (410, 420) to generate multimedia object hierarchy descriptions, to generate at least one description record including the multimedia object descriptions (200, 201, 205) and multimedia object hierarchy descriptions for content embedded within the multimedia information.

Description

SYSTEMS AND METHODS FOR INTEROPERABLE MULTIMEDIA CONTENT DESCRIPTIONS BACKGROUND OF THE INVENTION. I. Field of the invention. The present invention relates to techniques for describing multimedia information, and more specifically, to techniques that describe both video and image information, as well as the content of such information. II. Description of the related art. With the maturation of the global Internet and the widespread use of regional networks and local networks, digital multimedia information has become increasingly accessible to consumers and businesses. As a result, it has become progressively more important to develop systems for processing, filtering, searching and organizing digital multimedia information, so that useful information can be drawn from this growing mass of raw information. At the time of the presentation of the present application, there are solutions that allow consumers and businesses to search for textual information. Undoubtedly, numerous text-based search mechanisms, such as those provided by yahoo.com, goto.com, excite.com and others are available on the World Wide Web, and are among the most visited sites in the network, indicating what - significant demand for such information retrieval technology. Unfortunately, that is not the case for multimedia content, since there is no recognized description for this material. With respect to this, there have been past attempts to provide multimedia databases that allow users to search for images using features such as color information, texture and shape of video objects contained in the image. However, at the end of the 20th century, it is not yet possible to carry out a general search on the Internet or in most regional or local networks for multimedia content, since there is no widely recognized description of this material. On the other hand, the need to search for multimedia content is not limited to databases, but extends to other applications, such as digital television transmission and multimedia telephony. A broad attempt by the industry to develop such a standard is a structure of multimedia description that was made through the MPEG-7 standardization effort of the Group of Experts of Moving Image (MPEG). Launched in October 1996, the MPEG-7 aims to standardize the descriptions of multimedia data content in order to facilitate content-focused applications, such as search, filtering, quick review and summarization - multimedia. The document of the International Organization for Standardization ISO / IEC JTC1 / SC29 / WG11 N2460 (Oct. 1998) contains a more complete description of the objectives of the MPEG-7 standard, the content of which is incorporated herein for reference. The MPEG-7 standard has the objective of specifying a set of descriptor rules as well as structures (referred to as "description schemas") for the descriptors and their relationships to describe various types of multimedia information. The MPEG-7 also proposes to standardize ways of defining other descriptors as well as "description schemes" for the descriptors and their relationships. This description, that is, the combination of descriptors and description schemes, must be associated with the content itself to allow a quick and efficient search and filtering of a user's material of interest. The MPEG-7 also proposes to standardize a language to specify description schemes, that is, a Description Definition Language ("DDL"), and the schemes for the binary coding of multimedia content descriptions. At the time of the presentation of the present application, the MPEG is requesting proposals for techniques that optimally implement the description schemes necessary for its future integration in the MPEG-7 standard.

In order to provide optimized description schemes, three different multimedia application arrangements can be considered. These are, the distributed processing framework, the content exchange framework and the format that allows the customized display of multimedia content. Regarding distributed processing, the description scheme must provide the ability to exchange multimedia material descriptions independently of any platform, any vendor and any application, which allows the distributed processing of multimedia content. The standardization of interoperable content descriptions will mean that data from a variety of sources can be connected in a variety of distributed applications, such as multimedia processors, publishers, recovery systems, filtering agents, etc. Some of these applications can be provided by third parties, generating a sub-industry of multimedia tool providers that can work with standardized descriptions of multimedia data. A user can be allowed access to several network sites content providers, to copy content and associated classification data, obtained by low or high level processing, and proceed to access - several network sites providing tools, to copy tools (for example Java applets) to manipulate the descriptions of heterogeneous data in particular ways, according to the user's personal interests. An example of such a multimedia tool will be a video editor. A video editor conforming to MPEG-7 will be able to manipulate and process video content from a variety of sources if the description associated with each video is MPEG-7 compliant. Each video can come with a varying degree of description details, such as camera movement, scene cuts, annotations and target segmentations. A second framework that will benefit enormously from an interoperable content description standard is the exchange of multimedia content between heterogeneous multimedia databases. MPEG-7 seeks to provide the means to express, exchange, translate and reuse existing descriptions of multimedia material. Currently, TV broadcasters, Radio stations and other content providers handle and store a huge amount of multimedia material. This material is currently described manually using textual information and exclusive databases. Without a description of interoperable content, users of the content need to invest human resources to manually translate the descriptions used by each transmitter into their own exclusive scheme. The exchange of descriptions of multimedia content would be possible if all content providers adopted the same scheme and content description schemes. This is one of the objectives of MPEG-7. Finally, multimedia players and visualizers that use the description schemes must provide users with innovative capabilities, such as multiple views of the data configured by the user. The user must be able to change the configuration of the screen without requiring that the data be copied again in a different format from that of the content station. The above examples only suggest possible uses for richly structured data, delivered in a standardized form based on MPEG-7. Unfortunately, no prior art available at present is capable of generically satisfying the frameworks of distributed processing, content exchange or personalized views. Specifically, the prior art does not provide a technique for capturing content included in multimedia information based on both generic characteristics and semantic relationships, or in providing a technique to organize said contents. Consequently, there is a need in the art for efficient content description schemes for generic multimedia information. SUMMARY OF THE INVENTION. An object of the present invention is to provide content description schemes for generic multimedia information. Another objective of the present invention is to provide techniques for implementing standardized multimedia content description schemes. A further objective of the present invention is to provide an apparatus that allows users to carry out general searches, on the Internet or regional or local networks, of multimedia content. Yet another objective of the present invention is to provide a technique for capturing the content included in multimedia information based on both generic characteristics and semantic relationships. Yet another additional objective of the present invention is to provide a technique for organizing the content included in the multimedia information on the basis of either generic characteristics or semantic relationships In order to satisfy these and other objectives that will become more apparent with reference to the Additional disclosure set forth hereinafter, the present invention provides a system for generating a description record based on multimedia information. The system includes at least one multimedia information input interface that receives the multimedia information, a computer processor and a data storage system, operatively coupled to said processor, for storing at least one description record. In order to satisfy the objectives of the present invention, the processor performs object extraction processing to generate descriptions of multimedia objects from multimedia information, and object hierarchy processing to generate hierarchical descriptions of multimedia objects to generate the less a description record that includes the multimedia object descriptions and the multimedia object hierarchy descriptions for the content included within the multimedia information. In a preferred arrangement, multimedia information is image information, descriptions of multimedia objects are descriptions of image objects, and multimedia hierarchy object descriptions are hierarchy descriptions of image objects. In an alternative preferred arrangement, the multimedia information is video information, the descriptions of multimedia objects are descriptions of video objects and the hierarchy descriptions of multimedia objects are hierarchical descriptions of video objects.

Where the multimedia information is image information, it is widely preferred that the object extraction process includes image segmentation processing to segment each image of the image information into regions, and feature extraction processing to generate one or more descriptions of characteristics for one or more of the regions. The descriptions can include text annotations, color, texture, shape, size and position information. In the same way, it is advantageous that the object hierarchy process includes hierarchical organization of physical objects to generate descriptions of physical object hierarchies of image object descriptions that are based on spatial characteristics of objects, and hierarchy organization of logical objects to generate logical object hierarchy descriptions of image object descriptions that are based on semantic characteristics of the objects, such that image object hierarchy descriptions comprise both physical and logical descriptions. An encoder can be added to the system to encode descriptions of image objects and hierarchy descriptions of image objects into compressed description information. Where the multimedia information is video information, it is widely preferred that the object extraction process includes video segmentation processing to temporarily segment the video information into one or more events or group of video events, extraction processing of video objects. video to segment video events in regions and generate feature descriptions for regions; and feature extraction processing to generate one or more feature descriptions for video events. Feature descriptions for events can include text annotations, shot transitions, camera movement, time and guide box. Feature descriptions for objects can include text annotations, color, texture, shape, size, position, movement, and time. Similarly, it is advantageous for object hierarchy processing to include hierarchy organization of both physical events and objects, to generate hierarchy descriptions of physical events and video event objects, and descriptions of objects that are feature-based Temporary events of video and objects, and organization of event hierarchies and logical objects to generate hierarchy descriptions of events and logical objects of the video event and descriptions of objects that are based on semantic characteristics of those objects, and video object hierarchy extraction processing to generate hierarchy descriptions for events and objects contained within the video information. The present invention also provides methods for providing a content description scheme for generic multimedia information. In one arrangement, the method includes the steps of receiving multimedia information, processing the multimedia information by performing object extraction processing to generate descriptions of multimedia objects; processing the multimedia object descriptions generated by object hierarchy processing to generate hierarchy descriptions of multimedia objects, such that at least one description record is generated that includes the descriptions of the object and the hierarchy descriptions of the objects for the content included within the multimedia information; and store the record. The multimedia information can be image or video information. The present invention also provides computer readable media containing digital information, at least one multimedia description record that describes the multimedia content for the corresponding multimedia information. In an arrangement, the medium includes at least one description of objects for the corresponding objects included in the multimedia information, one or more characteristics that characterize each of the objects; and any available hierarchy information related to at least a portion of the objects according to at least one of the characteristics. The multimedia information can be image or video information, and in the case of being video information, the objects can be events or video objects included within the video information. The accompanying drawings, which are incorporated and constitute part of this disclosure, illustrate a preferred embodiment of the invention and serve to explain the principles of the invention. BRIEF DESCRIPTION OF THE DRAWINGS. Figure 1 is a system diagram of a preferred embodiment of the present invention; Figure 2 is a functional diagram of a multimedia content description system suitable for use in the system of Figure 1; Figure 3 is an illustrative diagram of an image showing exemplary image objects; Figures 4a and 4b are illustrative diagrams showing a set of exemplary picture objects and hierarchical organizations for the exemplary picture objects shown in Figure 3; Figure 5 is an illustrative diagram of a video showing exemplary video events; Figures 6a and 6b are illustrative diagrams showing a set of video events and an exemplary hierarchical organization for the exemplary video objects shown in Figure 5; Figure 7 is a flow diagram of a process that can be implemented in the system of Figure 1 to generate image descriptions; and Figure 8 is a flow chart of a process that can be implemented in the system of Figure 1 to generate video descriptions. DESCRIPTION OF THE PREFERRED MODALITIES. Referring to Figure 1, an exemplary embodiment of the present invention is provided. The architecture of the system 100 includes a client computer 110 and a server computer 120. The server computer 120 includes a display interface 130, a query dispatcher 140, a performance database 150, query translators 160, 161, 165, objective search mechanisms 170, 171, 175, an additional client computer 180, and multimedia content description systems 200, 201, 205, which will be described in more detail more ahead. While the following discussion will refer to this exemplary client-server mode, those skilled in the art should understand that the particular arrangement of the system can be modified within the scope of the invention to include numerous well-known local or distributed architectures. For example, all the functionality of the client-server system may be included within a single computer, or a plurality of server computers with shared or separate functionality may be used. The multimedia content description systems 200, 201, 205 are preferably software routines running on a general-purpose processor within the server computer 120. The commercially available meta-search mechanisms act as ports that link users automatically and transparently to multiple mechanisms search based on text. The system of Figure 1 grows according to the architecture of said metasearch mechanisms and is designed to intelligently select and give access to multiple multimedia search mechanisms when classifying its performance for different kinds of user queries. Accordingly, the query dispatcher 140, the query translators 160, 161, 165, and the display interface 130 of the commercially available metasearch mechanisms can be employed in the present invention. The dispatcher 140 selects the search mechanisms for targets to be consulted, upon consulting the performance data base 150 once a user request is received. This database 150 contains performance markers of successes and past query failures for each supported search option. The query dispatcher only selects search mechanisms 170, 171, 175, which are capable of satisfying the user's query, for example, a query seeking color information will trigger color-enabled search mechanisms. The query translators 160, 161, 165 translate the user's query into appropriate instructions according to the interfaces of the selected search mechanisms. The display component 130 uses the performance markers to merge the results of each search mechanism and present them to the user. In accordance with the present invention, in order to enable the user to intelligently search the multimedia contents on the Internet or on a regional or local network, search queries can be made with respect to the content included in the multimedia information. Content based search queries may be made by descriptions of the multimedia content according to the description schemes of the present invention, for example, or by outline. Each search mechanism 170, 171, 175 employs a description scheme, for example the description schemes described below, to describe the contents of the multimedia information accessible by the search mechanism and to implement the search. In order to implement a content-based search query for multimedia information generated via client computer 110, dispatcher 140 must match the query description, through a multimedia content description system 200, employed by each search mechanism 170 , 171, 175 to ensure the satisfaction of the user's preferences in the query. Then it will select the search mechanisms of targets 170, 171, 175 to be consulted when consulting the performance database 150. If the user of the client computer 110 wants to search by color and a search mechanism does not support any color descriptor, no It will be useful to request that particular search mechanism. Then, the query translators 160 will adapt the query description to descriptions according to each selected search mechanism. This translation will also be based on description schemes available from each search mechanism. This task may require the execution of an extraction code for standard descriptors or an extraction code copied from specific search mechanisms to transform descriptors. For example, if the user specifies the color characteristic of an object by using color coherence of 166 bins, the query translator will translate it to the specific color descriptors used by each search mechanism, for example, color coherence and histogram of color of x bins. Before displaying the results to the user, the query interface will merge the results of each search option when translating all the results descriptions into a homogeneous one for comparison and classification. Again, execution of the similarity code for standard descriptors or a similarity code copied from search mechanisms may be required. The user's preferences will determine how the results will be displayed to the user. Alternatively, a search query can be entered via client computer 180 which gives direct access to the objective search mechanism 170. Contrary to a query entered through a client computer 110, the client computer 180 will not allow a metasearch via multiple search mechanisms. However, the multimedia content description system 200 may be employed in any arrangement that leads to a content-based search according to the present invention. Referring now to Figure 2, a description system 200 is now described, which, according to the present invention, is employed by each search mechanism 170, 171, 175. In the preferred embodiment set forth herein, the language of Extensible Margin ("XML") is used to describe multimedia content. XML is a subset of the Standard Generalized Margin Language ("SGML"), the standard language for defining and using document formats. The SGML allows documents to be self-descriptive, that is, they describe their own grammar by specifying the set of labels used in the document and the structural relationships that those labels represent. XML retains the key advantages of SGML in a language that is designed to be considerably easier to learn, use and implement than the full SGML. A full description of the XML can be found on the World Wide Web consortium's XML page, at http://www.w3.org/XML/, of which the content is incorporated herein for reference. The description system 200 advantageously includes several subsystems of video and image processing, analysis and annotation 210, 220, 230, 240, 250, 260, 270, 280 to generate a rich variety of descriptions for a collection of image and video articles. 205. Each subsystem is described in turn. The first subsystem 210 is a classification and search system based on regions, which extracts visual characteristics such as color, texture, movement, shape, and size for automatically segmented regions of a video stream. The system 210 decomposes the video into separate shots by detection of change of scene, which can be abrupt or transitional (for example, dissolvencies, fades of entry / exit, sweeps). For each shot, the system 210 estimates both the global movement (for example, the movement of the domt background), and camera movement, then segments, detects, and locates regions through the frames of the shots computing the different visual characteristics for each region. For each shot, the description generated by this system is a set of regions with visual and movement characteristics, and camera movement. A complete description of the regional-based classification and search system 210 is contained in the co-pending PCT Application Series No. PCT / US98 / 09124, filed May 5, 1998, entitled "An Algorithm Architecture and Search System of Video Based on Oriented Object Content ", of which the content is incorporated herein for reference. As used herein, a "video clip" should refer to a sequence of video information frames having one or more video objects having identifiable attributes, such as, by way of example and not as limitation, a player of Baseball swinging a bat, a surfboard moving across the sea, or a horse running across a meadow. The "video object" is a contiguous set of pixels that is homogeneous in one or more features of interest, eg, texture, color, movement, and shape. Therefore, the video object is formed by one or more video regions that show consistency in at least one feature. For example the taking of a person (the person is here the "object") walking could be segmented into a collection of adjoining regions differing in criteria such as shape, color and texture, but all regions can show consistency in their movement attribute. The second subsystem 220 is a face detection system in the MPEG domain, which efficiently and automatically detects faces in the compressed MPEG domain. The human face is an important object in the video. It is ubiquitous in news, documentaries, films, etc., it provides key information to the viewer to understand the content of the video. This system provides a set of regions with face tags. A complete description of the system 220 is contained in PCT Application Series No. PCT / US97 / 20024, filed on November 4, 1997, entitled "A Highly Efficient System for Automatic Phase Region Detection in MPEG Video" Efficient for the Automatic Detection of Regions of Faces in Video MPEG "), the content of which is incorporated herein for reference. The third subsystem 230 is a video object segmentation system in which automatic segmentation is integrated with user input to locate semantic objects in video sequences. For video sources in general, the system allows users to define an approximate target limit when using a trace interface. Given the approximate limit of the object, the system automatically refines the boundary and locates the movement of the object in the subsequent frames of the video. The system is robust enough to handle many real-world situations that are difficult to model in existing approaches, including complex objects, fast and intermittent motion, complicated backgrounds, multiple movement, and partial occlusion. The description generated by this system is a set of semantic objects with associated regions and characteristics that can be manually annotated with text. A complete description of the system 230 is contained in the U.S. Patent Application. Series No. 09/405, 555, filed on September 24, 1998, entitled "An Active System and Algorithm for Semantic Video Object Segmentation" ("An Active System and Algorithm for Semantic Segmentation of Video Objects"), the content of which is incorporated in the present for reference. The fourth subsystem 240, is a hierarchical fast video review system that analyzes compressed MPEG video streams to extract boundary shots, moving objects, object characteristics, and camera movement and also generates a hierarchical quick review interface based on in shots for intuitive viewing and video editing. A complete description of the system 240 is contained in PCT Application Series No. PCT / US97 / 08266, filed May 16, 1997, entitled "A Method and Architecture for Indexing and Editing Compressed Video Over the World Wide Web" ("A Method and Architecture for Classification and Editing of Compressed Video through the World Wide Web "), the contents of which are incorporated herein for reference. The fifth subsystem 250 is the entry of manual text annotations. It is often desirable to integrate visual characteristics and textual characteristics for the classification of scenes. For images of online news sources, for example Clarinet, there is often textual information in the form of legends or articles associated with each image. This textual information can be included in the descriptions. The sixth subsystem 260 is a system for classification of high-level semantic images and video shots based on low-level visual characteristics. The essence of the system consists of several machine learning techniques such as the induction of the rule, grouping and classification of the nearest neighbor. The system is used to classify images and video scenes into classes of high level semantic scenes such as. { nature landscape} ,. { city / suburb} ,. { interiors } , Y . { outside} . The system focuses on machine learning techniques because we have found that the particular set of rules that can work well with a corpus may not work well for another corpus, even for the same set of semantic scene classes. Since the essence of the system is based on machine learning techniques, the system can be adapted to achieve high performance for different languages by training the system with examples for each language. The description generated by this system is a set of text annotations that indicate the kind of scene for each image of each guide frame associated with the shots of the video sequence. A complete description of the system 260 is contained in "Integration of Text and Visual-Based Procedures for Content Labeling and Photograph Classification" by S. Paek et al., Multimedia Classification and Recovery Workshop ACM SIGIR'99, Berkeley, CA (1999), the contents of which are incorporated herein for reference. The seventh subsystem 270 is a model-based image classification system. Many automatic image classification systems are based on a predefined set of classes in which specific class algorithms are used to carry out the classification. System 270 allows users to define their own classes and provide examples that are used to automatically learn visual models. Visual models are based on automatically segmented regions, their associated visual characteristics, and their spatial relationships. For example, the user can construct a visual model of a portrait in which a person wearing a blue suit is sitting on a brown sofa, and a second person is standing to the right of the person sitting. The system uses a combination of concise learning, decision trees and evolution programs during the classification. The description generated by this system is a set of text annotations, that is, the classes defined by the user for each image. A complete description of the system 270 is contained in "Classification of Visual Information Based on Models for Content-Based Recovery", by A. James et al., Symp. Elec.: Imaging: Multimedia Proc. and App. (Simp. Of Electoral Representation: Proc. and Ap. Multimedia) - Storage Recovery and for Video and Image VII databases, ISS.T / SPIE '99 (1999), of which the contents are incorporated here for reference. Other subsystems 280 may be added to the multimedia content description system 200, for example, a subsystem used by collaborators to generate descriptions. In operation, the image and video content 205 may be a database of motionless or moving video images, a buffer that receives the contents of a quick check interface 206, or a receptacle for image or video transmission in alive. The subsystems 210, 220, 230, 240, 250, 260, 270, 280 operate on the image and video content 205 to generate descriptions 211, 221, 231, 241, 251, 261, 271, 281, which include visual characteristics low level of automatically segmented regions, user-defined semantic objects, high-level scene properties, classifications and associated textual information, as described above. Once all the descriptions of an image or video article have been generated and integrated 290, the descriptions are entered into a database 295, which accesses the search mechanism 170. The process implemented by the subsystem 210, 220, 230 , 240, 250, 260, 270, 280 to generate descriptions 211, 221, 231, 241, 251, 261, 271, 281, in a standard format is described below with reference to Figures 7-8. It should be noted that certain subsystems, i.e., the region-based search and classification subsystem 210 and the video object segmentation system 230 can implement the entire description generation process, while the remaining subsystems implement only portions of the process and they can be called by subsystems 210, 230 during processing. In a similar manner, subsystems 210 and 230 may be called one for the other for specific tasks in the process. The standard description schemes for images will now be described with reference to Figures 3-4. Referring to Figure 3, an exemplary image 300 with three persons is shown. The < object > (object) is the fundamental description element. Each element < object > it has an identifier that is unique within a given image description. The identifier is expressed as an attribute of the < object > , for example, "cobject id =" l "> . The < object > it also requires a named type of attribute to distinguish between physical objects and logical objects. Physical objects usually correspond to continuous regions of the image with some descriptors in common (characteristics, semantics, etc.) - in other words, real objects in the image. Logical objects are groupings of objects based on some high-level semantic relationships (for example, people's faces). The elements < object > they can also include two or more attributes, object_ref and object_node_ref (object_ref and object_ref_node). The first allows you to derive an object from another already existing, and the second is linked back to nodes in the hierarchy of objects. The set of all objects identified in an image is included within the object set (<object_set>) (<joint_object>). Figure 3 shows nine exemplary objects, including the complete portrait of the family 300, the father 310, the mother 320, the child 330, the parents 340, the children 350, faces 360, the face of the father 311, the face of the mother 321. These objects can be expressed as a set of objects 0,1,2,3,4,5,6,7,8, as shown in Figure 4a, with the complete portrait of the family 300 being the object 0, father 310 being object 1, mother 320 being object 2, child 330 being object 3, parents 340 being object 4, children 350 being object 5, faces 360 being object 6 , the face of the father 311 being the object 7 and the face of the mother 321 being the object 8. In this example, the objects identified are each physical objects, with the exception of the faces 360, which is a logical object. In the XML, these image objects can be expressed as follows: <; object_set > < ! - Family portrait - > < object id = "0" ty? e = "PHYSICAL" > ... < / obj ect > < ! - Father - > < object id = "l" type = "PHYSICAL" > ... < / object > < ! - Mother - > < object id = "2" type = "PHYSICAL" > ... < / obj ect > < ! - Son - > < object id = "3" type = "PHYSICAL" > ... < / object > < ! - Parents - > < object id = "4" type = "PHYSICAL" > ... < / object > < ! - Children - > < object id = "5" type = "PHYSICAL" > ... < / object > < ! - Faces - > < object id = "6" type = "LOGICAL" > ... < / object > < ! - Father 's face - > < object id = "7" type = "PHYSICAL" > ... < / object > < ! - Mother's face - > < object id = "8" type = "PHYSICAL" > ... < / obj ect > < / object_set > In the XML description each object element has a unique identifier within an image description. The identifier is expressed as an attribute of the object element (id). Another attribute of the object element (type) distinguishes between physical and logical objects. The content of each object element has been left empty to illustrate the general structure of the image description. The image description scheme consists of object elements that are combined hierarchically into one or more object hierarchy elements (< object_hierarchy >) (< hierarchy_object >). The hierarchy is a way of organizing the object elements in the element set. Each object hierarchy consists of a tree of object node elements (< object_node >). Each object node points to an object. The objects in an image can be organized by their location in the image or by their semantic relationships. These two forms of grouping objects generate two types of hierarchies: physical and logical hierarchies. A physical hierarchy describes the physical location of the objects in the image. On the other hand, a logical hierarchy organizes objects based on a higher level understanding of its semantics, similar to semantic grouping. Continuing with the image of the example of Figure 3, two possible hierarchies are shown in Figure 4b, an object hierarchy that organizes objects physically 410 is shown, that is, objects 4 and 5 are physically inside object 0. It is shown a second object hierarchy that organizes objects logically 420, that is, objects 7 and 8 are associated with object 6. In XML, these two hierarchies can be expressed as follows: < object_hierarchy type = "PHYSICAL" > < ! - Family portrait - > < object_node id = "9" object_ref = "0" > < ! - Parents - > < object_node id = "10" object_ref = "4" > < object_node id = "ll" object_ref = "1" > < object_node id = "12" object_ref = "7" / > < / object_node > < object_node id = "13" object_ref = "2" > < object_node id = "14" object_ref = "8" / > < / object_node > < / object_node > < ! - Children - > < object_node id = "15" object_ref = "5" > < object_node id = "16" object_ref = "3" / > < / object_node > < / object_node > < / obj ect_hierarchy > < object_hierarchy type = "LOGICAL" > < ! - Faces - > < object_node id = "17" object_ref = "6" > < object_node id = "18" object_ref = "7" / > < object_node id = "19" object_ref = "8" / > < / object_node > < / obj ect_node > < / object_hierarchy > The type of hierarchy is included in the object hierarchy element as an attribute (type). The object node element has a unique identifier associated in the form of an attribute (id). The object node element refers to an object element when using the unique identifier of the latter. The reference to the object element is included as an attribute (object_ref). An object element can include back links to nodes in the object hierarchy as an attribute. A set element of objects and one or more elements of hierarchy of objects form the image element (<; image > ). The elements < object > in the < object_set > they are combined hierarchically in an element < object_hierarchy > . The object_node_ref attribute of the elements < object > point to their corresponding nodes in the < object_hierarchy > . On the other hand, the event_ref attribute (event_ref) of elements < event_node > (< event_node >) refers back to the < object > . An element < object > may contain an element < location > (< position >) optional and feature descriptor elements, such as elements < text_annotation > , < color > , < texture > , < shape > , < size > , < position > , < motion > and < time > (< text_anotation >, < color >, < texture >, < form >, < size >, < position >, < movement > and < t >). The < location > contains a list of the physical positions of the image. The elements < time > and < motion > they only make sense when the objects belong to a video sequence, which will be described later. For example: < ! - Father 's face - > < object id = "7> <locationx / location> <text_annotationx / text_annotation> <color> </ color> <shape> </ shape> <position? / position > < / object > Appendix I presents the full image description for the image example shown in Figure 3. The image description scheme is summarized in Table I below.

The item element contains pointers for the positions of the image. Note that the annotations can be textual, visual or multimedia. These features can be extracted or assigned automatically, semi-automatically or manually. When features are extracted automatically, the feature descriptors may include extract links and similarity draw code, and may still include annotation elements of external descriptors, as shown in the following example: < object id = "4" type = "PHYSICAL" object_node_ref = "12 16" > < ! - Father 's face - > < color > < / color > < texture > - < tamura > < tamura_value coarseness = "0.01" contrast = "0.39" orientation = "0.7" / > < code type = "EXTRACTION" language = "JAVA" version = "l .2" > < location > < location_site href = "ftp: // extraction. tamura. java" / > < / location > < / code > < / tamura > < / texture > < shape > < / shape > < position > < / position > < text_annotation xmlns: ext_AnDS = "http: // ww .other.ds / annotation.elements" > < extAnDS: Object > Face < / extAnDS: Object > < / text_annotation > < / object > A second example contained in appendix II up to now, illustrates the content of a particular image that may include one or more different objects in terms of the characteristics of those objects, including the position where the image is stored, text annotations, that is, , the name of the photograph, the names of the people in the photograph, the place where the photograph was taken, the event that is represented in the photograph and the date of the photograph, characteristics of color in terms of LUV color, characteristics of Texture in terms of Tamura texture, and size or dimensions of the image. Therefore, the information concerning the entire photograph, for example, the position where the image is stored, is descriptive of the object "id = o0", which represents the entire image. Other information concerns other objects within the image. For the object "id = ol", it is particularly described in the example in terms of text annotation (including the - name of the person), color, texture, shape using autovalue analysis, size and position in terms of an analysis of segmentation mask. For the object "id = o2", only text annotations are provided. The object "id = o3" is a logical object corresponding to the concept of faces. The standard description schemes for video will now be described with reference to Figures 5-6. Referring to Figure 5, an exemplary video clip 500 with five temporary video events is shown. In the video description scheme, the < event > it is the fundamental description element. Each element < event > it has an identifier that is unique within a DS description of given video data. The identifier is expressed as an attribute of the < event > , for example < event id = "l" > . The < event > it requires another named type of attribute to distinguish different types of events. The type of attribute can have three different values: take, continuos_group_shot (group_continuous_of_tomas), and discontinuos_group_shot (group_discontinuous_of_tomas). The discontinuous groups of shots will be commonly associated together based on common characteristics (eg, background color) or high-level semantic relationships (eg, on-screen actor). The elements < event > they can also include two or more attributes, basic_event_ref and event_node_ref (ref_event_basic and ref_node_event). The first allows you to derive an event from another already existing, and the second links back to nodes in the hierarchy of events. The set of all events is inside the < event_set > Nine exemplifying video events are shown in Figure 5, including the entire 500 video sequence, a scene where a tiger is stalking prey 510, a scene where the tiger is chasing its prey 520, a scene where the tiger captures its prey 530, and a scene where the tiger is feeding 540. The last scene includes two events, one where the tiger is hiding food 550 and a second event where the tiger is feeding the young 560. These video events, which are parallel to image objects, can be expressed as a set of events 0,1,2 , 3,4,5,6, as shown in Figure 6a, with the entire video sequence 500 being event 0, the scene where the tiger is stalking its prey 510 being event 1, the scene where the tiger is chasing its prey 520 being event 2, the scene where the tiger captures its prey 530 being event 4, the scene where the tiger is feeding 540 being event 4, the scene where the tiger is hiding the food 550 being event 5, and the scene where the tiger is feeding the young person 560 being the event 6. In this example, the identified events are each shots, with the exception of event 4 which is of the continuous group shot type. Non-temporally aligned events, such as events 1 and 3, can be organized into discontinuous collection groups. In XML, these image objects can be expressed as follows: <; event_set > < ! - The Tiger - > < event id = "0" type = "SHOT" > ... < / event > < ! - Stalking the prey - > < event id = "l" type = "SHOT" > ... < / event > < ! - The chase - > < event id = "2" type = "SHOT" > ... < / event > < ! - The capture - > < event id = "3" type = "SHOT" > ... < / event > < ! - Feeding - > < event id = "4" type = "CONTINUOUS_GROUP_SHOTS" > ... < / event > < ! - Feeding the young - > < event id = "5" type = "SHOT" > ... < / event > < ! - Protecting the food - > < event id = "6" type = "SHOT" > ... < / event > < / event_set > Note that each element < event > it is empty to show clearly the general structure of the video description scheme. It is important to note that the selection and definition of an event in a given video is determined by the author of the description. The < event > it may correspond to a shot or a scene in a video, or even a combination of these. The video description scheme basically consists of elements < event > which are combined hierarchically in an element < event_hierarchy > . The - - element < event_hierarchy > must contain a single element < event_node > . The < event_node > contains zero or more elements < event_node > and zero or more elements < event_hierarchy > , as described above for the image description scheme. Each element < event_node > it is associated with a unique identifier. The identifier is expressed as an attribute of the elements, for example < event_node id = "l" > . The hierarchy is a way to organize the elements < event > in the elements < event_set > . The different events that form a video sequence can be grouped or organized in two different ways: by their position in the video sequence or by their semantic relationships. The < event_hierarchy > includes an attribute, type, to distinguish between physical and logical hierarchies. A physical hierarchy will describe the temporal relationships of events in the image. On the other hand, logical hierarchies will organize events based on a high-level understanding of their semantics. Each element < event_node > consists of a reference to an element < event > , when using the unique identifiers associated with each < event > . The reference to the < event > it is given as an event_ref attribute. The video in Figure 5 has the hierarchy shown in Figure 6b. This hierarchy is expressed in XML - as follows: < event_hierarchy type = "PHYSICAL" > < ! - The Tiger - > < event_node id = "7" event_ref = "0" > < ! - Stalking the prey - > < event_node id = "8" event_ref = "l" / > < ! - The chase - > < event_node id = "9" event_ref = "2" / > < ! - Capture - > < event_node id = "10" event_ref = "3" / > < ! - - Feeding - > < event_node id = "ll" event_ref = "4" > < ! - Hiding the food - > < event_node id = "12" event_ref = "5" / > < ! - - Feeding the young - - > < event_node id = "13" event_ref = "6" / > < event_ / node > < / event_node > < / event_hierarchy > An event set element and one or more event hierarchy elements form the video element (< video >). The video element symbolizes the video sequence being described. The elements < event > in the < event_set > they are combined hierarchically in an element < event_hierarchy > . The attribute and ent_node_ref of elements < event > points to the corresponding nodes in the < event_hierarchy > . On the other hand, the event_ref attribute of elements < event_node > refers back to the < event > . In the video description scheme, the < event > It can contain the following elements: < location > (optional) < transition > (optional) < text_annotation > (optional) < object_set > (optional) < camera_motion > (optional) < time > (optional) < key_frame > (zero or more) The elements < object_set > < text_annotation > Y < location > they are defined above with respect to the image description scheme. The < transition > (< transition >) describes the transitions between takes. Therefore, event 3 in the tiger video can be described as follows: <;! - Capture - > < event id = "3" > < text_annotation > < name_annotation > < concept > The capture of the prey < / concept > < / name_annotation > < / text_annotation > < text_annotation > ... < / text_annotation > < object_set > ... < / object_set > < camera_motion > ... < / camera_motion > < time > ... < / time > < key_frame > ... < / key_frame > < / event > Appendix III presents the complete description of the video for the exemplary image shown in Figure 5. In the video description scheme, the event elements contain features including position, tap transition (that is, various special effects within the shots). or through the shots), camera movement, time, keyframe, annotation and elements of the set of objects, among others. The object element is defined in the image description scheme; represents the relevant objects in the event. As in the description of images, these characteristics can be extracted or assigned automatically or manually. For those characteristics extracted automatically, the feature descriptors may include links to the extraction and the similarity draw code. For example, < ! - Capture - > < event id = "3" > < text_annotation > < name_annotation > < concept > The capture of the prey < / concept > < / name_annotation > < / text_annotation > < text_annotation > ... < / text_annotation > < object_set > ... < / object_set > < camera_motion > ... < / camera_motion > < time > . . . < / time > < key_f rame > . . . < / key_f raime > < / event > A second example contained in appendix IV describes the content of a particular video sequence that can include one or more different events, in terms of the characteristics of those events including the position where the video is stored, text annotations, that is, the name of the video, the names of the people in the video, the place where the video was taken, the event that is represented in the video, and the date of the video, the objects within that video sequence, the camera movement , the total time of the video sequence in terms of the number of frames and guide box. This information concerning the entire video sequence is descriptive of the event id = E0. Other information concerns other objects within the image. The event hierarchy used to organize the described content is a physical hierarchy and describes temporal relationships. In this case, the only event is id = E0 that corresponds to the whole video. Within that event, two hierarchies are used to describe the objects within the event, that is, a physical and logical hierarchy, and in parallel the physical and logical hierarchies previously described for the example of the image. The process implemented by the system 200 to generate the image and video descriptions described with reference to Figures 3-6, should now be described with reference to Figures 7-8. Figure 7 is a flow diagram illustrating a preferred process for generating descriptions for images. The digital image data 710 is applied to the computer system via link 711. The image data may be non-compressed, or may be compressed according to any suitable compression scheme, for example JPEG. The computer system, under the control of the appropriate application software, first performs the extraction of objects 720, in the image data 710, to generate image objects. The extraction of objects 720 can take the form of a fully automatic processing operation, a semi-automatic processing operation or a substantially manual operation, in which objects are defined primarily through user interaction, such as through of a user's input device. In a preferred method, the extraction of objects 720 consists of two subsidiary operations, namely, segmenting images 725 and extracting and annotating features 726. For the 725 image segmentation stage, any region localization technique can be used. Divide digital images into regions that share one or more common characteristics. Similarly, for the characteristic extraction and annotation stage 326, any technique that generates characteristics of segmented regions can be used. The region-based search and classification subsystem 210 described above is suitable for image segmentation and automatic feature extraction; the video object segmentation system 230 described above is a suitable example of image segmentation and semi-automatic feature extraction. Alternately segmentation and extraction of manual features can be used.

The object extraction process 720 generates a set of image objects 721 and optional related features, such as annotations (collectively "image object descriptions"), which are preferably further processed by an object hierarchy building and extraction module. 730. Alternatively, the objects 721 may be stored directly in a database 740, or encoded by an XML encoder 750 or a binary encoder 760 and then stored 751, 752, in a database 740. The hierarchy extraction module of Objects and Construction 730 operates on descriptions of image objects to generate hierarchy descriptions of image objects 731. Preferably, both the hierarchy organization of physical objects 735 and the hierarchy organization of logical objects 736 are carried out in parallel to generate descriptions 731. The subsystem of classification and search based on regions 2 10 described above, is suitable for the automatic construction of hierarchy of image objects: the video object segmentation system 230 described above is a suitable example of semi-automatic construction of object hierarchy. Alternatively, manual object hierarchy construction can be used. The hierarchy descriptions of image objects 731 are either directly stored in a database 740, or encoded by an XML encoder 750 or a binary encoder 760 and then stored 751, 752 in a database 740 as a description record of image. Once the image description records have been stored in the 740 database storage, they are available in a format useful for access and use by other 770 applications, such as search, filtering, and file applications, for example. via bidirectional link 771. Figure 8 is a flow diagram illustrating a preferred process for generating descriptions for video. Digital video data 810 is applied to the computer system via link 811. The video data may be non-compressed, or may be compressed in accordance with any suitable compression scheme, eg, MPEG-1, MPEG-2, MPEG. -4, motion JPEG, H.261 or H.263. The computer system, under the control of the appropriate software application, first performs the extraction of events and objects 820 in the video data 810, in order to temporarily segment the video data 810 into video events and locate objects of video inside the events. The extraction of events and video objects 820 may take the form of a fully automatic processing operation, a semi-automatic processing operation, or a substantially manual operation in which the objects are defined primarily through interaction with the user , such as through a user input device. In a preferred method, the process of extracting events and video objects 820, consists of three subsidiary operations, namely video time segmentation 825, object extraction 826 and feature extraction and annotation 827. For the segmentation stage 825 , the video is temporarily divided into shots, continuous groups of shots or discontinuous shot groups, which share one or more common characteristics. For the object extraction stage 826, the video objects are extracted from the video shots in a manner similar to the extraction of image objects from non-moving images, except that movement and time information can be used. The feature extraction and annotation stage 827 can be carried out in parallel with the object extraction stage 826 and operates in the temporally segmented video shots to generate features such as camera movement, guide frames and text annotations. The region-based classification and search subsystem 210 described above is suitable for segmentation, object extraction and automated feature extraction.; the video object segmentation system 230 described above is a suitable example of segmentation, object extraction and semi-automated feature extraction. Alternatively, segmentation and manual extraction can be used. The event and object extraction processing 820 generates a set of video events and objects 821 and optional related features such as annotations (collectively "video object descriptions"), which are preferably further processed by an extraction and construction module. event hierarchies and objects 830. Alternatively, events and video objects 821 can be stored directly in a database 840, or encoded by an XML encoder 850 or a binary encoder 860 and then stored in a database 840. The module 830 operates in the descriptions of video objects to generate hierarchy descriptions of video objects 831. Preferably, descriptions of video objects are operated in parallel by both physical and logical operations. Therefore, descriptions of video objects can be submitted to both organization of physical (temporal) event hierarchies 835 and logical event hierarchies 836 organization in parallel, and then to organization of physical object hierarchies 837 and organization of hierarchies of logical objects 838, in such a way that both the video events and the objects contained in those events are organized hierarchically. The region-based classification and search subsystem 210 described above is suitable for the construction of hierarchies of automated video objects; The video object segmentation system 230 described above is a suitable example of building hierarchies of semi-automated video objects. Alternatively, manual construction of hierarchies of video objects can be used. The hierarchy descriptions of video objects 831 are stored either directly in a database 840 together with the descriptions of video objects, or are encoded by an XML encoder 850 or a binary encoder 860 and then stored in a database 840, as a description record of video. Once the video description records have been stored in the 840 database storage, they are available in a useful format for access and use by other 870 applications, such as search, filtering, and file applications, by example, via bidirectional link 871. The foregoing simply illustrates the principles of the invention. Various modifications and alterations to the depictions described will be apparent to those skilled in the art in view of the teachings of applicants, herein. It will be appreciated, therefore, that those skilled in the art will be able to devise numerous systems and methods which, although not explicitly shown or described herein, incorporate the principles of the invention and are therefore, within the spirit and scope of the invention APPENDIX I image_ds. dtd: < ! - Image DS - > < ! ELEMENT image (obj ect_set, obj ect_hierarchy *) > < ! ELEMENT object_set (object +) > < ! ELEMENT obj ect (location?, Text_annotation?, Color? Texture? Shape? Size? Position? Motion? Time?) > < ! ATTLIST object id ID #REQUIRED object_ref IDREF # IMPLIED object_node_ref IDREFS #IMPLIED type (PHYSICAL | LOGICAL) #REQUIRED > < ! - External object location DTD - > < ! ENTITY% location SYSTEM "location. Dtd" > % location; < ! - External object annotation DTD - > < ! ENTITY% text_annotation SYSTEM "text_annotation.dtd" > % text_annotation; < ! - External object color DTD - > < ! ENTITY% color SYSTEM "color. Dtd" > %color; < ! - External object texture DTD - > < ! ENTITY% texture SYSTEM "texture .dtd" > % texture; < ! - External object shape DTD - > < ! ENTITY% shape SYSTEM "shape. Dtd" > % shape; < ! - External object size DTD - > < ! ENTITY% size SYSTEM "size. Dtd" > % size; < ! - External object position DTD - > < ! ENTITY% position SYSTEM "position. Dtd" > % position; < ! - External object motion DTD - > < ! ENTITY% motion SYSTEM "motion. Dtd" > % motion; < ! - External object time DTD - > < ! ENTITY% time SYSTEM "time. Dtd" > % time; < ! - Object hierarchy - > < ! - The attribute "type" is the hierarchy binding type < .ELEMENT object_hierarchy (object_node) > < ! ATTLIST object_hierarchy type (LOGICAL | SPATIAL) #REQUIRED > < ! ELEMENT obj ect_node (obj ect_node *) > < ! ATTLIST obj ect_node id ID #REQUIRED object_refIDREF #REQUIRED > < ! ENTITY mpeg7"ISO / IEC JTC1 / SC29 / WG11 MPEG-7" > < ! - - Image DS end - - > location. dtd < ! - Description of resources' location- - > < ! - Objects, image, videos can be located / accessed at different locations - > < ELEMENT location (location_site *) > < ! ATTLIST location xml-link CDATA #FIXED "EXTENDED" role CDATA #IMPLIED title CDATA #IMPLIED show (EMBED | REPLACE | NEW) "EMBED" actote (AUTO | USER) "USER" behavior CDATA #IMPLIED > < ! - One location site - > < ! ELEMENT location_site EMPTY > < ! ATTLIST location_site xml-link CDATA #FIXED "LOCATOR" role CDATA #IMPLIED href CDATA #REQUIRED title CDATA #IMPLIED show (EMBED | REPLACE | NEW) "NEW" actote (AUTO | USER) "USER" behavior CDATA #IMPLIED > < ! ELEMENT code (location *) > < ! ATTLIST code type (EXTRACTION | DISTANCE) "EXTRACTION" language (C | JAVA | PERL) #REQUIRED CDATA version #REQUIRED > < ! - Description of resources' storage location text annotation dtd: < ! - Text annotation features - > < ! ELEMENT text_annotation (concept *, name_annotation?, People_annotation? Location_annotation?, Event_annotation?, Date annotation? Object annotation?) > < ! - Yam annotation - > < ! ELEMENT name_annotation (concept *) > < ! - People annotation < ELEMENT people_annotation (concepta <! - Location annotation - <! ELEMENT location_annotation (concept *) > <! - Event annotation - <! ELEMENT event_annotation (concept *) > <;! - Date annotation - <! ELEMENT date_annotation (concept *) > <! - Object annotation - < ELEMENT object_annotation (concept *) > <! - Concept - - < < ELEMENT concept (#PCDATA | code *) > <! ATTLIST concept language CDATA "english" annotation (automatic | manual) "manual" > <! - Text annotation features end - > color. < ! - Color features - > < ! ELEMENT color (color_hist *, luv_color *) > < ! - Color_histogram feature - > < ELEMENT color_hist (color_hist_value, code *) > < ! ATTLIST color_hist length CDATA #REQUIRED color_space (RGB | OHTA | HSV | LUV) #REQUIRED quantization (uniformj on-uniform) #REQUIRED > < ! ELEMENT color_hist (#PCDATA) > < ! ATTLIST color_hist_value format CDATA #REQUIRED < ! - LUV color feature - > < ELEMENT luv_color (luv_color_value, code *) > < ! ATTLIST luv_color length CDATA #REQUIRED > < ! ELEMENT luv color val é (luv bin *) > < ! ELEMENT luvjbin EMPTY > < ! ATTLIST luv_bin 1 CDATA #REQUIRED u CDATA #REQUIRED v CDATA #REQUIRED > < ! - Color features end - > texture .dtd: < ! - Texture features - > < ! ELEMENT texture (tamura?) > < ! - Tamura texture feature - > < ! ELEMENT tamura (tamura_value, code *) > < ! ELEMENT tamura_ alue EMPTY > < ! ATTLIST tamura_value coarseness CDATA #REQUIRED contrast CDATA #REQUIRED orientation CDATA #REQUIRED > < ! - Texture features end - > shape .dtd: < ! - Shape features - > < ELEMENT shape (eigenvalue_analysis *) > < ! - Eigenvalue analysis shape feature - > < ELEMENT eigenvalue_analysis (eigenvalue_analysis_value, code *) > < ! ATTLIST eigenvalue_analysis length CDATA #REQUIRED > < ELEMENT eigenvalue_analysis valué (eigenvalue *) > < ! ELEMENT eigenvalue EMPTY > < ! ATTLIST eigenvalue valué CDATA #REQUIRED > < ! - Shape features end - > size dtd: < ! - Size features - > < ! - Size - > < ELEMENT size (size_dimensions | size_num_pixels) > < ! - Dimensions (X, Y) - > < ELEMENT size_dimensions EMPTY > < ! ATTLIST size_dimentions x CDATA #REQUIRED and CDATA #REQUIRED > < ! - Number of pixels - > < ! ELEMENT size_num_pixels EMPTY > < ! ATTLIST size_pixels CDATA area #REQUIRED > < ! - Size features end - > position. dtd: < ! - Position features - > < ELEMENT position (segmetation_mask_analysis *) > < ! - Segmentation mask position feature - > < ELEMENT segmentation_mask_analysis (segmentation_mask_analysis_value, code *) > ELEMENT segmentation_mask_analysis_value (left-top_vertex, centroid) > < ELEMENT left-top_vertex EMPTY > < ! ATTLIST left-top_vertex x CDATA #REQUIRED and CDATA #REQUIRED > < ! ELEMENT centroid EMPTY > < ! ATTLIST centroid x CDATA #REQUIRED and CDATA #REQUIRED > < ! - Position features end - > motion. dtd: < ! - Motion features - > < ! ELEMENT motion (affine_model *) > < ! - Affine motion feature - > < ELEMENT affine_model (affine_model_value, code *) > < ELEMENT affine_model_value (vector2d *) > < ! ELEMENT vector2d EMPTY > < ! ATTLIST vector2d x CDATA #REQUIRED and CDATA #REQUIRED > < ! - Motion features end time .dtd: < ! - Time duration features - > < ELEMENT time (time_interval * | time_instant | time_span > < ! --Continuous duration time (seconds or frames in video clip) < ! ELEMENT time_interval EMPTY > < ! ATTLIST time_interval unit (SECONDS | FRAMES) "SECONDS" start CDATA #REQUIRED end CDATA #REQUIRED > < ! - Instant in time - > < ! ELEMENT time_instant EMPTY > < ! ATTLIST time_instant unit (SECONDS | FRAMES) "SECONDS" instant CDATA #REQUIRED > < ! - Continuous duration time - > < ! ELEMENT time_span EMPTY > < ! ATTLIST time_span unit (SECONDS | FRAMES) "SECONDS" span CDATA #REQUIRED > < ! - Time duration features end - > Family_Portrait .xml: < ? xml version = "l .0" standalone = "no"? > < ! DOCTYPE image PUBLIC "ISO // mpeg7 // xml // dtd // image_ds" "http: // www .ee columbia. Edu / mpeg7 / xml / dtd / image_ds. Dtd" > < ! image > < object_set > < ! - Family portrait - > < object id = "0" type = "PHYSICAL" object_node_ref = "9" > < location > < location_site href = "http: // www. family. portrait. gif / < / lt; / location > < text_annotation > <name_annotation> < concept > Family Portrait < / concept > < / name_annotation < / date_annotation > < concepts > September 26th, 1998 < / concept > < / date_annotation > < / text annotation > < color > < luv_color length = "l" < luv_color_value < luv_bin 1 = "56.70" u = "4.67" v = "78.56" / > < / luv_color_value > < / luv_color > < / color > < / object <! - - Father - > < object id = "l" type = "PHYSICAL" object_node_ref = "11" < text_annotation < people_annotation < < concept > Father < / concept > < / people_annotation > < / text annotation < / object > <! - Mother - > < object id = "2" type = "PHYSICAL" object_node_ref = "13"> < text_annota tion > < people_annotation > < concept > Mother < / concept > < / people_annotation > < / text annotation > < / object > < ! - Son - > • object id = "3" type = "PHYSICAL" object_node_ref = "16" > < text_annotation > < people_annotation > < concept > They are < / concept > < / people_annotation > < / text annotation > < / object > < ! - Parents - > < object id = "4" type = "PHYSICAL" object_node_ref = "10" > < text_annotation > < people_annotation > < concept > Parents < / concept > < / people_annotation > < / text annotation > < / object > < ! - Children - > < object id = "5" type = "PHYSICAL" object_node_ref = "15" > < text_annotation > < people_annotation > < concept > Children < / concept > < / people_annotation > < / text_annotation > < / object > < ! - Faces - > < object id = "6" type = "LOGICAL" object_node_ref = "17" > < text_annotation > < people_annotation > < concept > Faces < / concept > < / people_annotation > < / text annotation > < / object > < ! - Father 's face-- > < object id = "7" type = "PHYSICAL" object_node_ref = "12 18" > < text_annotation > < people_annotation > < concept > Father 's face < / concept > < / people_annotation > < / text annotation > < / object > < ! - Mother 's face - > < object id = "8" type = "PHYSICAL" object_node_ref = "14 19" > < text_annotation > < people_annotation > < concept > Mother 's face < / concept > < / people_annotation > < / text annotation > < / object > < / object_set > < object_hierarchy type = "PHYSICAL" > < ! - Family portrait - > < object_node id = "9" object_ref = "0" > < ! - Parents - > < object_node id = "10" object_ref = "4" > < object_node id = "ll" object_ref = "1" > < object_node id = "12" object_ref = "7" / > < / object_node > < object_node id = "13" object_ref = "2" > < object_node id = "14" object_ref = "8" / > < / obj ect_node > < / object_node > < ! - Children - > < object_node id = "15" object_ref = "5" > < object_node id = "16" object_ref = "3" / > < / object_node > < / object_node > < / obj ect_hierarchy > < object_hierarchy type = "LOGICAL" > < ! - Faces - > < object_node id = "17" object_ref = "6" > < object_node id = "18" object_ref = "7" / > < object_node id = "19" object_ref = "8" / > < / object_node > < / obj ect_node > < / obj ect_hierarchy > : / image: Appendix II? Xml version = "l. O" standalone = "no"? > < ! DOCTYPE image PUBLIC "ISO // mpeg7 // xml // dtd // image_ds" "image_ds. Dtd" > < image > < object_set > < object id = "oO" type = "PHYSICAL" > < location >; < location_site href = "http: // www .ee columbia. edu / -syp / images / yosemite. gif" / > < / location > < text_annotation > < name_annotation > < concept > Yosemite's Nevada Falls < / concept > < concept language = "spanish" > Nevada Falls in Yosemite < / concept > < / name_annotation > < people_annotation > < concept > Seungyup Paek < / concept > < concept > Alex Jaimes < / concept > < / people_annotation > < location_annotation > < concept > Yosemite's Nevada Falls < / concept > < concept annotation = "automatic" > outdoor < / concept > < concept annotation = "automatic" > landscape < / concept > < / location_annotation > < even_annotation > < concept > Trip to Nevada Falls in Yosemite < / concept > < / even_annotation > < date_annotation > < concept > September 26th, 1998 < / concept > < / date_annotation > < / text_annotation > < color > < luv_color length = "l" > < luv_color_value > < luv_bin 1 = "56.70" u = "4. 67" v = "78. 56" / > < / luv_color_value > < / luv_color > < /color ! > < texture > < tamura > < tamura_value coarseness = "0.70" contrast = "0.67" orientation = "0.22" / > < / tamura > < / texture > < size > < size_dimensions X = "512" y = "734" / > < / size > < / object > < object id = "ol" type = "PHYSICAL" > < text_annotation > < name_annotation > < concept > Seungyup Paek < / concept > < / name_annotation > < / people_annotation > < concept > Seungyup Paek < / concept > < / people_annotation > < / text_annotation > < color > < luv_color length = "l" > < luv_color_value > < luv bin 1 = "56.70" u = "4.67" v = "78.56" / > < / luv_color_value > < / luv_color > < / color > < texture > < tamura > < tamura_value coarseness = "0.70" contrast = "0.67" orientation = "0.22" / > < / tamura > < / texture > < shape > < eigenvalue_analysis length = "2" > < eigenvalue_analysis_value > < eigenvalue value = "l .22" / > < eigenvalue value = "0.22" / > < / eigenvalue_analysis_value > < / eigenvalue_analysis > < / shape > < size > < size_num_pixels area = "734" / > < / size > < position > < segmented ion_mask_analysis > < segmentation_mask_analysis_value > < left-top_vertex x = "23" y = "45" / > «Centroid x =" 35"y =" 57"/ > < / segmentation_mask_analysis_value > < / segmentation_mask_analysis > < / position > < object > < object id = "o2" type = "PHYSICAL" > < text_annotation > < concept > Seungyup 's face < / concept > < / text_annotation > < ! - Other tags - > < / object > < object id = "o3" type = "LOGICAL" > < text_annotation > < concept > Faces < / concept > < text_annotation > < / object > < / object_set > < object_hierarchy type = "PHYSICAL" > < object_node id = "o4" object_ref = "o0" > < object_node id = "o5" object_ref = "ol" > < object_node id = "o6" object_ref = "o2" > < / obj ect_node > < / object_node > < / obj ect_hierarchy > < object_hierarchy type = "LOGICAL" > < object_node id = "o7" object_ref = "o3" > < object_node id = "o8" object_ref = "o2" / > < / object_node > < / obj ect_hierarchy > < / image > Appendix III video_ds. dtd: < ! - Video DS - > < ! ELEMENT video (event_set, event_hierarchy *) > < ! ELEMENT event_set (event +) > < ! ELEMENT event (location?, Transition? Text_annotation?, Object_set? Camera_motion?, Time? Key_frame *) > < ! ATTLIST event id ID #REQUIRED event_ref IDREF # IMPLIED event_node_ref IDREFS #IMPLIED type (SHOT | CONTINUOUS_GROUP_SHOTS | DISCONTINUOUS_GROUP_SHOTS) #REQUIRED > < ! - External image DS DTD - > < ! ENTITY% image_ds SYSTEM "image_ds .dtd" > % image_ds; < ! - - Example of transitions: - dissolve (additive, cross, non-additive), - slide (band, slash, normal, band, boxes), - wipe (radial, random, rectangular, moving, crossed, star-shaped, corner , soft, cross-shaped, band, soft, gradient, di amond-shaped, pointed, circular, shaped), - merge (center), - peel (center, page, back), - stretch (cross, image, in, over), - spin (cube, rectangular, image, away), - zoom, curtain, door, funnel, spiral boxes, paint splatter, motion, luminace, push, flip, fold up, etc). - - > < ! ELEMENT transition EMPTY > < ! ATTLIST transition effect CDATA #REQUIRED > < ! - External camera motion descriptor DTD - > < ! ENTITY% camera_motion SYSTEM "camera_motion. Dtd" > % camera motion; < ELEMENT key_frame (size_dimensions?, Time_instant?) > < ! - Event hierarchy - > < ! - The attribute "type" is the hierarchy binding type - > < ! ELEMENT event_hierarchy (event_node) > < ! ATTLIST event_hierarchy type (LOGICAL | SPATIAL) #REQUIRED > < ! ELEMENT event_node (event_node *, obj ect_hierarchy *) > < ! ATTLIST event_node id ID #REQUIRED event ref IDREF #REQUIRED > Video DS end camera_motion.dtd < ! - Camera_motion features - > < ELEMENT camera_motion (background_affine_motion *) > < ! - Affine model for camera motion detection - > < ELEMENT background_affine notion (background_affine_motion_value, code *) > < ELEMENT background_affine_motion_value (panning? Zoom?) > < ! ELEMENT panning EMPTY > < < !! HAT1T1LI1 -ST1 p padnllnllilnligg direction (NTLNE I ET I SE I ST I SW I WTlNW) #REQUIRED < 1 ELEMENT zoom EMPTY > < ! ATTLIST zoom direction (INlOUT) #REQUIRED > < ! - Camera motion features end Appendix IV < ? xml version = "l. O" standalone = "no"? > < ! DOCI? PE video PUBLIC "ISO // ppeg7 // xml // dtd // video_ds" "video_ds. Dtd": < ! video < event_set > < event id = "eO" type = "CONTINUOUS_GROUP_SHOTS" > < location > < location_site href = "yosemite. avi /" / > < / location > < text_annotation > < name_annotation > < concept > Yosemite 's Nevada Falls < / concept > < concept language = "spanish" > Nevada Falls in Yosemite < / concept > < name_annotation > < people_annotation > < concept > Seungyup Paek < / concept > < people_annotation > < location_annotation > < concept > Yosemite's Nevada Falls < / concept > < concept annotation = "automatic" > outdoor < / concept > < concept annotation = "automatic" > landscape < / concept > < / location_annotation > < / event_annotation > < concept > Trip to Nevada Falls < / concept > < / event_annotation > < / date_annotation > < concept > September 26th, 1998 < / concept > < / date_annotation > < / text_annotation > < obj ect_set > < obj ect id = "ol" type = "PHYSICAL" > < text_annotation > < name_annot at ion > < concept > Seungyup Paek < / concept > < / name_annotation > < people_annotation > < concept > Seungyup Paek < / concept > < / people_annotation > < / text_annotation > < color > < luv_color length = "l" > < luv_color_value > < luv_bin 1 = "56.70" u = "4.67" v = "78.56" / > < / luv_color_value > < / luv_color > < / color > < texture > < tamura > < tamura_value coarseness = "0.70" contrast = "0.67" orientation = "0.22" / > < / tamura > < / texture > < shape > < eigenvalue_analysis length = "2" > < eigenvalue_analysis_value > < eigenvalue value = "1.22" / > < eigenvalue value = "0.22" / > < / eigenvalue_analysis_value > < / eigenvalue_analysis > < / shape > < size > < size_num_pixels area = "734" / > < / size > < position > < segmentation_mask_analysis >; < segmentation_mask_analysis_value > < left-top_vertex x = "23" y = "45" / > < centroid x = "35" y = "57" / > < / segmentation_mask_analysis_value > < / segmentation_mask_analysis > < / position > < motion > < affine_model > < affine_model_value > < vector2d x = "12.3" y = "2.34" / > < vector2d x = "1.3" y = "12.34" / > < vector2d x = "0.3" y = "23.34" / > < / affine_model_value > < / affine_model > < / motion > < time > < time_interval unit = "FRAMES" start = "l" end = "3" > < / time > < / object > < object id = "ol" type = "PHYSICAL" > < text_annotation > < concept > Seungyup's face < / concept > < / text_annotation > < ! - Other tags - > < / object > < object id = "o2" type = "LOGICAL" > < text_annotation > < concept > faces < / concept > < text_annotation > < / object > < / object_set > < camera_motion > < background_affine_motion > < background_affine_motion_value > < panning direction = "SE" / > < zoom direction = "IN" / > < / background_affine_motion_value > < / background_affine_motion > < / camera motion > < t? me > < time_interval unit = "FRAMES" start = "1" end = "10" / > < / time > < key_frame > < size_dimensions x = "512" y = "734" / > < time_instant unit = "FRAMES" instant = "5" / > < / key frame > < / event > < / event set > < event_hierarchy type = "PHYSICAL" > < event_node id = "the" event_ref = "eO" > < object_hierarchy type = "SPATIAL" > < obj ect_node id = "o3" obj ect_ref = "oO" > < object_node id = "o4" object_ref = "ol" / > < / obj ect_node > < / ob j ec t_hierarchy > < ob j ect_hierarchy type = "LOGICAL" > < obj ect_node id = "o5" obj ect_ref = "o3" > < object_node id = "o6" object_ref = "ol" / > < / obj ect_node > < / obj ect_hierarchy > < / event_node > < / event_hierarchy > < / video >

Claims

CLAIMS 1. A system for generating a description record from multimedia information, comprising: a) at least one multimedia information input interface that receives the multimedia information. b) a computer processor, coupled to the at least one multimedia information input interface, which receives multimedia information from there, processes the multimedia information when carrying out object extraction processing to generate descriptions of multimedia objects starting from of the multimedia information, and processes the descriptions of multimedia objects generated by object hierarchy processing to generate hierarchy descriptions of multimedia objects, wherein at least one description record is generated that includes the descriptions of multimedia objects and the hierarchy descriptions of multimedia objects, for the content included within the multimedia information; and c) a data storage system, operatively coupled to the processor, for storing at least one description record.
2. The system of claim 1, wherein the multimedia information comprises image information, the descriptions of multimedia objects comprise descriptions of image objects, and the hierarchy descriptions of multimedia objects comprise hierarchy descriptions of image objects.
The system of claim 2, wherein the object extraction process comprises: a) image segmentation processing to segment each image in the image information into regions within the image; and b) feature extraction processing to generate one or more feature descriptions for one or more of the regions; whereby the descriptions of generated objects comprise one or more descriptions of characteristics for one or more of the regions.
The system of claim 3, wherein one or more feature descriptions are selected from a group consisting of text, color, texture, shape, size and position annotations.
The system of claim 2, wherein the object hierarchy processing comprises hierarchy organization of physical objects to generate hierarchy descriptions of physical objects of the image object descriptions that are based on spatial characteristics of the objects, such that the hierarchy descriptions of image objects comprise physical descriptions.
The system of claim 5, wherein the object hierarchy processing further comprises organizing logical object hierarchies to generate logical object hierarchy descriptions of image object descriptions that are based on semantic characteristics of the objects, in such a way that the hierarchy descriptions of image objects comprise both physical and logical descriptions.
The system of claim 6, wherein the object extraction processing comprises; a) image segmentation processing to segment each image in the image information in regions within said image; and b) feature extraction processing to generate descriptions of objects for one or more of the regions; and where the physical hierarchy organization and the logical hierarchy organization generate hierarchy descriptions of the object descriptions for one or more of the regions.
The system of claim 7, further comprising an encoder that receives the object hierarchy descriptions and the image object descriptions and encodes the image object hierarchy descriptions and the image object descriptions in description information encoded, wherein the data storage system is operative to store the encoded description information as at least one description record.
9. The system of claim 1, wherein the multimedia information comprises video information, the descriptions of multimedia objects comprise descriptions of video objects that include both event descriptions and object descriptions, and the multimedia hierarchy descriptions comprise hierarchical descriptions of video objects that include both Event hierarchy descriptions as object hierarchy descriptions.
The system of claim 9, wherein the object extraction process comprises: a) temporal video segmentation processing to temporarily segment the video information into one or more video events or video event groups and generate descriptions of event for video events, b) processing of video object extraction to segment one or more video events or groups of video events in one or more regions, and to generate object descriptions for regions; and c) feature extraction processing to generate one or more descriptions of event characteristics for one or more video events or groups of video events, and one or more feature descriptions of objects for one or more regions; where descriptions of generated video objects include descriptions of event characteristics and descriptions of objects.
The system of claim 10, wherein one or more descriptions of event characteristics are selected from a group consisting of text annotations, tap transition, camera movement, time and leader frames, and where one or more descriptions of object characteristics are selected from a group consisting of color, texture, shape, size, position, movement and time.
The system of claim 9, wherein the object hierarchy processing comprises organizing physical event hierarchies to generate physical event hierarchy descriptions of video object descriptions that are based on temporal characteristics of video objects, such that the video hierarchy descriptions comprise temporal descriptions.
The system of claim 12, wherein the object hierarchy processing further comprises organizing logical event hierarchies to generate logical event hierarchy descriptions of video object descriptions that are based on semantic characteristics of the objects of video, in such a way that the descriptions of hierarchies comprise both temporal and logical descriptions.
The system of claim 13, wherein the object hierarchy processing further comprises rendering processing of logical and physical objects hierarchy, receiving the logical and temporal descriptions and generating hierarchy descriptions of objects for video objects contained within of the video information, in such a way that the video hierarchy descriptions comprise descriptions of temporal and logical objects and events.
The system of claim 14, wherein the object extraction processing comprises: a) temporal video segmentation processing to temporally segment the video information into one or more video events or video event groups and generate descriptions of event for video events, b) processing of video object extraction to segment one or more video events or groups of video events in one or more regions, and to generate object descriptions for regions; and c) feature extraction processing to generate one or more descriptions of event characteristics for one or more video events or groups of video events, and one or more descriptions of object characteristics for one or more of the regions; wherein the descriptions of generated video objects include the descriptions of event characteristics and the descriptions of objects and wherein the organization of the physical event hierarchy, and the hierarchy organization of logical events generate hierarchy descriptions from the descriptions of characteristics of events, and where the organization of hierarchy of physical objects and the organization of hierarchy of logical objects generate hierarchy descriptions from the descriptions of characteristics of objects.
The system of claim 15, further comprising an encoder that receives hierarchical descriptions of video objects and descriptions of video objects, and encodes hierarchy descriptions of video objects and descriptions of video objects in encoded description information, wherein the data storage system is operative to store the encoded description information as at least one description record.
17. A method for generating a description record from multimedia information, comprising the steps of: a) receiving multimedia information; b) processing the multimedia information by performing object hierarchy processing to generate descriptions of multimedia objects from the multimedia information; c) processing the descriptions of multimedia objects generated by object hierarchy processing to generate hierarchy descriptions of multimedia objects, wherein at least one description record is generated including the descriptions of multimedia objects and the hierarchy descriptions of multimedia objects for the content included within the multimedia information; and d) storing at least one description record.
The method of claim 17, wherein the multimedia information comprises image information, the descriptions of multimedia objects comprise descriptions of image objects, and the hierarchy descriptions of multimedia objects comprise hierarchy descriptions of image objects.
The method of claim 2, wherein the object extion processing step comprises the sub-steps of: a) image segmentation processing for segmenting each image into image information in regions within the image; and b) feature extion processing to generate one or more feature descriptions for one or more of the regions; wherein the descriptions of generated image objects comprise one or more feature descriptions for one or more of the regions.
The method of claim 19, wherein one or more feature descriptions are selected from the group consisting of text annotations, color, texture, shape, size and position.
The method of claim 18, wherein the object hierarchy processing step includes the sub-step of organizing physical object hierarchies to generate hierarchy descriptions of physical objects of the image object descriptions, which are based on in spatial chaeristics of the objects, in such a way that the descriptions of image hierarchy comprise physical descriptions.
The method of claim 21, wherein the object hierarchy processing step further includes the logical object hierarchy organization sub-step to generate logical object hierarchy descriptions of the image object descriptions that are based on in semantic chaeristics of the objects, in such a way that the hierarchy descriptions of image objects comprise both physical and logical descriptions.
The method of claim 22, wherein the object extion processing step further includes the sub-steps of: a) image segmentation processing to segment each image in the image information in regions within the image; and b) feature extion processing to generate descriptions of objects for one or more of the regions; and wherein the sub-step of organizing physical object hierarchies and the logical object hierarchy organization sub-stage generates hierarchy descriptions of the object descriptions for one or more of the regions.
The method of claim 24, further comprising the step of encoding the descriptions of image objects and the hierarchy descriptions of image objects into encoded description information, prior to the data storage step.
The method of claim 17, wherein the multimedia information comprises video information, the multimedia object descriptions comprise descriptions of video objects that include both event descriptions and object descriptions, and the multimedia hierarchy descriptions comprise descriptions of hierarchy of video objects that include both event hierarchy descriptions and object hierarchy descriptions.
26. The method of claim 25, wherein the object extion processing step comprises the sub steps of: a) temporal video segmentation processing to temporally segment the video information into one or more video events or groups of video events and generate event descriptions for video events, b) processing of video object extion to segment one or more video events or video event groups into one or more regions, and to generate descriptions of objects for the regions; and c) feature extraction processing to generate one or more descriptions of event characteristics for one or more video events or video event groups, and one or more feature descriptions of objects for one or more of the regions; wherein the descriptions of generated video objects include descriptions of event characteristics and descriptions of objects.
The method of claim 26, wherein one or more descriptions of event characteristics are selected from the group consisting of text annotations, tap transition, camera movement, time and leader frames, and where one or more descriptions of Object characteristics are selected from the group consisting of color, texture, shape, size, position, movement and time.
The method of claim 25, wherein the object hierarchy processing step includes the sub-step of organizing physical event hierarchies to generate physical event hierarchy descriptions of the video object descriptions, which are based on in temporal characteristics of the video objects, in such a way that the video hierarchy descriptions comprise temporal descriptions.
The method of claim 28, wherein the object hierarchy processing step further includes the logic event hierarchy organization sub-step to generate logical event hierarchy descriptions of the video object descriptions that are based on in semantic characteristics of the video objects, in such a way that the hierarchy descriptions comprise both temporal and logical descriptions.
The method of claim 29, wherein the object hierarchy processing step further comprises the sub-step of logical and physical object hierarchy extraction processing, which receives the logical and temporal descriptions and generates hierarchy descriptions of objects for video objects contained in the video information, such that the video hierarchy descriptions comprise descriptions of temporal and logical objects and events.
31. The method of claim 30, wherein the object extraction processing step comprises the sub steps of: a) temporally segmenting video processing to temporally segment the video information into one or more video events or groups of video events and generate event descriptions for video events, b) processing of video object extraction to segment one or more video events or video event groups into one or more regions, and to generate object descriptions for the regions; and c) feature extraction processing to generate one or more descriptions of event characteristics for one or more video events or video event groups, and one or more feature descriptions of objects for one or more of the regions. where descriptions of generated video objects include descriptions of event characteristics and object descriptions, and where the organization of the physical event hierarchy and the logical event hierarchy organization generate hierarchy descriptions of event feature descriptions , and where the organization of hierarchy of physical objects and the organization of hierarchy of logical objects generate hierarchy descriptions of the descriptions of characteristics of objects.
The method of claim 15, further comprising the step of encoding the descriptions of video objects and the hierarchy descriptions of video objects, into encoded description information prior to the data storage step.
33. A computer readable medium containing digital information with at least one multimedia description record describing multimedia content for corresponding multimedia information, the description record comprising: a) one or more descriptions of multimedia objects describing corresponding multimedia objects; b) one or more characteristics that characterize each of the descriptions of multimedia objects; and c) one or more hierarchy descriptions of multimedia objects, if any, that relate at least a portion of one or more multimedia objects in accordance with one or more features.
34. The computer readable medium of claim 33, wherein the multimedia information comprises image information, the multimedia objects comprise image objects, the descriptions of multimedia objects comprise descriptions of image objects, and the hierarchy descriptions of multimedia objects. they comprise descriptions of hierarchy of image objects.
35. The computer readable medium of claim 34, wherein one or more features are selected from the group consisting of text annotations, color, texture, shape, size, and position.
36. The computer readable medium of claim 34, wherein the hierarchy descriptions of image objects comprise hierarchical descriptions of physical objects of the descriptions of image objects, based on spatial characteristics of the image objects.
37. The computer readable medium of claim 36, wherein the hierarchy descriptions of image objects further comprise hierarchical descriptions of logical objects of the descriptions of image objects based on semantic characteristics of the image objects.
38. The computer readable medium of claim 33, wherein the multimedia information comprises video information, the multimedia objects comprise events and video objects, the multimedia object descriptions comprise descriptions of video objects that include both event descriptions and object descriptions, features comprise video event characteristics and video object characteristics, and multimedia hierarchy descriptions comprise hierarchical descriptions of video objects that include both event hierarchy descriptions and object hierarchy descriptions.
39. The computer readable medium of claim 38, wherein descriptions of characteristics of one or more events are selected from the group consisting of text annotations, tap transition, camera movement, time and guide box, and where one or more descriptions of object characteristics are selected from the group consisting of color, texture, shape, size, position, movement and time.
40. The computer readable medium of claim 38, wherein the event hierarchy descriptions comprise one or more physical hierarchy descriptions of events based on temporal characteristics.
41. The computer readable medium of claim 40, wherein the event hierarchy descriptions further comprise one or more logical hierarchy descriptions of the events based on semantic characteristics.
42. The computer readable medium of claim 38, wherein the object hierarchy descriptions comprise one or more physical hierarchy descriptions of the objects based on temporal characteristics.
43. The computer readable medium of claim 39, wherein the object hierarchy descriptions further comprise one or more logical hierarchy descriptions of the objects based on semantic characteristics.