HK1051941B

HK1051941B - Method and apparatus for generating compact transcoding hints metadata

Info

Publication number: HK1051941B
Application number: HK03102297.3A
Authority: HK
Inventors: 彼得‧库恩
Original assignee: Sony Corporation
Priority date: 2000-03-13
Filing date: 2001-03-13
Publication date: 2008-03-14

Description

Method and apparatus for generating compressed transcoding hints metadata

Technical Field

The present invention relates to an audio/video (or audiovisual "a/V") signal processing method and an a/V signal processing apparatus for extracting transcoding hints metadata for transcoding between a compressed representation of a multimedia description and a representation of different (e.g., MPEG) compressed content, manipulating (e.g., MPEG compressed) bitstream parameters such as frame rate, bit rate, dialog size, quantization parameters, and picture coding type structure such as group of pictures or "GOP", classifying a/V content, and retrieving multimedia information.

Background

a/V content is increasingly being transported over fiber optic, wireless, and wired networks. Because these networks are characterized by different network bandwidth constraints, the a/V content needs to be represented by different bit rates resulting in different subjective visual quality. Other requirements for compressed representation of a/V content arise from the screen size, computational performance and memory constraints of the a/V terminal.

Therefore, a/V content stored in a compressed format as specified by the moving picture experts group ("MPEG") must be converted to, for example, different bit rates, frame rates, screen sizes, and accommodate different decoding complexities and different a/V terminal storage constraints.

To avoid storing multiple compressed representations of the same a/V content for different network bandwidths and different a/V terminals, a/V content stored in a compressed MPEG format may be transcoded to a different MPEG format.

For video transcoding, reference may be made to the following:

WO09838800a 1: o.h.werner, n.d.wells, m.j.knee: digital compression coding with improved quantization, 1999, an adaptive quantization scheme is proposed;

US 5870146: zhu; Qin-Fan: apparatus and method for digital video transcoding, 1999;

WO09929113a 1: nilsson, Michael, Erling; ghanbari, Mohammed: transcoding, 1999;

US 5805224: keesman; gerrit J, Van Otterloo; petrus j.: method and apparatus for transcoding video signals, 1998;

WO09943162 algain, Stuart, Jay: motion vector extrapolation for transcoding video sequences, 1999;

US 5838664: polomski; mark D.: video teleconferencing systems using digital transcoding, 1998;

WO09957673a 2: baliol, Nicolas: transcoding of data streams, 1999;

US 5808570: bakhmutsky; michael: apparatus and methods for double-matched huffman transcoding and high performance variable length decoders with two codeword bit stream segments using them, 1998;

WO09905870a 2: lemaguet, Yann: a method of switching between a video sequence and a corresponding device, 1999; and

WO09923560a 1: LUDWIG, Lester; BROWN, William; YUL, Inn, j.; VUONG, Anh, t, VANDERLIPPE, Richard; burnet t, Gerald; LAUWERS, Chris; LUI, Richard; APPLEBAUM, Daniel: scalable network multimedia systems and applications, 1999.

However, in the context of video transcoding, none of these patents disclose or suggest the use of transcoding hints metadata information to facilitate A/V transcoding.

The society of motion picture and television engineers ("SMPTE") recommended a television standard (327M-2000) for MPEG-2 video recording data sets that provided re-encoded metadata using 256 bits for each macroblock of the source format. However, such extraction and representation of transcoding hints metadata has some drawbacks. For example, according to recommended criteria, transcoding hints metadata (e.g., GOP structure, quantizer settings, motion vectors, etc.) is extracted for each individual frame and macroblock of the A/V source content. The advantage of this approach is to provide detailed and content-adaptive transcoding hints that facilitate transcoding while greatly preserving subjective a/V quality. However, the size of the transcoding hints metadata is very large. In one particular implementation of the recommendation, each macroblock of MPEG video stores 256 bits of transcoding hints metadata. Such a large amount of transcoding hints metadata does not facilitate broadcast distribution to local (e.g., home) a/V content servers. Thus. The recommended criteria for transcoding hints metadata are limited to broadcast studio applications.

Another technique for transcoding hints metadata extraction and representation includes collecting generic transcoding hints metadata for transcoding compressed A/V source content to another compression format and bit rate at a particular bit rate. However, this technique has a disadvantage in that the characteristic characteristics of the transcoded content are not taken into account. For example, in source content, the A/V characteristics may change from an A/V segment with limited amount of motion and little detail (e.g., a news anchor scene) to another A/V segment that describes fast motion and much detail (e.g., a motion event scene). According to this technique, it is possible to select wrong leading transcoding hints metadata that are not suitable for representing different characteristics of the two video segments, thus resulting in poor a/V quality and wrong bit rate allocation.

Disclosure of Invention

In light of the foregoing, it is an object of the present invention to provide a method and apparatus for extracting compressed and A/V content adaptive multimedia description and transcoding hints metadata representations.

It is another object of the present invention to provide a transcoding method and apparatus, one requirement of which is to allow real-time execution without significant delay and to suppress computational complexity. A second requirement for transcoding methods is to preserve subjective a/V quality as much as possible. To facilitate the transcoding method to achieve these two requirements for different compression target formats, the transcoding hints metadata can be generated in advance and stored separately or together with the compressed A/V content. It is another object of the present invention to provide a highly compressed representation to reduce storage size and facilitate distribution of multimedia description and transcoding hints metadata (e.g., broadcast to local A/V content servers).

It is therefore an object of the present invention to provide a transcoding system with: 1) preserving a/V quality through the transcoding process, and 2) limiting computational complexity to allow real-time applications to proceed with minimal delay. Additional data (metadata) including transcoding hints may be associated with the compressed a/V content, according to embodiments of the present invention.

According to an aspect of the present invention, there is provided a content processing method including the steps of: storing terminal information about a terminal in a first memory; storing the contents and contents information for processing the contents in a second memory; extracting a code conversion prompt according to the content information and the terminal information; and converting the content according to the transcoding hints.

According to another aspect of the present invention, there is provided a content processing apparatus including: a first memory for storing terminal information about a terminal; a second memory for storing contents and contents information for processing the contents; extracting means for extracting a transcoding prompt based on the content information and the terminal information; and a conversion means for converting the content according to the transcoding hints.

Other objects and advantages of the present invention will become apparent from the description of the specification and the accompanying drawings.

The apparatus and method of the present invention provide for the extraction and compressed representation of automatic transcoding hints metadata.

The field of the invention is transcoding of compressed a/V content in one compression format into a/V content in another format by using supported transcoding metadata. The term "transcoding" includes, but is not limited to, changing the format of compression (e.g., from MPEG-2 format to MPEG-4 format), frame rate conversion, bit rate conversion, dialog size conversion, screen size conversion, picture coding type conversion, and the like.

The invention can also be applied to automatic video classification using the transcoding hints state described above as a classification of different scene activities of the video.

Accordingly, the invention comprises several steps and one or more of such steps in relation to each of the other steps, as well as apparatus embodying features of construction, combinations of elements and arrangements of parts suitable for carrying out such steps, all as exemplified by the detailed disclosure which follows.

Drawings

For a more complete understanding of the present invention, reference is made to the following description and accompanying drawings, in which:

FIG. 1 depicts a system overview of a transcoding system in a home network having different A/V terminals, according to an embodiment of the present invention;

FIG. 2 illustrates transcoding hints extraction (group of pictures, "GOP"), according to an embodiment of the present invention;

FIG. 3 illustrates an example of selecting a transcoding state based on the number of new feature points per frame, according to an embodiment of the present invention;

FIG. 4 illustrates an example of a transcode hint state diagram with 3 states, according to an embodiment of the present invention;

FIG. 5 illustrates extraction of transcoding hints metadata from compressed and uncompressed source content, according to embodiments of the invention;

FIG. 6 illustrates a video segment and transcoding hints state selection process, according to an embodiment of the invention;

fig. 7 illustrates a method of determining the boundaries of a new video segment (or new GOP), according to an embodiment of the invention;

FIG. 8 illustrates how an algorithm for transcoding hints state is selected, according to an embodiment of the present invention;

FIG. 9 provides an overview of the structural organization of the transcode hint metadata, according to an embodiment of the present invention;

FIG. 10 depicts the structural organization of a generic transcoding hints metadata description scheme, according to an embodiment of the invention;

FIG. 11 depicts transcoding hints metadata for a source format definition, according to an embodiment of the invention;

FIG. 12 depicts transcoding hints metadata for target format definition, according to an embodiment of the invention;

FIG. 13 depicts a generic transcoding hints metadata representation, according to an embodiment of the invention;

FIG. 14 depicts a segment-based transcoding hints metadata representation, according to an embodiment of the invention;

FIG. 15 depicts encoding complexity transcoding hints metadata, in accordance with an embodiment of the present invention; and

FIG. 16 depicts transcoding hints state metadata, according to an embodiment of the invention.

Detailed Description

Fig. 1 depicts a general overview of a system 100 for transcoding in a home networking environment, according to an embodiment of the invention. As shown in FIG. 1, A/V content server 102 includes A/V content memory 103, A/V transcoding unit 106, transcoding hints metadata extraction unit 104, and A/V transcoding hints metadata storage buffer 105. The a/V content memory 103 stores compressed a/V content from different sources having different bit rates and different subjective qualities. For example, the A/V content store 103 may contain home video from a portable digital video ("DV") camera 111, MPEG-4 compressed video with a very low bit rate (say 10kbit/s) from an MPEG-4 Internet camera 112, and MPEG-2MainProfile at Main Level ("MP @ ML") compressed broadcast video at about 5Mbit/s from the broadcast service 101, which in some cases has been associated with transcoding hints metadata. a/V content server 102 may also contain high definition compressed MPEG video at a relatively high bit rate.

As shown in FIG. 1, A/V content server 102 is connected to a network 113, which may be a wired or wireless home network. Some a/V terminals with different characteristics may also be connected to the network 113, including but not limited to: wireless MPEG-4A/V personal digital assistant ("PDA") 107, high resolution a/V terminals 108 for high definition television entertainment, a/V game console 109, and international telecommunication union technical standard group ("ITU-T") based videophone 110. a/V terminals 107, 108, 109 and 110 may be connected to home network 113 using different bit rate transmission capabilities (depending on cable or radio link).

In addition, wireless video PDA 107 may be limited based on computational power, memory, screen size, video frame rate, and network bit rate. Thus, A/V transcoding unit 106 can transcode 25 frames per second ("fps") of, for example, European 5Mbit/s MPEG-2 broadcast television and 720 × 480 pixels contained in A/V content server 102 to MPEG-4500 kbit/s 15fps video for wireless transmission and display on a 352 × 240 pixels display via wireless MPEG-4 video PDA 107. The a/V transcoding unit 106 may transcode the compressed source bit rate of the a/V content into the capabilities of each particular target a/V terminal 107, 108, 109, and 110 in real time using the transcoding hints metadata from the buffer 105. The transcoding hints metadata are generated in the transcoding hints metadata extraction unit 104 or they can be distributed by the broadcast service 101.

As shown in fig. 1, a compressed bitstream (hereinafter referred to as "first bitstream") 116 in a source format is transmitted from the a/V content buffer 103 to the a/V transcoding unit 106. A bit stream (hereinafter referred to as "second bit stream") 115 in a target format is transmitted to the home network 113 after transcoding by the transcoding unit 106. Content from the home network 113, for example in compressed DV format, is stored in the a/V content memory 103 via link 114.

FIG. 2 illustrates transcoding hints extraction, transcoding hints storage, and transcoding processes, according to embodiments of the invention. As shown in FIG. 2, buffer 201 contains A/V content in a source format. Buffer 202 contains a description of the source format, such as bit rate, compression method, GOP structure, screen size, interlaced or progressive format, etc. The buffer 203 contains a description of the target format, such as bit rate, compression method, GOP structure, screen size, interlaced or progressive format, etc. The transcoding hints extraction unit 207 reads out the a/V content in compressed source format from the a/V buffer 201, as well as the source format description from the buffer 202 and the transcoding target format description from the buffer 203. After the transcoding hints are computed by the transcoding hints extraction unit 207, the transcoding hints are stored in the transcoding hints metadata buffer 206. The a/V transcoding unit 205 reads out the first bitstream 204 in the source format from the a/V content buffer 201 and converts the source format into the target format by means of the transcoding hints metadata stored in the buffer 206. The a/V transcoding unit 205 outputs the second bit stream 208 in the new compressed target format to the a/V target format buffer 209 for storage.

Fig. 3 and 4 illustrate the principle of transcoding hints metadata organization, according to an embodiment of the invention. MPEG-based video compression uses a predictable approach where changes between adjacent frames are encoded. Video content with many changes from one frame to the next requires (in order to maintain subjective quality while limiting bit rate) different re-encoding parameter settings than video content with less changes between frames. Therefore, it is important to decide the re-encoding parameters in advance. Transcoding hints metadata selection depends primarily on the amount and nature of the visual content, which is unpredictable. The new visual content cannot be predicted from the previous frame and may be a bit rate that is densely encoded using DCT coefficients. Thus, the method of the present invention uses the number of new feature points, which were not tracked from the previous frame to the current frame, to determine the new amount of content per frame.

Fig. 3 depicts a graph (horizontal axis, time axis) of the number of new feature points per frame, based on the number of frames of the video. Portion 301 is a portion of a video segment where only a very small amount of new content is present between successive frames, so that corresponding transcoding hints metadata (e.g., large GOP size, low frame rate, low bit rate) can be selected. Section 302 includes a slightly higher number of new feature points per frame, which means that a state is chosen that describes the transcoding hints metadata, which provides the best transcoding parameters in this case (e.g., slightly smaller GOP size, higher bit rate). Section 303 describes the transcoding metadata hint state with a larger number of new feature points per frame, and therefore, a higher amount of new content per scene. Thus, a smaller value of M (I/P frame distance) and a higher bit rate are selected.

FIG. 4 depicts an example of the basic organization of a transcode hint metadata state diagram that includes three discrete transcode hint metadata states. Each discrete transcoding state may contain metadata for the GOP structure, quantizer parameters, bit rate, screen size, etc. These transcoding hints parameters can have a fixed value or can be a function of another parameter. For example, the GOP length may be a discrete function of the number of new feature points per frame, and the quantizer parameters may be a function of edge and texture activity derived from DCT coefficients. In this example, each of the three transcoding hints metadata states can be selected to accommodate three different encoding scenarios. As shown in fig. 4, state "3" 403 is selected for a high amount of motion and a low amount of new content per frame, and represents the best state for transcoding hints metadata for such content. State "2" 402 is selected for low amount of motion with high edge activity and high amount of content, which may require the use of many bits. For example, state "1" 401 is selected for transcoding processing of A/V content with low scene activity. Other special transcoding hints metadata states are also provided for video editing effects such as different cross-fade effects, sudden scene changes, or black images between two scenes. The position of the video editing effect may be detected manually, semi-automatically or fully automatically.

FIG. 5 illustrates transcoding hints metadata extracted from compressed and uncompressed source content, according to an embodiment of the invention. As shown in FIG. 5, the system 500 includes an A/V source content buffer 501, a source format description buffer 502, and a destination format description buffer 503.

The memory 504 is used to store motion vectors, DCT coefficients, and feature point extraction from compressed and uncompressed areas. In the compressed area, the motion vectors from the P and B macroblocks can be extracted directly from the bitstream. However, no motion vector exists within the macroblock. Thus, the motion vectors obtained for B and P macroblocks can be interpolated for I macroblocks (see Roy Wang, Thomas Huang: "fast camera motion analysis in MPEG region", IEEE image processing International Conference (International Conference on image processing), ICIP 99, Kobe, Japan, Oct 1999). The DCT coefficients for blocks within a macroblock can be extracted directly from the bitstream. For P and B macroblocks, a defined number of DCT coefficients (DC and 2 AC coefficients) can be obtained by the method described below, namely shin-Fu Chang, davidg. "manipulation and composition of MC-DCT compressed video", IEEE Selected communication area Journal (Journal on Selected Areas in Communications), vol.8, 1996. An exemplary method of compressed region feature point extraction and motion estimation is disclosed in PCT patent "method and apparatus for compressed region feature point registration and motion estimation" at 12 months of Peter Kuhn1999, which is incorporated herein by reference. In some cases, A/V source content may be available only in uncompressed format or in compressed formats that are not based on the DCT and motion compensation principles used by MPEG-1, MPEG-2, MPEG-4, ITU-T H.261, and ITU-T H.263. For DV formats, only DCT coefficients may be available. In these cases, the motion vectors can be obtained by motion estimation methods, see for example Peter Kuhn "algorithm for MPEG-4 motion estimation, complexity analysis and VLSI architecture", KluwerAcademic Publishers, 1999. DCT coefficients can be obtained by performing a block-based DCT transform, see k.r.rao, p.yip: "discrete cosine transform-algorithm, advantage, application", Academic Press 1990. The feature points in the pixel region (uncompressed region) can be obtained by, for example, Bruce d. lucas, Takeo Kanade, "an iterative registration technique applied to stereoscopic vision," International union meeting for artificial intelligence (International Joint Conference on intellectual association), pp.674-679, 1981.

The motion analysis section 505 extracts the parameters of the parametric motion model from the motion vector representation of the memory 504. The parametric motion model may have 6 and 8 parameters and the parametric motion estimation may be obtained by the method described below, i.e. m.tekalp: "digital video processing", Prentice Hall, 1995. The purpose of using motion representation is to remove motion estimation in the transcoder due to delay and speed. Thus, the motion input representation from the source bitstream can be used to obtain the output representation (target bitstream). For example, resizing the screen size, interlace-to-progressive conversion, etc. may depend primarily on the motion representation. The parameters of the motion representation may also be coded decisions based on the GOP structure. The texture/edge analysis unit 506 may be based on DCT coefficients extracted from a bitstream, and may refer to, for example, k.r.rao, P Yip "discrete cosine transform-algorithm, advantage, application", Academic Press (1990), or k.w.chun, k.w.lim, h.d.cho, j.b.ra "adaptive perceptual quantization algorithm for video coding", IEEE Consumer Electronics corpus (Transactions on Consumer Electronics), vol.39, No.3, August 1993.

The feature point tracking section 507 for the compressed region may use the technique described by Peter Kuhn, i.e., "method and apparatus for feature point registration and motion estimation of compressed region", PCT patent, 12 months 1999, which is incorporated herein by reference. The processor 508 calculates the number of new feature points per frame. Processor 509 calculates the video segments of time and processor 510 calculates the transcoding hints state for each segment. The methods for these calculations according to embodiments of the present invention are described in detail below with reference to fig. 6, 7 and 8.

Memory 511 contains motion-related transcoding hints metadata. Memory 512 contains texture/edge-related transcoding hints metadata, while memory 513 contains feature point transcoding hints metadata, which are described in detail below with reference to FIG. 15. The memory 514 contains video segment transcoding hints selection metadata, which are described with reference to fig. 16. Automatic extraction, compressed representation, and use of transcoding hints metadata are now described.

FIG. 6 discloses a video segment and transcoding hints state selection process, according to an embodiment of the invention. In step 601, some variables are initialized. The variable "frames" is the current frame number of the source bitstream and "n frames" is the number of frames in a new video segment (or GOP, i.e., group of pictures). The other variables are used only for this subroutine. In step 602, the number of frames in a GOP is incremented by 1. At step 603, it is determined whether a new segment/GOP begins within a frame, the details of which will be discussed with reference to fig. 7. If so ("yes"), control proceeds to step 604, otherwise, to step 615. In step 604, the variable "last _ gop _ start" is initialized with the value "new _ gop _ start". In steps 608 and 609, the variable "last _ gop _ stop" is set to "frame-1" if the variable "frame" is greater than 1. Otherwise, at step 610, "last _ gop _ stop" is set to 1. Next, at step 611, which will be described in detail in FIG. 8, a transcoding hints state is determined based on the motion parameters 605, texture/edge parameters 606, and feature point data 607. At step 612, the transcode hint metadata is output to the transcode hint metadata buffer. According to a preferred embodiment of the present invention, the transcoding hints metadata includes "n frames" (number of frames within a GOP), transcoding hints status with all parameters, and the number of start frames of the new GOP ("new _ GOP _ start"). After that, a variable "n frames" is set to 0, and the current frame number "frames" is given to a variable "new _ gop _ start". Then, at step 615, a test is made to determine if all frames of the source bitstream have been processed. If not ("NO"), control passes to step 614, where the number of frames is incremented by 1 and the process repeats beginning at step 602. Otherwise, the process terminates.

Fig. 7 illustrates a method of determining a start frame and an end frame of a new video segment or GOP in accordance with an embodiment of the invention. At step 701, it is determined whether the variable "n frames" from FIG. 6 is an integer multiple of M (which is the I/P frame distance). If so, then "no" is selected and at step 702, it is determined whether the current frame number is the first frame. If so ("NO"), control passes to step 703 where it is determined whether "n frames" is greater than the minimum number of GOP _ min frames in the GOP. In the event that the result of step 702 is yes, the new GOP begins at step 705. In the case where the result of step 703 is yes, a new GOP starts in step 705. In the event that the result of step 703 is "no," control passes to step 704 where a determination is made as to whether "n frames" is greater than the maximum number of intra-GOP frames "GOP _ max". In the event that the result of step 704 is "yes," the GOP is closed in step 706, otherwise the process terminates.

FIG. 8 illustrates a process of selecting a transcoding hints state for a particular GOP or A/V segment, considering only the number of new feature points per frame, according to an embodiment of the present invention. Similar decision structures can be implemented using the above-mentioned motion parameters from parametric motion estimation and texture/edge parameters derived from DCT coefficients according to the basic concept described. It should be noted that the described categories and algorithms may also be used to classify a/V content according to motion, edge activity, new content per frame, etc., so that there is a higher level of a/V classification. In this case, the transcoding hints state will represent a specific classification of the different content material. Referring now to FIG. 8, in step 801, the variables "frame _ no", "last _ gop _ start", "sum", and "new _ seg" are initialized. The variable "frame no" is given the contents of the "last _ gop _ start" parameter, and the variables "sum" and "new _ seg" are initialized to zero. Then, in step 802, the content of the variable "sum" is added to the number of new feature points ("frame _ no") of the current frame. At step 803, it is determined whether the variable "frame _ no" is less than the variable "last _ gop _ stop". If so ("yes"), step 802 is repeated, otherwise, control passes to step 804. At step 804, it is determined whether the value of the variable "sum" is less than one-eighth of the predetermined parameter "sum". The parameter "summax" is a constant that represents the maximum number of feature points that can be tracked on a frame-by-frame basis multiplied by the number of frames between frames "last _ gop _ start" and "last _ gop _ stop". Which may have a value of 200 in embodiments of the present invention. If the result at step 804 is "yes," transcoding hints state 1 is selected at step 806, which is used for the parameters shown in Table 1 of FIG. 8. Otherwise, in step 805, it is determined whether the value of the variable "sum" is less than one-fourth of the predetermined parameter "sum". If so ("yes"), state 2 is prompted for transcoding as shown in Table 1 selected at step 807. If not ("NO"), then transcode prompt state 3 is selected at step 808 (as shown in Table 1) and the process terminates. It should be noted that the decision threshold values of steps 804 and 805 depend on the specification and number of transcoding hints states.

Transcoding hints metadata description

To interpret the metadata, a pseudo-C encoding form may be used. The abbreviation D for description and the abbreviation DS for description scheme as specified in the MPEG-7 metadata standard may be used.

FIG. 9 depicts a general structural organization of the transcoding hints metadata within the A/V DS901, according to an embodiment of the invention. As shown in fig. 9, a segment DS 904 and a media information DS 902 are obtained from a general a/V DS 901. Segment decomposition 906 is obtained from segment DS 904 and video segment DS907 and moving area DS907 are obtained from segment decomposition 906. The segment-based transcoding hints DS 909 is obtained from the video segment DS907, and 909 will be described in detail with reference to fig. 14. Video segment DS907 accesses one or more transcoding hints state DS 911, which is described in detail with reference to FIG. 16. The segment-based transcoding hints DS 910, described in detail with reference to FIG. 14, are obtained for the move-region by the move-region DS 908, which accesses one or more transcoding hints state DSs 912, which are described in detail with reference to FIG. 16. The Media Profile (Media Profile) DS 903 is obtained from the Media information DS 902. A generic transcoding hints DS 905 is obtained from the media distribution DS 903, which will be described with reference to fig. 10.

Fig. 10 depicts the structural organization of a transcoding hints DS 1001 that includes one instance of the source format specification DS 1002 described with reference to fig. 11, and one or more instances of the target format specification DS 1003 will be described with reference to fig. 12. In addition, the transcoding hints DS 1001 includes a selection example of the general transcoding hints DS 1004 described with reference to fig. 13, and a selection transcoding complexity DS 1005 described with reference to fig. 15.

FIG. 11 depicts source format specification transcoding hints metadata associated with the entire A/V content or with a particular A/V segment (e.g., the source format specification DS 1002 of FIG. 10), according to embodiments of the invention. As shown in fig. 11, the related descriptors and description schemes may include:

bitrate is of the < int > type and describes the bit rate per second of the source a/V data stream.

Size _ of _ pictures is of type <2 int > and describes the image size in the x and y directions in the source a/V format.

Number _ of _ frames _ per _ second is of type < int > and describes the number of frames per second of the source content.

Pel _ aspect _ ratio is of the type < float > and describes the pixel aspect ratio.

Pel _ colour _ depth is of type < int > and describes the colour depth.

Use of progressive interlaced format is <1 bit > size and describes whether the source format is progressive or interlaced format.

Use _ of _ frame _ field _ pictures is <1 bit > size and describes whether a frame or field picture is used.

Compression method is of the < int > type and specifies the compression method used for the source format, which may be selected from the group consisting of: MPEG-1, MPEG-2, MPEG-4, DV, H.263, H.261, etc. For each compression method, other parameters may be specified herein.

GOP _ structure is a run-length encoded data field of I, P, B state. For example, in the case of only I-frames in MPEG-2 video, a DV format that transforms directly to a compressed area is possible.

FIG. 12 depicts target format specific transcoding hints metadata that can be related to the entire A/V content or to specific A/V segments, according to embodiments of the invention. As shown in fig. 12, the related descriptors and description schemes may include:

bitrate is of the < int > type and describes the bit rate per second of the target a/V data stream.

Size _ of _ pictures is of type <2 int > and describes the image size of the target a/V format in x and y directions.

Number _ of _ frames _ per _ second is of type < int > and describes the number of frames per second of the target content.

Pel _ colour _ depth is of type < int > and describes the colour depth.

Use of progressive interlaced format is <1 bit > size and describes whether the target format needs progressive or interlaced.

Compression _ method is of the < int > type and specifies the compression method for the target format, which may be selected from the group consisting of: MPEG-1, MPEG-2, MPEG-4, DV, H.263, H.261, etc. For each compression method, other parameters may be specified herein.

GOP _ structure is an optional run length data field for the I, P, B state. With this optional parameter, a fixed GOP structure can be forced. A fixed GOP structure may, for example, force an I-frame in a certain position to facilitate video editing.

FIG. 13 depicts generic transcoding hints metadata (e.g., generic transcoding hints DS 1004 of FIG. 11) that can be related to the entire A/V content or to specific A/V segments, according to embodiments of the invention. As shown in fig. 13, the related descriptors and description schemes may include:

use _ region _ of _ interest _ DS has a length of <1 bit > and indicates whether the region of the description scheme of interest is available for transcoding hints.

In the case of a region of interest DS, shape _ D (which may be, for example, one of boundary _ box _ D, MB _ shape _ D, or any other shape _ D) along with motion _ object _ D may be used to spatially and temporally describe the region of interest. MB _ shape _ D may use macroblock (16 × 16) sized blocks for the target shape description. Motion _ project _ D already includes the concept of time so that the start frame and the end frame of the region _ of _ interest _ DS can be specified. The region _ of _ interest _ DS may have a size of a corresponding shape _ D and a corresponding motion _ project _ D. For transcoding applications, for example, the region _ of _ interest _ DS can be used to spend more bits than the background for blocks within the region of interest (or modify the quantizer accordingly). Another transcoding application for MPEG-4 may describe the region of interest by separating the MPEG-4 objects and spending a higher bit rate and a higher frame rate for the region of interest than other MPEG-4 objects such as the background. The extraction of the region _ of _ interest _ DS may be done automatically or manually.

Use _ editing _ effects _ transcoding _ flags _ DS has a length of <1 bit > and indicates whether information is available from the transcoding hints based on editing effects.

Camera flash is a list of items where each item describes the number of frames produced by the camera flash. Thus, the length of the descriptor is the number of camera flash events multiplied by < int >. For transcoding applications, the camera flash descriptor is very useful because most video (re) encoders/transcoders use motion estimation methods based on luminance differences, see Peter Kuhn "algorithm for MPEG-4 motion estimation, complexity analysis and VLSI architecture", Kluwer Academic Publishers, 1999. In case of luminance-based motion estimation, the average absolute error between two macroblocks of two adjacent frames (one with flash and one without flash) is too high for prediction, and the frame with camera flash must be intra-coded with high bit-rate cost. Thus, representing camera flashes within a transcoding hints description scheme ("DS") allows for prediction of frames with camera flashes from anchor frames with appropriate bit costs, e.g., using luma-corrected motion estimation methods or other means.

Cross _ decoding is a list of entries where each entry describes the start frame and the end frame of a cross-fade. Thus, the length of this descriptor is twice < int > the number of cross-fade events. Representing cross-fade events in transcoding hints metadata during cross-fade is very useful for controlling the bit rate/quantizer. During cross-fading, prediction is generally restricted from being used, so that the bit rate for prediction error coding increases. Since during cross-fading the scene often becomes blurred, the bit rate increase can be limited by adjusting the quantizer scale, bit rate or rate control parameters, respectively.

Black _ pictures is a list of items, where each item describes the start frame and the end frame of a black picture sequence. Between scenes, especially in home video, black pictures may be produced. Empirically, this result indicates that a series of black pictures increases the bitrate of the motion compensated DCT coder, since prediction is only used to a limited extent. Thus, this transcoding hints descriptor can be used to limit the bit rate during black pictures by adjusting the quantizer scale, bit rate, or rate control parameters, respectively.

Fade _ in is similar to cross _ framing and is described as determining many items of the fade-in start and end frames. In contrast to cross-fading, fading starts from a black image, so by adjusting the quantizer scale, bit rate or rate control parameters, respectively, an eye-masking effect can be used to limit the bit rate during fading.

Fade _ out is similar to fade _ in except that after the scene, a series of black pictures are depicted.

Abrup _ change is described by a list of individual frame numbers of the < int > type, indicating sudden scene or shot changes without fading occurring. These events are represented by, for example, the very high and very steep peaks of fig. 3. These peaks indicate the start of a new shot or scene. The abruppt _ change editing effect is opposite to the fading effect. When an abrupt change between two video segments occurs, human vision takes a few milliseconds to accommodate and recognize the details of the new a/V segment. This slow adaptation effect of the human eye is beneficial for video transcoding, for example for reducing the bit rate or modifying the quantizer scale parameter for those starting frames of a video segment following a sudden change in scene or camera.

Use _ motion _ transcoding _ hits _ DS has a length of <1 bit > and represents the use of motion-related transcoding hints metadata

Number of regions indicates the number of regions for which underlying motion-related transcoding hints metadata is valid.

For _ event _ region indicates with a <1 bit > length field whether the region is rectangular or arbitrary. In the case where the region is an arbitrary shape, region descriptors (including, for example, shape descriptors and motion trajectory descriptors) are used. In the case of a rectangular region, the size of the rectangular region is used. The motion field in this region is described by a parametric motion model, which is determined by several parameters for each frame or sequence of frames. For transcoding, this motion representation of the real-time motion of the source video can be used to limit the search area of the computational complexity motion estimation of the (re-) encoded part, as well as for fast and efficient interlacing/de-interlacing (frame/field) conversion and to determine the GOP (group of pictures) structure from the amount of motion within the video. The motion representation can also be advantageously used for size conversion of video.

FIG. 14 depicts segment-based transcoding hints metadata (e.g., the segment-based transcoding hints DS 909 and 910 of FIG. 9) that can be used to determine the (re) encoder/transcoder of the A/V segment that describes the constant characteristics, according to embodiments of the invention. As shown in fig. 14, the related descriptors and description schemes may include:

start _ frame is a < int > type and describes the frame number at which transcoding hints metadata for A/V segments begin.

Nffame is < int > type and describes the length of the a/V segment.

I _ frame _ location gives several possibilities to describe the location of I frames within an A/V segment.

Select _ one _ out _ of _ the _ following is the < 2-bit > size and selects one of the following four I-frame position description methods.

First frame is <1 bit > size and is the default I frame location. This method describes the a/V segment where only the first frame is the intra frame of the a/V segment and is used as an anchor for further prediction, and all other frames within the a/V segment are P or B frames.

List of frames gives a List of the number of frames of the intra frame within the A/V segment. This method allows the position of the intra frame within the a/V segment to be arbitrarily described. For k frames in this list, this descriptor size is < k int >.

First _ frame _ and _ event _ k _ frames is of the < int > type, where the first frame within the segment is intra and k describes the interval of the I-frame within the a/V segment.

No _ I _ frame is of <1 bit > size and describes the case where no I-frame is used within an a/V segment, which is useful when the encoding of the a/V segment is based on the ground anchor of the previous segment (intra frame).

Quantizer scale is of type < int > and describes the initial quantizer scale values for the A/V section.

Target _ bitrate is of the < int > type and describes the target bit rate per second for the a/V segments.

Target min bitrate is of type < int > and describes the minimum target bitrate per second (optional) for the a/V segment.

Target _ max _ bitrate is of type < int > and describes the maximum target bitrate per second (optional) for the a/V segments.

Use _ transcoding _ states is a <1 bit > size and describes whether transcoding hints status is used for A/V segments.

Transcoding _ state _ nr is of type < int > and gives the transcoding hints metadata state for the segment. The transcode hint metadata state is a pointer to an entry in the transcode hint state table. The transcoding hints state table can have several entries where new entries can be added or subtracted by the transcoding hints parameters. The transcoding hints metadata for a single transcoding hints state is described below with reference to FIG. 16.

Add _ new _ transcoding _ state is the <1 bit > size and describes whether a new transcoding state with associated information must be added to the transcoding hints table. In the case where add _ new _ transcoding _ state signals "yes", a parameter list of new transcoding hints state is given. The size of the parameter list is determined by the number of parameters of the transcoding hints state and the number of transcoding hints state.

Remove _ transcoding _ state is a <1 bit > sized flag indicating whether the transcoding state can be removed. In the case that a transcoding state can be removed, the number of states of the removed transcoding state is given (type: < int >).

Use _ encoding _ complexity _ description is of <1 bit > size and signals whether a more detailed encoding complexity description scheme as specified in fig. 15 has to be used.

FIG. 15 depicts encoding complexity transcoding hints metadata that can be related to the entire A/V content or to specific A/V segments, according to embodiments of the invention. The encoding complexity metadata may be used for rate control and to determine quantizer and bit rate settings.

Use _ feature _ points is of <1 bit > size and represents the use of feature point based complexity estimation data.

Select _ feature _ point _ method is <2 bit > size and selects the feature point method.

Number _ of _ new _ feature _ points describes a list of the number of new feature points per frame as illustrated in fig. 3, and they are < nframes int > size. This metric represents the amount of new content per frame.

Feature _ point _ metrics describes a list of metrics based on new feature points per frame within a segment. The metric is represented as an ordered list of < int > values with the following meaning: mean, maximum, minimum, variance, standard deviation of the number of new feature points per frame.

Use _ equalization _ description is an equation description based < int > pointer to the coding complexity of each frame.

Use _ motion _ description is of < 1-bit > size and represents the use of motion-based complexity descriptions.

Select _ motion _ method is <4 bit > size and selects the motion description method.

Param _ k _ motion is the < nframes × k × int > size and describes the k parameters for each single frame of the global parametric motion model.

Motion _ metrics describes a metric list for the size of the motion vector based on the entire segment. The metric is represented as an ordered list of < int > values with the following meaning: mean, maximum, minimum, variance, standard deviation of macroblock motion vectors (stddev).

Block _ motion _ field describes each vector of the motion field of m × m block size and is of size < nframes _ int _ size _ x _ size _ y/(m × m) >.

Use _ texture _ edge _ metrics is a flag that is set when the texture or edge metric is used and it is a <1 bit > size.

Select _ texture _ edge _ metrics is the <4 bit > size and it determines which texture metric from below is used.

DCT block energy is the sum of all DCT coefficients of a block and is defined for each block within a frame. It is the size of < size _ y _ size X _ nframes int/64 >.

DCT block activity is defined as the sum of all DCT coefficients of a block, but no DC coefficients. It is defined for each block within the frame and is the size of < size _ y _ size _ x _ nframes int/64 >.

DCT _ energy _ metric describes a list of metrics based on the entire segment of each DCT energy per block. The metric is represented as an ordered list of < int > values with the following meaning: mean, maximum, minimum, variance, standard deviation of all individual DCT energy measures. The size of the descriptor is <6 int >. Another implementation of this descriptor is to describe the DCT energy metric for each single frame of the video segment.

DCT activity metric describes a list of metrics for the entire segment based on individual DCT activity per block. The metric is represented as an ordered list of < int > values with the following meaning: mean, maximum, minimum, variance, standard deviation of all individual DCT activity metrics. The size of the descriptor is <6 int >. Another implementation of this descriptor is to describe the DCT activity metric for each single frame of a video segment.

FIG. 16 depicts transcoding hints state metadata that can be related to the entire audiovisual content or to specific A/V segments, according to embodiments of the invention. The associated descriptors and description schemes may include:

m is a < int > type and describes the I frame/P frame distance.

Bitrate _ fraction _ for _ I is of the < float > type and describes the fraction of the bitrate specified for the a/V segment available for the I-frame.

Bitrate _ fraction _ for _ P is of the < float > type and describes the fraction of the bitrate specified that can be used for the a/V segment of the P-frame. The bit rate fraction for B frames is the remaining percentage to 100%.

The quantizer scale ratio I P is of the type < float > and represents the quantizer scale (as specified for this segment) relationship between I and P frames.

The quantizer scale ratio I B is of the type < float > and represents the quantizer scale (as specified for this segment) relationship between I and B frames. It should be noted that the bitrate descriptor (bitrate _ fraction _ for _ I < bitrate _ fraction _ for _ P), the quantizer _ scale _ ratio descriptor (quantizer _ scale _ ratio _ I _ P, quantizer _ scale _ ratio _ I _ B) or the following rate control parameter may be mandatory.

X _ I, X _ P, X _ B are frame _ vbv _ complexies, each being of the < int > type and specified in the case of a frame-based compression target format (refer to fig. 12). These and following virtual buffer verifier ("VBV") complexity adjustments are optional and can be used to modify the rate control scheme according to source content characteristics and target format specifications.

X _ I top, X _ P top, X _ B top are field _ vbv _ complexies for the top field, each being of the < int > type and specified in the case of a field-based compression target format (refer to fig. 12).

The objectives of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings, since various changes may be made in the above methods and structures without departing from the spirit and scope of the invention, and it is intended that all matter contained in the above description and shown in the accompanying drawings shall be interpreted as illustrative and not in a limiting sense.

It is also to be understood that the following claims are to cover all of the generic and specific features of the invention herein described and all statements of the scope of the invention which, as a matter of language, might be said to fall therebetween.

Claims

1. A content processing method for processing content, comprising the steps of:

storing terminal information about a terminal in a first memory;

storing the contents and contents information for processing the contents in a second memory;

extracting a code conversion prompt according to the content information and the terminal information; and

wherein content is converted according to the transcoding hints.

2. The content processing method of claim 1, wherein the step of extracting the transcoding hints comprises:

receiving a bit stream of compressed image data having a GOP structure;

obtaining motion information from the bitstream;

obtaining texture and edge information of the segment;

obtaining feature points and associated motion information from the bitstream; and

obtaining a region of information of interest from the bitstream.

3. The content processing method of claim 2, wherein the step of extracting the transcoding hints further comprises the step of storing the motion information as transcoding hints.

4. The content processing method of claim 2, wherein the step of extracting transcoding hints further comprises the step of representing motion-related transcoding hints as parameters of a parametric motion model.

5. The content processing method of claim 4, wherein the step of extracting transcoding hints further comprises the step of using a parametric motion model to describe global motion within a frame.

6. The content processing method of claim 4, wherein the step of extracting the transcoding hints further comprises the step of using a parametric motion model to describe motion within an arbitrarily-shaped defined region.

7. The content processing method of claim 2, wherein the step of extracting a transcoding hints further comprises the step of representing motion-related transcoding hints as an array of motion vectors contained within the bitstream.

8. The content processing method of claim 2, wherein the step of extracting the transcoding hints further comprises the step of representing motion-related transcoding hints as an array of motion vectors derived from motion vectors contained in the bitstream.

9. The content processing method of claim 2, wherein the step of extracting the transcoding hints further comprises representing the motion-related transcoding hints as a list of feature points with associated motion vectors, the feature points tracked within a frame.

10. The content processing method of claim 2, wherein the step of extracting the transcoding hints further comprises representing the motion-related transcoding hints as a list of feature points with associated motion vectors, said feature points being tracked within arbitrarily shaped regions within the frame.

11. The content processing method of claim 2, wherein the step of extracting the transcoding hints further comprises representing the transcoding hints associated with the textures and edges as one of a list of DCT coefficients and a measurement value obtained from the list of DCT coefficients, the measurement value being one of a mean, a minimum, a maximum, a variance, a standard deviation.

12. The content processing method of claim 2, wherein the step of extracting transcoding hints further comprises the step of representing the feature points and associated motion-related transcoding hints as a list.

13. The content processing method of claim 2, wherein the step of extracting the transcoding hints further comprises representing transcoding hints related to coding complexity as complexity measures derived from a list of feature points age that are tracked within each frame using a number of missing feature points and new feature points from one frame to the next.

14. A content processing apparatus for processing content, comprising:

a first memory for storing terminal information about a terminal;

a second memory for storing contents and contents information for processing the contents;

the extracting device is used for extracting a code conversion prompt according to the content information and the terminal information; and

and a conversion means for converting the content according to the transcoding hints.

15. The content processing apparatus according to claim 14, wherein the extracting means includes:

a unit receiving a bit stream of compressed image data having a GOP structure;

a unit for obtaining motion information from the bitstream;

a unit that obtains texture and edge information of the segment;

a unit for obtaining feature points and associated motion information from the bitstream; and

a unit for obtaining a region of interest information from the bitstream.

16. The content processing apparatus according to claim 15, wherein said extracting means stores the motion information as a transcoding hints.

17. The content processing apparatus according to claim 15, wherein said extracting means further comprises a unit for representing the motion-related transcoding hints as parameters of a parametric motion model.

18. The content processing apparatus according to claim 17, wherein said extracting means describes global motion within a frame using a parametric motion model.

19. The content processing apparatus according to claim 17, wherein said extracting means describes motion within an arbitrarily shaped defined area using a parametric motion model.

20. The content processing device of claim 15, wherein the extraction device represents motion-related transcoding hints as an array of motion vectors included within the bitstream.

21. The content processing apparatus of claim 15, wherein the means for extracting represents the motion-related transcoding hints as an array of motion vectors derived from motion vectors included in the bitstream.

22. The content processing apparatus according to claim 15, wherein said extracting means represents the motion-related transcoding hints as a list of feature points with associated motion vectors, said feature points being tracked within a frame.

23. The content processing apparatus according to claim 15, wherein said extracting means represents the motion-related transcoding hints as a list of feature points with associated motion vectors, said feature points being tracked within arbitrarily shaped regions within the frame.

24. The content processing apparatus according to claim 15, wherein said extracting means represents the transcoding hints associated with textures and edges as one of a list of DCT coefficients and a measurement value obtained from the list of DCT coefficients, the measurement value being one of an average value, a minimum value, a maximum value, a variance, and a standard deviation.

25. The content processing apparatus according to claim 15, wherein said extracting means represents the feature points and the associated motion-related transcoding hints as a list.

26. The content processing apparatus according to claim 15, wherein said extracting means expresses a transcoding hints related to coding complexity as a complexity measure obtained from a list of feature points lifetime which are tracked within each frame by using many missing feature points and new feature points from one frame to the next.