WO2020108756A1

WO2020108756A1 - Automatic composition of audio-visual multimedia

Info

Publication number: WO2020108756A1
Application number: PCT/EP2018/082958
Authority: WO
Inventors: Alden COOTS; Klas KÄLLQVIST; Chang Gao; Arash PENDARI
Original assignee: Vionlabs AB
Current assignee: Vionlabs AB
Priority date: 2018-11-29
Filing date: 2018-11-29
Publication date: 2020-06-04
Anticipated expiration: 2021-05-29

Abstract

Automatic generation of an audio-visual multimedia trailer is provided, by selecting and rearranging segments from a longer audio- visual multimedia, automatically matching them by means of their audio content to a new audio media, replacing or overlapping the original audio track from the segments.

Description

AUTOMATIC COMPOSITION OF AUDIO-VISUAL MULTIMEDIA

Technical field

The present invention relates to the field of automatic generation of an audio-visual multimedia (AVM) by selecting and rearranging segments from a longer AVM and replacing or overlapping the original audio track from the segments with a new audio track.

Background

In many industries, in particular the movie, television, media, advertising and music industry, it is a common task to compose a short section of AVM from selected segments of video and/or audio sequences of varying length. One example is the sectioning of an AVM and combining selected segments into a shorter AVM, e.g. a video preview, trailer, thumbnail sequence or advertisement using selected scenes from a video, replacing or overlaying the original soundtrack with a second soundtrack containing music or speech. In such scenarios, it is essential that the sequence of audio-visual segments (AVS) match the overlaid or replacing soundtrack to create an overall complete and consistent experience.

Traditionally, the production of preview material in the media industry has been manual, through the cutting and pasting of video and audio segments, the contents being decided upon by e.g. the producer of the preview, as a result of the producers artistic and creative skill and talent. The manual process makes use of visual editing tools and is a time consuming and costly process.

With a rapidly growing market for audio-visual products, there is an increasing need for short, descriptive and engaging previews in order to compactly present the contents of the product to the viewer in a short amount of time, e.g. while browsing a list of movies. As the number of commercially available audio-visual content is also growing rapidly, the time for production of such previews is presenting in increasing bottle-neck. Additionally, while previews are centrally produced, there is little option to present the contents in a way which is personalised or targeted to a specific audience of specific user preferences.

Various resources exist, both commercially and as open source and plugins which automate the process of creating a video preview, automate part of the process, or attempt at making the editing process more efficient. Examples include the automatic generation of featured images from video or video thumbnails (short sample sequences extracted from a video). However, these tools lack the ability to adjust the video preview to the original video content, adjusting it to the preview video of past and following scenes or adjusting it to the contents of a new added soundtrack.

There is thus a need for improvements within this context.

Summary of the invention

In view of the above, it is thus an object of the present invention to overcome or at least mitigate the problems discussed above. In particular, it is an object to provide a method and device for automatic generation of an AVM which is a shorter subset of a larger AVM, based on both its subjective and objective content, matching it to a new audio track of the same length.

According to a first aspect of the invention, there is provided a computer implemented method for creating a metadata which defines a plurality of AVSs from an AVM, each segment indicated in the metadata by a start time in the AVM, and a running length, wherein the metadata further defines an order of the segments, the method comprising the steps of:

a) Receiving an AVM.

b) Determining a candidate set of AVSs of the AVM by analysing an audio content of the AVM to determine a first set of points in time where significant changes of an emotional state of the audio content occurs, wherein the candidate set is determined based on the determined first set of points in time, wherein each AVS has a running length, a start time in the AVM and being associated with an emotional metric calculated by analysing an audio content of the AVS using a first algorithm.

c) Receiving an audio track. d) Calculating, at each point in time of a second set of points in time in the audio track, a correlation value between the emotional metric of each AVS and a second emotional metric of a segment of the audio track starting at the point in time and having the same running length as the AVS, the second emotional metric of the segment of the audio track calculated using the first algorithm;

e) Sorting the calculated correlation values.

f) Until each point in time in a full length of the audio track is mapped to an AVS and in a descending order of the sorted correlation values, mapping the segment of the audio track and the AVS which resulted in the correlation value to each other, wherein each AVS can be mapped to one segment of the audio track only.

g) Creating the metadata based on the mapping, such that all AVSs mapped to the audio track are indicated in the metadata with their respective start time, and with the order in which they are mapped to the audio track.

The present invention stems from the realisation that the audio track of an AVM is an important mediator of the atmosphere in the AVM, and constitutes a large part of the subjective feeling of the AVM. The audio track will continuously reflect the evolution of the plot as well as any changes in the mood throughout the AVM. There may be pivot points, characterised by sudden changes in mood or periods of built up tension or relaxation, emphasized by the producers by choice of music or presentation of dialogue and background sound. AVMs from different genres or movie categories will also display differences in audio track depending on the mood or emotional content of the AVM in question, and consequently the audio track will be a suitable means of classifying various genres. In the present invention the audio track of an AVM is exploited to automatically segment and rearrange segments based on an emotional score, reflecting this mood, and matching the segments to a new, shorter audio track, e. g. a short music piece, thereby constructing a preview of the AVM.

By the term“audio-visual multimedia (AVM) should, in the context of the present specification, be understood a piece of media containing at least a visual and an audio part, such as a video, animation, montage or set of images), stored in digital form.

An algorithm evaluating a metric describing the emotional content is used with both the audio track of the AVM and the new audio track, and each segment of the AVM is then matched to a corresponding segment of the new audio track in such a way that the correlation between the two metrics is maximised. The result is a set of automatically produced metadata. By the term“correlation value” should, in the context of the present specification, be understood a metadata entry containing at least a correlation score and an association to both an AVS, identified by at least a length and a starting point in the AVM, and a point in time in the audio track. The correlation score defines the similarity (proximity, distance, comparability, equality, likeness, likelihood) between the emotional metric of the AVS and the second emotional metric of the corresponding point in time of the AT, the purpose of which is to quantify a similarity in the emotional state between the two media. Methods for calculating the value of comparison can be, but are not limited to; subtraction, minimization of variance, minimization of Euclidean distance or least squares approximation. By sorting of the correlation value should be understood the process of arranging the metadata entries according to the value of their respective correlation score, so that a descending order of sorting results in the correlation values containing the highest correlation score (representing the AVS and the point in time of the audio track (AT) which show the highest similarity in emotional state between their respective audio content).

By way of example, Euclidean distance, cosine similarity, cosine distance, angular cosine distance, angular cosine similarity or other suitable scores can be used to describe the similarity between the two value groups. It is obvious to the skilled person that for some of these values, a higher value designates a higher similarity between the two value groups, and for others the situation will be reversed. By the term“descending order” should, in the context of the present specification, be understood an order which places a score representing a higher similarity (lower distance) between the emotional metrics for the AVS and the point in time of the AT before a score representing a lower similarity (higher distance). The ordering of correlation score may also be selected by weighing the sorting with other factors, such as order of the original AVSs, musical content of the AT, visual content of the AVS, user preferences, etc.

By way of the metadata, a set of AVSs can be combined with the audio track by stacking them after each other at the associated points in time along the audio track. Advantageously, this procedure constitutes an automatic fashion of creating AVSs with a high degree of emotional correlation with the correspondingly mapped sequence of the audio track. By the term“sequence of the audio track” should, in the context of the present specification, be understood a set of sequences from the audio track, designated by the above mentioned“point in time” at which it starts, and a length determined by the AVS to which the“point in time” is currently being mapped.

As will be disclosed below, characteristics other than the emotional metrics of the audio tracks, may also be included in the evaluation in order to achieve the overall best result. The preview (i.e. metadata describing the preview) may have been produced taking the personal preferences of each viewer into account. It may also differ depending on a classification of the AVM e.g. into genre (category, type), such as action, romance or

documentary, depending on the visual content of the AVM, or on manually or automatically attached labels, e.g. artist, composer, volume or genre of musical content in the audio track. Any emotional score, matching score, or other metric in vector or aggregated form will be associated with each segment of the AVM.

By the term“emotional state” should, in the context of present specification, be understood the subjective mood, atmosphere, spirit or temperament of that particular part of the AVM, as perceived by the viewer.

By the term“emotional metric” should, in the context of the present specification, be understood a way of measurement of the emotional state, which is objective in nature.

By the term“a first algorithm” should, in the context of the present specification be understood a suitable algorithm for emotional analysis (analysis of the emotional state), the output of which will be an initial (optionally also termed first or temporary) emotional metric. There are several well established methods for identifying objective metrics for emotional states, such as the valence-arousal-dominance (VAD) score, the pleasure-arousal- dominance (PAD) score, the positive activation - negative activation (PANA) model, the Lovheim cube or the Circumplex model. Other suitable algorithms may be used.

The term“significant changes” is a term that must be interpreted in the context of the analysis. By this term should in the context of the present specification be understood a period in time, where a running average calculated over the emotional metric is higher or lower than that for a previous period in time (synonymous to a transition, change, step, peak, dip, etc.). What is deemed as a“significantly” higher or lower value may be set by e.g. a threshold, which may then be varied according to e.g. the maximum variation among the total number of emotional metrics calculated for the entire AVM and the desired number of segments.

According to some embodiments, the method further comprises (in the step g) setting the running length of each AVS defined in the metadata to one of; the running length of the AVS, or a time period between the start time of the AVS and the start time of another, in a chronological order of the audio track, immediately following mapped AVS. In case the AVS is longer than the mapped space to which it is being mapped, the length is adjusted (by adjusting the running length of the AVS in the metadata) to fit by reducing the length at the end. If not, the running length of the AVS will simply be used. This approach ensures two advantages. First, that no registration of overlap of segments exists in the metadata. Second, in case no AVS of suitable length exists to map to a specific sequence of the audio track, for instance between two AVSs which have already been mapped, this approach makes sure that the mapped oversized AVS can be cut down to size, making sure that the entire length of the audio track can be mapped without gaps between the AVSs.

According to some embodiments, the method further comprises (in step f) upon determining that a mapping between an AVS and a segment of the audio track will overlap with another, in a chronological order of the audio track, immediately following mapping between a further AVS and a further segment of the audio track, shorten the AVS and the segment of the audio track to avoid such overlap. If it is known that a mapped AVS will overlap with another segment, the audio-visual may be shortened already at the mapping stage. Advantageously, the creation of metadata is simplified, and one additional computational step removed.

According to some embodiments, wherein the audio-visual multimedia comprises a set of audio-visual frames, the method further comprises (in step b)

- Determining a first set of AVSs of the audio-visual multimedia

based on a first sample frequency, each AVS comprising one or more audio-visual frames.

By determining this first set of segments, the sample frequency at which to calculate the metrics for matching scores is determined.

Advantageously, the computational complexity of the method can be varied by setting a suitable sample frequency based on e.g. available resources etc.

After the first set of AVSs is determined, the method further comprises:

- For each AVS of the first set of AVSs, performing emotion analysis on an audio content of the AVS using the first algorithm to determine an initial emotional metric of the AVS.

- Using a second algorithm and the determined initial emotional

metric of the first set of AVSs to determine the first set of points in time in the audio-visual multimedia where significant changes between the initial emotional metric of consecutive segments from the first set of segments occur.

- Determining a second set of AVSs of the audio-visual multimedia based on the first set of points in time in the audio-visual multimedia where significant changes occur, each AVS of the second set of AVSs comprising one or more AVSs of the first set of AVSs, each AVS of the second set of AVSs being associated with an emotional metric calculated by a third algorithm using the emotional metric of said one or more AVSs of the first set of AVSs. By the term“a second algorithm” should, in the context of the present specification, be understood any suitable method of detecting location of values in a data set where values differ significantly from preceding or following values, or mark locations in the data set where changes are present, when compared to preceding or following values. These changes could e.g. be in the form of a step function, saddle point, local maximum, local minimum or local maximum or minimum of a gradient with respect to time. Several well- known methods exist for change and point detection, such as derivation, step- or edge detection, probabilistic inference methods (e.g. Hidden Markov Model) or error minimisation methods but any suitable method from

multidimensional signal analysis can be used.

By the term“initial emotional metric” should, in the context of the invention be understood a suitable quantitative measurement as produced by the first algorithm. It may be, but is not limited to, a momentary value for a VAD or other emotional score in the form of a scalar or vector, an average value or the parameters of a fitted model.

By the term“a third algorithm” should, in the context of the present specification, be understood any suitable method for calculating or

aggregating sets of values, including, but not limited to standard algorithms for statistical computation such as average, standard deviation, kurtosis or skewness, or (if suitable) algorithms for computation on a co-occurrence matrix, such as entropy of homogeneity or methods for the fitting of

mathematical models, producing an emotional metric in the form of a set of model parameters. The examples given above could if suitable also be used as part of the calculation by the first algorithm, in which case the two calculations can be considered identical.

In either case, after determining a second set of AVSs, the method further comprises:

- Using the second set of AVSs as the candidate set of AVSs of the audio-visual multimedia.

The present embodiment advantageously provides a flexible way (in terms of computational complexity and possible implementations) of achieving a candidate set of AVSs in which each AVS is associated with an emotional metric.

According to some embodiments, the method further comprises (in step d) determining a plurality of correlation values, wherein each correlation value of the plurality of correlation values is associated with an AVS of the candidate set and a point in time in the audio track, the plurality of correlation values determined by, for each AVS of the candidate set and for each point in time of the second set of points in time of the audio track:

- Calculating a second emotional metric of a segment of the audio track by performing emotion analysis on the segment using the first algorithm, the segment of the audio track starting at the point in time and having a length equal to the length of the AVS,

- Calculating a correlation value between the emotional metric

associated with the AVS and the second emotional metric of the segment of the audio track,

- Associating the correlation value with the AVS and with the point in time of the audio track, and adding the correlation value to the plurality of correlation values.

In order to evaluate the matching of various AVSs against segments of the audio track, an all-versus-all correlation is performed between the AVM and the audio track, resulting in a multidimensional set of correlation scores.

A cosine similarity score may be used as correlation score, but also other values for distance calculation may be used such as cosine distance, angular cosine distance, angular cosine similarity or Euclidean distance.

In case the correlation score is scalar, the set will have two

dimensions, with one dimension for the set of AVSs and one for the set of sequences of the audio track. In case the correlation score is

multidimensional, the correlation set will have a corresponding higher dimensionality. By the term“plurality of correlation values” should, in the context of present specification, be understood this multidimensional set. This part of the invention describes how the manor by which the correlation score is being produced can differ in order to optimize the mapping to a particular use of the preview produced by the resulting metadata. Additionally, it allows for the set of correlation scores to be simultaneously used for multiple purposes (e.g. to produce multiple sets of mapping metadata for different users, or to present the viewer with a variation in preview at different occasions) by adding dimensions to the correlation score, thus saving computational power.

According to some embodiments, the method further comprises (in step f):

Defining a timeline with a length equal to a length of the audio track, and repetitively determining a highest correlation value among the plurality of correlation values until all points in time in the timeline is assigned an AVS from the candidate set. When determining a highest correlation value, the method further comprises;

- Assigning the AVS associated with the determined highest

correlation to a segment of time in the timeline, the segment of time starting at the point in time associated with the determined highest correlation and having a same length as the running length of the AVS.

- Upon determining that assigning the AVS to the segment of time results in an overlap with a, in a chronological order of the timeline, immediately following assignment of a further AVS to a further segment of timeline, the AVS and the segment of the timeline will be shorten to avoid such overlap.

- When assignment of an AVS to a segment of time in the timeline results in that all correlation values in the plurality of correlation values associated with that AVS are disabled from being determined as a highest score.

By the term“timeline” should, in the context of the present

specification, be understood a data structure consisting of equally spaced points in time, with a length equal to that of the audio track. The points in time can have a frequency as the sample frequency used to determine the first set of segments of the AVSs, but the frequency can also be either longer or shorter. The points in time along time timeline are used to define anchor point to which an association between an AVS, or part of an AVS, and the AT can be made.

When mapping AVSs to suitable parts of the audio track, an iterative assignment is performed, where the mapping resulting in the highest correlation score is chosen first, and the rest of the segments mapped subsequently in descending order of correlation score. Preferably, the AVSs are mapped in their original length, but if deemed more suitable, the segments may also be shortened to fit between already mapped segments along the timeline of the audio track. No AVS can be mapped more than once. As a consequence, all other correlation values for the same AVS are disabled, as soon as a mapping is determined. Advantageously, this ensures that AVSs are not repeated in the preview produced by the resulting metadata and the preview is guaranteed to display a high degree of scene variation.

According to some embodiments, the method further comprises wherein assignment of an AVS to a segment of time in the timeline results in that all correlation values associated with a point in time which overlaps with that segment of time in the timeline are disabled from being determined as a highest score. Just as no AVS can be mapped more than once,

advantageously in this embodiment, no part of the audio track can be mapped more than once which may simplify the metadata creation in the end. In this embodiment, it is defined that it is the first point in time along a segment of the audio track which must be available, in order for mapping of subsequent points in time to be possible.

According to some embodiments, the method further comprises wherein a segment of the candidate set of AVSs is determined to be one of:

- A segment of the audio-visual multimedia between two points in time of the first set of points in time where significant changes of an emotional state of an audio content of the audio-visual multimedia occurs; or

- A segment of the audio-visual multimedia around a point in time of the first set of points in time where significant changes of an emotional state of the audio content occurs, and having a predetermined running length. In order to cut the AVM in candidate segments of suitable length, an algorithm which weights in changes the emotional metric is used. Not all segments between points in time where changes occur will be suitable for mapping to the audio track. A set of candidate segments is selected, which may have a total length less than the length of the total audio-visual media.

By way of example, the embodiments above specify two ways to segment the AVM: First using two points in time where significant changes in the audio content occurs as start and end. Each AVS is then defined as a section of the AVM between the two consecutive points in time where significant changes have been detected by the second algorithm. Second using one point in time where a significant change in the audio content occurs together with a set length for the AVS.

Several other parameters may be chosen to be weighted in when determining an optimal choice of candidate segment length and the corresponding emotional metric of each segment. Examples are a

predetermined optimal length of a segment, stretches of segments where little changes in the emotional metric occur, etc. In this fashion, processing power can be reduced by excluding unsuitable segments at an early stage in the calculation. Additionally, the quality of the preview will be increased as there might be unsuitable AVSs which may still be of a high correlation of the emotional score. For example, if it is known that the viewer intended for the preview is in favour of a typical type of movie, AVSs displaying unfavourable scenes may be excluded.

According to some embodiments, the method further comprises upon determining that an AVS comprises a visual break representing a cut in the audio-visual multimedia, removing the AVS from the candidate set of AVSs. As described above, further aspects may be weighted in when determining segment length, for example visual cues of change, such as a change of visual scenery as the result of a cut in the audio-visual multimedia. Segments which may have a high correlation in emotional metrics to the audio track, may still be undesirable for the formation of a preview as a result of their visual appearance. As above, processing power is thus reduced by excluding unsuitable segments at an early stage in the calculation. Additionally, the quality of the preview when constructed from the metadata will be improved. This embodiment of the invention highlights the impact of the visual appearance of an AVS, which may be used in a variety of ways to improve the quality of the preview. For example, the flow of the story told can be emphasized by selecting AVSs for which the majority of motion vectors in the AVS are in the same direction. In another example, AVSs with motion vectors in opposing directions may be selected in order to display a preview with a conflicting theme, etc.

By the term“visual break” should, in the context of the present specification, be understood an occurrence related to the pixel values in the set of frames of the AVM, which can be of objective (freeze frames, noisy image conditions, low contrast frames, frames displaying high or low levels of motion, etc.) or subjective (displaying colour schemes, motion patterns or scenes which have previously been tagged as undesirable by a viewer) nature.

According to some embodiments, the multidimensional metric is a Valence-Arousal-Dominance, VAD, metric. There are several options for emotional metric. However, VAD is a well-studied and well known metric yet simple as it only contains three dimensions. Consequently, the choice of VAD will simplify the production of manually annotated testing data.

According to some embodiments, the method further comprises (in step b) using a Hidden-Markov-Model (HMM), for determining the first set of points where significant changes of an emotional state of an audio content of the audio-visual multimedia occurs.

The methods for Bayesian statistical analysis and learning models are well-known and will not be described in detail, as variations of an HMM, such as other graphical statistical frameworks can be used as well, as can other methods for interest point detection in a time series, e.g. derivation, peak detection, step- or edge detection, probabilistic inference methods or error minimisation methods. However, HMM has been deemed as suitable as it uses a probabilistic model which gives more flexibility than the non- probabilistic models mentioned above, yet is simple, thus reducing model complexity to avoid overfitting as well as limits computation effort. Advantageously, the second algorithm described herein may be embodied by a HMM algorithm as described above.

According to a second aspect of the invention, the above object is achieved by a computer implemented method for automatically creating a trailer of an audio-visual multimedia, comprising the steps of:

- Creating metadata which defines a plurality of segments of audio visual content from an audio-visual multimedia according to the first aspect.

- Creating a second audio-visual multimedia, wherein an audio

content of the second audio-visual multimedia comprises the audio track used in the previously described embodiments, and wherein a visual content of the second audio-visual multimedia comprises a visual content from each AVS the set of audio-visual segments defined in the metadata.

One important commercial aspect of the invention is the possibility to automatically create a movie trailer or preview from a longer video sequence. In this embodiment of the invention this process is defined. Additionally, by dividing the computation in two separate steps, the flexibility is increased as analysis and metadata creation can be performed on a device separate from the preview production. It may also reduce the amount of data transferred over a network if the original AVM is analysed at its source, and only the metadata along with the selected AVSs are sent to the preview production device.

According to a third aspect of the invention, the above object is achieved by a computer program product comprising a computer-readable medium having computer code instructions stored thereon for carrying out the method of the first aspect when executed by a device having processing capability.

According to a fourth aspect of the invention, the above object is achieved by a device for creating metadata which defines a plurality of segments of audio-visual content from an audio-visual multimedia, each segment indicated in the metadata by a start time in the audio-visual multimedia, and a running length, wherein the metadata further defines an order of the segments, the device comprises:

- A receiving component configured for receiving an audio-visual multimedia and an audio track.

- A processor configured for:

o Determining a candidate set of AVSs of the audio-visual multimedia by analysing an audio content of the AVM to determine a first set of points in time where significant changes of an emotional state of the audio content occurs, wherein the candidate set is determined based on the determined first set of points in time, wherein each AVS has a running length, a start time in the audio-visual multimedia and being associated with an emotional metric calculated by analysing an audio content of the AVS using a first algorithm.

o Calculating, at each point in time of a second set of points in time in the audio track, a correlation value between the emotional metric of each AVS and a second emotional metric of a segment of the audio track starting at the point in time and having the same running length as the audio-visual segment, the second emotional metric of the segment of the audio track calculated using the first algorithm.

o Sorting the calculated correlation values.

o Until each point in time in a full length of the audio track is mapped to an AVS and in a descending order of the sorted correlation values, mapping the AVS and the segment of the audio track which resulted in the correlation value to each other, wherein each AVS can be mapped to one segment of the audio track only.

o Creating the metadata based on the mapping, such that the AVSs mapped to the audio track are indicated in the metadata with their respective start time, and in an order corresponding to a chronological order of the audio track in which the AVSs are mapped to the audio track.

According to a fifth aspect of the invention, the above object is achieved by a system comprising a first device according to the third aspect of the invention, and a second device configured for automatically creating a trailer of an audio-visual multimedia, the second device comprising:

- A receiving component configured for receiving, from the first

device, an audio-visual multimedia, an audio track and a metadata which defines a plurality of segments of audio-visual content from the audio-visual multimedia, each segment indicated in the metadata by a start time in the audio-visual multimedia, and a running length, wherein the metadata further defines an order of the segments;

- A processor configured for creating the trailer being a second

audio-visual multimedia, wherein an audio content of the second audio-visual multimedia comprises the audio track, and wherein a visual content of the second audio-visual multimedia comprises a visual content from each AVS the set of audio-visual segments defined in the metadata.

The second, third, fourth and fifth aspects may generally have the same features and advantages as the first aspect. It is further noted that the invention relates to all possible combinations of features unless explicitly stated otherwise.

Brief Description of the Drawings

The above, as well as additional objects, features and advantages of the present invention, will be better understood through the following illustrative and non-limiting detailed description of preferred embodiments of the present invention, with reference to the appended drawings, where the same reference numerals will be used for similar elements, wherein:

Figure 1 shows a flow diagram of a method for matching segments in an audio-visual multimedia to an audio track by means of the emotional state, Figure 2-3 shows embodiments of matching segments from an audio visual multimedia to an audio track,

Figure 4 shows the details of an embodiment of the matching process, Figure 5 shows by way of example details of an embodiment of the correlation process.

Detailed description of embodiments

The present invention will now be described more fully hereinafter with reference to the accompanying drawings, in which embodiments of the invention are shown. The systems and devices disclosed herein will be described during operation.

Embodiments of the invention will now be described in conjunction with figure 1-3.

Figure 1 shows a flow diagram comprising the steps of receiving an AVM S101 and an audio track (202 in Figure 2, 302 in Figure 3) in S102. The audio-visual multimedia typically consists of a set of visual frames

accompanied by an audio track (201 in Figure 2, 301 in Figure 3) and can be in the form of a video file (avi, mpeg, matroska, vob, ogg, etc.). For the AVM is optionally in S103a determined a set of frame segments of equal length which are intended for analysis of emotional content and the frequency of segments will constitute the sample frequency. In some embodiments, each visual frame constitutes such segment. A typical frequency is to segment the AVM every second but advantageously this frequency may be chosen differently depending on the desired outcome and the properties of the received AVM (e.g. framerate, resolution, etc.). The emotional analysis is in S104a performed by a first algorithm (further described above and below) on the audio content of the AVM and used to determine in S105 points in time where significant changes of the audio content occurs, after which the points in time are used to determine a second set of points in time at which to cut the AVM into segments. Data defining the segments are stored in a record of metadata (205 in Figure 2) defining for each segment at least a starting point in time in the AVM and a length. In one embodiment, points in time where the emotional metrics of the audio content of the AVS undergo significant changes is detected by inference by an HMM. These changes could e.g. be in the form of a step function, saddle point, local maximum, local minimum, local maximum or minimum of a gradient with respect to time.

The HMM is the simples form of a dynamic Bayesian network, describing a transition over time between hidden states, where each hidden state cannot be observed directly, yet produces an observable visible state. In the case presented in this specification, the hidden states may for instance be the emotional metric, whereas the visible states are detectable changes in the audio content, and where each emotional state can be described as a probability distribution from a statistical model over the possible visible states. The problem of defining the emotional state is thus reduced to defining the parameters of the statistical model. For each point in time at which the analysis is performed, the parameters are stored in metadata and referred to as the emotional metric of the system at that particular point in time.

Several methods are possible for the definition of points in time for cutting the AVM. In one embodiment two points in time where significant changes in the audio content occurs are used. Each AVS is then defined as a section of the AVM between two consecutive points in time where significant changes have been detected by the second algorithm. In this case, it is expected that the emotional state of the audio track of the AVS remains relatively unchanged (compared to when changed at the points in time which are used as the start and end of this part of the audio track) throughout the AVS. Such a segment can be used where the same mood in desired in part of the preview, or in the preview as a whole. In another embodiment, one point in time where a significant change in the audio content occurs is used as the starting point (or middle point, end point, etc.,) for the candidate AVS, together with a set length for the AVS, which is used to determine the end of the candidate AVS. Several set lengths may be used, such that a same point in time where a significant change in the audio content occurs will result in more than one candidate AVS such as 2, 3, 5, etc. candidate AVSs, where e.g. one candidate AVS has the length 4 second, and another AVS has the length of 5 seconds. In this case, it may be expected that the emotional state of the audio track of the AVS displays a change throughout the AVS. Such a segment can be used to reflect a change in mood in the AVM in the preview, e.g. a transition from descriptive scenery or dialogue to action sequences.

The result of S106 will be a set of metadata defining a candidate set of AVSs. Advantageously, a final set of segments should be chosen from the AVM such that no overlap exists when mapped to the audio track and no gaps exist along the length of the audio track, i.e. the entire audio track is mapped. This process will be further described below.

Optionally, the set of AVSs may be reduced or modified as shown in S107. There may also be several reasons to exclude or repeat the use of a specific segment in the mapping, such as to enhance preview quality or increased or decreased variation of content. In one embodiment AVSs are removed from the candidate set of AVSs if deemed to contain visual content unsuitable for use in the preview. In this embodiment, the candidate content for the preview can be narrowed to a specific visual content, such as a specific colour scheme or direction of scene motion. From the candidate set of AVSs can also be removed AVSs with a high level of noise, low contrast, displaying sudden changes or no visible content such as cuts between movie scenes in the AVM.

In another embodiment, AVSs are removed from the candidate set of AVSs if deemed to contain other content unsuitable for use in the preview. In this embodiment, the properties used to evaluate the suitability of the AVSs are subjective, such as user preferences or genre or topic of the AVM.

Additionally, specific parts of the AVM which reveal crucial parts of the plot may be removed prior to analysis (e.g. the last part of the AVM, such as the last 20%, last 10 minutes, etc.,) to prevent AVSs from these parts of the AVM to be used as candidate AVSs. Advantageously, these parts of the AVM are omitted already at the analysis stage (e.g. step S103a and forward) to reduce the computational complexity of the process.

In one embodiment, the length of the AVSs is typically 2-5 seconds, a candidate set of AVSs comprises 100-1000 segments. The analysis of AVM is in Figure 2 depicted by process 203 and in Figure 3 by process 303. Process 203 in Figure 2 and 303 in Figure 3 consists of steps S101 -S108 in the flow diagram in Figure 1. In the context of the invention, the calculation of emotional metric S108a may be performed at several different locations in the flow. In one embodiment it will be calculated last in the process 203 in Figure 2 and 303 in Figure 3, as shown by S108a in Figure 1. In another embodiment it may be calculated simultaneously with the emotional analysis in S104a, or anywhere in between. In one embodiment, the output of the emotional analysis S104a, b is one initial emotional metric at each segment. The (second, final) emotional metric calculated by a third algorithm in S108a,b is then an aggregated value such as an average value of the initial emotional metric (or max value etc., as exemplified above), calculated over the whole or part of the longer segments produced in S105.

The received audio track 202 in Figure 2, 302 in Figure 3 is received in S102. In one embodiment, the audio track is segmented based on a sample frequency in S103b and analysed for emotional content S104b, calculating emotional metric S108b, similar to the analysis of the AVM in S103a, S104a and S108a, respectively. In Figure 2 and 3 this analysis is shown by process 204 and 304, respectively. The result is a set of metadata for the audio track 206 describing the emotional metrics of segments of the audio track. In another embodiment, the emotional analysis of audio content 204 in Figure 2 (304 in Figure 3) is performed simultaneously to the combinatory analysis of the AVM and the audio track in S109, described next.

Referring to Figure 2, using the emotional metric recorded in metadata for both the AVM 205 and the audio track 206, a correlation score 210 is calculated in S109, and associated with each record in metadata 205, 206 so that each correlation score holds a reference back to the original AVS and point in time at which the AT begins. As the metadata for the AVS 205 also holds the length of the segment, a correlation score for the combination of one AVS with a particular segment of the AT is effectively produced. In Figure 5 is shown a schematic view of the correlation score and their association to the correlated entries. Each AVS 601 (from the candidate set of AVS) is matched against each point in time of the AT 602 of equal length and a correlation score 611 calculated. The correlation score may be in the form of a scalar, vector, the parameters of a model or other multidimensional representation. By way of example, Figure 5 shows 9 scalar correlation scores which corresponds to a similarity score (e.g. cosine similarity) between the emotional metric of the audio track of an AVS and the corresponding point in time of the AT. Thus, a higher correlation score in this example refers to a higher similarity in emotional state between the audio track of the AVS and the corresponding sequence of the AT.

In one embodiment of the invention, the correlation scores are sorted in S111 and mapped in order or descending correlation score in S112. In Figure 5, the AVS at position 2 will first be mapped to the corresponding AT sequence at position 2, as position (2,2) in the matrix of correlation scores holds the highest value 28. This mapping then disables (disqualifies) each overlapping AT sequence from being mapped, as well as the same AVS from being mapped again. Consequently, positions (1 ,2), (2,3), (2,1 ) and (3,2) are excluded from further consideration. Next, the AVS at position 1 will be mapped to position 3 along the AT. The process will continue iteratively until the entire length of the AT is mapped.

In one embodiment, a cosine similarity score between the emotional metric of the audio track of the AVS and the emotional metric of the

corresponding point in time of the AT is used as correlation score. In other embodiments, related values such as cosine distance, angular cosine distance, angular cosine similarity or Euclidean distance can be used.

In one embodiment, the correlation scores are sorted column-wise and the point in time of the AT 602 at position 1 mapped first to the AVS with the highest correlation score for this column only, followed by the mapping of the position in time of the AT 602 at position 2, and so on. In Figure 5, this will result in the mapping (1 ,1 ), (2, 2). In this embodiment, the mapping is performed to ensure the highest possible correlation score for the beginning of the AT, producing a preview with the highest quality in the beginning.

Advantageously, this ensures the viewer will see the highest quality part of the preview first, maintaining viewer interest. In one embodiment, the correlation scores are sorted row-wise and the AVS 601 at position 1 mapped first to the point in time of the AT with the highest correlation score for this row only, followed by the mapping of the AVS 602 at position 2, and so on. In Figure 5, this will result in the mapping (3,1 ), (2, 2).

In one embodiment, the order of the mapped AVSs 601 is retained from the original AVM. In this case, any AVS located with a starting time preceding the AVS currently being mapped, is excluded from further evaluation. In Figure 5, this will result in the mapping of (2,2)=28 to disqualify all AVS at position 1 and 2, i.e. (1 ,1 ), (1 ,2), (1 ,3), (2,1 ) and (2,3) from being mapped to position 3 along the AT. Advantageously, this solution ensures that scenes appear in the original order as in the AVM.

In one embodiment, the correlation values are mapped partly based on the value of the correlation score, but weighted by the order in which the AVS appear in the original AVM, allowing them to be mapped to limited extent in a reverse order. In yet another embodiment, other factors are chosen to weigh in the mapping, adjusting the mapping according to user preferences, visual content, amplitude of the audio track, etc.

The process for calculation of correlation score and mapping shown by S109-S110 is shown also in Figure 2 as process 209. The process of mapping S109-S115 is shown in Figure 3 as 309 and in detail in Figure 4a, b.

Referring to Figure 4, an AVS 407 is being matched to a sequence 432 of equal length from the AT 402. The actual matching is made between the corresponding entry from the metadata for the AVS 405 and the entry from the metadata for the AT 406. The mapping is performed along a timeline 422 of equal length to the AT 402. The timeline has been divided into points in time equally spaced. One point in time 423 is being evaluated for mapping at a time. To the timeline have also previously been mapped the AVSs 408. The evaluation for mapping is determined by the chronological order of the correlation scores stored in the correlation values data set, the length of the AVS and the AT and the starting point in time on the timeline also being stored in the same correlation value data entry. As only one AVS can be mapped to any point in time 423 along the timeline 422, as soon as an AVS 408 is mapped, all points in time covered by the length of the AVS along the same timeline are disabled from further mapping (diagonally dashed in Figure 4a, b). The flag for disabling points in time along the audio track may be stored as parts of the correlation value metadata set, or as separate data. When evaluating the mapping of an AVS to a disabled point in time, no mapping will take place (Figure 4a). The starting point in time must in some embodiments be free for mapping to occur (Figure 4b). If the entire length of the AVS can be mapped to points in time along the timeline 422, the entire AVS 408 is mapped. If the AVS is longer than the available points in time along the timeline 422 but the starting point in time 423 with which mapping is evaluated is not disabled, the length of the AVS 407 will be shortened from the end and produce a new AVS 417 to fit the available points in time along the timeline 422. This is shown in figure 4b, showing that in one embodiment the length of the AVS is shortened to a length between the starting time of the AVS and the starting time of another, already mapped AVS 408, the starting time being a point in time along the timeline 422 of the AT. In this

embodiment, the shortening of the AVS is performed after the mapping is complete, i.e. AVS which are known to be overlapping will first be mapped, then adjusted to length in a later step. In another embodiment, a similar shortening of the AVS is performed already at the mapping step, by checking the previously mapped AVSs 408 for overlap and shortening the AVS directly, thus saving one computational step. The shortening is recorded as a reduction of the value of the output metadata 216,316 describing the length of the AVS.

The process will continue in the order of descending correlation score, until all points in time along the timeline have been mapped. The association between the AVS and the corresponding point in time along the timeline will be stored in a new set of metadata 216 in Figure 2 (316 in Figure 3), describing a final set of segments from the original AVM, each associated with at least a starting time in the original AVM 201 in Figure 2 (301 in Figure 3), a length and a starting time in the AT 202 in Figure 2 (302 in Figure 3) (point in time along the timeline). In a last step S1 16, optionally the metadata is then used to combine each AVS described with the corresponding segment of the AT, forming a new AVM 320 in Figure 3 consisting of segments from the original AVM, with the original audio track overlaid or replaced by the new AT 302 in Figure 3.

Typically, the final new AVM (or metadata defining the ordered AVSs) consists of one to a few hundred segments, producing a preview/trailer which is a few minutes long, typically shorter than 10 minutes.

Figure 3 shows a graphic representation of the complete process of a computer implemented method for automatically creating a trailer of an audio visual multimedia, comprising the steps of creating metadata 300 which defines a plurality of segments of audio-visual content from an AVM 301 and creating a second audio-visual multimedia 320, wherein an audio content of the second AVM comprises the AT 302, and wherein a visual content of the second AVM comprises a visual content from each AVS the set of audio visual segments defined in the metadata 316. The new AVM 320 in Figure 3 may be produced by a second device, or separate system from the process 300. In a similar way, the partial processes creating candidate AVS metadata 303, calculating AT emotional metrics 304 and performing combinatory analysis and matching 309 may be computed on separate devices, avoiding the transfer of unnecessary data over a network, or to optimize computational flexibility.

Figure 2 shows a more detailed view of the process of creating the metadata. For simplicity, the corresponding steps to process 209, 215 in Figure 2 have been combined to 309 in Figure 3 and the intermittently produced metadata 205, 206 and correlation data 210 in Figure 2 have been omitted in Figure 3.

In one aspect of the invention, the method is implemented as computer-readable instructions stored on a computer-readable storage medium and executable by a device having processing capability. The processor could be one, executing the instructions in sequence, several executing part of the instructions in parallel or sequence, or a single processor executing part of the instructions in parallel using separate memory allocations (multi-threaded). In one aspect of the invention, the above described method is implemented in a device as computer readable instructions stored on a computer-readable storage medium and executable by a device having processing capability (processor). The device further comprises a receiving component configured for receiving an AVM and an AT (receiving component, receiving means). The receiving component could be a computer

implemented method for transferring the AVM and the AT to and from different digital storage media, which could be stored on the same memory component as used by the processor, or by a separate memory component. By the term“memory component” should, in the context of the present specification, be understood a suitable means of storing digital information in retrievable form, including but not limited to physical computer hard drives, flash memory, RAM, ROM, EEPROM, or other memory technology, CD- ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tapes, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer.

In a further aspect of the invention, the above described method is implemented as a system in a first device as computer readable instructions stored on a computer-readable storage medium and executable by a device having processing capability (processor, microprocessor). The device further comprises a receiving component configured for receiving an AVM and an AT (receiving component, receiving means). The system further comprises a second device, in turn comprising a second receiving component, configured for receiving any or several of: The AVM, the AT, a set of AVSs, a metadata as computed by the first device, wherein the metadata defines a plurality of segments of audio-visual content from the audio-visual multimedia, each segment indicated in the metadata by a start time in the audio visual multimedia, and a running length, wherein the metadata further defines an order of the segments. The second device further comprises a processor configured for creating a preview (trailer, short AVM), wherein an audio content of the second audio visual multimedia comprises the audio track, and wherein a visual content of the second audio-visual multimedia comprises a visual content from each audio-visual segment the set of audio visual segments defined in the metadata. The transfer of data and metadata between the first and the second device could be by wireless or wired transmission, or the first and second device could be parts of the same physical device. Advantageously, if implemented on the same device, the amount of data transfers is limited, and the complete process handled by the same device, allowing for easy use and simple user interface.

Advantageously, when implemented on a set of physically separate devices, the processing power and storage capabilities of the two can be varied so as to optimize memory space and/or processing power. E.g. the first device can be optimized for data storage, holding a large set of complete AMVs and ATs, and being configured for creating metadata. Subsequently, metadata along with candidate AVSs and ATs are transferred to a second device, which could be physically remote from the first device, where creating of a preview is completed. The set of candidate AVSs can then hold more AVSs than finally displayed in the preview, allowing for several previews to be constructed by the second device, this limiting the processing power and memory space needed by the second device, yet retaining the capability of constructing a variety of previews.

The systems and methods disclosed hereinabove may be implemented as software, firmware, hardware or a combination thereof. In a hardware implementation, the division of tasks between functional units referred to in the above description does not necessarily correspond to the division into physical units; to the contrary, one physical component may have multiple functionalities, and one task may be carried out by several physical components in cooperation. Certain components or all components (such as the receiving component) may be implemented as software executed by a digital signal processor or microprocessor, or be implemented as hardware or as an application-specific integrated circuit. Such software may be distributed on computer readable media, which may comprise computer storage media (or non-transitory media) and communication media (or transitory media). As is well known to a person skilled in the art, the term computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. Further, it is well known to the skilled person that communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media.

Thus, the invention should not be limited to the shown embodiments but should only be defined by the appended claims. Additionally, as the skilled person understands, the shown embodiments may be combined.

Claims

1. A computer implemented method for creating a metadata which defines a plurality of audio-visual segments from an audio-visual multimedia, each segment indicated in the metadata by a start time in the audio visual multimedia, and a running length, wherein the metadata further defines an order of the segments, the method comprising the steps of:

a) receiving an audio-visual multimedia;

b) determining a candidate set of audio-visual segments of the audio visual multimedia by analysing an audio content of the audio-visual multimedia to determine a first set of points in time where significant changes of an emotional state of the audio content occurs, wherein the candidate set is determined based on the determined first set of points in time, wherein each audio-visual segment of the candidate set has a running length, a start time in the audio-visual multimedia and being associated with an emotional metric calculated by analysing an audio content of the audio-visual segment using a first algorithm;

c) receiving an audio track;

d) calculating, at each point in time of a second set of points in time in the audio track, a correlation value between the emotional metric of each audio-visual segment of the candidate set and a second emotional metric of a segment of the audio track starting at the point in time and having the same running length as the audio visual segment, the second emotional metric of the segment of the audio track calculated using the first algorithm;

e) sorting the calculated correlation values;

f) until each point in time in a full length of the audio track is mapped to an audio-visual segment of the candidate set, and in a descending order of the sorted correlation values, mapping the segment of the audio track and the audio-visual segment which resulted in the correlation value to each other, wherein each audio-visual segment can be mapped to one segment of the audio track only; g) creating the metadata based on the mapping, such that all audio visual segment mapped to the audio track are indicated in the metadata with their respective start time, and with the order in which they are mapped to the audio track.

2. The method of claim 1 , wherein the step g) comprises:

g1 ) setting the running length of each audio-visual segment defined in the metadata to one of:

the running length of the audio-visual segment, or a time period between the start time of the audio-visual segment and the start time of another, in a chronological order of the audio track, immediately following mapped audio-visual segment.

3. The method of any one of claims 1 -2, wherein the step f) comprises: f1 ) upon determining that a mapping between an audio-visual segment and a segment of the audio track will overlap with another, in a chronological order of the audio track, immediately following mapping between a further audio-visual segment and a further segment of the audio track, shorten the audio-visual segment and the segment of the audio track to avoid such overlap.

4. The method of any one of claims 1 -3, wherein the audio-visual multimedia comprises a set of audio-visual frames, and wherein step b) comprises:

b1 ) determining a first set of audio-visual segments of the audio-visual multimedia based on a first sample frequency, each audio-visual segment comprising one or more audio-visual frames,

b2) for each audio-visual segment of the first set of audio-visual segments, performing emotion analysis on an audio content of the audio visual segment using the first algorithm to determine an initial emotional metric of the audio-visual segment,

b3) using a second algorithm and the determined initial emotional metric of the first set of audio-visual segments to determine the first set of points in time in the audio-visual multimedia where significant changes between the initial emotional metric of consecutive segments from the first set of segments occur,

b4) determining a second set of audio-visual segments of the audio visual multimedia based on the first set of points in time in the audio-visual multimedia where significant changes occur, each audio-visual segment of the second set of audio-visual segments comprising one or more audio-visual segments of the first set of audio-visual segments, each audio-visual segment of the second set of audio-visual segments being associated with the emotional metric calculated by a third algorithm using the initial emotional metric of said one or more audio-visual segments of the first set of audio visual segments,

b5) using the second set of audio-visual segments as the candidate set of audio-visual segments of the audio-visual multimedia.

5. The method of any one of claims 1 -4, wherein step d) comprises:

d1 ) determining a plurality of correlation values, wherein each correlation value of the plurality of correlation values is associated with an audio-visual segment of the candidate set and a point in time in the audio track, the plurality of correlation values determined by, for each audio-visual segment of the candidate set and for each point in time of the second set of points in time of the audio track:

d1.1 ) calculating a second emotional metric of a segment of the audio track by performing emotion analysis on the segment using the first algorithm, the segment of the audio track starting at the point in time and having a length equal to the length of the audio-visual segment,

d1.2) calculating a correlation value between the emotional metric associated with the audio-visual segment and the second emotional metric of the segment of the audio track,

d1.3) associating the correlation value with the audio-visual segment and with the point in time of the audio track, and adding the correlation value to the plurality of correlation values.

6. The method of claim 5, wherein step f) comprises:

defining a timeline with a length equal to a length of the audio track, and repetitively determining a highest correlation value among the plurality of correlation values until all points in time in the timeline is assigned an audio visual segment from the candidate set, wherein when determining a highest correlation value, the method further comprises;

assigning the audio-visual segment associated with the determined highest correlation to a segment of time in the timeline, the segment of time starting at the point in time associated with the determined highest correlation and having a same length as the running length of the audio-visual segment; wherein, upon determining that assigning the audio-visual segment to the segment of time results in an overlap with a, in a chronological order of the timeline, immediately following assignment of a further audio-visual segment to a further segment of timeline, the audio-visual segment and the segment of the timeline will be shorten to avoid such overlap;

wherein assignment of an audio-visual segment to a segment of time in the timeline results in that all correlation values in the plurality of correlation values associated with that audio-visual segment are disabled from being determined as a highest score.

7. The method of claim 6, wherein assignment of an audio-visual segment to a segment of time in the timeline results in that all correlation values associated with a point in time which overlaps with that segment of time in the timeline are disabled from being determined as a highest score.

8. The method of any one of claims 1 -7, wherein a segment of the candidate set of audio-visual segments is determined to be one of:

a segment of the audio-visual multimedia between two points in time of the first set of points in time where significant changes of an emotional state of an audio content of the audio-visual multimedia occurs; or a segment of the audio-visual multimedia around a point in time of the first set of points in time where significant changes of an emotional state of the audio content occurs, and having a predetermined running length.

9. The method of any one of claims 1 -8, wherein step b) comprises:

upon determining that an audio-visual segment comprises a visual break representing a cut in the audio-visual multimedia, removing the audio-visual segment from the candidate set of audio-visual segments.

10. The method of claim 9, wherein the multidimensional metric is a Valence-Arousal-Dominance, VAD, metric.

11. The method of any one of claims 1 -10, wherein step b) comprises using a Hidden-Markov-Model, for determining the first set of points where significant changes of an emotional state of an audio content of the audio visual multimedia occurs.

12. A computer implemented method for automatically creating a trailer of an audio-visual multimedia, comprising the steps of:

creating metadata which defines a plurality of segments of audio-visual content from an audio-visual multimedia according to any one of claims 1 -11 ; creating a second audio-visual multimedia, wherein an audio content of the second audio-visual multimedia comprises the audio track used in claims 1 -11 , and wherein a visual content of the second audio-visual multimedia comprises a visual content from each audio-visual segment the set of audio visual segments defined in the metadata.

13. A computer program product comprising a computer-readable storage medium with instructions adapted to carry out the method of any one of claims 1 -11 or claim 12 when executed by a device having processing capability.

14. A device for creating metadata which defines a plurality of segments of audio-visual content from an audio-visual multimedia, each segment indicated in the metadata by a start time in the audio visual multimedia, and a running length, wherein the metadata further defines an order of the segments, the device comprises:

a receiving component configured for receiving an audio-visual multimedia and an audio track;

a processor configured for:

determining a candidate set of audio-visual segments of the audio visual multimedia by analysing an audio content of the audio-visual media to determine a first set of points in time where significant changes of an emotional state of the audio content occurs, wherein the candidate set is determined based on the determined first set of points in time, wherein each audio-visual segment of the candidate set has a running length, a start time in the audio-visual multimedia and being associated with an emotional metric calculated by analysing an audio content of the audio-visual segment using a first algorithm;

calculating, at each point in time of a second set of points in time in the audio track, a correlation value between the emotional metric of each audio visual segment of the candidate set and a second emotional metric of a segment of the audio track starting at the point in time and having the same running length as the audio visual segment, the second emotional metric of the segment of the audio track calculated using the first algorithm;

sorting the calculated correlation values;

until each point in time in a full length of the audio track is mapped to an audio-visual segment of the candidate set and in a descending order of the sorted correlation values, mapping the audio-visual segment and the segment of the audio track which resulted in the correlation value to each other, wherein each audio-visual segment can be mapped to one segment of the audio track only;

creating the metadata based on the mapping, such that the audio visual segments mapped to the audio track are indicated in the metadata with their respective start time, and in an order corresponding to a chronological order of the audio track in which the audio-visual segments are mapped to the audio track.

15 A system comprising a first device according to claim 14, and a second device configured for automatically creating a trailer of an audio-visual multimedia, the second device comprising:

a receiving component configured for receiving, from the first device, an audio-visual multimedia, an audio track and a metadata which defines a plurality of segments of audio-visual content from the audio-visual multimedia, each segment indicated in the metadata by a start time in the audio visual multimedia, and a running length, wherein the metadata further defines an order of the segments;

a processor configured for:

creating the trailer being a second audio-visual multimedia, wherein an audio content of the second audio visual multimedia comprises the audio track, and wherein a visual content of the second audio-visual multimedia comprises a visual content from each audio visual segment the set of audio visual segments defined in the metadata.