WO2024235420A1

WO2024235420A1 - An apparatus and a method for managing playlist

Info

Publication number: WO2024235420A1
Application number: PCT/EP2023/062711
Authority: WO
Inventors: Jianhan MEI; Jian Song; Haiyi WANG; Wei Tan; Petri Kainiemi
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2023-05-12
Filing date: 2023-05-12
Publication date: 2024-11-21
Anticipated expiration: 2025-11-12

Abstract

Various embodiments relate to playlist management. A method (700) for playlist management comprises providing (702) a playlist user interface (200) to be displayed on a user equipment, the playlist user interface (200) comprising a user interface element allowing a user to select a video (202) for editing with timestamps; causing (704) at least one of video or audio data of the video (202) selected by the user to be displayed on the playlist user interface with a timeline (206); obtaining (706) information on two or more timestamps (208) for the timeline (206) based on at least one of user input or keyframes detected by a keyframe classifier based on the video; determining (708) one or more segments (204) of the video or audio data based on the two or more timestamps (208), wherein the segment (204) comprises an interval between two consecutive timestamps (208); causing (710) the playlist user interface (200) to display at least one user interface element (210) selectable by the user for saving at least one of the one or more segments (204) to a playlist; and causing (712) the one or more segments (204) to be saved to the playlist based on received user input for selecting the at least one user interface element (210). Devices and a method are disclosed.

Description

AN APPARATUS AND A METHOD FOR MANAGING PLAYLIST

TECHNICAL FIELD

The present disclosure generally relates to the field of information technology. Some embodiments of the disclosure relate to managing playlists with timestamps.

BACKGROUND

A playlist may be an electronic file comprising a list of video and/or audio data files to be played on an application or a device. However, ways to modify a playlist may be limited. It would be beneficial to provide improved means for managing a playlist and its content.

SUMMARY

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

It is an objective of the present disclosure to provide an apparatus and a method for managing one or more playlists. In an example embodiment, the playlists may be managed based on timestamps selected by a user or indicated by a keyframe classifier on a timeline of at least one of video or audio data.

The foregoing and other objectives may be achieved by the features of the independent claims. Further implementation forms are apparent from the dependent claims, the description, and the drawings.

According to a first aspect, a method for managing playlists is provided. The method may be computer-implemented. The method may comprise providing a playlist user interface to be displayed on a user equipment, the playlist user interface comprising a user interface element allowing a user to select a video for editing with timestamps; causing at least one of video or audio data of the video selected by the user to be displayed on the playlist user interface with a timeline; obtaining information on two or more timestamps for the timeline based on at least one of user input or keyframes detected by a keyframe classifier based on the video; determining one or more segments of the video or audio data based on the two or more timestamps, wherein the segment comprises an interval between two consecutive timestamps; causing the playlist user interface to display at least one user interface element selectable by the user for saving at least one of the one or more segments to a playlist; and causing the one or more segments to be saved to the playlist based on received user input for selecting the at least one user interface element. This may enable managing a playlist with timestamps such that a user is able to save segments of media files to a playlist and enable further playlist functions to play the video/audio data according to their saved timestamps/segments.

According to an implementation form of the first aspect, the method may further comprise causing the playlist user interface to display a user interface element selectable by the user for saving at least one of the whole video or audio data to the playlist; and causing the whole video or audio data to be saved to the playlist based on a received user input for selecting the respective user interface element. Hence, a user is provided with an option to save one or more selected segments of a video or audio data, the whole video or audio data, or both, to the playlist.

According to an implementation form of the first aspect, the method may comprise receiving, from the user, an indication of selection of the whole video or the one or more segments to be inputted for keyframe detection; calculating a self-correlation between a current frame and neighboring frames for each frame of the selected video or the one or more segments to obtain keyframe labels and build a generative label for a keyframe classifier; calculating at least one of multi-scale contrast feature, relative motion intensity feature and relative motion consistency feature for video data or relative feature intensity feature and relative feature consistency feature for audio data of the current and neighboring frames; performing attention fusion for the calculated features to output a fused feature to the keyframe classifier; extracting keyframes by the keyframe classifier based on the input keyframe labels and fused features; displaying on the timeline suggested timestamps to be selected by the user for the one or more segments, wherein the suggested timestamps correspond to the extracted keyframes. This may enable providing suggestions for the user where to place the timestamps, or to automatically set timestamps for a video/audio file inputted by the user. According to an implementation form of the first aspect, the attention fusion is performed with a transformer-based attention model. This enables improving performance of the attention model.

According to an implementation form of the first aspect, the method may further comprise: determining if the input is associated with one or more timestamps pre-selected by the user on the timeline; comparing the timestamps of the extracted keyframes to the one or more preselected timestamps; and displaying on the timeline at least timestamps of the extracted keyframes locating nearest to the one or more pre-selected timestamps as the suggested timestamps. This enables aiding the user to position the timestamps for segments more accurately. For example, a statistical feature learned from a large-scale dataset based on the extraction may provide a more accurate result compared with the manually set timestamp.

According to an implementation form of the first aspect, the method may comprise causing the one or more segments to be saved with a unique identifier, wherein the unique identifier is inputted by the user or generated automatically. Hence, the segments may be named by at least one of automatically or by the user to enable the segments to be found and managed easily based on their identifiers.

According to an implementation form of the first aspect, the method may comprise causing the playlist user interface to display a user interface element allowing the user to modify the unique identifier of each segment. Hence, a naming function may be provided such that the user is able to manage names of the segments via the playlist user interface.

According to an implementation form of the first aspect, the unique identifier is saved with an identifier of the video. This enables linking the segments and respective videos such that playing the items may be managed individually as well as by groups.

According to an implementation form of the first aspect, the method may comprise causing the playlist user interface to display the unique identifier together with the identifier of the video as a name of the segment on the playlist. Hence, both the name of the segment and an identifier of the video of origin may be retrieved to enable ease of use and more efficient management of the playlist for the user.

According to an implementation form of the first aspect, the method may comprise providing a search tool on the playlist user interface allowing the user to search the saved segments based on at least one of the unique identifiers or the identifier of the video. This helps the user in managing and using the playlist.

According to an implementation form of the first aspect, the method may comprise causing at least one of the one or more segments, the whole video or the whole audio data to be saved with one or more tags indicating if a saved item is at least one of a segment, a whole video data, a whole audio data or associated to another item saved on the playlist; causing the playlist user interface to display a user interface element for allowing the user to select between at least two different play modes for the playlist, wherein the play mode is configured to cause playing saved items based on one or more tags; and causing the playlist to be played according to selected play mode. This enables controlling how the playlist is played based on tags for the saved segments and other saved items.

According to an implementation form of the first aspect, the method may comprise providing a user interface element for allowing the user to select how the playlist is displayed on the playlist user interface, wherein the user is allowed to select between a single list form, wherein at least one of the whole video or one or more segments are listed together, or in a hierarchical form, wherein the whole video is listed at a higher level and the segments at a lower level; and causing the playlist to be displayed on the playlist user interface according to the user selection with user interface elements for allowing the user to select to play one or more saved items on the playlist based on a unique identifier or the tag of the saved item. This enables controlling with tags and other identifiers how the playlist and its control elements are displayed.

According to an implementation form of the first aspect, the method may comprise causing one or more user interface elements to be displayed on the playlist user interface allowing the user to at least one of play selected, next or previous segment, whole video or whole audio data on the playlist, play a random item saved on the playlist, play the playlist by order, add a saved segment from the playlist to another playlist or modify a playing order of the playlist. Hence, a user may be provided with different kinds of selectable control elements associated with the timestamps for playing the playlist.

According to an implementation form of the first aspect, the user interface element for playing next or previous segment causes playing the next or previous segment on the playlist associated to a same video as a current segment. Hence, the user interface element enables to play segments linked with a same original video data.

According to an implementation form of the first aspect, the user interface element for playing next or previous segment causes playing a next or previous segment on the playlist associated to a different video than a current segment. Hence, a user interface element may be configured to allow a user to play segments linked with different original video data.

According to an implementation form of the first aspect, the user interface element for playing next or previous segment causes playing a next or previous segment on the playlist comprising a same data type as a current segment, wherein the data type comprises at least one of video or audio data. Hence, a user interface element may be configured to manage functions of the playlist based on data types of items saved on the playlist.

According to an implementation form of the first aspect, the user inputs are received via at least one of the first user equipment displaying the playlist user interface or a second user equipment communicatively coupled with the first user equipment. Hence, the playlist may be managed by the user via a device displaying the user interface, or via a second device such as a wearable device, a mobile device or an embedded control system of a vehicle.

According to a second aspect, an apparatus for managing playlists is provided. The apparatus may comprise at least one processor; and at least one memory comprising instructions which, when executed by the at least one processor, cause the apparatus at least to: provide a playlist user interface to be displayed on a user equipment, the playlist user interface comprising a user interface element allowing a user to select a video for editing with timestamps; cause at least one of video or audio data of the video selected by the user to be displayed on the playlist user interface with a timeline; obtain information on two or more timestamps for the timeline based on at least one of user input or keyframes detected by a keyframe classifier based on the video; determine one or more segments of the video or audio data based on the two or more timestamps, wherein the segment comprises an interval between two consecutive timestamps; cause the playlist user interface to display at least one user interface element selectable by the user for saving at least one of the one or more segments to a playlist; and cause the one or more segments to be saved to the playlist based on received user input for selecting the at least one user interface element. This may enable managing a playlist with timestamps such that a user is able to save segments of media files to a playlist and enable further playlist functions to play the video/audio data according to their saved timestamps/segments.

According to an implementation form of the second aspect, the at least one memory further comprises instructions which, when executed by the at least one processor, cause the apparatus to: cause the playlist user interface to display a user interface element selectable by the user for saving at least one of the whole video or audio data to the playlist; and cause the whole video or audio data to be saved to the playlist based on a received user input for selecting the respective user interface element. Hence, a user is provided with an option to save one or more selected segments of a video or audio data, the whole video or audio data, or both, to the playlist.

According to an implementation form of the second aspect, the at least one memory comprises instructions which, when executed by the at least one processor, cause the apparatus to: receive, from the user, an indication of selection of the whole video or the one or more segments to be inputted for keyframe detection; calculate a self-correlation between a current frame and neighboring frames for each frame of the selected video or the one or more segments to obtain keyframe labels and build a generative label for a keyframe classifier; calculate at least one of multi-scale contrast feature, relative motion intensity feature and relative motion consistency feature for video data or relative feature intensity feature and relative feature consistency feature for audio data of the current and neighboring frames; perform attention fusion for the calculated features to output a fused feature to the keyframe classifier; extract keyframes by the keyframe classifier based on the input keyframe labels and fused features; display suggested timestamps to be selected by the user for the one or more segments, wherein the suggested timestamps correspond to the extracted keyframes. This may enable providing suggestions for the user where to place the timestamps, or to automatically set timestamps for a video/audio file inputted by the user.

According to an implementation form of the second aspect, the attention fusion is performed with a transformer-based attention model. This enables improving performance of the attention model.

According to an implementation form of the second aspect, the at least one memory comprises instructions which, when executed by the at least one processor, cause the apparatus to: determine if the input is associated with one or more timestamps pre-selected by the user on the timeline; compare the timestamps of the extracted keyframes to the one or more pre-selected timestamps; and display on the timeline at least timestamps of the extracted keyframes locating nearest to the one or more pre-selected timestamps as the suggested timestamps. This enables aiding the user to position the timestamps for segments more accurately. For example, a statistical feature learned from a large-scale dataset based on the extraction may provide a more accurate result compared with the manually set timestamp.

According to an implementation form of the second aspect, the at least one memory comprises instructions which, when executed by the at least one processor, cause the apparatus to: cause the one or more segments to be saved with a unique identifier, wherein the unique identifier is inputted by the user or generated automatically. Hence, the segments may be named by at least one of automatically or by the user to enable the segments to be found and managed easily based on their identifiers. Hence, a naming function may be provided such that the user is able to manage names of the segments via the playlist user interface.

According to an implementation form of the second aspect, the at least one memory comprises instructions which, when executed by the at least one processor, cause the apparatus to: cause the playlist user interface to display a user interface element allowing the user to modify the unique identifier of each segment. This enables linking the segments and respective videos such that playing the items may be managed individually as well as by groups. According to an implementation form of the second aspect, the unique identifier is saved with an identifier of the video. This enables linking the segments and respective videos such that playing the items may be managed individually as well as by groups.

According to an implementation form of the second aspect, the at least one memory comprises instructions which, when executed by the at least one processor, cause the apparatus to: cause the playlist user interface to display the unique identifier together with the identifier of the video as a name of the segment on the playlist. Hence, both the name of the segment and an identifier of the video of origin may be retrieved to enable ease of use and more efficient management of the playlist for the user.

According to an implementation form of the second aspect, the at least one memory comprises instructions which, when executed by the at least one processor, cause the apparatus to: provide a search tool on the playlist user interface allowing the user to search the saved segments based on at least one of the unique identifiers or the identifier of the video. This helps the user in managing and using the playlist.

According to an implementation form of the second aspect, the at least one memory comprises instructions which, when executed by the at least one processor, cause the apparatus to: cause at least one of the one or more segments, the whole video or the whole audio data to be saved with one or more tags indicating if a saved item is at least one of a segment, a whole video data, a whole audio data or associated to another item saved on the playlist; cause the playlist user interface to display a user interface element for allowing the user to select between at least two different play modes for the playlist, wherein the play mode is configured to cause playing saved items based on one or more tags; and cause the playlist to be played according to selected play mode. This enables controlling how the playlist is played based on tags for the saved segments and other saved items.

According to an implementation form of the second aspect, the at least one memory comprises instructions which, when executed by the at least one processor, cause the apparatus to: provide a user interface element for allowing the user to select how the playlist is displayed on the playlist user interface, wherein the user is allowed to select between a single list form, wherein at least one of the whole video or one or more segments are listed together, or in a hierarchical form, wherein the whole video is listed at a higher level and the segments at a lower level; and cause the playlist to be displayed on the playlist user interface according to the user selection with user interface elements for allowing the user to select to play one or more saved items on the playlist based on a unique identifier or the tag of the saved item. This enables controlling with tags and other identifiers how the playlist and its control elements are displayed.

According to an implementation form of the second aspect, the at least one memory comprises instructions which, when executed by the at least one processor, cause the apparatus to: cause one or more user interface elements to be displayed on the playlist user interface allowing the user to at least one of play selected, next or previous segment, whole video or whole audio data on the playlist, play a random item saved on the playlist, play the playlist by order, add a saved segment from the playlist to another playlist or modify a playing order of the playlist. Hence, a user may be provided with different kinds of selectable control elements associated with the timestamps for playing the playlist.

According to an implementation form of the second aspect, the user interface element for playing next or previous segment causes playing the next or previous segment on the playlist associated to a same video as a current segment. Hence, the user interface element enables to play segments linked with a same original video data.

According to an implementation form of the second aspect, the user interface element for playing next or previous segment causes playing a next or previous segment on the playlist associated to a different video than a current segment. Hence, a user interface element may be configured to allow a user to play segments linked with different original video data.

According to an implementation form of the second aspect, the user interface element for playing next or previous segment causes playing a next or previous segment on the playlist comprising a same data type as a current segment, wherein the data type comprises at least one of video or audio data. Hence, a user interface element may be configured to manage functions of the playlist based on data types of items saved on the playlist. According to an implementation form of the second aspect, the user inputs are received via at least one of the first user equipment displaying the playlist user interface or a second user equipment communicatively coupled with the apparatus. Hence, the playlist may be managed by the user via a device displaying the user interface, or via a second device such as a wearable device, a mobile device or an embedded control system of a vehicle.

Implementation forms of the present disclosure can thus provide a method and an apparatus for managing playlists, for example, based on timestamps. These and other aspects of the present disclosure will be apparent from the example embodiment s) described below.

DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are included to provide a further understanding of the example embodiments and constitute a part of this specification, illustrate example embodiments and, together with the description, help to explain the example embodiments. In the drawings:

FIG. 1 illustrates an example of an apparatus configured to perform one or more embodiments of the disclosure;

FIG. 2 illustrates an example of a playlist interface for timestamp labeling of a playlist with a saving function for at least one of video or audio segments according to an embodiment of the disclosure;

FIG. 3 illustrates an example of a playlist interface configured to display saved segments and other playlist items in a single list according to an embodiment of the disclosure;

FIG. 4 illustrates an example of a playlist interface configured to display saved segments and other playlist items in a hierarchical form according to an embodiment of the disclosure;

FIG. 5 illustrates an example of a playlist interface with a search and naming function according to an embodiment of the disclosure;

FIG. 6 illustrates an example of a flowchart for learning-based keyframe detection, according to an embodiment of the disclosure; and

FIG. 7 illustrates an example of a method for playlist management according to an embodiment of the disclosure. Like references are used to designate like parts in the accompanying drawings.

DETAILED DESCRIPTION

Reference will now be made in detail to example embodiments, examples of which are illustrated in the accompanying drawings. The detailed description provided below in connection with the appended drawings is intended as a description of the present embodiments and is not intended to represent the only forms in which the present examples may be constructed or utilized. The description sets forth the functions of the examples and the sequence of operations for constructing and operating the examples. However, the same or equivalent functions and sequences may be accomplished by different examples.

A playlist may refer to a collection of media content in an electronic form explicitly saved by a user or a group of users. In video platform playlists, whole videos may be listed. The videos may be downloaded on a device configured to play the playlist or streamed without downloading them on the device. The video data may further comprise audio data. A video platform may provide a function that allows users to label their videos with timestamps. Referring to the timestamps, a video may be divided into segments so that the users can choose to start from a preferred part of the video marked with the timestamp and to end the video segment at a second timestamp.

An objective of an example embodiment is to provide a function and an interface for at least one of a video or audio playlist, wherein at least one of saving, editing or labeling segments of the video/audio data is enabled. A user may be able to choose to save only a part of a video or audio, edit timestamps of the video or audio, and search a particular video/audio segment respective to the one or more timestamps. Different playing modes related to the whole video/audio data, or one or more segments of the video/audio data, as well as naming and searching methods for the video/audio segments are provided. Further, a tool enabling easier setting of the timestamps for a video or audio is provided. An auto-detection and auto-correction algorithm is designed for quicker timestamp editing. In an embodiment, an automatic timestamp detection is provided to facilitate timestamp selection. Further, a correction tool for timestamps may be provided to improve position of the timestamp along a timeline of the video and/or audio data based on content of the video and/or audio data. FIG.l illustrates an example of an apparatus 100 configured to practice one or more example embodiment.

The apparatus 100 may comprise at least one processor 102. The at least one processor 102 may comprise, for example, one or more of various processing devices, such as for example a coprocessor, a microprocessor, a controller, a digital signal processor (DSP), a processing circuitry with or without an accompanying DSP, or various other processing devices including integrated circuits such as, for example, an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a microcontroller unit (MCU), a hardware accelerator, a special-purpose computer chip, or the like.

The apparatus 100 may further comprise at least one memory 104. The at least one memory 104 may be configured to store, for example, computer program code or the like, for example operating system software and application software. The at least one memory 104 may comprise one or more volatile memory devices, one or more non-volatile memory devices, and/or a combination thereof. For example, the memory may be embodied as magnetic storage devices (such as hard disk drives, magnetic tapes, etc.), optical magnetic storage devices, or semiconductor memories (such as mask ROM, PROM (programmable ROM), EPROM (erasable PROM), flash ROM, RAM (random access memory), etc.).

The apparatus 100 may further comprise a communication interface 108 configured to enable the apparatus 100 to transmit information to other devices. The communication interface 108 may be further configured to receive information from other devices. The communication interface 108 may be configured to receive at least one of video or audio data. For example, the apparatus 100 may be configured to download and store a copy of the video and/or audio data, for example, to the at least one memory 104. Alternatively, or in addition, the apparatus 100 may be configured to stream the video and/or audio data. Streaming may refer to continuous transmission of audio or video files from a server to a client device. With streaming, the media file being played on the client device may be stored remotely, and may be transmitted a few seconds at a time over the Internet. Hence, the apparatus 100 may be able to play media files also without copying and saving them or provide streaming for other devices. In an embodiment, the apparatus 100 may comprise the remote server configured for streaming, or the apparatus 100 may be configured to provide content, such as playlist, to be streamed by a remote server.

The apparatus 100 may further comprise other components and/or functions such as for example a user interface comprising at least one input device and/or at least one output device. The input device may take various forms such a keyboard, a touch screen, or one or more embedded control buttons. The output device may for example comprise a display, a speaker, or the like. The apparatus 100 may be also configured to control an external user interface based on received user inputs. For example, the apparatus 100 may be configured to cause a user interface to display one or more user interface elements, and perform one or more operations based on user interaction with the one or more user interface elements. A user interface element may be a visual representation of an object or feature in the interface. The one or more user interface elements may include at least one of input controls, navigational and informational components, and containers.

When the apparatus 100 is configured to implement some functionality, some component and/or components of the apparatus 100, such as for example the at least one processor 102 and/or the at least one memory 104, may be configured to implement this functionality. Furthermore, when the at least one processor 102 is configured to implement some functionality, this functionality may be implemented using program code 106 comprised, for example, in the at least one memory 104.

The functionality described herein may be performed, at least in part, by one or more computer program product components such as software components. According to an embodiment, the apparatus 100 comprises a processor or processor circuitry, such as for example a microcontroller, configured by the program code 106, when executed, to execute the embodiments of the operations and functionality described herein. Alternatively, or in addition, the functionality described herein can be performed, at least in part, by one or more hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include field-programmable gate arrays (FPGAs), applicationspecific integrated circuits (ASICs), application-specific standard products (ASSPs), system- on-a-chip systems (SOCs), complex programmable logic devices (CPLDs), graphics processing units (GPUs), or the like.

The apparatus 100 may be configured to perform method(s) described herein or comprise means for performing method(s) described herein. In one example, the means comprises the at least one processor 102, the at least one memory 104 including instructions which, when executed by the at least one processor 102, cause the apparatus 100 to perform the method(s).

The apparatus 100 may comprise, for example, a computing device. The computing device may be, for example, a user device, a client device, a mobile phone, a tablet computer, a laptop, a server, or the like. In an example embodiment, the apparatus 100 may comprise a vehicle, such as a car. Although the apparatus 100 is illustrated as a single device, it is appreciated that, wherever applicable, functions of the apparatus 100 may be distributed to a plurality of devices. In an embodiment, the apparatus 100 may comprise or be coupled to an embedded system such that functions provided by the apparatus 100 may be applied to mobile devices and vehicle systems. For example, once a mobile device is connected to a vehicle system, playlists saved to the mobile device can be shared via the connected system.

FIG. 2 illustrates an example of a playlist interface 200 for timestamp labeling of a playlist with a saving function for at least one of video or audio segments according to an embodiment of the disclosure.

The playlist interface 200 may be displayed, or caused to be displayed, by the apparatus 100 on a user equipment. The playlist interface 200 may be configured to provide an editing tool for setting timestamps on videos. The editing tool may be integrated to a playlist. The user equipment may comprise a device configured to deliver media content to a user. The user equipment may comprise, for example, a mobile phone, a laptop, or an infotainment system of a vehicle, such as a car. In an embodiment, the apparatus 100 may comprise the user equipment.

An infotainment system in a car may refer to hardware and software configured to deliver a combination of information and entertainment content/services, for example, via a touchscreen or a display mounted in the car. The playlist interface may comprise a user interface element, UI element, configured to allow the user to select a video 202 for editing with timestamps. Timestamps may refer to transcription tags configured to identify an exact point in an electronic file with media content, such as video and/or audio data. Timestamps may be used to label points for different video and/or audio segments. A video and/or audio data file may contain multiple timestamps. Intervals between subsequent timestamps may form the video or audio segments.

The apparatus 100 may be configured to cause at least one of the video or audio data of the video 202 selected by the user to be displayed on the playlist user interface 200 with a timeline 206. A timeline may refer to an area of a video displaying or editing application illustrating a chronological order of frames of the video. The timeline may be also referred to as a timeline bar.

The apparatus 100 may be further configured to obtain information on two or more timestamps 208 for the timeline 206 based on at least one of user input or keyframes detected by a keyframe classifier based on the video. For example, the user may be able to select a specific point on the timeline via the user interface displaying the playlist interface. The apparatus 100 may be then configured to receive an indication of the user selection, and store the timestamp comprising information on a time and/or frame associated with the selected point on the timeline 206.

Based on the two or more timestamps 208, the apparatus 100 may be configured to determine one or more segments 204 of the video or audio data. The apparatus 100 may be configured to cause the playlist user interface 200 to display at least one user interface element 210 selectable by the user for saving at least one of the one or more segments 204 to a playlist. In addition, the apparatus 100 may be configured to cause the playlist user interface 200 to display a user interface element 212 selectable by the user for saving at least one of the whole video or audio data to the playlist.

The apparatus 100 may be configured to receive the user inputs via the user equipment. For example, the user equipment may comprise a display and one or more virtual or physical buttons selectable by the user. With the virtual or physical buttons, the user may be able to at least navigate and select a UI element configured by the apparatus 100. In an embodiment, the user equipment may be configured to receive voice commands, and the UI element may be configured to be selectable with a voice command. In an embodiment, at least one of the apparatus 100 or the user equipment may be communicatively coupled with a device configured to provide the user inputs. The device may comprise user equipment, such as ear plugs, a smart watch, a mobile phone, or a steering wheel of a car. For example, the playlist interface 200 may be displayed on a display mounted in the car, and the user may be able to navigate and provide control inputs on the playlist interface 200 via buttons located on the steering wheel of the car. Alternatively, the apparatus 100 may be communicative coupled with a mobile phone or ear plugs of the user, and the user may be able to press a button on the mobile phone or ear plug to provide user inputs for the apparatus 100. Instead of button presses, a device may be configured to provide user inputs in any suitable means, such as based on a tap on the earplug. The apparatus 100 may be also configured to receive user inputs from more than one device.

The apparatus 100 may be configured to cause the one or more segments (108) to be saved to the playlist based on the received user input for selecting the at least one user interface element 110. In addition, the apparatus 100 may be configured to cause the while video or audio data to be saved to the playlist based on a received user input for selecting the respective user interface element 112. The user may further indicate a specific playlist for which the one or more segments and/or whole video/audio is to be saved. For example, the user may select one or more of the items to be saved to a favorite playlist. Hence, the user may be able to create customized playlists based on user inputs.

When video and/or audio segments are saved based on the timestamps, the user may be able to refer to the timestamps so that more precise segments can be controlled for playing. In an embodiment, the items (i.e., the one or more segments, whole video data or whole audio data) may be saved, or caused to be saved, with one or more tags. The tag may be configured to indicate if the saved item is a segment of video or audio data or the whole video or audio data.

The tag may be also configured to indicate if the saved item is associated to another saved item on the playlist, or on another playlist. For example, a tag of a saved item may indicate that the item is a segment of a specific video data. The tag may further indicate a data type of the item. The tags may be preconfigured and/or configurable by a user. For example, the apparatus 100 may be configured to cause the playlist interface 200 to display a user interface element for allowing the user to select between at least two different play modes for the playlist. Each play mode may be configured to cause playing saved items on the playlist based on the one or more tags. For example, a first play mode may be configured to cause playing only segments saved on the playlist. For example, a second play mode may be configured to cause playing only the saved items tagged to comprise whole video and/or audio data. For another example, a play mode may be configured to cause the playlist to be played in a mixed mode, wherein both the whole video/audio data and segments are played. Based on the tags, the mixed mode may first cause a whole video to be played and thereafter segment(s) of the whole video, or vice versa. The apparatus 100 may be configured to provide a user- selectable switch on the playlist user interface 200, allowing the user to choose whether to play the whole videos/audio or the segments. Both the play modes and tags may be configurable by the user. In addition, or alternatively, the apparatus 100 may be configured to allow the user to select how the playlist is displayed on the playlist interface based on the tags. The apparatus 100 may be also configured to display the playlist in a preselected form based on the tags. For example, the playlist may be displayed in a flat form, with the saved items forming a single list, or in a hierarchical form, with the saved items forming sub-lists arranged hierarchically.

FIG. 3 illustrates an example of a playlist interface 200 configured to display saved segments 204 and other playlist items in a single list form according to an embodiment of the disclosure. The single list form may have a flat structure, wherein the segments and the whole video and/or audio files are listed together.

In case of the flat structure, the user may be able to choose, for example, to play only a whole video or only one or more segments of the video. The apparatus 100 may be configured to display a UE element 300 selectable by the user for playing a segment 204. The apparatus 100 may be further configured to display UI element selectable by the user for playing the whole video. The apparatus 100 may be further configured to display one or more other UI elements allowing the user to manage how the playlist is played. For example, a UI element may be configured for allowing the user to at least one of play a segment or whole video/audio data selected by the user from the playlist; play a next segment or whole video/audio data from the playlist; play a previous segment or shole video/audio data from the playlist; play a random item saved on the playlist; play the playlist by order of the saved items; add a saved segment from the playlist to another playlist; or modify a playing order of the items saved on the playlist.

FIG. 4 illustrates an example of a playlist interface 200 configured to display saved segments 204 and other playlist items in a hierarchical form according to an embodiment of the disclosure.

The user may be able to set a tag on a video such that the whole video or the segments of the video may be played from the list of saved items according to user selection. In the hierarchical playing, the whole video 202 may be saved on a higher level, while the selected segments 204 may be stored under a sub-level. The playlist user interface may be configured by the apparatus 100 to display, for example, play buttons as UI elements for playing a segment of a video 300 and for playing the whole video 400.

The playlist user interface 200 may be configured to be displayed automatically, or based on user selection, with the flat or hierarchical structure. The user may be allowed to select one or more items from the playlist to be played, for example, based on the one or more tags. For example, the apparatus 100 may be configured to display one or more UI elements for one or more playing functions, wherein the UI elements are associated with the tags. For example, a UI element for playing next or previous segment may be configured to cause playing the next or previous segment on the playlist associated to a same video as a current segment.

Alternatively, or in addition, a UI element for playing next or previous segment may be configured to cause playing a next or previous segment on the playlist comprising a same data type as the current segment. The data type may refer to, for example, data comprising at least one of video or audio data. Alternatively, or in addition, a UI element for playing next or previous segment may be configured to cause playing next or previous segment on the playlist associated to a different video than the current segment. Hence, the user is able to manage playing the playlist in various ways based on the tags and UI elements generated by the apparatus 100.

FIG. 5 illustrates an example of a playlist interface 200 with a search and naming function according to an embodiment of the disclosure. The apparatus 100 may be configured to provide means for managing naming of the saved items on the playlist. The apparatus 100 may be further configured to provide means for searching the saved items from one or more playlists based on the names of the saved items. The means may comprise one or more UI elements generated by the apparatus 100, such as a text field for the naming function 500 and a search tool 502 for the searching function. The search tool 502 may comprise, for example, a search box configured to allow the user to enter a query and submit it to search the index with the intention of getting back the most relevant results. The search tool 502 may further comprise a search button for initiating the search function after submitting text to the search box. The search box may comprise, for example, a single-line text box as a search field accompanied by the search button.

The segments and other items may be saved by the apparatus 100 with a unique identifier to name the saved item automatically. The segment naming and search function may be integrated into the playlist user interface providing the timestamp editing tool. When the segment is selected and added to a playlist, the name can be further specified. A user can set a new name for the segment based on text input provided via the text field for naming function or just keep its original name generated by the apparatus 100. The user may be also able to modify the unique identifier any time after saving from the UI element for naming the segments. The apparatus 100 may be configured to save the set unique identifier of a segment together with an identifier of the video from which the segment is selected from. The apparatus 100 may be configured to cause the playlist user interface 100 to display the unique identifier together with the identifier of the video as a name of the segment on the playlist. The user may be allowed to select to play saved items from the playlist based on the unique identifier or a tag of the saved item.

FIG. 6 illustrates an example of a flowchart for learning-based keyframe detection, according to an embodiment of the disclosure. In addition to allowing a user to select timestamps from the timeline, the apparatus 100 may be configured to suggest timestamps for the user for saving a segment. The suggestions may be provided before the user has selected any timestamps to help the user to decide one or more segments to be saved. Alternatively, the apparatus 100 may be configured to provide the suggestions after the user has selected one or more timestamps to help the user find more accurate positions for the timestamps. For example, the video may comprise a transition from one scene to another and the user wants to save a segment comprising one of the scenes. It may be hard for the user to be able to select the exact time point where the scenes change. The apparatus 100 may analyze the video and/or audio data of the video with machine learning. Based on the analysis, the apparatus 100 may detect an improved position for the timestamp(s). The improved position may be determined, for example, based on features of frames of the video/audio indicative of when the transition happens.

Hence, the apparatus 100 may comprise a learning-based algorithm designed for auto-detection and correction of timestamps. The algorithm may enable setting timestamps on a timeline of a video automatically by the apparatus 100. Alternatively, the algorithm may enable a user wanting to manually set the timestamps to select more precise segments based on suggestions provided by the apparatus 100. The algorithm may comprise, for example, a deep neural network. At least one of video input 600 or audio input 602 may be provided for the algorithm. For example, the apparatus 100 may be configured to receive an indication from the user of a selected video file or one or more segments of the video to be inputted for keyframe detection. The indication of one or more segments may be based, for example, timestamps selected by the user from the timeline of the video.

A keyframe may refer to a frame that defines a starting or an ending point of a smooth transition from one image to another. A keyframe may be a frame that shows strong non-relation to previous frames. In digital video editing, a keyframe may refer to a frame used to indicate a beginning or end of a change made to a parameter. For example, a keyframe could be set to indicate the point at which audio will have faded up or down to a certain level. A frame may refer to a position in time on a digital video editing timeline. A sequence of keyframes may define which movement a viewer of a video will see, whereas the position of the keyframes on the video, may define the timing of the movement. At 608, the apparatus 100 may be configured to extract self-correlation based on the input data. The self-correlation may be extracted by using a current frame 604 and at least one neighboring frame 606 from the video or the segment(s). The apparatus 100 may be configured to obtain keyframe labels by calculating the self-correlation for each inputted frame based on the current and neighboring frames. The apparatus 100 may be configured to build a generative label for a keyframe classifier 610 based on the keyframe labels.

The self-correlation for an input signal X may be defined as:

Rxx(ti, t₂) = ISp j Xt where, E[*] is the expected value operator and the bar represents complex conjugation. By calculating the overall self-correlation of a video/audio, the potential keyframes can be labeled. The algorithm may be firstly pre-trained by doing the predicting of the potential keyframes, which may be a fully unsupervised process.

The apparatus 100 may analyze multiple features from the video and/or audio data. At 612, the apparatus 100 may be configured to calculate a multiscale contrast based on the current and neighboring frames 604, 606. A multiscale contrast may be also referred to as a channel contrast. For video data, the channel contrast for frame F along pixel p may be calculated as:

2 where N denotes the n by n neighborhood around pixel p, where n is a hyperparameter which is set experimentally. Further, c indicates different channels of the frame. The contrast value may be normalized into range of [0, 1], Meanwhile, a Gaussian pyramid may be built so that the multi-layer features are concatenated forming the final channel contrast feature. Contrast is a distinctive visual attribute in the frames, which can be used to detect a change between said frames.

At 614, relative motion intensity may be calculated based on the current at neighboring frames 604, 606 for the video data. The relative motion intensity I may be defined as:

where M are the motion vectors for directions of x and y. The motion vectors may be calculated by using the Lucas Kanade Algorithm.

At 616, the apparatus 100 may be configured to calculate the relative motion consistency based on the current and neighboring frames 604, 606 for the video data. The motion consistency O may be defined as:

Aiy(p)

O(F, ?) = arctan I

At 618, the apparatus 100 may be configured to calculate the relative feature intensity based on the current and neighboring frames 604, 606. For audio data, the feature intensity may be defined according to the Mel-frequency cepstral coefficients (MFCC) feature. At 620, the apparatus 100 may be configured to calculate relative feature consistency for the audio data based on the current and neighboring frames 604, 606. The relative feature consistency may comprise the phase of the signal’s discrete cosine transform calculated by the apparatus 100.

At 622, the apparatus 100 may be configured to perform attention fusion based on the multiscale contrast 612, relative motion intensity 614, relative motion consistency 616, and/or the relative feature intensity 618 and and relative feature consistency 620 to output a fused feature to the keyframe classifier 610.

The apparatus 100 may comprise an attention fusion module with a transformer-like structure. For example, the extracted features, contrast for video data, intensity and consistency for both video and audio data, may be concatenated as tokens for transformer inference. Transformer may refer to an attention-based deep learning framework which can be used for dealing with multimodal data and extracting vision, audio and/or text features. A final output may comprise the prediction for the current frame classifying if it is a keyframe or not.

Training of the algorithm may comprise two stages. In the first stage, correlations according to the self-correlation labels may be predicted with the algorithm. The prediction may contain predicting all the self-correlation labels, nearby correlation and masked correlation. The second stage may be performed with true keyframe labels. The algorithm may be fine-tuned with only few parameters updated.

After the resulted fused feature from the attention fusion is fed to the keyframe classifier at 610, the keyframe classifier 610 may be configured to extract a keyframe at 624 or a non-keyframe to provide a tag for the current frame. The apparatus 100 may be then configured to display at least some of the suggested timestamps to the user, wherein the suggested timestamps correspond to the extracted keyframes. The user may then select final timestamps from the suggestions for saving a segment.

The apparatus 100 may be configured to determine if the input provided for the keyframe detection is associated with one or more timestamps manually pre-selected by the user before the input. When pre-selected timestamps are detected, the apparatus 100 may be configured to compare the timestamps to the extracted keyframes. Based on the comparison, the apparatus 100 may be configured to display to the user on the timeline at least timestamps of the extracted keyframes locating nearest to the pre-selected timestamps. The apparatus 100 may be also configured to display to the user the extracted keyframes locating within the segment(s). Hence, the user may be able correct the timestamps to more precise positions based on the suggestions provided by the apparatus 100.

The user may be also able to indicate one or more ranges within segment(s) of a video selected by the user for input. The apparatus 100 may then perform the keyframe detection within the ranges.

If the input is without pre-selected timestamps, keyframe detection for the whole video data may be applied. The apparatus 100 may be then configured to return all detected keyframes as the timestamp suggestions to be displayed for the user.

FIG. 7 illustrates an example of a method 700 for playlist management according to an embodiment of the disclosure. The method may be performed, for example, by the apparatus 100. At 702, the method may comprise providing a playlist user interface to be displayed on a user equipment, the playlist user interface comprising a user interface element allowing a user to select a video for editing with timestamps.

At 704, the method may comprise causing at least one of video or audio data of the video selected by the user to be displayed on the playlist user interface with a timeline.

At 706, the method may comprise obtaining information on two or more timestamps for the timeline based on at least one of user input or key frames detected by a keyframe classifier based on the video.

At 708, the method may comprise determining one or more segments of the video or audio data based on the two or more timestamps, wherein the segment comprises an interval between two consecutive timestamps.

At 710, the method may comprise causing the playlist user interface to display at least one user interface element selectable by the user for saving at least one of the one or more segments to a playlist.

At 712, the method may comprise causing the one or more segments to be saved to the playlist based on received user input for selecting the at least one user interface element.

Further features of the methods directly result from the functionalities and parameters of the methods and devices, for example the apparatus 100, as described in the appended claims and throughout the specification and are therefore not repeated here.

A device or a system may be configured to perform or cause performance of any aspect of the method(s) described herein. Further, a computer program may comprise program code configured to cause performance of an aspect of the method(s) described herein, when the computer program is executed on a computer. Further, the computer program product may comprise a computer readable storage medium storing program code thereon, the program code comprising instruction for performing any aspect of the method(s) described herein. Further, a device may comprise means for performing any aspect of the method(s) described herein. According to an example embodiment, the means comprises at least one processor, and at least one memory including program code, the at least one processor, and program code configured to, when executed by the at least one processor, cause performance of any aspect of the method(s).

Any range or device value given herein may be extended or altered without losing the effect sought. Also, any embodiment may be combined with another embodiment unless explicitly disallowed.

Although the subject matter has been described in language specific to structural features and/or acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as examples of implementing the claims and other equivalent features and acts are intended to be within the scope of the claims.

It will be understood that the benefits and advantages described above may relate to one embodiment or may relate to several embodiments. The embodiments are not limited to those that solve any or all of the stated problems or those that have any or all of the stated benefits and advantages. It will further be understood that reference to 'an' item may refer to one or more of those items. Furthermore, references to ‘at least one’ item or ‘one or more’ items may refer to one or a plurality of those items.

The operations of the methods described herein may be carried out in any suitable order, or simultaneously where appropriate. Additionally, individual blocks may be deleted from any of the methods without departing from the scope of the subject matter described herein. Aspects of any of the embodiments described above may be combined with aspects of any of the other embodiments described to form further embodiments without losing the effect sought.

The term 'comprising' is used herein to mean including the method, blocks, or elements identified, but that such blocks or elements do not comprise an exclusive list and a method or device may contain additional blocks or elements. It will be understood that the above description is given by way of example only and that various modifications may be made by those skilled in the art. The above specification, examples and data provide a complete description of the structure and use of exemplary embodiments. Although various embodiments have been described above with a certain degree of particularity, or with reference to one or more individual embodiments, those skilled in the art could make numerous alterations to the disclosed embodiments without departing from scope of this specification.

Claims

1. A computer-implemented method (700) for managing playlists, the method (700) comprising: providing (702) a playlist user interface (200) to be displayed on a user equipment, the playlist user interface (200) comprising a user interface element allowing a user to select a video (202) for editing with timestamps; causing (704) at least one of video or audio data of the video (202) selected by the user to be displayed on the playlist user interface (200) with a timeline (206); obtaining (706) information on two or more timestamps (208) for the timeline (206) based on at least one of user input or keyframes detected by a keyframe classifier based on the video (202); determining (708) one or more segments (204) of the video or audio data based on the two or more timestamps (208), wherein the segment (204) comprises an interval between two consecutive timestamps (208); causing (710) the playlist user interface (200) to display at least one user interface element (210) selectable by the user for saving at least one of the one or more segments (204) to a playlist; and causing (712) the one or more segments (204) to be saved to the playlist based on received user input for selecting the at least one user interface element (210).

2. The computer-implemented method (700) of claim 1, further comprising: causing the playlist user interface (200) to display a user interface element (212) selectable by the user for saving at least one of the whole video or audio data to the playlist; and causing the whole video or audio data to be saved to the playlist based on a received user input for selecting the respective user interface element (212).

3. The computer-implemented method (700) of any preceding claim, further comprising: receiving (600, 602), from the user, an indication of selection of the whole video (202) or the one or more segments (204) to be inputted for keyframe detection; calculating (608) a self-correlation between a current frame (604) and neighboring frames (606) for each frame of the selected video or the one or more segments to obtain keyframe labels and build a generative label for a keyframe classifier (610); calculating (612, 614, 616, 618, 620) at least one of multi-scale contrast feature, relative motion intensity feature and relative motion consistency feature for video data or relative feature intensity feature and relative feature consistency feature for audio data of the current (604) and neighboring frames (606); performing (622) attention fusion for the calculated features to output a fused feature to the keyframe classifier (610); extracting (624) keyframes by the keyframe classifier (610) based on the input keyframe labels and fused features; displaying on the timeline (206) suggested timestamps to be selected by the user for the one or more segments, wherein the suggested timestamps correspond to the extracted keyframes.

4. The computer-implemented method (700) of claim 3, wherein the attention fusion is performed with a transformer-based attention model.

5. The computer-implemented method (700) of claim 3 or 4, comprising: determining if the input is associated with one or more timestamps (208) pre-selected by the user on the timeline (206); comparing the timestamps of the extracted keyframes to the one or more pre-selected timestamps (208); and displaying on the timeline (206) at least timestamps of the extracted keyframes locating nearest to the one or more pre-selected timestamps (208) as the suggested timestamps.

6. The computer-implemented method (700) of any preceding claim, further comprising: causing the one or more segments (204) to be saved with a unique identifier, wherein the unique identifier is inputted by the user or generated automatically.

7. The computer-implemented method (700) of claim 6, comprising: causing the playlist user interface (200) to display a user interface element (500) allowing the user to modify the unique identifier of each segment.

8. The computer-implemented method (700) of claim 6 or 7, wherein the unique identifier is saved with an identifier of the video (202).

9. The computer-implemented method (700) of claim 8, comprising: causing the playlist user interface (200) to display the unique identifier together with the identifier of the video as a name of the segment on the playlist.

10. The computer-implemented method of any of claim 7 to 9, comprising: providing a search tool (502) on the playlist user interface (200) allowing the user to search the saved segments based on at least one of the unique identifiers or the identifier of the video.

11. The computer-implemented method (700) of any preceding claim, comprising: causing at least one of the one or more segments (204), the whole video (202) or the whole audio data to be saved with one or more tags indicating if a saved item is at least one of the segments, a whole video data, a whole audio data or associated to another item saved on the playlist; causing the playlist user interface (200) to display a user interface element for allowing the user to select between at least two different play modes for the playlist, wherein the play mode is configured to cause playing saved items based on one or more tags; and causing the playlist to be played according to selected play mode.

12. The computer-implemented method (700) of claim 11, further comprising: providing a user interface element for allowing the user to select how the playlist is displayed on the playlist user interface, wherein the user is allowed to select between a single list form, wherein at least one of the whole videos (202) or one or more segments (204) are listed together, or in a hierarchical form, wherein the whole video (202) is listed at a higher level and the segments (204) at a lower level; and causing the playlist to be displayed on the playlist user interface (200) according to the user selection with user interface elements (300, 400) for allowing the user to select to play one or more saved items on the playlist based on a unique identifier or the tag of the saved item.

13. The computer-implemented method (700) of any preceding claim, comprising: causing one or more user interface elements to be displayed on the playlist user interface (200) allowing the user to at least one of play selected, next or previous segment, whole video or whole audio data on the playlist, play a random item saved on the playlist, play the playlist by order, add a saved segment from the playlist to another playlist or modify a playing order of the playlist.

14. The computer-implemented method (700) of claim 13, wherein the user interface element for playing next or previous segment causes playing the next or previous segment on the playlist associated to a same video as a current segment.

15. The computer-implemented method (700) of claim 13, wherein the user interface element for playing next or previous segment causes playing a next or previous segment on the playlist associated to a different video than a current segment.

16. The computer-implemented method (700) of any of claims 13 to 15, wherein the user interface element for playing next or previous segment causes playing a next or previous segment on the playlist comprising a same data type as a current segment, wherein the data type comprises at least one of video or audio data.

17. The computer-implemented method of any preceding claim, wherein the user inputs are received via at least one of the first user equipment displaying the playlist user interface or a second user equipment communicatively coupled with the first user equipment.

18. An apparatus (100) for managing playlists of media content, comprising: at least one processor (102); and at least one memory (104) comprising instructions which, when executed by the at least one processor (102), cause the apparatus (100) at least to: provide a playlist user interface (200) to be displayed on a user equipment, the playlist user interface (200) comprising a user interface element allowing a user to select a video (202) for editing with timestamps; cause at least one of video or audio data of the video (202) selected by the user to be displayed on the playlist user interface (200) with a timeline (206); obtain information on two or more timestamps (208) for the timeline (206) based on at least one of user input or keyframes detected by a keyframe classifier based on the video; determine one or more segments (204) of the video or audio data based on the two or more timestamps (208), wherein the segment (204) comprises an interval between two consecutive timestamps (208); cause the playlist user interface (200) to display at least one user interface element (210) selectable by the user for saving at least one of the one or more segments (204) to a playlist; and cause the one or more segments (204) to be saved to the playlist based on received user input for selecting the at least one user interface element (210).

19. The apparatus (100) of claim 18, wherein the at least one memory (204) further comprises instructions which, when executed by the at least one processor (102), cause the apparatus (100) to: cause the playlist user interface (200) to display a user interface element (212) selectable by the user for saving at least one of the whole videos (202) or audio data to the playlist; and cause the whole video (202) or audio data to be saved to the playlist based on a received user input for selecting the respective user interface element (212).

20. The apparatus (100) of claim 18 or 19, wherein the at least one memory (204) comprises instructions which, when executed by the at least one processor (102), cause the apparatus (100) to: receive (600, 602), from the user, an indication of selection of the whole video (202) or the one or more segments (204) to be inputted for keyframe detection; calculate (608) a self-correlation between a current frame (604) and neighboring frames (606) for each frame of the selected video (202) or the one or more segments (208) to obtain keyframe labels and build a generative label for a keyframe classifier (610); calculate (612, 614, 616, 618, 620) at least one of multi-scale contrast feature, relative motion intensity feature and relative motion consistency feature for video data or relative feature intensity feature and relative feature consistency feature for audio data of the current and neighboring frames; perform (622) attention fusion for the calculated features to output a fused feature to the keyframe classifier (610); extract (624) keyframes by the keyframe classifier based on the input keyframe labels and fused features; cause display of suggested timestamps to be selected by the user for the one or more segments (208), wherein the suggested timestamps correspond to the extracted keyframes.

21. The apparatus (100) of claim 20, wherein the attention fusion is performed with a transformer-based attention model.

22. The apparatus (100) of claim 20 or 21, wherein the at least one memory (104) comprises instructions which, when executed by the at least one processor (102), cause the apparatus (100) to: determine if the input is associated with one or more timestamps (208) pre-selected by the user on the timeline (206); compare the timestamps of the extracted keyframes to the one or more pre-selected timestamps (208); and display on the timeline (206) at least timestamps of the extracted keyframes locating nearest to the one or more pre-selected timestamps (208) as the suggested timestamps.

23. The apparatus (100) of any of claims 18 to 22, wherein the at least one memory (104) comprises instructions which, when executed by the at least one processor (102), cause the apparatus (100) to: cause the one or more segments (204) to be saved with a unique identifier, wherein the unique identifier is inputted by the user or generated automatically.

24. The apparatus (100) of claim 23, wherein the at least one memory (204) comprises instructions which, when executed by the at least one processor (102), cause the apparatus (100) to: cause the playlist user interface (200) to display a user interface element (500) allowing the user to modify the unique identifier of each segment.

25. The apparatus (100) of claim 23 or 24, wherein the unique identifier is saved with an identifier of the video.

26. The apparatus (100) of claim 25, wherein the at least one memory (204) comprises instructions which, when executed by the at least one processor (102), cause the apparatus (100) to: cause the playlist user interface (200) to display the unique identifier together with the identifier of the video as a name of the segment on the playlist.

27. The apparatus (100) of any of claim 24 to 26, wherein the at least one memory (104) comprises instructions which, when executed by the at least one processor (102), cause the apparatus (100) to: provide a search tool (502) on the playlist user interface (200) allowing the user to search the saved segments (208) based on at least one of the unique identifiers or the identifier of the video.

28. The apparatus (100) of any of claim 18 to 27, wherein the at least one memory (104) comprises instructions which, when executed by the at least one processor (102), cause the apparatus (100) to: cause at least one of the one or more segments (204), the whole video (202) or the whole audio data to be saved with one or more tags indicating if a saved item is at least one of the segments, a whole video data, a whole audio data or associated to another item saved on the playlist; cause the playlist user interface (200) to display a user interface element for allowing the user to select between at least two different play modes for the playlist, wherein the play mode is configured to cause playing saved items based on one or more tags; and cause the playlist to be played according to selected play mode.

29. The apparatus (100) of claim 28, wherein the at least one memory (104) comprises instructions which, when executed by the at least one processor (102), cause the apparatus (100) to: provide a user interface element for allowing the user to select how the playlist is displayed on the playlist user interface, wherein the user is allowed to select between a single list form, wherein at least one of the whole video (202) or one or more segments (204) are listed together, or in a hierarchical form, wherein the whole video (202) is listed at a higher level and the segments (204) at a lower level; and cause the playlist to be displayed on the playlist user interface (200) according to the user selection with user interface elements (300, 400) for allowing the user to select to play one or more saved items on the playlist based on a unique identifier or the tag of the saved item.

30. The apparatus (100) of any of claim 18 to 29, wherein the at least one memory (104) comprises instructions which, when executed by the at least one processor (102), cause the apparatus (100) to: cause one or more user interface elements to be displayed on the playlist user interface (200) allowing the user to at least one of play selected, next or previous segment, whole video or whole audio data on the playlist, play a random item saved on the playlist, play the playlist by order, add a saved segment from the playlist to another playlist or modify a playing order of the playlist.

31. The apparatus (100) of claim 30, wherein the user interface element for playing next or previous segment causes playing the next or previous segment on the playlist associated to a same video as a current segment.

32. The apparatus (100) of claim 30, wherein the user interface element for playing next or previous segment causes playing a next or previous segment on the playlist associated to a different video than a current segment.

33. The apparatus (100) of any of claim 30 to 32, wherein the user interface element for playing next or previous segment causes playing a next or previous segment on the playlist comprising a same data type as a current segment, wherein the data type comprises at least one of video or audio data.

34. The apparatus (100) of any of claim 18 to 33, wherein the user inputs are received via at least one of the first user equipment displaying the playlist user interface or a second user equipment communicatively coupled with the apparatus.