WO2025007738A1 - Audio-picture synchronization detection method and apparatus, and device and storage medium - Google Patents
Audio-picture synchronization detection method and apparatus, and device and storage medium Download PDFInfo
- Publication number
- WO2025007738A1 WO2025007738A1 PCT/CN2024/099889 CN2024099889W WO2025007738A1 WO 2025007738 A1 WO2025007738 A1 WO 2025007738A1 CN 2024099889 W CN2024099889 W CN 2024099889W WO 2025007738 A1 WO2025007738 A1 WO 2025007738A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- video
- audio
- key frame
- audio data
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/4302—Content synchronisation processes, e.g. decoder synchronisation
- H04N21/4307—Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen
- H04N21/43072—Synchronising the rendering of multiple content streams or additional data on devices, e.g. synchronisation of audio on a mobile phone with the video output on the TV screen of multiple content streams on the same device
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/4402—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
- H04N21/440281—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by altering the temporal resolution, e.g. by frame skipping
Definitions
- the embodiments of the present disclosure relate to a method, apparatus, device and storage medium for detecting audio and video synchronization.
- the present disclosure provides a method, device, equipment and storage medium for detecting audio and video synchronization, so as to automatically detect audio and video synchronization of a transcoded video, thereby improving the detection efficiency of audio and video synchronization and ensuring the accuracy of audio and video synchronization detection.
- an embodiment of the present disclosure provides a method for detecting audio and video synchronization, including:
- the first video is a video obtained by transcoding a second video with synchronized audio and video;
- the first audio data is matched with the second audio data to determine an audio-visual synchronization detection result of the first video.
- the present disclosure also provides an audio-video synchronization detection device, including:
- a first video acquisition module is used to acquire a first video to be detected, where the first video is a video obtained by transcoding a second video with synchronized audio and video;
- a key frame alignment module used to extract a first key frame in the first video and a second key frame in the second video, align the first key frame with the second key frame, and determine a target second key frame aligned with the first key frame;
- An audio data extraction module used to extract first audio data corresponding to the first key frame and second audio data corresponding to the target second key frame;
- the audio and video synchronization detection module is used to match the first audio data with the second audio data to determine the audio and video synchronization detection result of the first video.
- an embodiment of the present disclosure further provides an electronic device, the electronic device comprising:
- processors one or more processors
- a storage device for storing one or more programs
- the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the audio and video synchronization detection method as described in any one of the embodiments of the present disclosure.
- the embodiments of the present disclosure further provide a storage medium comprising computer executable instructions, wherein the computer executable instructions, when executed by a computer processor, are used to execute the audio-visual synchronization detection method as described in any one of the embodiments of the present disclosure.
- FIG1 is a flow chart of a method for detecting synchronization of audio and video provided by an embodiment of the present disclosure
- FIG2 is a waveform matching diagram of first audio data and second audio data involved in an embodiment of the present disclosure
- FIG3 is a flow chart of another method for detecting synchronization between audio and video provided by an embodiment of the present disclosure
- FIG4 is an example of audio data extraction according to an embodiment of the present disclosure.
- FIG5 is a schematic diagram of the structure of a device for detecting synchronization between audio and video provided by an embodiment of the present disclosure
- FIG. 6 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present disclosure.
- Figure 1 is a flow chart of a method for detecting audio and video synchronization provided by an embodiment of the present disclosure.
- the embodiment of the present disclosure is applicable to the case of performing audio and video synchronization detection on a transcoded video.
- the method can be executed by an audio and video synchronization detection device, which can be implemented in the form of software and/or hardware.
- an electronic device which can be a mobile terminal, a PC or a server, etc.
- the audio-video synchronization detection method specifically includes the following steps:
- S110 Acquire a first video to be detected, where the first video is a video obtained by transcoding a second video with synchronized audio and video.
- the first video refers to the video to be detected after transcoding.
- the second video refers to the video before transcoding.
- the audio and video images in the second video are synchronized, so the second video can be used as a reference video to detect the synchronization of audio and video of the first video.
- users can create and publish videos in the client.
- the server After receiving the video uploaded by the client, the server needs to transcode the uploaded video, such as adjusting the video resolution and bit rate, so as to improve the picture quality, and send the transcoded video to other clients for playback.
- the video uploaded to the server can be used as the second video with audio and video synchronization, and the transcoded video can be used as the first video to be detected.
- the second video before transcoding as a reference video for audio and video synchronization, it is possible to accurately detect whether the first video has audio and video asynchronization caused by video transcoding.
- S120 extract a first key frame in the first video and a second key frame in the second video, align the first key frame and the second key frame, and determine a target second key frame that is aligned with the first key frame.
- a key frame may refer to a video frame that plays a decisive role in the video content, such as a video frame with scene switching, a video frame with motion changes, a video frame where a key action is located, etc.
- a first key frame may refer to a key frame in a first video. The number of first key frames may be one or more.
- a second key frame may refer to a key frame in a second video. The number of second key frames may also be one or more. The number of first key frames may be the same as or different from the number of second key frames.
- Alignment refers to the operation of aligning key frames with the same picture.
- the target second key frame refers to a second key frame with the same picture as the first key frame.
- key frames of the first video and the second video are detected, and all first key frames in the first video and all second key frames in the second video are extracted.
- the first key frames and the second key frames can be aligned by detecting the image similarity between the first key frames and the second key frames. For example, for each first key frame, the image similarity between the first key frame and each second key frame can be determined, and the second key frame with the highest image similarity and greater than or equal to a preset similarity threshold can be used as the target second key frame to be aligned with the first key frame, thereby obtaining a target second key frame having the same picture as the first key frame.
- image similarity can be measured using any indicator that can measure the similarity between two images.
- image similarity is characterized by the structural similarity SSIM (Structural Similarity) indicator.
- the value range of the SSIM indicator is 0 to 1, and the larger the SSIM indicator, the higher the image similarity.
- the first key frame can be ignored and only the aligned first key frame can be used for subsequent audio and video synchronization detection.
- S130 Extract first audio data corresponding to the first key frame and second audio data corresponding to the target second key frame.
- the first audio data corresponding to the preset audio duration can be extracted at the first key frame position in the first video.
- the second audio data corresponding to the preset audio duration is also extracted, so that the extracted first audio data and the second audio data have the same audio duration.
- the extracted first audio data is audio data with a preset audio duration using the playback timestamp of the first key frame as a preset reference moment.
- the extracted second audio data is audio data with a preset audio duration using the playback timestamp of the target second key frame as a preset reference moment.
- the preset reference moment may refer to the start moment, center moment or end moment of the audio data.
- the picture content of the aligned first key frame and the target second video frame is the same, but their corresponding playback timestamps may be different.
- the first audio data and the second audio data need to be extracted at the same position of the key frame to ensure the accuracy of subsequent audio and video synchronization detection.
- the first audio data with a preset audio duration starting at the playback timestamp of the first key frame is extracted from the first video.
- the second audio data with a preset audio duration starting at the playback timestamp of the target second key frame is extracted from the second video.
- S140 Match the first audio data with the second audio data to determine an audio-video synchronization detection result of the first video.
- FIG2 shows a waveform matching diagram of the first audio data and the second audio data.
- the first audio data and the second audio data have the same audio duration.
- the matching result between the first audio data and the second audio data can be determined based on the similarity between the two audio waveforms in FIG2.
- the covariance between the first audio data and the second audio data can be used as the audio similarity between the first audio data and the second audio data
- the audio similarity can be used as the covariance between the first audio data and the second audio data.
- the similarity is greater than or equal to the preset similarity threshold, it is determined that the first audio data and the second audio data match successfully.
- the audio similarity is less than the preset similarity threshold, it is determined that the first audio data and the second audio data fail to match. If the match is successful, it indicates that the audio and video images at the position of the first key frame are synchronized. If the match fails, it indicates that the audio and video images at the position of the first key frame are not synchronized, and there is a situation where the audio and video are not synchronized due to transcoding. According to the matching results corresponding to each first key frame in the first video, the audio and video synchronization detection result of the first video can be determined. For example, if the matching results corresponding to all the first key frames in the first video are successful matches, the audio and video synchronization detection result of the first video is determined to be audio and video synchronization.
- the matching result corresponding to at least one first key frame in the first video is a match failure, it is determined that the audio and video synchronization detection result of the first video is audio and video synchronization, and there is an audio and video synchronization situation caused by transcoding in the first video.
- the technical solution of the disclosed embodiment detects the transcoded first video by taking the second video with synchronized audio and video before transcoding as a reference, and matches the first audio data corresponding to the first key frame after key frame alignment with the second audio data corresponding to the target second key frame, so as to accurately determine whether the first video has audio and video asynchronism caused by transcoding, thereby realizing automatic detection of audio and video synchronization and ensuring the accuracy of audio and video synchronization detection.
- "aligning the first key frame and the second key frame to determine a target second key frame aligned with the first key frame" in S120 may include: obtaining an optional second key frame sequence for the first key frame; obtaining the current second key frame in sequence according to the order of the second key frame sequence, and determining the image similarity between the first key frame and the current second key frame; if the image similarity is greater than or equal to a preset similarity threshold, determining the current second video frame as the target second key frame aligned with the first key frame.
- image similarity can be characterized by any indicator that can measure the similarity between two images.
- image similarity is characterized by the structural similarity SSIM (Structural Similarity) indicator.
- the value range of the SSIM indicator is 0 to 1, and the larger the SSIM indicator, the higher the image similarity.
- all the extracted first key frames can be sorted in the order of video playback to obtain a first key frame sequence, such as ⁇ A1, A2, ..., An ⁇ .
- all the extracted second key frames can be sorted in the order of video playback to obtain a second key frame sequence, such as ⁇ B1, B2, ..., Bm ⁇ .
- the target second key frame to be aligned with each first key frame can be determined from the second key frame sequence in sequence according to the order of the first key frame sequence.
- the optional second key frame sequence of the first first key frame A1 is ⁇ B1, B2, ..., Bm ⁇ .
- the second key frame B1 can be first used as the current second key frame, and it is detected whether the image similarity between the first key frame A1 and the second key frame B1 is greater than or equal to the preset similarity threshold; if so, it indicates that the first key frame A1 and the second key frame B1 are similar in picture.
- the second key frame B1 can be directly used as the target second key frame to be aligned with the first key frame A1; if not, the next second key frame B2 is used as the current second key frame to re-detect whether the image similarity between the first key frame A1 and the second key frame B2 is greater than or equal to the preset similarity threshold, until the second key frame matching the first key frame A1 is determined, thereby completing the alignment of the first key frame A1.
- the target second key frame aligned with the first key frame A1 is B2
- the second key frame after the second key frame B2 can be used as the optional second key frame sequence of the first key frame A2, that is, ⁇ B3, B4, ..., Bm ⁇
- the target second key frame aligned with the first key frame A2 is determined from the second key frame sequence ⁇ B3, B4, ..., Bm ⁇ .
- the execution is carried out sequentially, and the target second key frame aligned with each first key frame can be obtained more quickly, thereby improving the key frame alignment speed and further improving the detection efficiency of audio and video synchronization.
- FIG3 is a flow chart of another method for detecting audio-visual synchronization provided by an embodiment of the present disclosure.
- the audio duration of the extracted first audio data is greater than the audio duration of the second audio data, and on this basis, the step of "matching the first audio data with the second audio data to determine the audio-visual synchronization detection result of the first video" is optimized. Explanations of terms that are the same or corresponding to the above disclosed embodiments are not repeated here.
- the audio-video synchronization detection method specifically includes the following steps:
- S310 Obtain a first video to be detected, where the first video is a video obtained by transcoding a second video with synchronized audio and video.
- S320 extract a first key frame in the first video and a second key frame in the second video, align the first key frame and the second key frame, and determine a target second key frame that is aligned with the first key frame.
- S330 Extract first audio data corresponding to the first key frame and second audio data corresponding to the target second key frame.
- the first audio data corresponds to an audio duration of the first preset duration
- the second audio data corresponds to an audio duration of the first preset duration.
- the audio duration is the second preset duration.
- the first preset duration is greater than the second preset duration.
- the first preset duration is 500ms and the second preset duration is 20ms.
- the first audio data corresponding to the first preset duration can be extracted at the first key frame position in the first video.
- the second audio data corresponding to the second preset duration can be extracted at the target second key frame position in the second video, so that the audio duration of the extracted first audio data is greater than the audio duration of the second audio data, so that the second audio data can be matched in the first audio data later.
- the extraction of the first audio data needs to be performed within the left and right range of the first key frame position so as to perform accurate audio matching later.
- the first audio data can be audio data with a first preset duration centered at the playback timestamp of the first key frame.
- FIG4 shows an example of first audio data extraction. Referring to FIG4, first audio data with a first preset duration of 500ms centered at the playback timestamp of each first key frame is extracted from the audio of the first video.
- the second audio data may be audio data having a second preset duration with the playback timestamp of the target second key frame as a preset reference moment.
- the preset reference moment may be one of the following: a start moment, a center moment, and an end moment.
- FIG4 also shows an example of second audio data extraction. Referring to FIG4, second audio data having a second preset duration of 20 ms and starting at the playback timestamp of each target second key frame is extracted from the audio of the second video.
- S340 Determine, from the first audio data, target audio data having a second preset duration and matching the second audio data.
- the target audio data may refer to the audio data in the first audio data that successfully matches the second audio data.
- the target audio data and the second audio data may be approximately the same data.
- the first audio data can be traversed to extract multiple audio data to be selected with the second preset duration, for example, one audio data to be selected with the second preset duration is extracted every 1 ms.
- the audio similarity between each audio data to be selected and the second audio data can be determined, and the audio data to be selected with the greatest audio similarity is used as the target audio data to match the second audio data.
- S340 may include: based on a preset sliding duration and a second preset duration, The audio data is slid to obtain current audio data with a second preset duration corresponding to the current sliding; the audio similarity between the current audio data and the second audio data is determined; and based on the audio similarity corresponding to the current audio data, the target audio data matching the second audio data is determined.
- the first audio data can be slid successively to obtain the current audio data with the second preset duration.
- the preset sliding duration is 1ms
- the current audio data obtained by the first sliding is the audio data from the 0th ms to the 20th ms in the first audio data
- the current audio data obtained by the second sliding is the audio data from the 1st ms to the 21st ms in the first audio data, and so on.
- the covariance between the current audio data obtained by the sliding and the second audio data can be determined, and the covariance can be used as the audio similarity corresponding to the current audio data.
- the audio similarity is greater than or equal to the preset similarity threshold, it can be determined that the current audio data matches the second audio data, and the current audio data is used as the target audio data. If the audio similarity is less than the preset similarity threshold, it can be determined that the current audio data does not match the second audio data. At this time, it is necessary to update the current audio data by sliding, and detect whether the current audio data obtained by the next sliding matches the second audio data, until the matching target audio data is determined, the sliding is stopped.
- the above-mentioned sliding matching method does not need to traverse all audio data with the second preset duration in the first audio data.
- the target audio data matching the second audio data can be obtained more quickly while ensuring the accuracy of synchronization detection, thereby further improving the efficiency of audio and video synchronization detection.
- the audio-video offset may refer to the time difference between the sound of the target audio data and the picture of the first key frame.
- the first time difference between the second audio data and the target second key frame can be used as a reference time difference for audio and video synchronization, and the target time difference between the target audio data and the first key frame is detected. If the target time difference is equal to the reference time difference, it indicates that the target audio data and the first key frame are synchronized in audio and video, and the audio and video offset at this time is 0. If the target time difference is not equal to the reference time difference, the specific audio and video offset can be determined based on the difference between the target time difference and the reference time difference.
- S350 may include: determining a reference time difference for audio-visual synchronization based on the start timestamp of the second audio data and the play timestamp of the target second key frame; determining a reference time difference for audio-visual synchronization based on the start timestamp of the target audio data;
- the target time difference corresponding to the first key frame is determined based on the timestamp and the playback timestamp of the first key frame; and the audio and video offset corresponding to the first key frame is determined based on the target time difference and the reference time difference.
- the start timestamp of the second audio data can be subtracted from the playback timestamp of the target second key frame, and the obtained time difference value can be determined as the reference time difference for audio and video synchronization.
- the start timestamp of the second audio data is the playback timestamp of the target second key frame, so that the reference time difference for audio and video synchronization is 0.
- the start timestamp of the target audio data is subtracted from the playback timestamp of the first key frame, and the obtained time difference value is determined as the target time difference corresponding to the first key frame.
- the target time difference is subtracted from the reference time difference, and the obtained time difference value is determined as the audio and video offset corresponding to the first key frame.
- the reference parallax is 0, and the target time difference is the audio and video offset. If the audio and video offset is greater than 0, it indicates that the audio is ahead of the video picture. If the audio and video offset is less than 0, it indicates that the audio lags behind the video picture.
- S360 Determine an audio and video synchronization detection result of the first video based on the audio and video offset corresponding to the first key frame.
- the audio and video synchronization detection result of the first video as a whole can be determined.
- S360 may include: if the audio and video offsets corresponding to each first video frame in the first video are within a preset allowable range, then determining that the audio and video synchronization detection result of the first video is audio and video synchronization; if there is at least one first video frame in the first video whose audio and video offsets correspond to are not within the preset allowable range, then determining that the audio and video synchronization detection result of the first video is audio and video asynchrony.
- the preset allowable range may be a time difference range of the allowed audio and video offset that is pre-set based on the detection standard.
- the preset allowable range is: [-185, 90], which indicates that, with respect to the video screen, an audio lag of 185ms or an audio lead of 90ms is acceptable and is considered to be synchronized with the audio and video.
- the audio and video synchronization detection result of the first video is audio and video synchronization, that is, the audio and video of the transcoded first video is also synchronized, and the video transcoding does not cause the audio and video to be out of sync.
- the audio and video synchronization detection result of the first video is audio and video out of sync, that is, audio and video out of sync caused by transcoding occurs at the position of the first video frame that is not within the preset allowable range.
- the audio and video offset corresponding to the first video frame that is not within the preset allowable range can be output to remind the user that the audio and video are out of sync.
- the technical solution of the disclosed embodiment extracts the first audio data with a longer audio duration, extracts the target audio data matching the second audio data again from the first audio data, and uses the second audio data and the target second key frame as the reference for audio and video synchronization, determines the audio and video offset between the target audio data and the first key frame, and based on the audio and video offset corresponding to the first key frame, can more accurately determine the audio and video synchronization detection result of the first video, thereby further improving the accuracy of the audio and video synchronization detection.
- FIG5 is a schematic diagram of the structure of an audio-video synchronization detection device provided by an embodiment of the present disclosure.
- the device specifically includes: a first video acquisition module 510 , a key frame alignment module 520 , an audio data extraction module 530 and an audio-video synchronization detection module 540 .
- the first video acquisition module 510 is used to acquire the first video to be detected, where the first video is a video obtained by transcoding the second video with synchronized audio and video;
- the key frame alignment module 520 is used to extract the first key frame in the first video and the second key frame in the second video, and align the first key frame and the second key frame to determine the target second key frame aligned with the first key frame;
- the audio data extraction module 530 is used to extract the first audio data corresponding to the first key frame and the second audio data corresponding to the target second key frame;
- the audio and video synchronization detection module 540 is used to match the first audio data with the second audio data to determine the audio and video synchronization detection result of the first video.
- the technical solution provided by the embodiment of the present disclosure detects the transcoded first video by taking the second video with synchronized audio and video before transcoding as a reference, and matches the first audio data corresponding to the first key frame after key frame alignment with the second audio data corresponding to the target second key frame, so as to accurately determine whether the first video has audio and video asynchronism caused by transcoding, thereby realizing automatic detection of audio and video synchronization and ensuring the accuracy of audio and video synchronization detection.
- the key frame alignment module 520 is specifically used for:
- Obtain an optional second key frame sequence of the first key frame obtain the current second key frame in sequence according to the order of the second key frame sequence, and determine the image similarity between the first key frame and the current second key frame; if the image similarity is greater than or equal to a preset similarity threshold, determine the current second video frame as the target second key frame aligned with the first key frame.
- the audio duration corresponding to the first audio data is a first preset duration
- the audio duration corresponding to the second audio data is a second preset duration, wherein the first The first preset time length is greater than the second preset time length.
- the first audio data is audio data with a first preset duration and a playback timestamp of the first key frame as the center moment;
- the second audio data is audio data having a second preset duration with the playback timestamp of the target second key frame as a preset reference time; wherein the preset reference time is one of the following:
- the audio and video synchronization detection module 540 includes:
- a target audio data determining unit configured to determine, from the first audio data, target audio data having the second preset duration and matching the second audio data
- an audio-visual offset determination unit configured to determine an audio-visual offset between the target audio data and the first key frame by taking the second audio data and the target second key frame as an audio-visual synchronization reference;
- the audio and video synchronization detection unit is used to determine the audio and video synchronization detection result of the first video based on the audio and video offset corresponding to the first key frame.
- the target audio data determination unit is specifically used to:
- the first audio data is slid to obtain the current audio data with the second preset duration corresponding to the sliding; the audio similarity between the current audio data and the second audio data is determined; based on the audio similarity corresponding to the current audio data, the target audio data matching the second audio data is determined.
- the audio and video offset determination unit is specifically used to:
- the audio and video synchronization detection unit is specifically used for:
- the audio and video synchronization detection result of the first video is determined to be audio and video synchronization; if there is at least one first video frame in the first video whose audio and video offsets are not within the preset allowable range, the audio and video synchronization detection result of the first video is determined to be audio and video out of synchronization.
- the audio-video synchronization detection device provided in the embodiments of the present disclosure can execute the audio-video synchronization detection device provided in any embodiment of the present disclosure.
- the provided audio and video synchronization detection method has the corresponding functional modules and beneficial effects for executing the audio and video synchronization detection method.
- FIG6 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present disclosure.
- a schematic diagram of the structure of an electronic device e.g., a terminal device or server in FIG6 ) 500 suitable for implementing an embodiment of the present disclosure is shown.
- the terminal device in the embodiment of the present disclosure may include, but is not limited to, mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), vehicle-mounted terminals (e.g., vehicle-mounted navigation terminals), etc., and fixed terminals such as digital TVs, desktop computers, etc.
- the electronic device shown in FIG6 is merely an example and should not impose any limitations on the functions and scope of use of the embodiments of the present disclosure.
- the electronic device 500 may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 501, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 502 or a program loaded from a storage device 508 to a random access memory (RAM) 503.
- a processing device e.g., a central processing unit, a graphics processing unit, etc.
- RAM random access memory
- various programs and data required for the operation of the electronic device 500 are also stored.
- the processing device 501, the ROM 502, and the RAM 503 are connected to each other via a bus 504.
- An edit/output (I/O) interface 505 is also connected to the bus 504.
- the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, a touchpad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc.; output devices 507 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc.; storage devices 508 including, for example, a magnetic tape, a hard disk, etc.; and communication devices 509.
- the communication devices 509 may allow the electronic device 500 to communicate wirelessly or wired with other devices to exchange data.
- FIG. 6 shows an electronic device 500 with various devices, it should be understood that it is not required to implement or have all the devices shown. More or fewer devices may be implemented or have alternatively.
- an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a non-transitory computer readable medium, the computer program including a computer program for executing the flowchart.
- the computer program can be downloaded and installed from the network through the communication device 509, or installed from the storage device 508, or installed from the ROM 502.
- the processing device 501 When the computer program is executed by the processing device 501, the above functions defined in the method of the embodiment of the present disclosure are performed.
- the electronic device provided by the embodiment of the present disclosure and the audio and video synchronization detection method provided by the above embodiment belong to the same inventive concept.
- the technical details not fully described in this embodiment can be referred to the above embodiment, and this embodiment has the same beneficial effects as the above embodiment.
- the embodiment of the present disclosure provides a computer storage medium on which a computer program is stored.
- the program is executed by a processor, the audio and video synchronization detection method provided by the above embodiment is implemented.
- the computer-readable medium disclosed above may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
- the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above.
- Computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
- a computer-readable storage medium may be any tangible medium containing or storing a program that may be used by or in combination with an instruction execution system, device or device.
- a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, in which a computer-readable program code is carried.
- This propagated data signal may take a variety of forms, including but not limited to an electromagnetic signal, an optical signal, or any suitable combination of the above.
- the computer readable signal medium may also be any computer readable medium other than a computer readable storage medium, which may send, propagate or transmit a program for use by or in conjunction with an instruction execution system, apparatus or device.
- the program code contained on the computer readable medium may be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
- the client and the server may utilize any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol).
- the network may communicate using any protocol and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN”), a wide area network ("WAN”), an internetwork (e.g., the Internet), and a peer-to-peer network (e.g., an ad hoc peer-to-peer network), as well as any network currently known or developed in the future.
- LAN local area network
- WAN wide area network
- an internetwork e.g., the Internet
- peer-to-peer network e.g., an ad hoc peer-to-peer network
- the computer-readable medium may be included in the electronic device, or may exist independently without being incorporated into the electronic device.
- the computer-readable medium carries one or more programs.
- the electronic device obtains a first video to be detected, where the first video is a video obtained by transcoding a second video with synchronized audio and video; extracts a first key frame in the first video and a second key frame in the second video, and aligns the first key frame and the second key frame to determine a target second key frame aligned with the first key frame; extracts first audio data corresponding to the first key frame and second audio data corresponding to the target second key frame; matches the first audio data with the second audio data to determine an audio and video synchronization detection result of the first video.
- Computer program code for performing the operations of the present disclosure may be written in one or more programming languages or a combination thereof, including, but not limited to, object-oriented programming languages, such as Java, Smalltalk, C++, and conventional procedural programming languages, such as "C" or similar programming languages.
- the program code may be executed entirely on the user's computer, partially on the user's computer, as a separate software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server.
- the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., through the Internet using an Internet service provider).
- LAN local area network
- WAN wide area network
- Internet service provider e.g., AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
- each box in the flowchart or block diagram may represent a module, a program segment, or a portion of a code, which contains one or more executable instructions for implementing a specified logical function.
- the functions marked in the boxes may also occur in an order different from that marked in the accompanying drawings. For example, two boxes represented in succession may actually be executed substantially in parallel, and they may sometimes be executed in the opposite order, depending on the functions involved.
- each block in the block diagram and/or flowchart, and combinations of blocks in the block diagram and/or flowchart may be implemented by a dedicated hardware-based system that performs the specified function or operation, or may be implemented by a combination of dedicated hardware and computer instructions.
- exemplary types of hardware logic components include: field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips (SOCs), complex programmable logic devices (CPLDs), and the like.
- FPGAs field programmable gate arrays
- ASICs application specific integrated circuits
- ASSPs application specific standard products
- SOCs systems on chips
- CPLDs complex programmable logic devices
- a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, device, or equipment.
- a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
- a machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or equipment, or any suitable combination of the foregoing.
- a more specific example of a machine-readable storage medium may include an electrical connection based on one or more lines, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
- RAM random access memory
- ROM read-only memory
- EPROM or flash memory erasable programmable read-only memory
- CD-ROM portable compact disk read-only memory
- CD-ROM compact disk read-only memory
- magnetic storage device or any suitable combination of the foregoing.
- Example 1 provides a method for detecting audio and video synchronization, including:
- the first video is a video obtained by transcoding a second video with synchronized audio and video;
- the first audio data is matched with the second audio data to determine the first video
- the audio and video synchronization detection results
- Example 2 provides a method for detecting audio and video synchronization, further comprising:
- aligning the first key frame and the second key frame to determine a target second key frame aligned with the first key frame includes:
- the current second video frame is determined as a target second key frame aligned with the first key frame.
- Example 3 provides a method for detecting audio-visual synchronization, further comprising:
- the audio duration corresponding to the first audio data is a first preset duration
- the audio duration corresponding to the second audio data is a second preset duration, wherein the first preset duration is greater than the second preset duration
- Example 4 provides a method for detecting audio and video synchronization, further comprising:
- the first audio data is audio data with a first preset duration and a playback timestamp of the first key frame as the center moment;
- the second audio data is audio data having a second preset duration with the playback timestamp of the target second key frame as a preset reference time; wherein the preset reference time is one of the following:
- Example 5 provides a method for detecting audio and video synchronization, further comprising:
- matching the first audio data with the second audio data to determine an audio-video synchronization detection result of the first video includes:
- an audio and video synchronization detection result of the first video is determined.
- the determining, from the first audio data, target audio data having the second preset duration and matching the second audio data includes:
- target audio data matching the second audio data is determined.
- Example 7 provides a method for detecting audio-visual synchronization, further comprising:
- the determining the audio and video offset between the target audio data and the first key frame by taking the second audio data and the target second key frame as an audio and video synchronization reference includes:
- an audio and video offset corresponding to the first key frame is determined.
- Example 8 provides a method for detecting audio and video synchronization, further comprising:
- determining the audio and video synchronization detection result of the first video based on the audio and video offset corresponding to the first key frame includes:
- the audio and video offset corresponding to at least one first video frame in the first video is not within a preset allowable range, it is determined that the audio and video synchronization detection result of the first video is that the audio and video are out of synchronization.
- Example 9 provides a device for detecting audio-visual synchronization. include:
- a first video acquisition module is used to acquire a first video to be detected, where the first video is a video obtained by transcoding a second video with synchronized audio and video;
- a key frame alignment module used to extract a first key frame in the first video and a second key frame in the second video, align the first key frame with the second key frame, and determine a target second key frame aligned with the first key frame;
- An audio data extraction module used to extract first audio data corresponding to the first key frame and second audio data corresponding to the target second key frame;
- the audio and video synchronization detection module is used to match the first audio data with the second audio data to determine the audio and video synchronization detection result of the first video.
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
Abstract
Description
本申请要求于2023年7月4日提交的中国专利申请第202310813430.4号的优先权,该中国专利申请的全文通过引用的方式结合于此以作为本申请的一部分。This application claims priority to Chinese Patent Application No. 202310813430.4 filed on July 4, 2023, the entire text of which is incorporated herein by reference as a part of this application.
本公开实施例涉及一种音画同步检测方法、装置、设备和存储介质。The embodiments of the present disclosure relate to a method, apparatus, device and storage medium for detecting audio and video synchronization.
随着计算机技术的快速发展,往往需要对视频进行转码处理。然而,在视频转码过程中往往会出现音画不同步的情况,也就是音频和视频画面不能准确保持一致。例如,视频画面显示正在说话但没有相应的声音,极大降低了用户观看体验。目前,通常是通过人工观看视频的方式,自主判断是否存在因转码而导致音画不同步的情况。可见,这种人工检测方式费时费力,降低了音画同步的检测效率。With the rapid development of computer technology, it is often necessary to transcode videos. However, in the process of video transcoding, the audio and video are often out of sync, that is, the audio and video images cannot be accurately consistent. For example, the video screen shows that someone is talking but there is no corresponding sound, which greatly reduces the user's viewing experience. At present, it is usually judged by manually watching the video whether there is a situation where the audio and video are out of sync due to transcoding. It can be seen that this manual detection method is time-consuming and labor-intensive, and reduces the detection efficiency of audio and video synchronization.
发明内容Summary of the invention
本公开提供一种音画同步检测方法、装置、设备和存储介质,以对转码后的视频进行音画同步的自动检测,从而提高了音画同步的检测效率,并且保证了音画同步检测的准确性。The present disclosure provides a method, device, equipment and storage medium for detecting audio and video synchronization, so as to automatically detect audio and video synchronization of a transcoded video, thereby improving the detection efficiency of audio and video synchronization and ensuring the accuracy of audio and video synchronization detection.
第一方面,本公开实施例提供了一种音画同步检测方法,包括:In a first aspect, an embodiment of the present disclosure provides a method for detecting audio and video synchronization, including:
获取待检测的第一视频,所述第一视频是对音画同步的第二视频进行转码获得的视频;Acquire a first video to be detected, where the first video is a video obtained by transcoding a second video with synchronized audio and video;
提取所述第一视频中的第一关键帧和所述第二视频中的第二关键帧,并对所述第一关键帧和所述第二关键帧进行对齐,确定与所述第一关键帧对齐的目标第二关键帧;Extracting a first key frame in the first video and a second key frame in the second video, aligning the first key frame and the second key frame, and determining a target second key frame aligned with the first key frame;
提取所述第一关键帧对应的第一音频数据和所述目标第二关键帧对应的第二音频数据; Extracting first audio data corresponding to the first key frame and second audio data corresponding to the target second key frame;
将所述第一音频数据与所述第二音频数据进行匹配,确定所述第一视频的音画同步检测结果。The first audio data is matched with the second audio data to determine an audio-visual synchronization detection result of the first video.
第二方面,本公开实施例还提供了一种音画同步检测装置,包括:In a second aspect, the present disclosure also provides an audio-video synchronization detection device, including:
第一视频获取模块,用于获取待检测的第一视频,所述第一视频是对音画同步的第二视频进行转码获得的视频;A first video acquisition module is used to acquire a first video to be detected, where the first video is a video obtained by transcoding a second video with synchronized audio and video;
关键帧对齐模块,用于提取所述第一视频中的第一关键帧和所述第二视频中的第二关键帧,并对所述第一关键帧和所述第二关键帧进行对齐,确定与所述第一关键帧对齐的目标第二关键帧;A key frame alignment module, used to extract a first key frame in the first video and a second key frame in the second video, align the first key frame with the second key frame, and determine a target second key frame aligned with the first key frame;
音频数据提取模块,用于提取所述第一关键帧对应的第一音频数据和所述目标第二关键帧对应的第二音频数据;An audio data extraction module, used to extract first audio data corresponding to the first key frame and second audio data corresponding to the target second key frame;
音画同步检测模块,用于将所述第一音频数据与所述第二音频数据进行匹配,确定所述第一视频的音画同步检测结果。The audio and video synchronization detection module is used to match the first audio data with the second audio data to determine the audio and video synchronization detection result of the first video.
第三方面,本公开实施例还提供了一种电子设备,所述电子设备包括:In a third aspect, an embodiment of the present disclosure further provides an electronic device, the electronic device comprising:
一个或多个处理器;one or more processors;
存储装置,用于存储一个或多个程序,a storage device for storing one or more programs,
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如本公开实施例任一所述的音画同步检测方法。When the one or more programs are executed by the one or more processors, the one or more processors implement the audio and video synchronization detection method as described in any one of the embodiments of the present disclosure.
第四方面,本公开实施例还提供了一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如本公开实施例任一所述的音画同步检测方法。In a fourth aspect, the embodiments of the present disclosure further provide a storage medium comprising computer executable instructions, wherein the computer executable instructions, when executed by a computer processor, are used to execute the audio-visual synchronization detection method as described in any one of the embodiments of the present disclosure.
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。The above and other features, advantages and aspects of the embodiments of the present disclosure will become more apparent with reference to the following detailed description in conjunction with the accompanying drawings. Throughout the accompanying drawings, the same or similar reference numerals represent the same or similar elements. It should be understood that the drawings are schematic and the originals and elements are not necessarily drawn to scale.
图1是本公开实施例所提供的一种音画同步检测方法流程示意图;FIG1 is a flow chart of a method for detecting synchronization of audio and video provided by an embodiment of the present disclosure;
图2是本公开实施例所涉及的一种第一音频数据和第二音频数据的波形匹配图;FIG2 is a waveform matching diagram of first audio data and second audio data involved in an embodiment of the present disclosure;
图3是本公开实施例所提供的另一种音画同步检测方法流程示意图; FIG3 is a flow chart of another method for detecting synchronization between audio and video provided by an embodiment of the present disclosure;
图4是本公开实施例所涉及的一种音频数据提取的示例;FIG4 is an example of audio data extraction according to an embodiment of the present disclosure;
图5是本公开实施例所提供的一种音画同步检测装置的结构示意图;FIG5 is a schematic diagram of the structure of a device for detecting synchronization between audio and video provided by an embodiment of the present disclosure;
图6是本公开实施例所提供的一种电子设备的结构示意图。FIG. 6 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present disclosure.
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the accompanying drawings, it should be understood that the present disclosure can be implemented in various forms and should not be construed as being limited to the embodiments described herein, which are instead provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are only for exemplary purposes and are not intended to limit the scope of protection of the present disclosure.
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。It should be understood that the various steps described in the method embodiments of the present disclosure may be performed in different orders and/or in parallel. In addition, the method embodiments may include additional steps and/or omit the steps shown. The scope of the present disclosure is not limited in this respect.
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。The term "including" and its variations used herein are open inclusions, i.e., "including but not limited to". The term "based on" means "based at least in part on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". The relevant definitions of other terms will be given in the following description.
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。It should be noted that the concepts such as "first" and "second" mentioned in the present disclosure are only used to distinguish different devices, modules or units, and are not used to limit the order or interdependence of the functions performed by these devices, modules or units.
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。It should be noted that the modifications of "one" and "plurality" mentioned in the present disclosure are illustrative rather than restrictive, and those skilled in the art should understand that unless otherwise clearly indicated in the context, it should be understood as "one or more".
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。The names of the messages or information exchanged between multiple devices in the embodiments of the present disclosure are only used for illustrative purposes and are not used to limit the scope of these messages or information.
图1为本公开实施例所提供的一种音画同步检测方法的流程示意图,本公开实施例适用于对转码后的视频进行音画同步检测的情况,该方法可以由音画同步检测装置来执行,该装置可以通过软件和/或硬件的形式实现,可选的,通过电子设备来实现,该电子设备可以是移动终端、PC端或服务器等。Figure 1 is a flow chart of a method for detecting audio and video synchronization provided by an embodiment of the present disclosure. The embodiment of the present disclosure is applicable to the case of performing audio and video synchronization detection on a transcoded video. The method can be executed by an audio and video synchronization detection device, which can be implemented in the form of software and/or hardware. Optionally, it can be implemented by an electronic device, which can be a mobile terminal, a PC or a server, etc.
如图1所示,音画同步检测方法具体包括以下步骤: As shown in FIG1 , the audio-video synchronization detection method specifically includes the following steps:
S110、获取待检测的第一视频,第一视频是对音画同步的第二视频进行转码获得的视频。S110: Acquire a first video to be detected, where the first video is a video obtained by transcoding a second video with synchronized audio and video.
其中,第一视频是指转码后的待检测视频。第二视频是指转码前的视频。第二视频中的音频和视频画面是同步的,从而可以将第二视频作为参考视频,对第一视频进行音画同步的检测。The first video refers to the video to be detected after transcoding. The second video refers to the video before transcoding. The audio and video images in the second video are synchronized, so the second video can be used as a reference video to detect the synchronization of audio and video of the first video.
具体地,用户可以在客户端中制作视频并发布视频。服务器在接收到客户端发布上传的视频后,需要对上传视频进行转码处理,比如调整视频分辨率和码率等,从而可以提升画面质量,并将转码后的视频下发至其他客户端中进行播放。在上述应用场景中,可以将上传至服务器的视频作为音画同步的第二视频,转码后的视频作为待检测的第一视频。通过以转码前的第二视频作为音画同步的参考视频,可以准确地检测出第一视频是否存在因视频转码而导致的音画不同步的情况。Specifically, users can create and publish videos in the client. After receiving the video uploaded by the client, the server needs to transcode the uploaded video, such as adjusting the video resolution and bit rate, so as to improve the picture quality, and send the transcoded video to other clients for playback. In the above application scenario, the video uploaded to the server can be used as the second video with audio and video synchronization, and the transcoded video can be used as the first video to be detected. By using the second video before transcoding as a reference video for audio and video synchronization, it is possible to accurately detect whether the first video has audio and video asynchronization caused by video transcoding.
S120、提取第一视频中的第一关键帧和第二视频中的第二关键帧,并对第一关键帧和第二关键帧进行对齐,确定与第一关键帧对齐的目标第二关键帧。S120: extract a first key frame in the first video and a second key frame in the second video, align the first key frame and the second key frame, and determine a target second key frame that is aligned with the first key frame.
其中,关键帧可以是指对视频内容起到决定性作用的视频帧,比如,存在场景切换的视频帧、存在运动变化的视频帧、关键动作所处于的视频帧等。第一关键帧可以是指第一视频中的关键帧。第一关键帧的数量可以为一个或多个。第二关键帧可以是指第二视频中的关键帧。第二关键帧的数量也可以为一个或多个。第一关键帧的数量与第二关键帧的数量可能相同,也可能不同。对齐是指将具有相同画面的关键帧进行对齐的操作。目标第二关键帧是指与第一关键帧具有相同画面的第二关键帧。Among them, a key frame may refer to a video frame that plays a decisive role in the video content, such as a video frame with scene switching, a video frame with motion changes, a video frame where a key action is located, etc. A first key frame may refer to a key frame in a first video. The number of first key frames may be one or more. A second key frame may refer to a key frame in a second video. The number of second key frames may also be one or more. The number of first key frames may be the same as or different from the number of second key frames. Alignment refers to the operation of aligning key frames with the same picture. The target second key frame refers to a second key frame with the same picture as the first key frame.
具体地,对第一视频和第二视频进行关键帧的检测,提取出第一视频中的所有第一关键帧和第二视频中的所有第二关键帧。可以通过检测第一关键帧与第二关键帧之间的图像相似度的方式,对第一关键帧和第二关键帧进行对齐处理。例如,针对每个第一关键帧而言,可以确定第一关键帧与每个第二关键帧之间的图像相似度,并可以将图像相似度最高且大于或等于预设相似度阈值的第二关键帧作为与该第一关键帧对齐的目标第二关键帧,从而获得与该第一关键帧具有相同画面的目标第二关键帧。Specifically, key frames of the first video and the second video are detected, and all first key frames in the first video and all second key frames in the second video are extracted. The first key frames and the second key frames can be aligned by detecting the image similarity between the first key frames and the second key frames. For example, for each first key frame, the image similarity between the first key frame and each second key frame can be determined, and the second key frame with the highest image similarity and greater than or equal to a preset similarity threshold can be used as the target second key frame to be aligned with the first key frame, thereby obtaining a target second key frame having the same picture as the first key frame.
其中,图像相似度可以利用任意一种能够衡量两个图像相似度的指标进 行表征。例如,图像相似度利用结构相似性SSIM(Structural Similarity)指标进行表征。SSIM指标的取值范围为0到1,并且SSIM指标越大表明图像相似度越高。Among them, image similarity can be measured using any indicator that can measure the similarity between two images. For example, image similarity is characterized by the structural similarity SSIM (Structural Similarity) indicator. The value range of the SSIM indicator is 0 to 1, and the larger the SSIM indicator, the higher the image similarity.
需要说明的是,若所有第二关键帧中不存在与某个第一关键帧相同的目标第二关键帧,则表明无法对该第一关键帧进行对齐,此时可以忽略该第一关键帧,仅便利用能够对齐的第一关键帧进行后续的音画同步检测。It should be noted that if there is no target second key frame identical to a first key frame among all second key frames, it indicates that the first key frame cannot be aligned. In this case, the first key frame can be ignored and only the aligned first key frame can be used for subsequent audio and video synchronization detection.
S130、提取第一关键帧对应的第一音频数据和目标第二关键帧对应的第二音频数据。S130: Extract first audio data corresponding to the first key frame and second audio data corresponding to the target second key frame.
具体地,针对每个第一关键帧而言,可以在第一视频中的第一关键帧位置处,提取预设音频时长对应的第一音频数据。在第二视频中的目标第二关键帧位置处,也提取该预设音频时长对应的第二音频数据,从而提取出的第一音频数据和第二音频数据具有相同音频时长。例如,提取出的第一音频数据是以第一关键帧的播放时间戳为预设参考时刻,具有预设音频时长的音频数据。提取出的第二音频数据是以目标第二关键帧的播放时间戳为预设参考时刻,具有预设音频时长的音频数据。其中,预设参考时刻可以是指音频数据的开始时刻、中心时刻或者结束时刻。Specifically, for each first key frame, the first audio data corresponding to the preset audio duration can be extracted at the first key frame position in the first video. At the target second key frame position in the second video, the second audio data corresponding to the preset audio duration is also extracted, so that the extracted first audio data and the second audio data have the same audio duration. For example, the extracted first audio data is audio data with a preset audio duration using the playback timestamp of the first key frame as a preset reference moment. The extracted second audio data is audio data with a preset audio duration using the playback timestamp of the target second key frame as a preset reference moment. The preset reference moment may refer to the start moment, center moment or end moment of the audio data.
需要说明的是,对齐后的第一关键帧和目标第二视频帧的画面内容是相同的,但其对应的播放时间戳可能是不同的。在提取出的第一音频数据和第二音频数据具有相同音频时长时,第一音频数据和第二音频数据需要在关键帧的相同位置处提取,以保证后续音画同步检测的准确性。例如,从第一视频中提取以第一关键帧的播放时间戳为开始时刻,具有预设音频时长的第一音频数据。从第二视频中提取以目标第二关键帧的播放时间戳为开始时刻,具有预设音频时长的第二音频数据。It should be noted that the picture content of the aligned first key frame and the target second video frame is the same, but their corresponding playback timestamps may be different. When the extracted first audio data and the second audio data have the same audio duration, the first audio data and the second audio data need to be extracted at the same position of the key frame to ensure the accuracy of subsequent audio and video synchronization detection. For example, the first audio data with a preset audio duration starting at the playback timestamp of the first key frame is extracted from the first video. The second audio data with a preset audio duration starting at the playback timestamp of the target second key frame is extracted from the second video.
S140、将第一音频数据与第二音频数据进行匹配,确定第一视频的音画同步检测结果。S140: Match the first audio data with the second audio data to determine an audio-video synchronization detection result of the first video.
具体地,图2给出了一种第一音频数据和第二音频数据的波形匹配图。如图2所示,第一音频数据和第二音频数据具有相同的音频时长,针对于此,可以基于图2中两个音频波形之间的相似程度确定第一音频数据与第二音频数据之间的匹配结果。例如,可以将第一音频数据和第二音频数据之间的协方差作为第一音频数据与第二音频数据之间的音频相似度,并在该音频相似度 大于或等于预设相似度阈值时,确定第一音频数据与第二音频数据匹配成功。在该音频相似度小于预设相似度阈值时,确定第一音频数据与第二音频数据匹配失败。若匹配成功,则表明在该第一关键帧位置处的音频和视频画面是同步的。若匹配失败,则表明在该第一关键帧位置处的音频和视频画面是不同步的,存在因转码而导致的音画不同步的情况。根据第一视频中的每个第一关键帧对应的匹配结果,可以确定第一视频的音画同步检测结果。例如,若第一视频中所有第一关键帧对应的匹配结果为匹配成功,则确定第一视频的音画同步检测结果为音画同步。若第一视频中存在至少一个第一关键帧对应的匹配结果为匹配失败,则确定第一视频的音画同步检测结果为音画不同步,第一视频中存在因转码而导致的音画不同步的情况。Specifically, FIG2 shows a waveform matching diagram of the first audio data and the second audio data. As shown in FIG2, the first audio data and the second audio data have the same audio duration. In view of this, the matching result between the first audio data and the second audio data can be determined based on the similarity between the two audio waveforms in FIG2. For example, the covariance between the first audio data and the second audio data can be used as the audio similarity between the first audio data and the second audio data, and the audio similarity can be used as the covariance between the first audio data and the second audio data. When the similarity is greater than or equal to the preset similarity threshold, it is determined that the first audio data and the second audio data match successfully. When the audio similarity is less than the preset similarity threshold, it is determined that the first audio data and the second audio data fail to match. If the match is successful, it indicates that the audio and video images at the position of the first key frame are synchronized. If the match fails, it indicates that the audio and video images at the position of the first key frame are not synchronized, and there is a situation where the audio and video are not synchronized due to transcoding. According to the matching results corresponding to each first key frame in the first video, the audio and video synchronization detection result of the first video can be determined. For example, if the matching results corresponding to all the first key frames in the first video are successful matches, the audio and video synchronization detection result of the first video is determined to be audio and video synchronization. If the matching result corresponding to at least one first key frame in the first video is a match failure, it is determined that the audio and video synchronization detection result of the first video is audio and video synchronization, and there is an audio and video synchronization situation caused by transcoding in the first video.
本公开实施例的技术方案,通过将转码前音画同步的第二视频作为参考对转码后的第一视频进行检测,并将关键帧对齐后的第一关键帧对应的第一音频数据和目标第二关键帧对应的第二音频数据进行匹配,从而可以准确地确定第一视频是否存在因转码而导致的音画不同步的情况,实现了音画同步的自动检测,并且保证了音画同步检测的准确性。The technical solution of the disclosed embodiment detects the transcoded first video by taking the second video with synchronized audio and video before transcoding as a reference, and matches the first audio data corresponding to the first key frame after key frame alignment with the second audio data corresponding to the target second key frame, so as to accurately determine whether the first video has audio and video asynchronism caused by transcoding, thereby realizing automatic detection of audio and video synchronization and ensuring the accuracy of audio and video synchronization detection.
作为一种可选的实施例,S120中的“对第一关键帧和第二关键帧进行对齐,确定与第一关键帧对齐的目标第二关键帧”,可以包括:获取第一关键帧可选的第二关键帧序列;按照第二关键帧序列的顺序,依次获取当前第二关键帧,并确定第一关键帧与当前第二关键帧之间的图像相似度;若图像相似度大于或等于预设相似度阈值,则将当前第二视频帧确定为与第一关键帧对齐的目标第二关键帧。As an optional embodiment, "aligning the first key frame and the second key frame to determine a target second key frame aligned with the first key frame" in S120 may include: obtaining an optional second key frame sequence for the first key frame; obtaining the current second key frame in sequence according to the order of the second key frame sequence, and determining the image similarity between the first key frame and the current second key frame; if the image similarity is greater than or equal to a preset similarity threshold, determining the current second video frame as the target second key frame aligned with the first key frame.
其中,图像相似度可以利用任意一种能够衡量两个图像相似度的指标进行表征。例如,图像相似度利用结构相似性SSIM(Structural Similarity)指标进行表征。SSIM指标的取值范围为0到1,并且SSIM指标越大表明图像相似度越高。Among them, image similarity can be characterized by any indicator that can measure the similarity between two images. For example, image similarity is characterized by the structural similarity SSIM (Structural Similarity) indicator. The value range of the SSIM indicator is 0 to 1, and the larger the SSIM indicator, the higher the image similarity.
具体地,提取出的所有第一关键帧可以按照视频播放先后顺序进行排序,获得第一关键帧序列,比如{A1,A2,……,An}。同理,提取出的所有第二关键帧可以按照视频播放先后顺序进行排序,获得第二关键帧序列,比如{B1,B2,……,Bm}。可以按照第一关键帧序列的顺序,依次从第二关键帧序列中确定每个第一关键帧对齐的目标第二关键帧。 Specifically, all the extracted first key frames can be sorted in the order of video playback to obtain a first key frame sequence, such as {A1, A2, ..., An}. Similarly, all the extracted second key frames can be sorted in the order of video playback to obtain a second key frame sequence, such as {B1, B2, ..., Bm}. The target second key frame to be aligned with each first key frame can be determined from the second key frame sequence in sequence according to the order of the first key frame sequence.
例如,第一个第一关键帧A1可选的第二关键帧序列为{B1,B2,……,Bm},此时按照该第二关键帧序列的顺序,可以先将第二关键帧B1作为当前第二关键帧,并检测第一关键帧A1与第二关键帧B1之间的图像相似度是否大于或等于预设相似度阈值;若是,则表明第一关键帧A1与第二关键帧B1画面相似,此时可以直接将第二关键帧B1作为与第一关键帧A1对齐的目标第二关键帧;若否,则将下一个第二关键帧B2作为当前第二关键帧重新检测第一关键帧A1与第二关键帧B2之间的图像相似度是否大于或等于预设相似度阈值,直到确定出与第一关键帧A1相匹配的第二关键帧为止,从而完成第一关键帧A1的对齐。假设与第一关键帧A1对齐的目标第二关键帧为B2,则对第二个第一关键帧A2进行对齐时,由于与第一关键帧A2对齐的目标第二关键帧是出现在第二关键帧B2后面的关键帧中,从而可以将第二关键帧B2之后的第二关键帧作为第一关键帧A2可选的第二关键帧序列,即{B3,B4,……,Bm},并基于上述相同的检测过程,从第二关键帧序列{B3,B4,……,Bm}中确定与第一关键帧A2对齐的目标第二关键帧。同理依次执行,可以更加快速地获得与每个第一关键帧对齐的目标第二关键帧,从而提高了关键帧对齐速度,进一步提高了音画同步的检测效率。For example, the optional second key frame sequence of the first first key frame A1 is {B1, B2, ..., Bm}. At this time, according to the order of the second key frame sequence, the second key frame B1 can be first used as the current second key frame, and it is detected whether the image similarity between the first key frame A1 and the second key frame B1 is greater than or equal to the preset similarity threshold; if so, it indicates that the first key frame A1 and the second key frame B1 are similar in picture. At this time, the second key frame B1 can be directly used as the target second key frame to be aligned with the first key frame A1; if not, the next second key frame B2 is used as the current second key frame to re-detect whether the image similarity between the first key frame A1 and the second key frame B2 is greater than or equal to the preset similarity threshold, until the second key frame matching the first key frame A1 is determined, thereby completing the alignment of the first key frame A1. Assuming that the target second key frame aligned with the first key frame A1 is B2, when aligning the second first key frame A2, since the target second key frame aligned with the first key frame A2 appears in the key frame after the second key frame B2, the second key frame after the second key frame B2 can be used as the optional second key frame sequence of the first key frame A2, that is, {B3, B4, ..., Bm}, and based on the same detection process as above, the target second key frame aligned with the first key frame A2 is determined from the second key frame sequence {B3, B4, ..., Bm}. Similarly, the execution is carried out sequentially, and the target second key frame aligned with each first key frame can be obtained more quickly, thereby improving the key frame alignment speed and further improving the detection efficiency of audio and video synchronization.
图3为本公开实施例所提供的另一种音画同步检测方法的流程示意图,本公开实施例在上述公开实施例的基础上,提取的第一音频数据的音频时长大于第二音频数据的音频时长,并在此基础上,对步骤“将第一音频数据与第二音频数据进行匹配,确定第一视频的音画同步检测结果”进行了优化。其中与上述各公开实施例相同或相应的术语的解释在此不再赘述。FIG3 is a flow chart of another method for detecting audio-visual synchronization provided by an embodiment of the present disclosure. Based on the above disclosed embodiments, the audio duration of the extracted first audio data is greater than the audio duration of the second audio data, and on this basis, the step of "matching the first audio data with the second audio data to determine the audio-visual synchronization detection result of the first video" is optimized. Explanations of terms that are the same or corresponding to the above disclosed embodiments are not repeated here.
如图3所示,音画同步检测方法具体包括以下步骤:As shown in FIG3 , the audio-video synchronization detection method specifically includes the following steps:
S310、获取待检测的第一视频,第一视频是对音画同步的第二视频进行转码获得的视频。S310: Obtain a first video to be detected, where the first video is a video obtained by transcoding a second video with synchronized audio and video.
S320、提取第一视频中的第一关键帧和第二视频中的第二关键帧,并对第一关键帧和第二关键帧进行对齐,确定与第一关键帧对齐的目标第二关键帧。S320: extract a first key frame in the first video and a second key frame in the second video, align the first key frame and the second key frame, and determine a target second key frame that is aligned with the first key frame.
S330、提取第一关键帧对应的第一音频数据和目标第二关键帧对应的第二音频数据。S330: Extract first audio data corresponding to the first key frame and second audio data corresponding to the target second key frame.
其中,第一音频数据对应的音频时长为第一预设时长,第二音频数据对应 的音频时长为第二预设时长。第一预设时长大于第二预设时长。例如,第一预设时长为500ms,第二预设时长为20ms。The first audio data corresponds to an audio duration of the first preset duration, and the second audio data corresponds to an audio duration of the first preset duration. The audio duration is the second preset duration. The first preset duration is greater than the second preset duration. For example, the first preset duration is 500ms and the second preset duration is 20ms.
具体地,针对每个第一关键帧而言,可以在第一视频中的第一关键帧位置处,提取第一预设时长对应的第一音频数据。在第二视频中的目标第二关键帧位置处,提取第二预设时长对应的第二音频数据,从而提取出的第一音频数据的音频时长大于第二音频数据的音频时长,以便后续在第一音频数据中进行第二音频数据的匹配。Specifically, for each first key frame, the first audio data corresponding to the first preset duration can be extracted at the first key frame position in the first video. The second audio data corresponding to the second preset duration can be extracted at the target second key frame position in the second video, so that the audio duration of the extracted first audio data is greater than the audio duration of the second audio data, so that the second audio data can be matched in the first audio data later.
示例性地,第一音频数据的提取需要在第一关键帧位置处的左右范围内进行提取,以便后续进行准确的音频匹配。例如,第一音频数据可以是以第一关键帧的播放时间戳为中心时刻,具有第一预设时长的音频数据。图4给出了一种第一音频数据提取的示例。参见图4,从第一视频的音频中提取以每个第一关键帧的播放时间戳为中心时刻,具有第一预设时长500ms的第一音频数据。Exemplarily, the extraction of the first audio data needs to be performed within the left and right range of the first key frame position so as to perform accurate audio matching later. For example, the first audio data can be audio data with a first preset duration centered at the playback timestamp of the first key frame. FIG4 shows an example of first audio data extraction. Referring to FIG4, first audio data with a first preset duration of 500ms centered at the playback timestamp of each first key frame is extracted from the audio of the first video.
示例性地,用于作为参考的第二音频数据可以存在多种提取方式,其可以在目标第二关键帧位置处的左面、左面或者左右范围内进行提取。例如,第二音频数据可以是以目标第二关键帧的播放时间戳为预设参考时刻,具有第二预设时长的音频数据。其中,预设参考时刻可以为以下其中之一:开始时刻、中心时刻和结束时刻。图4也给出了一种第二音频数据提取的示例。参见图4,从第二视频的音频中提取以每个目标第二关键帧的播放时间戳为开始时刻,具有第二预设时长20ms的第二音频数据。Exemplarily, there may be a variety of ways to extract the second audio data used as a reference, and the data may be extracted to the left, to the left, or in the left and right range of the target second key frame position. For example, the second audio data may be audio data having a second preset duration with the playback timestamp of the target second key frame as a preset reference moment. The preset reference moment may be one of the following: a start moment, a center moment, and an end moment. FIG4 also shows an example of second audio data extraction. Referring to FIG4, second audio data having a second preset duration of 20 ms and starting at the playback timestamp of each target second key frame is extracted from the audio of the second video.
S340、从第一音频数据中确定具有第二预设时长且与第二音频数据相匹配的目标音频数据。S340: Determine, from the first audio data, target audio data having a second preset duration and matching the second audio data.
其中,目标音频数据可以是指第一音频数据中与第二音频数据匹配成功的音频数据。也就是说,目标音频数据与第二音频数据可以近似于相同数据。The target audio data may refer to the audio data in the first audio data that successfully matches the second audio data. In other words, the target audio data and the second audio data may be approximately the same data.
具体地,由于第一音频数据的时长大于第二音频数据的时长,从而可以对第一音频数据进行遍历,提取出多个具有第二预设时长的待选音频数据,比如每隔1ms提取一个具有第二预设时长的待选音频数据。可以确定每个待选音频数据与第二音频数据之间的音频相似度,并将音频相似度最大的待选音频数据作为与第二音频数据相匹配的目标音频数据。Specifically, since the duration of the first audio data is greater than that of the second audio data, the first audio data can be traversed to extract multiple audio data to be selected with the second preset duration, for example, one audio data to be selected with the second preset duration is extracted every 1 ms. The audio similarity between each audio data to be selected and the second audio data can be determined, and the audio data to be selected with the greatest audio similarity is used as the target audio data to match the second audio data.
示例性地,S340可以包括:基于预设滑动时长和第二预设时长,对第一 音频数据进行滑动,获取当次滑动对应的具有第二预设时长的当前音频数据;确定当前音频数据与第二音频数据之间的音频相似度;基于当前音频数据对应的音频相似度,确定与第二音频数据相匹配的目标音频数据。Exemplarily, S340 may include: based on a preset sliding duration and a second preset duration, The audio data is slid to obtain current audio data with a second preset duration corresponding to the current sliding; the audio similarity between the current audio data and the second audio data is determined; and based on the audio similarity corresponding to the current audio data, the target audio data matching the second audio data is determined.
具体地,可以基于预设滑动时长,在第一音频数据进行逐次滑动,获得具有第二预设时长的当前音频数据。例如,在预设滑动时长为1ms时,参见图4,第一次滑动获得的当前音频数据为第一音频数据中的第0ms到第20ms的音频数据,第二次滑动获得的当前音频数据为第一音频数据中的第1ms到第21ms的音频数据,依次类推。在每次滑动后,可以确定当次滑动获得的当前音频数据与第二音频数据之间的协方差,并将该协方差作为当前音频数据对应的音频相似度。若该音频相似度大于或等于预设相似度阈值,则可以确定当前音频数据与第二音频数据相匹配,此时将当前音频数据作为目标音频数据。若该音频相似度小于预设相似度阈值,则可以确定当前音频数据与第二音频数据不匹配,此时需要通过滑动更新当前音频数据,并检测下次滑动获得的当前音频数据是否与第二音频数据相匹配,直到确定出相匹配的目标音频数据时,停止滑动。通过上述滑动匹配的方式无需遍历第一音频数据中具有第二预设时长的所有音频数据,可以在保证同步检测准确性的基础上更加快速地获得与第二音频数据相匹配的目标音频数据,从而进一步提高了音画同步检的测效率。Specifically, based on the preset sliding duration, the first audio data can be slid successively to obtain the current audio data with the second preset duration. For example, when the preset sliding duration is 1ms, referring to FIG. 4, the current audio data obtained by the first sliding is the audio data from the 0th ms to the 20th ms in the first audio data, and the current audio data obtained by the second sliding is the audio data from the 1st ms to the 21st ms in the first audio data, and so on. After each sliding, the covariance between the current audio data obtained by the sliding and the second audio data can be determined, and the covariance can be used as the audio similarity corresponding to the current audio data. If the audio similarity is greater than or equal to the preset similarity threshold, it can be determined that the current audio data matches the second audio data, and the current audio data is used as the target audio data. If the audio similarity is less than the preset similarity threshold, it can be determined that the current audio data does not match the second audio data. At this time, it is necessary to update the current audio data by sliding, and detect whether the current audio data obtained by the next sliding matches the second audio data, until the matching target audio data is determined, the sliding is stopped. The above-mentioned sliding matching method does not need to traverse all audio data with the second preset duration in the first audio data. The target audio data matching the second audio data can be obtained more quickly while ensuring the accuracy of synchronization detection, thereby further improving the efficiency of audio and video synchronization detection.
S350、以第二音频数据和目标第二关键帧为音画同步基准,确定目标音频数据和第一关键帧之间的音画偏移量。S350: Taking the second audio data and the target second key frame as a reference for audio-visual synchronization, determine an audio-visual offset between the target audio data and the first key frame.
其中,音画偏移量可以是指目标音频数据的声音与第一关键帧的画面之间的时间差。The audio-video offset may refer to the time difference between the sound of the target audio data and the picture of the first key frame.
具体地,由于第二视频是音画同步的,从而第二音频数据和目标第二关键帧是音画同步的,因此可以将第二音频数据与目标第二关键帧之间的第一时差作为音画同步的参考时差,对目标音频数据与第一关键帧之间的目标时差进行检测,若目标时差等于参考时差,则表明目标音频数据与第一关键帧是音画同步的,此时的音画偏移量为0。若目标时差不等于参考时差,则可以基于目标时差与参考时差之间的差值,确定具体的音画偏移量。Specifically, since the second video is synchronized in audio and video, and thus the second audio data and the target second key frame are synchronized in audio and video, the first time difference between the second audio data and the target second key frame can be used as a reference time difference for audio and video synchronization, and the target time difference between the target audio data and the first key frame is detected. If the target time difference is equal to the reference time difference, it indicates that the target audio data and the first key frame are synchronized in audio and video, and the audio and video offset at this time is 0. If the target time difference is not equal to the reference time difference, the specific audio and video offset can be determined based on the difference between the target time difference and the reference time difference.
示例性地,S350可以包括:基于第二音频数据的开始时间戳和目标第二关键帧的播放时间戳,确定音画同步时的参考时差;基于目标音频数据的开始 时间戳和第一关键帧的播放时间戳,确定第一关键帧对应的目标时差;基于目标时差和参考时差,确定第一关键帧对应的音画偏移量。Exemplarily, S350 may include: determining a reference time difference for audio-visual synchronization based on the start timestamp of the second audio data and the play timestamp of the target second key frame; determining a reference time difference for audio-visual synchronization based on the start timestamp of the target audio data; The target time difference corresponding to the first key frame is determined based on the timestamp and the playback timestamp of the first key frame; and the audio and video offset corresponding to the first key frame is determined based on the target time difference and the reference time difference.
具体地,可以将第二音频数据的开始时间戳减去目标第二关键帧的播放时间戳,获得的时间差值确定为音画同步时的参考时差。对于图4中的第二音频数据的提取方式而言,第二音频数据的开始时间戳即为目标第二关键帧的播放时间戳,从而音画同步时的参考时差为0。将目标音频数据的开始时间戳减去第一关键帧的播放时间戳,获得的时间差值确定为第一关键帧对应的目标时差。将目标时差减去参考时差,获得的时间差值确定为第一关键帧对应的音画偏移量。参见图4,参考视差为0,目标时差即为音画偏移量。若音画偏移量大于0,则表明音频超前视频画面。若音画偏移量小于0,则表明音频滞后视频画面。Specifically, the start timestamp of the second audio data can be subtracted from the playback timestamp of the target second key frame, and the obtained time difference value can be determined as the reference time difference for audio and video synchronization. For the extraction method of the second audio data in Figure 4, the start timestamp of the second audio data is the playback timestamp of the target second key frame, so that the reference time difference for audio and video synchronization is 0. The start timestamp of the target audio data is subtracted from the playback timestamp of the first key frame, and the obtained time difference value is determined as the target time difference corresponding to the first key frame. The target time difference is subtracted from the reference time difference, and the obtained time difference value is determined as the audio and video offset corresponding to the first key frame. Referring to Figure 4, the reference parallax is 0, and the target time difference is the audio and video offset. If the audio and video offset is greater than 0, it indicates that the audio is ahead of the video picture. If the audio and video offset is less than 0, it indicates that the audio lags behind the video picture.
S360、基于第一关键帧对应的音画偏移量,确定第一视频的音画同步检测结果。S360: Determine an audio and video synchronization detection result of the first video based on the audio and video offset corresponding to the first key frame.
具体地,基于第一视频中的每个第一关键帧对应的音画偏移量,可以确定第一视频整体上的音画同步检测结果。Specifically, based on the audio and video offset corresponding to each first key frame in the first video, the audio and video synchronization detection result of the first video as a whole can be determined.
示例性地,S360可以包括:若第一视频中每个第一视频帧对应的音画偏移量均处于预设允许范围内,则确定第一视频的音画同步检测结果为音画同步;若第一视频中存在至少一个第一视频帧对应的音画偏移量未处于预设允许范围内,则确定第一视频的音画同步检测结果为音画不同步。Exemplarily, S360 may include: if the audio and video offsets corresponding to each first video frame in the first video are within a preset allowable range, then determining that the audio and video synchronization detection result of the first video is audio and video synchronization; if there is at least one first video frame in the first video whose audio and video offsets correspond to are not within the preset allowable range, then determining that the audio and video synchronization detection result of the first video is audio and video asynchrony.
其中,预设允许范围可以是预先基于检测标准设置的,允许音画偏移的时差范围。例如,预设允许范围为:[-185,90],其表明相当于视频画面而言,音频滞后185ms或者超前90ms都是可接受的,认为也是音画同步的。The preset allowable range may be a time difference range of the allowed audio and video offset that is pre-set based on the detection standard. For example, the preset allowable range is: [-185, 90], which indicates that, with respect to the video screen, an audio lag of 185ms or an audio lead of 90ms is acceptable and is considered to be synchronized with the audio and video.
具体地,检测第一视频中每个第一视频帧对应的音画偏移量是否处于预设允许范围内,若所有的第一视频帧对应的音画偏移量均处于预设允许范围内,则确定第一视频的音画同步检测结果为音画同步,即转码后的第一视频也是音画同步的,视频转码并未导致音画不同步的情况。若存在至少一个第一视频帧对应的音画偏移量未处于预设允许范围内,则确定第一视频的音画同步检测结果为音画不同步,也就是说,在未处于预设允许范围内的第一视频帧位置处出现了因转码而导致的音画不同步的情况,此时可以将未处于预设允许范围内的第一视频帧对应的音画偏移量进行输出,以提醒用户音画不同步的 具体位置以及音画不同步的程度大小,使得用户进行快速处理。Specifically, detect whether the audio and video offset corresponding to each first video frame in the first video is within the preset allowable range. If the audio and video offsets corresponding to all first video frames are within the preset allowable range, determine that the audio and video synchronization detection result of the first video is audio and video synchronization, that is, the audio and video of the transcoded first video is also synchronized, and the video transcoding does not cause the audio and video to be out of sync. If there is at least one first video frame whose audio and video offset corresponding to the first video frame is not within the preset allowable range, determine that the audio and video synchronization detection result of the first video is audio and video out of sync, that is, audio and video out of sync caused by transcoding occurs at the position of the first video frame that is not within the preset allowable range. At this time, the audio and video offset corresponding to the first video frame that is not within the preset allowable range can be output to remind the user that the audio and video are out of sync. The specific location and degree of asynchrony between audio and video allow users to process the problem quickly.
本公开实施例的技术方案,通过提取音频时长更长的第一音频数据,从第一音频数据中再次提取出与第二音频数据相匹配的目标音频数据,并以第二音频数据和目标第二关键帧为音画同步基准,确定目标音频数据和第一关键帧之间的音画偏移量,并基于第一关键帧对应的音画偏移量,可以更加准确地确定出第一视频的音画同步检测结果,从而进一步提高了音画同步检测的准确性。The technical solution of the disclosed embodiment extracts the first audio data with a longer audio duration, extracts the target audio data matching the second audio data again from the first audio data, and uses the second audio data and the target second key frame as the reference for audio and video synchronization, determines the audio and video offset between the target audio data and the first key frame, and based on the audio and video offset corresponding to the first key frame, can more accurately determine the audio and video synchronization detection result of the first video, thereby further improving the accuracy of the audio and video synchronization detection.
图5为本公开实施例所提供的一种音画同步检测装置的结构示意图,如图5所示,该装置具体包括:第一视频获取模块510、关键帧对齐模块520、音频数据提取模块530和音画同步检测模块540。FIG5 is a schematic diagram of the structure of an audio-video synchronization detection device provided by an embodiment of the present disclosure. As shown in FIG5 , the device specifically includes: a first video acquisition module 510 , a key frame alignment module 520 , an audio data extraction module 530 and an audio-video synchronization detection module 540 .
其中,第一视频获取模块510,用于获取待检测的第一视频,所述第一视频是对音画同步的第二视频进行转码获得的视频;关键帧对齐模块520,用于提取所述第一视频中的第一关键帧和所述第二视频中的第二关键帧,并对所述第一关键帧和所述第二关键帧进行对齐,确定与所述第一关键帧对齐的目标第二关键帧;音频数据提取模块530,用于提取所述第一关键帧对应的第一音频数据和所述目标第二关键帧对应的第二音频数据;音画同步检测模块540,用于将所述第一音频数据与所述第二音频数据进行匹配,确定所述第一视频的音画同步检测结果。Among them, the first video acquisition module 510 is used to acquire the first video to be detected, where the first video is a video obtained by transcoding the second video with synchronized audio and video; the key frame alignment module 520 is used to extract the first key frame in the first video and the second key frame in the second video, and align the first key frame and the second key frame to determine the target second key frame aligned with the first key frame; the audio data extraction module 530 is used to extract the first audio data corresponding to the first key frame and the second audio data corresponding to the target second key frame; the audio and video synchronization detection module 540 is used to match the first audio data with the second audio data to determine the audio and video synchronization detection result of the first video.
本公开实施例所提供的技术方案,通过将转码前音画同步的第二视频作为参考对转码后的第一视频进行检测,并将关键帧对齐后的第一关键帧对应的第一音频数据和目标第二关键帧对应的第二音频数据进行匹配,从而可以准确地确定第一视频是否存在因转码而导致的音画不同步的情况,实现了音画同步的自动检测,并且保证了音画同步检测的准确性。The technical solution provided by the embodiment of the present disclosure detects the transcoded first video by taking the second video with synchronized audio and video before transcoding as a reference, and matches the first audio data corresponding to the first key frame after key frame alignment with the second audio data corresponding to the target second key frame, so as to accurately determine whether the first video has audio and video asynchronism caused by transcoding, thereby realizing automatic detection of audio and video synchronization and ensuring the accuracy of audio and video synchronization detection.
在上述技术方案的基础上,关键帧对齐模块520,具体用于:Based on the above technical solution, the key frame alignment module 520 is specifically used for:
获取所述第一关键帧可选的第二关键帧序列;按照所述第二关键帧序列的顺序,依次获取当前第二关键帧,并确定所述第一关键帧与当前第二关键帧之间的图像相似度;若所述图像相似度大于或等于预设相似度阈值,则将当前第二视频帧确定为与所述第一关键帧对齐的目标第二关键帧。Obtain an optional second key frame sequence of the first key frame; obtain the current second key frame in sequence according to the order of the second key frame sequence, and determine the image similarity between the first key frame and the current second key frame; if the image similarity is greater than or equal to a preset similarity threshold, determine the current second video frame as the target second key frame aligned with the first key frame.
在上述各技术方案的基础上,所述第一音频数据对应的音频时长为第一预设时长,所述第二音频数据对应的音频时长为第二预设时长,其中,所述第 一预设时长大于第二预设时长。On the basis of the above technical solutions, the audio duration corresponding to the first audio data is a first preset duration, and the audio duration corresponding to the second audio data is a second preset duration, wherein the first The first preset time length is greater than the second preset time length.
在上述各技术方案的基础上,所述第一音频数据是以所述第一关键帧的播放时间戳为中心时刻,具有第一预设时长的音频数据;Based on the above technical solutions, the first audio data is audio data with a first preset duration and a playback timestamp of the first key frame as the center moment;
所述第二音频数据是以所述目标第二关键帧的播放时间戳为预设参考时刻,具有第二预设时长的音频数据;其中,所述预设参考时刻为以下其中之一:The second audio data is audio data having a second preset duration with the playback timestamp of the target second key frame as a preset reference time; wherein the preset reference time is one of the following:
开始时刻、中心时刻和结束时刻。The beginning moment, the center moment, and the end moment.
在上述各技术方案的基础上,音画同步检测模块540,包括:Based on the above technical solutions, the audio and video synchronization detection module 540 includes:
目标音频数据确定单元,用于从所述第一音频数据中确定具有所述第二预设时长且与所述第二音频数据相匹配的目标音频数据;a target audio data determining unit, configured to determine, from the first audio data, target audio data having the second preset duration and matching the second audio data;
音画偏移量确定单元,用于以所述第二音频数据和所述目标第二关键帧为音画同步基准,确定所述目标音频数据和所述第一关键帧之间的音画偏移量;an audio-visual offset determination unit, configured to determine an audio-visual offset between the target audio data and the first key frame by taking the second audio data and the target second key frame as an audio-visual synchronization reference;
音画同步检测单元,用于基于所述第一关键帧对应的音画偏移量,确定所述第一视频的音画同步检测结果。The audio and video synchronization detection unit is used to determine the audio and video synchronization detection result of the first video based on the audio and video offset corresponding to the first key frame.
在上述各技术方案的基础上,目标音频数据确定单元,具体用于:On the basis of the above technical solutions, the target audio data determination unit is specifically used to:
基于预设滑动时长和所述第二预设时长,对所述第一音频数据进行滑动,获取当次滑动对应的具有所述第二预设时长的当前音频数据;确定所述当前音频数据与所述第二音频数据之间的音频相似度;基于所述当前音频数据对应的音频相似度,确定与所述第二音频数据相匹配的目标音频数据。Based on the preset sliding duration and the second preset duration, the first audio data is slid to obtain the current audio data with the second preset duration corresponding to the sliding; the audio similarity between the current audio data and the second audio data is determined; based on the audio similarity corresponding to the current audio data, the target audio data matching the second audio data is determined.
在上述各技术方案的基础上,音画偏移量确定单元,具体用于:On the basis of the above technical solutions, the audio and video offset determination unit is specifically used to:
基于所述第二音频数据的开始时间戳和所述目标第二关键帧的播放时间戳,确定音画同步时的参考时差;基于所述目标音频数据的开始时间戳和所述第一关键帧的播放时间戳,确定所述第一关键帧对应的目标时差;基于所述目标时差和所述参考时差,确定所述第一关键帧对应的音画偏移量。Based on the start timestamp of the second audio data and the playback timestamp of the target second key frame, determine the reference time difference for audio and video synchronization; based on the start timestamp of the target audio data and the playback timestamp of the first key frame, determine the target time difference corresponding to the first key frame; based on the target time difference and the reference time difference, determine the audio and video offset corresponding to the first key frame.
在上述各技术方案的基础上,音画同步检测单元,具体用于:On the basis of the above technical solutions, the audio and video synchronization detection unit is specifically used for:
若所述第一视频中每个第一视频帧对应的音画偏移量均处于预设允许范围内,则确定所述第一视频的音画同步检测结果为音画同步;若所述第一视频中存在至少一个第一视频帧对应的音画偏移量未处于预设允许范围内,则确定所述第一视频的音画同步检测结果为音画不同步。If the audio and video offsets corresponding to each first video frame in the first video are within the preset allowable range, the audio and video synchronization detection result of the first video is determined to be audio and video synchronization; if there is at least one first video frame in the first video whose audio and video offsets are not within the preset allowable range, the audio and video synchronization detection result of the first video is determined to be audio and video out of synchronization.
本公开实施例所提供的音画同步检测装置可执行本公开任意实施例所提 供的音画同步检测方法,具备执行音画同步检测方法相应的功能模块和有益效果。The audio-video synchronization detection device provided in the embodiments of the present disclosure can execute the audio-video synchronization detection device provided in any embodiment of the present disclosure. The provided audio and video synchronization detection method has the corresponding functional modules and beneficial effects for executing the audio and video synchronization detection method.
值得注意的是,上述装置所包括的各个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本公开实施例的保护范围。It is worth noting that the various units and modules included in the above-mentioned device are only divided according to functional logic, but are not limited to the above-mentioned division, as long as the corresponding functions can be achieved; in addition, the specific names of the functional units are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the embodiments of the present disclosure.
图6为本公开实施例所提供的一种电子设备的结构示意图。下面参考图6,其示出了适于用来实现本公开实施例的电子设备(例如图6中的终端设备或服务器)500的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图6示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。FIG6 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present disclosure. Referring to FIG6 below, a schematic diagram of the structure of an electronic device (e.g., a terminal device or server in FIG6 ) 500 suitable for implementing an embodiment of the present disclosure is shown. The terminal device in the embodiment of the present disclosure may include, but is not limited to, mobile terminals such as mobile phones, laptop computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), vehicle-mounted terminals (e.g., vehicle-mounted navigation terminals), etc., and fixed terminals such as digital TVs, desktop computers, etc. The electronic device shown in FIG6 is merely an example and should not impose any limitations on the functions and scope of use of the embodiments of the present disclosure.
如图6所示,电子设备500可以包括处理装置(例如中央处理器、图形处理器等)501,其可以根据存储在只读存储器(ROM)502中的程序或者从存储装置508加载到随机访问存储器(RAM)503中的程序而执行各种适当的动作和处理。在RAM 503中,还存储有电子设备500操作所需的各种程序和数据。处理装置501、ROM 502以及RAM 503通过总线504彼此相连。编辑/输出(I/O)接口505也连接至总线504。As shown in FIG6 , the electronic device 500 may include a processing device (e.g., a central processing unit, a graphics processing unit, etc.) 501, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 502 or a program loaded from a storage device 508 to a random access memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the electronic device 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other via a bus 504. An edit/output (I/O) interface 505 is also connected to the bus 504.
通常,以下装置可以连接至I/O接口505:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置506;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置507;包括例如磁带、硬盘等的存储装置508;以及通信装置509。通信装置509可以允许电子设备500与其他设备进行无线或有线通信以交换数据。虽然图6示出了具有各种装置的电子设备500,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Typically, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, a touchpad, a keyboard, a mouse, a camera, a microphone, an accelerometer, a gyroscope, etc.; output devices 507 including, for example, a liquid crystal display (LCD), a speaker, a vibrator, etc.; storage devices 508 including, for example, a magnetic tape, a hard disk, etc.; and communication devices 509. The communication devices 509 may allow the electronic device 500 to communicate wirelessly or wired with other devices to exchange data. Although FIG. 6 shows an electronic device 500 with various devices, it should be understood that it is not required to implement or have all the devices shown. More or fewer devices may be implemented or have alternatively.
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流 程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置509从网络上被下载和安装,或者从存储装置508被安装,或者从ROM 502被安装。在该计算机程序被处理装置501执行时,执行本公开实施例的方法中限定的上述功能。In particular, according to an embodiment of the present disclosure, the process described above with reference to the flowchart can be implemented as a computer software program. For example, an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a non-transitory computer readable medium, the computer program including a computer program for executing the flowchart. In such an embodiment, the computer program can be downloaded and installed from the network through the communication device 509, or installed from the storage device 508, or installed from the ROM 502. When the computer program is executed by the processing device 501, the above functions defined in the method of the embodiment of the present disclosure are performed.
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。The names of the messages or information exchanged between multiple devices in the embodiments of the present disclosure are only used for illustrative purposes and are not used to limit the scope of these messages or information.
本公开实施例提供的电子设备与上述实施例提供的音画同步检测方法属于同一发明构思,未在本实施例中详尽描述的技术细节可参见上述实施例,并且本实施例与上述实施例具有相同的有益效果。The electronic device provided by the embodiment of the present disclosure and the audio and video synchronization detection method provided by the above embodiment belong to the same inventive concept. The technical details not fully described in this embodiment can be referred to the above embodiment, and this embodiment has the same beneficial effects as the above embodiment.
本公开实施例提供了一种计算机存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述实施例所提供的音画同步检测方法。The embodiment of the present disclosure provides a computer storage medium on which a computer program is stored. When the program is executed by a processor, the audio and video synchronization detection method provided by the above embodiment is implemented.
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。It should be noted that the computer-readable medium disclosed above may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two. The computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium containing or storing a program that may be used by or in combination with an instruction execution system, device or device. In the present disclosure, a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, in which a computer-readable program code is carried. This propagated data signal may take a variety of forms, including but not limited to an electromagnetic signal, an optical signal, or any suitable combination of the above. The computer readable signal medium may also be any computer readable medium other than a computer readable storage medium, which may send, propagate or transmit a program for use by or in conjunction with an instruction execution system, apparatus or device. The program code contained on the computer readable medium may be transmitted using any suitable medium, including but not limited to: wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协 议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。In some embodiments, the client and the server may utilize any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol). The network may communicate using any protocol and may be interconnected with any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), an internetwork (e.g., the Internet), and a peer-to-peer network (e.g., an ad hoc peer-to-peer network), as well as any network currently known or developed in the future.
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。The computer-readable medium may be included in the electronic device, or may exist independently without being incorporated into the electronic device.
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取待检测的第一视频,所述第一视频是对音画同步的第二视频进行转码获得的视频;提取所述第一视频中的第一关键帧和所述第二视频中的第二关键帧,并对所述第一关键帧和所述第二关键帧进行对齐,确定与所述第一关键帧对齐的目标第二关键帧;提取所述第一关键帧对应的第一音频数据和所述目标第二关键帧对应的第二音频数据;将所述第一音频数据与所述第二音频数据进行匹配,确定所述第一视频的音画同步检测结果。The computer-readable medium carries one or more programs. When the one or more programs are executed by the electronic device, the electronic device: obtains a first video to be detected, where the first video is a video obtained by transcoding a second video with synchronized audio and video; extracts a first key frame in the first video and a second key frame in the second video, and aligns the first key frame and the second key frame to determine a target second key frame aligned with the first key frame; extracts first audio data corresponding to the first key frame and second audio data corresponding to the target second key frame; matches the first audio data with the second audio data to determine an audio and video synchronization detection result of the first video.
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for performing the operations of the present disclosure may be written in one or more programming languages or a combination thereof, including, but not limited to, object-oriented programming languages, such as Java, Smalltalk, C++, and conventional procedural programming languages, such as "C" or similar programming languages. The program code may be executed entirely on the user's computer, partially on the user's computer, as a separate software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving a remote computer, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., through the Internet using an Internet service provider).
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意 的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and block diagrams in the accompanying drawings illustrate the possible architectures, functions, and operations of the systems, methods, and computer program products according to various embodiments of the present disclosure. In this regard, each box in the flowchart or block diagram may represent a module, a program segment, or a portion of a code, which contains one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions marked in the boxes may also occur in an order different from that marked in the accompanying drawings. For example, two boxes represented in succession may actually be executed substantially in parallel, and they may sometimes be executed in the opposite order, depending on the functions involved. It should also be noted that It is noted that each block in the block diagram and/or flowchart, and combinations of blocks in the block diagram and/or flowchart, may be implemented by a dedicated hardware-based system that performs the specified function or operation, or may be implemented by a combination of dedicated hardware and computer instructions.
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定,例如,第一获取单元还可以被描述为“获取至少两个网际协议地址的单元”。The units involved in the embodiments described in the present disclosure may be implemented by software or hardware. The name of a unit does not limit the unit itself in some cases. For example, the first acquisition unit may also be described as a "unit for acquiring at least two Internet Protocol addresses".
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。The functions described above herein may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips (SOCs), complex programmable logic devices (CPLDs), and the like.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, device, or equipment. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or equipment, or any suitable combination of the foregoing. A more specific example of a machine-readable storage medium may include an electrical connection based on one or more lines, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
根据本公开的一个或多个实施例,示例一提供了一种音画同步检测方法,包括:According to one or more embodiments of the present disclosure, Example 1 provides a method for detecting audio and video synchronization, including:
获取待检测的第一视频,所述第一视频是对音画同步的第二视频进行转码获得的视频;Acquire a first video to be detected, where the first video is a video obtained by transcoding a second video with synchronized audio and video;
提取所述第一视频中的第一关键帧和所述第二视频中的第二关键帧,并对所述第一关键帧和所述第二关键帧进行对齐,确定与所述第一关键帧对齐的目标第二关键帧;Extracting a first key frame in the first video and a second key frame in the second video, aligning the first key frame and the second key frame, and determining a target second key frame aligned with the first key frame;
提取所述第一关键帧对应的第一音频数据和所述目标第二关键帧对应的第二音频数据;Extracting first audio data corresponding to the first key frame and second audio data corresponding to the target second key frame;
将所述第一音频数据与所述第二音频数据进行匹配,确定所述第一视频 的音画同步检测结果。The first audio data is matched with the second audio data to determine the first video The audio and video synchronization detection results.
根据本公开的一个或多个实施例,示例二提供了一种音画同步检测方法,还包括:According to one or more embodiments of the present disclosure, Example 2 provides a method for detecting audio and video synchronization, further comprising:
可选的,对所述第一关键帧和所述第二关键帧进行对齐,确定与所述第一关键帧对齐的目标第二关键帧,包括:Optionally, aligning the first key frame and the second key frame to determine a target second key frame aligned with the first key frame includes:
获取所述第一关键帧可选的第二关键帧序列;Obtain a second key frame sequence that is optional for the first key frame;
按照所述第二关键帧序列的顺序,依次获取当前第二关键帧,并确定所述第一关键帧与当前第二关键帧之间的图像相似度;Obtaining the current second key frame in sequence according to the order of the second key frame sequence, and determining the image similarity between the first key frame and the current second key frame;
若所述图像相似度大于或等于预设相似度阈值,则将当前第二视频帧确定为与所述第一关键帧对齐的目标第二关键帧。If the image similarity is greater than or equal to a preset similarity threshold, the current second video frame is determined as a target second key frame aligned with the first key frame.
根据本公开的一个或多个实施例,示例三提供了一种音画同步检测方法,还包括:According to one or more embodiments of the present disclosure, Example 3 provides a method for detecting audio-visual synchronization, further comprising:
可选的,所述第一音频数据对应的音频时长为第一预设时长,所述第二音频数据对应的音频时长为第二预设时长,其中,所述第一预设时长大于第二预设时长。Optionally, the audio duration corresponding to the first audio data is a first preset duration, and the audio duration corresponding to the second audio data is a second preset duration, wherein the first preset duration is greater than the second preset duration.
根据本公开的一个或多个实施例,示例四提供了一种音画同步检测方法,还包括:According to one or more embodiments of the present disclosure, Example 4 provides a method for detecting audio and video synchronization, further comprising:
可选的,所述第一音频数据是以所述第一关键帧的播放时间戳为中心时刻,具有第一预设时长的音频数据;Optionally, the first audio data is audio data with a first preset duration and a playback timestamp of the first key frame as the center moment;
所述第二音频数据是以所述目标第二关键帧的播放时间戳为预设参考时刻,具有第二预设时长的音频数据;其中,所述预设参考时刻为以下其中之一:The second audio data is audio data having a second preset duration with the playback timestamp of the target second key frame as a preset reference time; wherein the preset reference time is one of the following:
开始时刻、中心时刻和结束时刻。The beginning moment, the center moment, and the end moment.
根据本公开的一个或多个实施例,示例五提供了一种音画同步检测方法,还包括:According to one or more embodiments of the present disclosure, Example 5 provides a method for detecting audio and video synchronization, further comprising:
可选的,所述将所述第一音频数据与所述第二音频数据进行匹配,确定所述第一视频的音画同步检测结果,包括:Optionally, matching the first audio data with the second audio data to determine an audio-video synchronization detection result of the first video includes:
从所述第一音频数据中确定具有所述第二预设时长且与所述第二音频数据相匹配的目标音频数据;Determine, from the first audio data, target audio data having the second preset duration and matching the second audio data;
以所述第二音频数据和所述目标第二关键帧为音画同步基准,确定所述目标音频数据和所述第一关键帧之间的音画偏移量; Taking the second audio data and the target second key frame as an audio-visual synchronization reference, determining an audio-visual offset between the target audio data and the first key frame;
基于所述第一关键帧对应的音画偏移量,确定所述第一视频的音画同步检测结果。Based on the audio and video offset corresponding to the first key frame, an audio and video synchronization detection result of the first video is determined.
根据本公开的一个或多个实施例,示例六提供了一种音画同步检测方法,还包括:According to one or more embodiments of the present disclosure, Example 6 provides a method for detecting audio and video synchronization, further comprising:
可选的,所述从所述第一音频数据中确定具有所述第二预设时长且与所述第二音频数据相匹配的目标音频数据,包括:Optionally, the determining, from the first audio data, target audio data having the second preset duration and matching the second audio data includes:
基于预设滑动时长和所述第二预设时长,对所述第一音频数据进行滑动,获取当次滑动对应的具有所述第二预设时长的当前音频数据;Based on the preset sliding duration and the second preset duration, sliding the first audio data, and acquiring the current audio data with the second preset duration corresponding to the sliding;
确定所述当前音频数据与所述第二音频数据之间的音频相似度;Determining audio similarity between the current audio data and the second audio data;
基于所述当前音频数据对应的音频相似度,确定与所述第二音频数据相匹配的目标音频数据。Based on the audio similarity corresponding to the current audio data, target audio data matching the second audio data is determined.
根据本公开的一个或多个实施例,示例七提供了一种音画同步检测方法,还包括:According to one or more embodiments of the present disclosure, Example 7 provides a method for detecting audio-visual synchronization, further comprising:
可选的,所述以所述第二音频数据和所述目标第二关键帧为音画同步基准,确定所述目标音频数据和所述第一关键帧之间的音画偏移量,包括:Optionally, the determining the audio and video offset between the target audio data and the first key frame by taking the second audio data and the target second key frame as an audio and video synchronization reference includes:
基于所述第二音频数据的开始时间戳和所述目标第二关键帧的播放时间戳,确定音画同步时的参考时差;Determine a reference time difference for audio and video synchronization based on a start timestamp of the second audio data and a play timestamp of the target second key frame;
基于所述目标音频数据的开始时间戳和所述第一关键帧的播放时间戳,确定所述第一关键帧对应的目标时差;Determining a target time difference corresponding to the first key frame based on a start timestamp of the target audio data and a play timestamp of the first key frame;
基于所述目标时差和所述参考时差,确定所述第一关键帧对应的音画偏移量。Based on the target time difference and the reference time difference, an audio and video offset corresponding to the first key frame is determined.
根据本公开的一个或多个实施例,示例八提供了一种音画同步检测方法,还包括:According to one or more embodiments of the present disclosure, Example 8 provides a method for detecting audio and video synchronization, further comprising:
可选的,所述基于所述第一关键帧对应的音画偏移量,确定所述第一视频的音画同步检测结果,包括:Optionally, determining the audio and video synchronization detection result of the first video based on the audio and video offset corresponding to the first key frame includes:
若所述第一视频中每个第一视频帧对应的音画偏移量均处于预设允许范围内,则确定所述第一视频的音画同步检测结果为音画同步;If the audio and video offset corresponding to each first video frame in the first video is within a preset allowable range, determining that the audio and video synchronization detection result of the first video is audio and video synchronization;
若所述第一视频中存在至少一个第一视频帧对应的音画偏移量未处于预设允许范围内,则确定所述第一视频的音画同步检测结果为音画不同步。If the audio and video offset corresponding to at least one first video frame in the first video is not within a preset allowable range, it is determined that the audio and video synchronization detection result of the first video is that the audio and video are out of synchronization.
根据本公开的一个或多个实施例,示例九提供了一种音画同步检测装置, 包括:According to one or more embodiments of the present disclosure, Example 9 provides a device for detecting audio-visual synchronization. include:
第一视频获取模块,用于获取待检测的第一视频,所述第一视频是对音画同步的第二视频进行转码获得的视频;A first video acquisition module is used to acquire a first video to be detected, where the first video is a video obtained by transcoding a second video with synchronized audio and video;
关键帧对齐模块,用于提取所述第一视频中的第一关键帧和所述第二视频中的第二关键帧,并对所述第一关键帧和所述第二关键帧进行对齐,确定与所述第一关键帧对齐的目标第二关键帧;A key frame alignment module, used to extract a first key frame in the first video and a second key frame in the second video, align the first key frame with the second key frame, and determine a target second key frame aligned with the first key frame;
音频数据提取模块,用于提取所述第一关键帧对应的第一音频数据和所述目标第二关键帧对应的第二音频数据;An audio data extraction module, used to extract first audio data corresponding to the first key frame and second audio data corresponding to the target second key frame;
音画同步检测模块,用于将所述第一音频数据与所述第二音频数据进行匹配,确定所述第一视频的音画同步检测结果。The audio and video synchronization detection module is used to match the first audio data with the second audio data to determine the audio and video synchronization detection result of the first video.
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only a preferred embodiment of the present disclosure and an explanation of the technical principles used. Those skilled in the art should understand that the scope of disclosure involved in the present disclosure is not limited to the technical solutions formed by a specific combination of the above technical features, but should also cover other technical solutions formed by any combination of the above technical features or their equivalent features without departing from the above disclosed concept. For example, the above features are replaced with the technical features with similar functions disclosed in the present disclosure (but not limited to) by each other.
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。In addition, although each operation is described in a specific order, this should not be understood as requiring these operations to be performed in the specific order shown or in a sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Similarly, although some specific implementation details are included in the above discussion, these should not be interpreted as limiting the scope of the present disclosure. Some features described in the context of a separate embodiment can also be implemented in a single embodiment in combination. On the contrary, the various features described in the context of a single embodiment can also be implemented in multiple embodiments individually or in any suitable sub-combination mode.
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。 Although the subject matter has been described in language specific to structural features and/or methodological logical actions, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or actions described above. On the contrary, the specific features and actions described above are merely example forms of implementing the claims.
Claims (11)
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310813430.4A CN116708892A (en) | 2023-07-04 | 2023-07-04 | Sound and picture synchronous detection method, device, equipment and storage medium |
CN202310813430.4 | 2023-07-04 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2025007738A1 true WO2025007738A1 (en) | 2025-01-09 |
Family
ID=87839105
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2024/099889 Pending WO2025007738A1 (en) | 2023-07-04 | 2024-06-18 | Audio-picture synchronization detection method and apparatus, and device and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN116708892A (en) |
WO (1) | WO2025007738A1 (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116708892A (en) * | 2023-07-04 | 2023-09-05 | 抖音视界有限公司 | Sound and picture synchronous detection method, device, equipment and storage medium |
CN116958331B (en) * | 2023-09-20 | 2024-01-19 | 四川蜀天信息技术有限公司 | Sound and picture synchronization adjusting method and device and electronic equipment |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102056026A (en) * | 2009-11-06 | 2011-05-11 | 中国移动通信集团设计院有限公司 | Audio/video synchronization detection method and system, and voice detection method and system |
US20140153909A1 (en) * | 2012-12-03 | 2014-06-05 | Broadcom Corporation | Audio and video management for parallel transcoding |
US20180077445A1 (en) * | 2016-09-13 | 2018-03-15 | Facebook, Inc. | Systems and methods for evaluating content synchronization |
CN112434263A (en) * | 2020-10-15 | 2021-03-02 | 杭州安存网络科技有限公司 | Method and device for extracting similar segments of audio file |
CN112597794A (en) * | 2020-10-20 | 2021-04-02 | 季鹏飞 | Video matching method |
WO2022144543A1 (en) * | 2020-12-31 | 2022-07-07 | Darabase Limited | Audio synchronisation |
US11395031B1 (en) * | 2021-03-29 | 2022-07-19 | At&T Intellectual Property I, L.P. | Audio and video synchronization |
CN115412733A (en) * | 2021-05-26 | 2022-11-29 | 上海哔哩哔哩科技有限公司 | Video processing method and device |
CN115776588A (en) * | 2021-09-09 | 2023-03-10 | 北京字跳网络技术有限公司 | Video processing method, video processing apparatus, electronic device, video processing medium, and program product |
CN116708892A (en) * | 2023-07-04 | 2023-09-05 | 抖音视界有限公司 | Sound and picture synchronous detection method, device, equipment and storage medium |
-
2023
- 2023-07-04 CN CN202310813430.4A patent/CN116708892A/en active Pending
-
2024
- 2024-06-18 WO PCT/CN2024/099889 patent/WO2025007738A1/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102056026A (en) * | 2009-11-06 | 2011-05-11 | 中国移动通信集团设计院有限公司 | Audio/video synchronization detection method and system, and voice detection method and system |
US20140153909A1 (en) * | 2012-12-03 | 2014-06-05 | Broadcom Corporation | Audio and video management for parallel transcoding |
US20180077445A1 (en) * | 2016-09-13 | 2018-03-15 | Facebook, Inc. | Systems and methods for evaluating content synchronization |
CN112434263A (en) * | 2020-10-15 | 2021-03-02 | 杭州安存网络科技有限公司 | Method and device for extracting similar segments of audio file |
CN112597794A (en) * | 2020-10-20 | 2021-04-02 | 季鹏飞 | Video matching method |
WO2022144543A1 (en) * | 2020-12-31 | 2022-07-07 | Darabase Limited | Audio synchronisation |
US11395031B1 (en) * | 2021-03-29 | 2022-07-19 | At&T Intellectual Property I, L.P. | Audio and video synchronization |
CN115412733A (en) * | 2021-05-26 | 2022-11-29 | 上海哔哩哔哩科技有限公司 | Video processing method and device |
CN115776588A (en) * | 2021-09-09 | 2023-03-10 | 北京字跳网络技术有限公司 | Video processing method, video processing apparatus, electronic device, video processing medium, and program product |
CN116708892A (en) * | 2023-07-04 | 2023-09-05 | 抖音视界有限公司 | Sound and picture synchronous detection method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN116708892A (en) | 2023-09-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9961398B2 (en) | Method and device for switching video streams | |
WO2025007738A1 (en) | Audio-picture synchronization detection method and apparatus, and device and storage medium | |
KR20210021099A (en) | Establishment and use of temporal mapping based on interpolation using low-rate fingerprinting to facilitate frame-accurate content modification | |
CN110267083B (en) | Audio and video synchronization detection method, device, equipment and storage medium | |
CN111064987B (en) | Information display method and device and electronic equipment | |
CN112601101A (en) | Subtitle display method and device, electronic equipment and storage medium | |
CN111163336B (en) | Video resource pushing method and device, electronic equipment and computer readable medium | |
CN113542888B (en) | Video processing method and device, electronic equipment and storage medium | |
US20240292044A1 (en) | Method, apparatus, electronic device and storage medium for audio and video synchronization monitoring | |
CN111669625A (en) | Processing method, device and equipment for shot file and storage medium | |
US9535450B2 (en) | Synchronization of data streams with associated metadata streams using smallest sum of absolute differences between time indices of data events and metadata events | |
WO2023098576A1 (en) | Image processing method and apparatus, device, and medium | |
CN113144620B (en) | Method, device, platform, readable medium and equipment for detecting frame synchronous game | |
CN114117127B (en) | Video generation method, device, readable medium and electronic device | |
CN112437289B (en) | Switching time delay obtaining method | |
CN118101926B (en) | Video generation method, device, equipment and medium based on monitoring camera adjustment | |
CN116033199B (en) | Multi-device audio and video synchronization method, device, electronic device and storage medium | |
CN117201894A (en) | Media stream slicing method, device, system, equipment and storage medium | |
CN112287171A (en) | Information processing method and device and electronic equipment | |
CN116842216A (en) | Video dialogue question-answer data generation method and device, electronic equipment and medium | |
WO2022218425A1 (en) | Recording streaming method and apparatus, device, and medium | |
US20210365688A1 (en) | Method and apparatus for processing information associated with video, electronic device, and storage medium | |
CN115474065B (en) | Subtitle processing method, device, electronic device and storage medium | |
CN116249004B (en) | Video acquisition control method, device, equipment and storage medium | |
CN120185749A (en) | Radar and video fusion processing method, device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 24835221 Country of ref document: EP Kind code of ref document: A1 |