CN109819313A

CN109819313A - Method for processing video frequency, device and storage medium

Info

Publication number: CN109819313A
Application number: CN201910023976.3A
Authority: CN
Inventors: 田元
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Shenzhen Yayue Technology Co ltd
Priority date: 2019-01-10
Filing date: 2019-01-10
Publication date: 2019-05-28
Anticipated expiration: 2039-01-10
Also published as: CN109819313B

Abstract

The embodiment of the present application discloses a kind of method for processing video frequency, device and storage medium, wherein method for processing video frequency include: obtain user's input dub audio data；Multi-frame video image is obtained from video file；The initial video image comprising target face is determined from multi-frame video image, and the target face in initial video image is merged with the facial image of selection, obtains target video image；Synthesis processing is carried out at least target video image to audio data is dubbed, obtains audio-video composite document.User is dubbed and is organically blended with elements such as user's portraits into video production by this programme, promotes user's depth degree of involvement and video personalization intensity in video production.

Description

Method for processing video frequency, device and storage medium

Technical field

This application involves technical field of information processing, and in particular to a kind of method for processing video frequency, device and storage medium.

Background technique

With the development of internet with the development of mobile communications network, while also along with the processing capacity of terminal and storage The fast development of ability, the application program of magnanimity have obtained rapid propagation and use, especially video class application.

Video refer to a series of static images are captured in a manner of electric signal, are noted down, are handled, are stored, transmit with The various technologies reappeared.Continuous image change is per second when being more than certain frame number picture or more, and human eye can not distinguish the quiet of single width State picture, it appears that be smooth continuous visual effect, picture continuous in this way is called video.The prosperity of network technology also promotes The record segment of video is present on internet in the form of streaming media and can be received and be played by computer.The relevant technologies In, it may also allow for user to carry out the operation such as editing, recombination, format transformation to video material.

Summary of the invention

The embodiment of the present application provides a kind of method for processing video frequency, device and storage medium, can be promoted in video production and be used The family depth degree of involvement and video personalization intensity.

The embodiment of the present application provides a kind of method for processing video frequency, comprising:

Obtain user's input dubs audio data；

Multi-frame video image is obtained from video file；

The initial video image comprising target face is determined from the multi-frame video image, by the initial video figure Target face as in is merged with the facial image of selection, obtains target video image；

Audio data is dubbed and at least described target video image carries out synthesis processing to described, obtains audio-video synthesis text Part.

Correspondingly, the embodiment of the present application also provides a kind of video process apparatus, comprising:

Audio acquiring unit, obtain user's input dubs audio data；

Image acquisition unit, for obtaining multi-frame video image from video file；

Processing unit will for determining the initial video image comprising target face from the multi-frame video image Target face in the initial video image is merged with the facial image of selection, obtains target video image；

Synthesis unit is obtained for dubbing audio data and at least described target video image carries out synthesis processing to described To audio-video composite document.

Correspondingly, the storage medium is stored with a plurality of instruction, institute the embodiment of the present application also provides a kind of storage medium It states instruction to be loaded suitable for processor, to execute the step in method for processing video frequency as described above.

For the embodiment of the present application during playing video file, what acquisition user inputted first dubs audio data, from Multi-frame video image is obtained in video file.Then, it is determined from the multi-frame video image initial comprising target face Video image, and the target face in the initial video image is merged with the facial image of selection, obtain target video figure Picture.Finally, dubbing audio data and at least described target video image carries out synthesis processing to described, audio-video synthesis text is obtained Part.User can be dubbed and be organically blended with elements such as user's portraits into video production by this programme, promoted and used in video production The family depth degree of involvement and video personalization intensity.

Detailed description of the invention

In order to more clearly explain the technical solutions in the embodiments of the present application, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the drawings in the following description are only some examples of the present application, for For those skilled in the art, without creative efforts, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 is a kind of configuration diagram of method for processing video frequency provided by the embodiments of the present application.

Fig. 2 is the flow diagram of method for processing video frequency provided by the embodiments of the present application.

Fig. 3 is another flow diagram of method for processing video frequency provided by the embodiments of the present application.

Fig. 4 is a kind of application scenarios schematic diagram of method for processing video frequency provided by the embodiments of the present application.

Fig. 5 is another configuration diagram of method for processing video frequency provided by the embodiments of the present application.

Fig. 6 a~6e is the interface alternation schematic diagram of method for processing video frequency provided by the embodiments of the present application.

Fig. 7 is another configuration diagram of method for processing video frequency provided by the embodiments of the present application.

Fig. 8 is the structural schematic diagram of video process apparatus provided by the embodiments of the present application.

Fig. 9 is another structural schematic diagram of video process apparatus provided by the embodiments of the present application.

Figure 10 is the structural schematic diagram of terminal provided by the embodiments of the present application.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, those skilled in the art's every other implementation obtained without creative efforts Example, shall fall in the protection scope of this application.

The embodiment of the present application provides a kind of method for processing video frequency, device and storage medium.

Wherein, which specifically can integrate has in tablet PC (Personal Computer), mobile phone etc. Storage element is simultaneously equipped with microprocessor and has in the terminating machine of operational capability.For example, specifically being collected with the video process apparatus At for mobile phone, referring to Fig. 1, for mobile phone during playing video file, obtain user's input dubs audio data, together When according to predetermined frame rate time interval capture played video, by video pumping frame at image.Then, from multi-frame video image Determine target face to be processed, the facial image fusion treatment that target face and user are chosen, the target that obtains that treated Video image (i.e. face fusion image).Then, to treated, target video image is encoded, and obtains video code flow, and Audio data is encoded, audio code stream is obtained.Finally, video code flow is synthesized output with audio code stream, audio-video is obtained Composite document.

It is described in detail separately below.It should be noted that the serial number of following embodiment is not as preferably suitable to embodiment The restriction of sequence.

The embodiment of the present application provides a kind of method for processing video frequency, comprising: obtain user's input dubs audio data；From view Multi-frame video image is obtained in frequency file；The initial video figure comprising target face is determined from the multi-frame video image Target face in the initial video image is merged with the facial image of selection, obtains target video image by picture；To described It dubs audio data and at least described target video image carries out synthesis processing, obtain audio-video composite document.

Referring to Fig. 2, Fig. 2 is the flow diagram of method for processing video frequency provided by the embodiments of the present application.Video processing The detailed process of method can be such that

101, obtain user's input dubs audio data.

Specifically, can obtain user's input during playing video file dubs audio data, it can also be by User records dub audio data in advance.It can be during playing video file, use for example, this dubs audio data The voice messaging that family passes through the equipment real-time recordings such as microphone, the receiver of terminal.The video file can be that totally disappeared sound version view Frequency file (i.e. without audio data in video file), part noise reduction version video file (only remain portion in video file Multi-voice frequency data), or non-noise reduction version video file.

It is described to dub audio data in the present embodiment, it may include audio user data, original audio data and back Scape audio data, wherein the audio user data are user's actual sound that user is directed to that specific video display role records, it can also Think the aside sound that user records for movie and television contents；The original audio data is the original sound of nonspecific video display role Sound；The background audio data are video display background sound.For example, including video display role A and video display role B in the video file, then When playing the video file, the original sound that audio data can retain video display role A sending is dubbed, dubbing user can be to spy Fixed video display role B needs the lines dubbed to carry out dubbing recording, and in addition to this described dub in audio data can also include Video display background sound, such as background music, background special efficacy sound etc..

102, multi-frame video image is obtained from video file.

It in the embodiment of the present application, include at least one video display role with facial image in the video file.Pass through Video file is carried out to take out frame processing, multi-frame video image can be obtained from the video file.

103, the initial video image comprising target face is determined from multi-frame video image, it will be in initial video image Target face merged with the facial image of selection, obtain target video image.

Wherein, the specific video display role that the target face in the initial video image can need to dub it for user Face, the facial image of the selection is user by opening the face in the photo that photograph album is chosen, or passes through hand of taking pictures Section directly shoots obtained face.

Facial image fusion refers to that the facial image by the selection replaces or cover the target facial image, Huo Zheji Appearance obtained from the target face and the facial image of selection the characteristics of deforms face.When it is implemented, can be first to first Target face in beginning video image is detected, and integrity information, orientation information, the expression information of the target face are obtained Deng such as detecting the target face and whether be blocked, if side or just facing towards camera lens, if neigh and shout or cry.It is obtaining After taking above- mentioned information, performed corresponding processing in conjunction with facial image of the above- mentioned information to selection.Such as when target face is hidden When gear, the facial image of the selection is carried out adaptable to block processing；When target face is side face, from selection Adaptable side face image is obtained in facial image；When target face is in sobbing state, to the facial image of selection Sobbing image procossing is carried out, the facial image chosen more naturally is dissolved among the video image, is obtained More natural target video image.

Wherein, selected facial image can refer to the facial image shot by mobile phone, be also possible to protect There are the local facial images in mobile phone.In practical application, the facial image of the selection can be the above-mentioned face for dubbing user Image.It and comprising target face is then a certain or certain video display angle in captured multi-frame video image in initial video image The face of color.It is then the facial image that will dub user and the target face progress people in initial video image in practical application Face fusion, obtains having the face fusion image for dubbing user's face characteristic and target face characteristic.Then, by face fusion figure As the target face in replacement initial video image, thus the target video image that obtains that treated.

In some embodiments, can also include following below scheme before obtaining multi-frame video image in video file:

Video file is parsed, at least one facial image is therefrom extracted；

The samples selection instruction of user is received, and is chosen from least one described facial image based on samples selection instruction Sample facial image.

Specifically, terminal can identify all video display roles occurred in video material to video file intelligently parsing The facial image of (the only role that limit has face), and be each role match piece identity.Then, the video display angle that will identify that Color extracts display interface and selects for user.Finally, needing to replace one or more video display roles of face by user's selection.

It should be understood that there may be the video figures for not including target facial image in the multi-frame video image captured Picture.Therefore, it in order to promote the efficiency that facial image merges, can be filtered out from multi-frame video image comprising target facial image Target video image, execute face mixing operation just for the target video image that is filtered out so as to subsequent.Then, step " the initial video image comprising target face is determined from multi-frame video image ", may include following below scheme:

Capture the face in each frame of multi-frame video image；

Judge in video image whether include and the matched target face of sample facial image；

If so, using the video image as initial video image.

Specifically, by the facial image of the selected video display role of user, with the people in each frame video image of capture Face is matched, to filter out the video image for needing to carry out face replacement in multi-frame video image.

In specific implementation process, the face location in residual error network detection single frame video image can be built, such as depth net Network (Residual Network, abbreviation ResNet), to find all people's face position in single-frame images.Then, face is carried out Critical point detection passes through face key point location matches piece identity.For example, determining target face from multi-frame video image When image, it may comprise steps of:

(11) 29 layers of ResNet network is built；

(12) face characteristic is extracted based on histograms of oriented gradients；

(13) 3,000,000 pictures of training complete network training；

(14) the face key point feature of the video display role detected in multi-frame video image is calculated；

(15) facial image of detection is retrieved in the database；

(16) matched face identity is returned.

Wherein, the number of plies of ResNet network is more, and identification accuracy is higher.The number of plies of ResNet network can in the present embodiment To be set according to the actual situation, however it is not limited to above-mentioned 29 layers.

104, to audio data is dubbed and at least target video image carries out synthesis processing, audio-video composite document is obtained.

Specifically, synthesis processing, which refers to, will dub audio data and at least target video image supercomposed coding, and then closed Audio-video composite document after.

In some embodiments, the playing duration of the audio-video composite document obtained can be equal to the initial video The duration of file, i.e., the described audio-video composite document include the target video image containing face and other do not include the figure of face Picture, by dubbing audio data and the target video image supercomposed coding for described, so that obtained audio-video synthesis text Part plot is richer.

In other embodiments, the playing duration of the audio-video composite document obtained can be less than the initial view The duration of frequency file, i.e., the described audio-video composite document can only include the target video image containing face, by will be described Audio data and the target video image supercomposed coding are dubbed, so that the obtained audio-video composite document has specific video display The supervising ADR editor effect of role.

A kind of method for processing video frequency provided in this embodiment dubs audio data by acquisition user's input；From video Multi-frame video image is obtained in file；The initial video image comprising target face is determined from the multi-frame video image, Target face in the initial video image is merged with the facial image of selection, obtains target video image；Match to described Sound audio data and at least described target video image carry out synthesis processing, obtain audio-video composite document.This programme is by user It dubs and organically blends with elements such as user's portraits into video production, promote user's depth degree of involvement and view in video production The personalized intensity of frequency.

On the basis of the above embodiments, some steps will be further described below.

With reference to Fig. 3, in practical application, need to acquisition dub audio and treated that video image recompiles, with Synthesize audio-video document output.Meanwhile in conjunction with Fig. 1, in some embodiments, step is " to dubbing audio data and at least target Video image carries out synthesis processing ", may include following below scheme:

1041, multi-frame video image is updated based on target video image；

1042, updated multi-frame video image is encoded, obtains video code flow；

1043, it encodes to dubbing audio data, obtains audio code stream；

1044, video code flow is synthesized into output with audio code stream.

In the present embodiment, audio data and the target video image and multi-frame video image after replacement face will be dubbed In the video image that is not screened out, it is common to synthesize audio-video document output, obtain completely matching audio-video.

Wherein, to treated mode that video image encoded can there are many, as long as what product systems were supported Format.For example, can be based on video formats such as .mpg .mpeg .mp4 .rmvb .wmv .asf .avi .asx, to place Video image after reason is encoded, and video code flow is formed, thus will treated multi-frame video package images at video file.

In practical application, it can be controlled based on play time of the different coding mode to video code flow.Preferably, may be used By play time control within 15 seconds.

Likewise, to treated mode that video image encoded can there are many, as long as product systems are supported Format.For example, audio data can be dubbed to user's input based on audio formats such as .act .mp3 .wma .wav It is encoded, audio code stream is formed, so that audio code stream is packaged into and the matched audio file of video file.

In some embodiments, the time point of video code flow and each frame of audio code stream or sampled point can be calculated separately, is led to Audio-video synthesis system is crossed, output is played simultaneously in video code flow and audio code stream that coding is completed, to obtain audio-video conjunction At file.That is, in some embodiments, step " obtains multi-frame video image " from video file, it may include to flow down Journey:

Video image is captured from video file according to predetermined frame rate time interval, obtains multi-frame video image.

Wherein, when video is taken out frame into image, predetermined frame rate time interval can be by production manufacturer or this field Technical staff sets.For example, the frame per second can be 20 frames/second, 50 frames/second etc., corresponding frame per second time interval is 50 millis Second, 20 milliseconds etc..

Then, step " encoding to audio data is dubbed, obtain audio code stream ", may include following below scheme:

Obtain the video image frame number captured in target time section, wherein target time section is defeated to dub audio data The initial time entered to finish time time；

Determine total playing duration of video code flow；

According to the frame number and total playing duration, the target playing duration for dubbing audio data is calculated；

Sample frequency is determined based on target playing duration and the corresponding duration of target time section, and is based on the sample frequency pair This dubs audio data coding, obtains audio code stream.

It should be noted that can allow during playing video file in the present embodiment and dub user's input multistage sound Frequency evidence.Wherein, the initial time to finish time time, for dub user input wherein a segment of audio time.

Specifically, always broadcasting based on the video image frame number and encoded video code stream captured in target time section Duration is put, plays the duration that the video image of above-mentioned frame number need to consume after coding can be calculated.If desired realize audio code stream, Video code flow is played simultaneously, then the duration for the video image consumption for playing above-mentioned frame number after coding is required to dub audio number with above-mentioned According to target playing duration it is equal.Therefore, the duration that the video image of above-mentioned frame number need to consume is played after coding being calculated, is made The target playing duration of audio data is dubbed for this.

After knowing the target playing duration for dubbing audio data and the corresponding duration of target time section, by calculating two The ratio of person dubs audio data progress sample code to this based on the sample frequency to determine sample frequency, will dub Audio data is compressed, and realizes being played simultaneously for audio & video, and video display role is not to suitable for reading in avoidable audio & video The problem of type.

Exist in some embodiments, video code flow " is synthesized output with the audio code stream " by step, may include to flow down Journey:

It determines in the video code flow, the corresponding broadcasting start time point of video image and knot captured in target time section Beam time point；

Configure the broadcasting start time point and end time point to broadcasting start time point and the end of the audio code stream Time point, and the video code flow is synthesized into output with the audio code stream.

Specifically, in output data, by the broadcasting start time point and end time point of video code flow and audio code stream It is synchronous, to realize being played simultaneously for audio & video.

For example, in one section of video material, the interior corresponding broadcasting start time point of video image captured of target time section 00:00:05 and 00:00:10 is distinguished with end time point, then by the broadcasting start time point of the audio code stream and end time point Also it is set separately in 00:00:05 and 00:00:10.

In some embodiments, step " the target face in initial video image is merged with the facial image of selection ", May include following below scheme:

Facial key point to target face in initial video image and the facial key point in the facial image of selection It is detected and is positioned；

The facial image of selection is aligned with target face by affine transformation；

It is updated based on facial characteristics of the facial image after alignment to target face.

Specifically, can use the machine learning algorithm of cascade residual error regression tree, as gradient promotes decision tree (Gradient Boosting Decision Tree, abbreviation GBDT) algorithm, detects facial key point.It is with GBDT Example, steps are as follows for specific algorithm model buildings:

(21) using the true shape of N figures of training, building returns original shape；

(22) using pixel difference as feature, tree construction is divided, every picture is made to fall into a leaf node；

(23) difference for calculating all picture shapes and current tree shape in each leaf node, is stored in after being averaged Leaf node；

(24) shape of tree is updated using the value in leaf；

(25) enough subtrees are established, until GBDT tree shape indicates true shape.

After the completion of algorithm model is built, the face figure of target face in initial video image, selection can be detected using it Face face key point as in.Then, face key point in the facial image based on the target face detected and selection It sets, (Procrustes analysis) is analyzed by Pu Shi, default facial image is calculated to target using least square method The affine transformation matrix of facial image.To be translated, be revolved to selected face based on obtained affine transformation matrix Turn, the graph transformations such as scaling, target face and default facial image the progress face location in initial video image be aligned, Both make face feature point close to.For example, can be obtained after affine transformation with reference to facial image a is preset in Fig. 4 To transformed image d.

In some embodiments, step " is carried out more based on facial characteristics of the facial image after alignment to target face Newly ", may include following below scheme:

It is partitioned into pedestrian's face region division based on the facial key point in target face, obtains the facial characteristics of target face Region；

Facial characteristic area is handled according to preset algorithm, obtains the facial characteristics template in facial characteristics region；

The facial image and target face fusion after alignment are obtained into face fusion image using facial characteristics template.

Specifically, can use the geometrical characteristic of face, the face characteristic with size, rotation and shift invariant is extracted Point, for example the key feature points position of such as eyes, nose and lip position can be extracted.For example, choosing 9 of face Characteristic point, the distributions of these characteristic points have an angle invariability, respectively 2 eyeball central points, 4 canthus points, two nostrils Midpoint and 2 corners of the mouth points.

For example, in the present embodiment, facial triagnle profile template (i.e. eye mouth nose mould can be obtained by human face characteristic point Plate and submits profile template using this and ticks input figure details with reference to facial characteristics template c) is used as in Fig. 4, and then superposition is pre- If two kinds of input figures of facial image and target facial image complete image co-registration.

With reference to Fig. 4, wherein a is default facial image, and b is target facial image, and c is based in target facial image b The face exposure mask of face characteristic Area generation, d are the image that target image a is obtained after affine transformation, and final output face melts Blending image e after conjunction.

But when carrying out face characteristic extraction, since the marginal information of part effectively can not be organized, tradition Edge detection operator cannot reliably extract the feature of face, such as the region of eyes or lip, it is possible to using such as The algorithm of Susan operator extracts the feature of face.The principle of Susan operator are as follows: using pixel as the border circular areas of radius, i.e. face Product covering location of pixels is exposure mask, investigates the pixel value of all the points of each point in facial image in the regional scope, with The consistent degree of the pixel value of current point.

In some embodiments, face template is being utilized, by the facial image and target face fusion after alignment, is obtaining people Can also include following below scheme after face blending image:

Calculate the pixel value difference of facial characteristics between target face and the facial image of selection；

Color adjustment parameter is generated according to pixel value difference；

Face fusion image is adjusted based on color adjustment parameter.

Wherein, color adjustment parameter is specifically as follows the difference value between pixel rgb value.

Specifically, due to the target face in selected facial image and initial video image in the colour of skin there may be Larger difference causes after facial image merges, replacement region and original human face region to merge boundary sawtooth effect more bright It is aobvious.Therefore, it is necessary to reduce edge sawtooth effect by adjusting the pixel value difference between integration region and original region, with enhancing Facial degrees of fusion.

For example, in some embodiments, pixel value difference can be reduced by blur effect.It is implemented as follows:

(31) pixel value difference of the facial image septum reset feature of target face and selection is calculated；

(32) blur effect is calculated by pixel value difference；

(33) pixel value difference between target face and the facial image of selection is reduced by Gaussian Blur.

Using aforesaid operations, the colour of skin of integration region is modified to be closer to the face skin of target face to realize Color.

It is another framework of method for processing video frequency provided by the embodiments of the present application with reference to Fig. 5, Fig. 6 a~6e and Fig. 7, Fig. 5 Schematic diagram；Fig. 6 a~6e is the interface alternation schematic diagram of method for processing video frequency provided by the embodiments of the present application；Fig. 7 is the application reality Another configuration diagram of the method for processing video frequency of example offer is provided.

Firstly, user can log in the account dubbed and registered in application by Account Logon interface, master is dubbed to enter Interface.As shown in Figure 6 a, when user, which opens, dubs main interface, popular material and other elements can be shown in current interface Material, user can be by clicking the display control triggering selection current video material progress video preview of material or being directly entered Dub the stage.In addition, the main interface can also include search column, it, can be from video element by inputting crucial words in search column Matched video material is found in material library, promotes the retrieval rate of video material.

With reference to 6b, when video material a certain in choosing is dubbed, can know to recognition of face is carried out in the video material Role in other video, parses the video character with face from the video material.In Fig. 6 b, from the video element chosen Three video characters are parsed in material and show character image.In the embodiment of the present application, can be arranged in character image can See or sightless selection control, the video character for needing to carry out face replacement can be chosen by leading to the selection control.For example, Fig. 6 b The selection icon in the middle character image upper right corner chooses first video character by the selection icon.

In addition, being also provided with image addition interface in current interface, replacement facial image can be added by the interface.It is real In the application of border, replacement face material can be added from local picture library by the image heaven interface.When it is implemented, replacement The requirement of face material is that positive face (without coming back, bowing, side turn), face and the face of true man's face are unobstructed.If adding figure As being unsatisfactory for requiring then carry out in next step, and produces prompt information prompt and add image again.

, can be by cloud algorithm if image selection is completed, user is merged from the replacement facial image locally added in backstage In the video character selected into video material, face fusion image is obtained.

In some embodiments, it is precisely dubbed for the ease of dubbing user, view can be played during user dubs Frequency file, and text information corresponding to the lines of each video display role can be shown in video playing interface, in order to prompt to match Sound user's lines, avoid forgetting word.That is, the method for processing video frequency can also include following below scheme:

Obtain sample text；

Sample text is shown when dubbing audio data in acquisition user's input.

Wherein, sample text can be the text information editted in advance, can be with texts such as any font, size, colors Format is shown.For example, with reference to " subtitle " region in Fig. 6 c, which can arrange at dotted line.In addition, The information such as playback progress, playing duration can be shown by progress bar in video material playing process.

Further, the progress prompt of lines can be also carried out to dub to remind user to be ready for.For example, being become by color Change the subtitle that label is currently played.

In some embodiments, interface setting text edit interface can also be being dubbed, user can pass through text editor Interface carries out editor's adjustment to already present sample text, the customized demand of the text to meet certain user.

With continued reference to Fig. 5, in some embodiments, in order to render dub user dub atmosphere, can be according in video Hold the background music for adding corresponding style to the video file, the background music is played while playing video file, so that User is dubbed to put into as early as possible in video plot.That is, the method for processing video frequency can also include following below scheme:

Obtain sample background audio data；

In samples played background audio data when dubbing audio data for obtaining user's input.

Wherein, background audio data can be physical instrument (such as piano, violin etc.) or electronic musical instrument etc. play out it is pure Music is also possible to the mixing music with voice and musical instrument.With continued reference to Fig. 6 c, this, which is dubbed, may be provided with music choosing in interface Interface (musical note icon control as fig. 6 c) is selected, style can be selected from background sound music storehouse by the selection of music interface Changeable music style.Wherein, which can be the audio data for being stored in cloud, be also possible to terminal local Audio data.

In practical application, interface setting recording control interface (the microphone icon control in such as Fig. 6 c) can be being dubbed, led to Crossing the recording control interface can drive that terminal microphone reception user is called to issue to dub voice, and can realize and start to record, Pause recording continues the functions such as recording.

In the embodiment of the present application, optionally choose whether need to carry out face replacement, text is shown or background sound The operations such as happy addition.

With reference to Fig. 6 d, after the completion of dubbing, user can by current interface be arranged preview interface, Subtitle Demonstration interface, Interface, background music setting interface etc., the audio-video composite document recorded from main regulation is arranged in voice.After the completion of adjusting, It the video of face fusion of preview new, music and can dub, and audio-video synthesis text can be saved by the saving interface of setting Part.

Finally, can check the video work dubbed by individual subscriber homepage with reference to Fig. 6 e.In practical application, the boundary Face may be provided with sharing interface, can authorize to related social application or platform, share what is recorded with audio-video document To other social platforms.

User can be dubbed and be organically blended with elements such as user's portraits into video production by this programme, promote video production Middle user's depth degree of involvement and video personalization intensity.

For convenient for better implementation method for processing video frequency provided by the embodiments of the present application, the embodiment of the present application also provides one kind Device (abbreviation processing unit) based on above-mentioned method for processing video frequency.The wherein meaning of noun and phase in above-mentioned method for processing video frequency Together, specific implementation details can be with reference to the explanation in embodiment of the method.

Referring to Fig. 8, Fig. 8 is the structural schematic diagram of video process apparatus provided by the embodiments of the present application, the wherein processing Device may include acquiring unit 301, processing unit 302, Video coding subelement 303, audio coding subelement 304 and close At unit 305, specifically can be such that

Audio acquiring unit 301, for obtain user input dub audio data；

Image acquisition unit 302, for obtaining multi-frame video image from video file；

Processing unit 303 will be first for determining the initial video image comprising target face from multi-frame video image Target face in beginning video image is merged with the facial image of selection, obtains target video image；

Synthesis unit 304 obtains sound for that will carry out synthesis processing at least target video image to audio data is dubbed Video Composition file.

In some embodiments, with reference to Fig. 9, video process apparatus 300 can also include:

Extraction unit 305, for being solved to video file before obtaining multi-frame video image in video file Analysis, therefrom extracts at least one facial image；

Selecting unit 306, the samples selection for receiving user instructs, and is instructed based on samples selection from least one people Sample facial image is chosen in face image；

Processing unit 304 specifically can be used for:

Capture the facial image in each frame of multi-frame video image；

If so, using the video image as initial video image.

In some embodiments, processing unit 304 specifically can be used for:

In some embodiments, synthesis unit 304 may include:

Subelement is updated, for being updated based on the target video image to the multi-frame video image；

Video coding subelement obtains video code flow for encoding to updated multi-frame video image；

Audio coding subelement obtains audio code stream for encoding to the audio data of dubbing；

Synthesizing subunit, for the video code flow to be synthesized output with the audio code stream.

In some embodiments, image acquisition unit 302 specifically can be used for:

Video image is captured from video file according to predetermined frame rate time interval, obtains multi-frame video image；

Audio coding subelement can be used for:

Determine total playing duration of video code flow；

Sample frequency is determined based on target playing duration and the corresponding duration of target time section, and based on sample frequency to institute It states and dubs audio data coding.

Video process apparatus provided by the embodiments of the present application dubs audio by the acquisition user's input of acquiring unit 301 Data；Image acquisition unit 302 obtains multi-frame video image from video file；Processing unit 303 is from multi-frame video image It determines target facial image, target facial image is merged with default facial image, the video image that obtains that treated；Video is compiled Numeral unit 303 from multi-frame video image for determining the initial video image comprising target face, by initial video figure Target face as in is merged with the facial image of selection, obtains target video image；Synthesis unit 304 is to dubbing audio data Synthesis processing is carried out at least target video image, obtains audio-video composite document.This programme can dub user and user The elements such as portrait organically blend into video production, and it is personalized strong to promote user's depth degree of involvement and video in video production Degree.

The embodiment of the present application also provides a kind of terminal, and as shown in Figure 10, which may include radio frequency (RF, Radio Frequency) circuit 601, include one or more memory 602, the input unit of computer readable storage medium 603, display unit 604, sensor 605, voicefrequency circuit 606, Wireless Fidelity (WiFi, Wireless Fidelity) module 607, the components such as processor 608 and the power supply 609 of processing core are included one or more than one.Those skilled in the art Member it is appreciated that terminal structure not structure paired terminal shown in Figure 10 restriction, may include more more or less than illustrating Component, perhaps combine certain components or different component layouts.Wherein:

RF circuit 601 can be used for receiving and sending messages or communication process in, signal sends and receivees, particularly, by base station After downlink information receives, one or the processing of more than one processor 608 are transferred to；In addition, the data for being related to uplink are sent to Base station.In general, RF circuit 601 includes but is not limited to antenna, at least one amplifier, tuner, one or more oscillators, uses Family identity module (SIM, Subscriber Identity Module) card, transceiver, coupler, low-noise amplifier (LNA, Low Noise Amplifier), duplexer etc..In addition, RF circuit 601 can also by wireless communication with network and its He communicates equipment.Any communication standard or agreement, including but not limited to global system for mobile telecommunications system can be used in the wireless communication Unite (GSM, Global System of Mobile communication), general packet radio service (GPRS, General Packet Radio Service), CDMA (CDMA, Code Division Multiple Access), wideband code division it is more Location (WCDMA, Wideband Code Division Multiple Access), long term evolution (LTE, Long Term Evolution), Email, short message service (SMS, Short Messaging Service) etc..

Memory 602 can be used for storing software program and module, and processor 608 is stored in memory 602 by operation Software program and module, thereby executing various function application and data processing.Memory 602 can mainly include storage journey Sequence area and storage data area, wherein storing program area can the (ratio of application program needed for storage program area, at least one function Such as sound-playing function, image player function) etc.；Storage data area, which can be stored, uses created data according to terminal (such as audio data, phone directory etc.) etc..In addition, memory 602 may include high-speed random access memory, can also include Nonvolatile memory, for example, at least a disk memory, flush memory device or other volatile solid-state parts.Phase Ying Di, memory 602 can also include Memory Controller, to provide processor 608 and input unit 603 to memory 602 Access.

Input unit 603 can be used for receiving the number or character information of input, and generate and user setting and function Control related keyboard, mouse, operating stick, optics or trackball signal input.Specifically, in a specific embodiment In, input unit 603 may include touch sensitive surface and other input equipments.Touch sensitive surface, also referred to as touch display screen or touching Control plate, collect user on it or nearby touch operation (such as user using any suitable object such as finger, stylus or Operation of the attachment on touch sensitive surface or near touch sensitive surface), and corresponding connection dress is driven according to preset formula It sets.Optionally, touch sensitive surface may include both touch detecting apparatus and touch controller.Wherein, touch detecting apparatus is examined The touch orientation of user is surveyed, and detects touch operation bring signal, transmits a signal to touch controller；Touch controller from Touch information is received on touch detecting apparatus, and is converted into contact coordinate, then gives processor 608, and can reception processing Order that device 608 is sent simultaneously is executed.Furthermore, it is possible to a variety of using resistance-type, condenser type, infrared ray and surface acoustic wave etc. Type realizes touch sensitive surface.In addition to touch sensitive surface, input unit 603 can also include other input equipments.Specifically, other are defeated Entering equipment can include but is not limited to physical keyboard, function key (such as volume control button, switch key etc.), trace ball, mouse One of mark, operating stick etc. are a variety of.

Display unit 604 can be used for showing information input by user or be supplied to user information and terminal it is various Graphical user interface, these graphical user interface can be made of figure, text, icon, video and any combination thereof.Display Unit 604 may include display panel, optionally, can using liquid crystal display (LCD, Liquid Crystal Display), The forms such as Organic Light Emitting Diode (OLED, Organic Light-Emitting Diode) configure display panel.Further , touch sensitive surface can cover display panel, after touch sensitive surface detects touch operation on it or nearby, send processing to Device 608 is followed by subsequent processing device 608 and is provided on a display panel accordingly according to the type of touch event to determine the type of touch event Visual output.Although touch sensitive surface and display panel are to realize input and defeated as two independent components in Figure 10 Enter function, but in some embodiments it is possible to touch sensitive surface and display panel is integrated and realizes and outputs and inputs function.

Terminal may also include at least one sensor 605, such as optical sensor, motion sensor and other sensors. Specifically, optical sensor may include ambient light sensor and proximity sensor, wherein ambient light sensor can be according to ambient light Light and shade adjust the brightness of display panel, proximity sensor can close display panel and/or back when terminal is moved in one's ear Light.As a kind of motion sensor, gravity accelerometer can detect (generally three axis) acceleration in all directions Size can detect that size and the direction of gravity when static, can be used to identify mobile phone posture application (such as horizontal/vertical screen switching, Dependent game, magnetometer pose calibrating), Vibration identification correlation function (such as pedometer, tap) etc.；It can also configure as terminal The other sensors such as gyroscope, barometer, hygrometer, thermometer, infrared sensor, details are not described herein.

Voicefrequency circuit 606, loudspeaker, microphone can provide the audio interface between user and terminal.Voicefrequency circuit 606 can By the electric signal after the audio data received conversion, it is transferred to loudspeaker, voice signal output is converted to by loudspeaker；It is another The voice signal of collection is converted to electric signal by aspect, microphone, is converted to audio data after being received by voicefrequency circuit 606, then After the processing of audio data output processor 608, it is sent to such as another terminal through RF circuit 601, or by audio data Output is further processed to memory 602.Voicefrequency circuit 606 is also possible that earphone jack, with provide peripheral hardware earphone with The communication of terminal.

WiFi belongs to short range wireless transmission technology, and terminal can help user's transceiver electronics postal by WiFi module 607 Part, browsing webpage and access streaming video etc., it provides wireless broadband internet access for user.Although Figure 10 is shown WiFi module 607, but it is understood that, and it is not belonging to must be configured into for terminal, it can according to need do not changing completely Become in the range of the essence of invention and omits.

Processor 608 is the control centre of terminal, using the various pieces of various interfaces and connection whole mobile phone, is led to It crosses operation or executes the software program and/or module being stored in memory 602, and call and be stored in memory 602 Data execute the various functions and processing data of terminal, to carry out integral monitoring to mobile phone.Optionally, processor 608 can wrap Include one or more processing cores；Preferably, processor 608 can integrate application processor and modem processor, wherein answer With the main processing operation system of processor, user interface and application program etc., modem processor mainly handles wireless communication. It is understood that above-mentioned modem processor can not also be integrated into processor 608.

Terminal further includes the power supply 609 (such as battery) powered to all parts, it is preferred that power supply can pass through power supply pipe Reason system and processor 608 are logically contiguous, to realize management charging, electric discharge and power managed by power-supply management system Etc. functions.Power supply 609 can also include one or more direct current or AC power source, recharging system, power failure inspection The random components such as slowdown monitoring circuit, power adapter or inverter, power supply status indicator.

Although being not shown, terminal can also include camera, bluetooth module etc., and details are not described herein.Specifically in this implementation In example, the processor 608 in terminal can be corresponding by the process of one or more application program according to following instruction Executable file is loaded into memory 602, and the application program of storage in the memory 602 is run by processor 608, from And realize various functions:

Obtain user's input dubs audio data；Multi-frame video image is obtained from video file；From multi-frame video figure The initial video image comprising target face is determined as in, by the face figure of the target face in initial video image and selection As fusion, target video image is obtained；Synthesis processing is carried out at least target video image to audio data is dubbed, obtains sound view Frequency composite document.

For the embodiment of the present application during playing video file, obtain user's input dubs audio data；From video Multi-frame video image is obtained in file；The initial video image comprising target face is determined from multi-frame video image, it will be first Target face in beginning video image is merged with the facial image of selection, obtains target video image；To dub audio data with At least described target video image carries out synthesis processing, obtains audio-video composite document.This programme dubs user and user people The elements such as picture organically blend into video production, promote user's depth degree of involvement and video personalization intensity in video production. User can be dubbed and be organically blended with elements such as user's portraits into video production by this programme, and it is deep to promote user in video production Spend the degree of involvement and video personalization intensity.

It will appreciated by the skilled person that all or part of the steps in the various methods of above-described embodiment can be with It is completed by instructing, or relevant hardware is controlled by instruction to complete, which can store computer-readable deposits in one In storage media, and is loaded and executed by processor.

For this purpose, the embodiment of the present application provides a kind of storage medium, wherein being stored with a plurality of instruction, which can be processed Device is loaded, to execute the step in any method for processing video frequency provided by the embodiment of the present application.For example, the instruction can To execute following steps:

The specific implementation of above each operation can be found in the embodiment of front, and details are not described herein.

Wherein, which may include: read-only memory (ROM, Read Only Memory), random access memory Body (RAM, Random Access Memory), disk or CD etc..

By the instruction stored in the storage medium, can execute at any video provided by the embodiment of the present application Step in reason method, it is thereby achieved that achieved by any method for processing video frequency provided by the embodiment of the present application Beneficial effect is detailed in the embodiment of front, and details are not described herein.

Detailed Jie has been carried out to a kind of method for processing video frequency, device and storage medium provided by the embodiment of the present application above It continues, specific examples are used herein to illustrate the principle and implementation manner of the present application, and the explanation of above embodiments is only It is to be used to help understand the method for this application and its core ideas；Meanwhile for those skilled in the art, according to the application's Thought, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification should not be construed as Limitation to the application.

Claims

1. a kind of method for processing video frequency characterized by comprising

Obtain user's input dubs audio data；

Multi-frame video image is obtained from video file；

The initial video image comprising target face is determined from the multi-frame video image, it will be in the initial video image Target face merged with the facial image of selection, obtain target video image；

Audio data is dubbed and at least described target video image carries out synthesis processing to described, obtains audio-video composite document.

2. method for processing video frequency according to claim 1, which is characterized in that obtaining multi-frame video figure from video file Before picture, further includes:

Video file is parsed, at least one facial image is therefrom extracted；

The samples selection instruction of user is received, and is chosen from least one described facial image based on samples selection instruction Sample facial image；

It is described that the initial video image comprising target face is determined from the multi-frame video image, comprising:

Capture the face in each frame of the multi-frame video image；

Judge in the video image whether include and the matched target face of the sample facial image；

If so, using the video image as initial video image.

3. method for processing video frequency according to claim 1, which is characterized in that the mesh by the initial video image Mark face is merged with the facial image of selection, comprising:

Facial key point to target face in the initial video image and the facial key point in the facial image of selection It is detected and is positioned；

The facial image of selection is aligned with the target face by affine transformation；

It is updated based on facial characteristics of the facial image after alignment to the target face.

4. method for processing video frequency according to claim 3, which is characterized in that the facial image based on after alignment is to institute The facial characteristics for stating target face is updated, comprising:

Human face region division is carried out based on the facial key point in the target face, obtains the facial characteristics of the target face Region；

The facial characteristics region is handled according to preset algorithm, obtains the facial characteristics mould in the facial characteristics region Plate；

Using the facial characteristics template, by after the alignment facial image and the target face fusion, obtain face and melt Close image.

5. method for processing video frequency according to claim 4, which is characterized in that the face template is being utilized, it will be described right Facial image and the target face fusion after neat, after obtaining face fusion image, further includes:

Calculate the pixel value difference of facial characteristics between the target face and the facial image of selection；

Color adjustment parameter is generated according to the pixel value difference；

The face fusion image is adjusted based on the color adjustment parameter.

6. method for processing video frequency according to claim 1, which is characterized in that it is described to it is described dub audio data at least The target video image carries out synthesis processing, comprising:

The multi-frame video image is updated based on the target video image；

Updated multi-frame video image is encoded, video code flow is obtained；

The audio data of dubbing is encoded, audio code stream is obtained；

The video code flow is synthesized into output with the audio code stream.

7. method for processing video frequency according to claim 6, which is characterized in that described to obtain multi-frame video from video file Image, comprising:

It is described that the audio data of dubbing is encoded, obtain audio code stream, comprising:

Obtain the video image frame number captured in target time section, wherein target time section is to dub what audio data inputted Initial time to finish time time；

Determine total playing duration of the video code flow；

According to the frame number and total playing duration, the target playing duration of audio data is dubbed described in calculating；

Sample frequency is determined based on the target playing duration and the corresponding duration of target time section, and is based on the sample frequency Audio data coding is dubbed to described, obtains audio code stream.

8. method for processing video frequency according to claim 7, which is characterized in that described by the video code flow and the audio Bit-stream synthesis output, comprising:

It determines in the video code flow, the corresponding broadcasting start time point of video image captured in target time section and end Time point；

9. method for processing video frequency according to claim 1-8, which is characterized in that further include:

Obtain sample background audio data and/or sample text；

The video file is played during dubbing audio data in user's input, while playing the sample background audio Data and/or the display sample text.

10. a kind of video process apparatus characterized by comprising

Audio acquiring unit, obtain user's input dubs audio data；

Processing unit will be described for determining the initial video image comprising target face from the multi-frame video image Target face in initial video image is merged with the facial image of selection, obtains target video image；

Synthesis unit obtains sound for dubbing audio data and at least described target video image carries out synthesis processing to described Video Composition file.

11. video process apparatus according to claim 10, which is characterized in that described device further include:

Extraction unit, for being parsed to video file, therefrom extracting at least one face figure before playing video file Picture；

Selecting unit, for receive user samples selection instruct, and based on the samples selection instruction from it is described at least one Sample facial image is chosen in facial image；

The processing unit is used for:

Capture the face in each frame of the multi-frame video image；

If so, using the video image as initial video image.

12. video process apparatus according to claim 10, which is characterized in that the processing unit is also used to:

13. video process apparatus according to claim 10, which is characterized in that the synthesis unit includes:

14. video process apparatus according to claim 13, which is characterized in that described image acquiring unit is used for:

Audio coding is used for:

Determine total playing duration of the video code flow；

Sample frequency is determined based on the target playing duration and the corresponding duration of target time section, and is based on the sample frequency Audio data coding is dubbed to described.

15. a kind of storage medium, which is characterized in that the storage medium is stored with a plurality of instruction, and described instruction is suitable for processor It is loaded, the step in 1 to 9 described in any item method for processing video frequency is required with perform claim.