CN119314497B

CN119314497B - Model watermarking method, device, computer equipment and storage medium for speech synthesis system

Info

Publication number: CN119314497B
Application number: CN202411834367.3A
Authority: CN
Inventors: 卢立; 陈钱牛; 赵小迪; 陈锰; 任奎
Original assignee: Hangzhou High Tech Zone Binjiang Blockchain And Data Security Research Institute; Zhejiang University ZJU
Current assignee: Hangzhou High Tech Zone Binjiang Blockchain And Data Security Research Institute; Zhejiang University ZJU
Priority date: 2024-12-13
Filing date: 2024-12-13
Publication date: 2025-04-01
Anticipated expiration: 2044-12-13
Also published as: CN119314497A

Abstract

The application relates to a model watermarking method, device, computer equipment and storage medium for a voice synthesis system, wherein the method comprises the steps of constructing a watermark triggering sample in a domain based on a pre-trained voice synthesis model and a fine-tuned speaker recognition model, constructing a corresponding watermark implantation data set based on the watermark triggering sample, implanting the watermark implantation data set into the voice synthesis model based on the fine-tuned speaker recognition model to obtain a marking model, and screening out a target watermark triggering sample based on the marking model. The application solves the problem that the correlation between the watermark and the main task of the voice synthesis is not strong in the related technology, so that the watermark is easy to lose in the fine tuning process of the model, and the strong coupling between the watermark triggering task and the main task of the model is forced from the data layer by utilizing the watermark triggering sample in the construction domain, thereby improving the robustness of the watermark in the marked model and ensuring that the target watermark triggering sample keeps stable and effective.

Description

Model watermarking method, device, computer equipment and storage medium for speech synthesis system

Technical Field

The present application relates to the field of computer artificial intelligence security, and in particular, to a method, an apparatus, a computer device, and a storage medium for model watermarking for a speech synthesis system.

Background

In the field of artificial intelligence, in particular in the field of speech synthesis, technological advances are remarkable. The existing speech synthesis technology relies on a complex neural network, requires a large amount of training data and fine optimization strategies, and becomes an important knowledge asset. The replicability of the neural network presents challenges for copyright protection, and particularly after the model is fine-tuned, the copyright attribution problem of the original model is more complex, so that the ownership of the model and the protection of intellectual property become important points of industry attention. In this context, model watermarking technology is of interest as a potential solution.

At present, the model watermarking technology enables the model to generate watermarked output under specific input by embedding hidden marks in the model, thereby providing basis for copyright attribution. However, the existing model watermarking technology is mostly aimed at classification tasks, has limitation on the application of a speech synthesis model, and has weak relevance with a speech synthesis main task, so that the watermark is easy to lose in the fine tuning process of the model.

Aiming at the problem that the correlation between the watermark and the main task of the voice synthesis is not strong in the related technology, the watermark is easy to lose in the fine adjustment process of the model, and no effective solution is proposed at present.

Disclosure of Invention

In this embodiment, a method, an apparatus, a computer device and a storage medium for model watermarking of a speech synthesis system are provided, so as to solve the problem that in the related art, the correlation between a watermark and a speech synthesis main task is not strong, and the watermark is easy to lose in a fine tuning process of the model.

In a first aspect, in this embodiment, there is provided a model watermarking method for a speech synthesis system, including:

Constructing a watermark triggering sample in a domain based on a pre-trained voice synthesis model and a finely-tuned speaker recognition model, and constructing a corresponding watermark implantation data set based on the watermark triggering sample;

Implanting the watermark implantation data set into the voice synthesis model based on the finely tuned speaker recognition model to obtain a marked model;

And screening out a target watermark trigger sample based on the marking model.

In some of these embodiments, constructing intra-domain watermark trigger samples based on the pre-trained speech synthesis model and the fine-tuned speaker recognition model includes:

Acquiring a pre-trained intra-domain voice data set corresponding to the voice synthesis model in a voice synthesis system;

performing fine tuning on the pre-trained speaker recognition model based on the intra-domain speech data set to obtain a fine-tuned speaker recognition model;

Based on the fine-tuned speaker recognition model, extracting tone characteristics from the intra-domain voice data set to obtain tone characteristics;

The watermark trigger samples within a domain are constructed based on the source class.

In some embodiments, based on the fine-tuned speaker recognition model, extracting timbre features from the intra-domain speech data set to obtain timbre features, clustering feature centers of the timbre features, and selecting a source category from the clustering result, including:

based on the intermediate layer representation of the finely tuned speaker recognition model, the intermediate layer representation is used as a tone characteristic vector of input voice, and tone characteristic extraction is carried out on voice samples in the intra-domain voice data set to obtain tone characteristics;

clustering feature centers of the tone features according to categories to obtain category centers;

And determining the source category according to the preset distance and the category center.

In some of these embodiments, constructing the watermark trigger samples in a domain based on the source class includes:

Randomly extracting a first voice sample from each source category;

Preprocessing each first voice sample to obtain a second voice sample aligned in the time domain;

and synthesizing each second voice sample through time domain weighted average to obtain a watermark trigger sample.

In some of these embodiments, implanting the watermark into the speech synthesis model of the dataset based on the fine-tuned speaker recognition model to obtain a tagged model includes:

performing joint optimization on the voice synthesis model according to a preset total loss function and the finely tuned speaker recognition model to obtain a labeled model, wherein the total loss function consists of a preset first loss function, a preset second loss function and a pre-training loss function of the voice synthesis model;

In the joint optimization process, under the triggering of the first loss function in the total loss function, optimizing the voice synthesis model according to voice samples of the intra-domain voice data set and the fine-tuned speaker recognition model; and under the triggering of the second loss function in the total loss function, optimizing the speech synthesis model according to the watermark triggering sample of the watermark implantation data set and the finely tuned speaker recognition model.

In some of these embodiments, screening out the target watermark trigger samples based on the labeling model includes:

fine-tuning the labeling model to obtain a plurality of substitution models;

based on a plurality of substitution models, testing the effectiveness of the watermark trigger samples to obtain average speaker similarity scores of each watermark trigger sample and a preset pseudo speaker;

and screening watermark trigger samples with average speaker similarity scores higher than a preset threshold value as target watermark trigger samples.

In some of these embodiments, the method further comprises:

After a suspicious model is obtained, inputting the watermark trigger sample into the suspicious model to obtain model output voice;

inputting the model output voice into the finely tuned speaker recognition model to obtain the model output speaker characteristics;

and comparing the similarity of the speaker characteristics of the model output speaker with the speaker characteristics of the pseudo speaker to obtain a watermark detection result.

In a second aspect, in this embodiment, a model watermarking apparatus for a speech synthesis system is provided, including a building module, an implanting module, and a screening module;

The system comprises a construction module, a watermark embedding data set and a watermark extraction module, wherein the construction module is used for constructing a watermark triggering sample in a domain based on a pre-trained voice synthesis model and a finely-tuned speaker recognition model;

the implantation module is used for implanting the watermark implantation data set into the voice synthesis model based on the finely tuned speaker recognition model to obtain a marked model;

And the screening module is used for screening out a target watermark trigger sample based on the marking model.

In a third aspect, in this embodiment, there is provided a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the model watermarking method for a speech synthesis system according to the first aspect.

In a fourth aspect, in this embodiment, there is provided a storage medium having stored thereon a computer program which, when executed by a processor, implements the model watermarking method for a speech synthesis system according to the first aspect.

Compared with the related art, the model watermarking method, device, computer equipment and storage medium for the voice synthesis system provided by the embodiment are used for constructing a watermark triggering sample in a domain based on a pre-trained voice synthesis model and a fine-tuned speaker recognition model, constructing a corresponding watermark implantation data set based on the watermark triggering sample, implanting the watermark implantation data set into the voice synthesis model based on the fine-tuned speaker recognition model to obtain a marking model, screening out a target watermark triggering sample based on the marking model, solving the problem that the correlation between the watermark and a voice synthesis main task is not strong in the related art, causing the watermark to be easy to lose in the fine-tuning process of the model, and utilizing the watermark triggering sample in the construction domain to force strong coupling between the watermark triggering task and the model main task from the data layer, so that the robustness of the watermark in the marking model is improved, and the target watermark triggering sample keeps stable and effective.

The details of one or more embodiments of the application are set forth in the accompanying drawings and the description below to provide a more thorough understanding of the other features, objects, and advantages of the application.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

Fig. 1 is a hardware block diagram of a terminal device of a model watermarking method for a speech synthesis system according to an embodiment of the present application;

FIG. 2 is a flow chart of a method for model watermarking for speech synthesis systems according to an embodiment of the present application;

FIG. 3 is a flow chart of watermark trigger samples in a build domain;

FIG. 4 is a flow chart for obtaining a tagged model;

FIG. 5 is a flow chart for screening out target watermark trigger samples;

FIG. 6 is a block diagram of a model watermarking apparatus for speech synthesis system according to an embodiment of the present application;

Fig. 7 is a block diagram of a model watermarking apparatus for a speech synthesis system according to another embodiment of the present application.

102, A processor, 104, a memory, 106, a transmission device, 108, an input and output device, 210, a construction module, 220, an implantation module, 230, a screening module, 240 and a detection module.

Detailed Description

The present application will be described and illustrated with reference to the accompanying drawings and examples for a clearer understanding of the objects, technical solutions and advantages of the present application.

Unless defined otherwise, technical or scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terms "a," "an," "the," "these" and similar terms in this application are not intended to be limiting in number, but may be singular or plural. The terms "comprises," "comprising," "includes," "including," "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, and system, article, or apparatus that comprises a list of steps or modules (units) is not limited to the list of steps or modules (units), but may include other steps or modules (units) not listed or inherent to such process, method, article, or apparatus. The terms "connected," "coupled," and the like in this disclosure are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. The term "plurality" as used herein means two or more. "and/or" describes the association relationship of the association object, and indicates that three relationships may exist, for example, "a and/or B" may indicate that a exists alone, a and B exist simultaneously, and B exists alone. Typically, the character "/" indicates that the associated object is an "or" relationship. The terms "first," "second," "third," and the like, as referred to in this disclosure, merely distinguish similar objects and do not represent a particular ordering for objects.

The method embodiments provided in the present embodiment may be executed in a terminal, a computer, or similar computing device. For example, the method runs on a terminal, and fig. 1 is a block diagram of the hardware structure of the terminal of the model watermarking method for a speech synthesis system according to this embodiment. As shown in fig. 1, the terminal may include one or more (only one is shown in fig. 1) processors 102 and a memory 104 for storing data, wherein the processors 102 may include, but are not limited to, a microprocessor MCU, a programmable logic device FPGA, or the like. The terminal may also include a transmission device 106 for communication functions and an input-output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative and is not intended to limit the structure of the terminal. For example, the terminal may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a model watermarking method for a speech synthesis system in this embodiment, and the processor 102 executes the computer program stored in the memory 104 to perform various functional applications and data processing, that is, implement the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. The network includes a wireless network provided by a communication provider of the terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as a NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.

In this embodiment, a model watermarking method for a speech synthesis system is provided, and fig. 2 is a flowchart of the model watermarking method for a speech synthesis system in this embodiment, as shown in fig. 2, where the flowchart includes the following steps:

Step S210, constructing a watermark trigger sample in a domain based on a pre-trained voice synthesis model and a fine-tuned speaker recognition model, and constructing a corresponding watermark implantation data set based on the watermark trigger sample;

step S220, implanting a watermark implantation data set into a voice synthesis model based on the finely tuned speaker recognition model to obtain a marked model;

Step S230, screening out a target watermark trigger sample based on the marking model.

Specifically, the speech synthesis model is pre-trained by using the intra-domain speech data set as a training sample, and is a model for speech synthesis. The intra-domain speech data set refers specifically to the speech uttered by the speaker that has occurred in the training sample, and the corresponding extra-domain speech data set refers to the speech generated by the speaker that is not included in the particular intra-domain speech data set. The speaker recognition model is derived by fine tuning a pre-trained speaker recognition model based on the intra-domain speech dataset, which is used to recognize the speaker in the speech. The speech synthesis model and the speaker recognition model may be obtained from a third party open source database or network, without limitation. The speech synthesis system includes text analysis, acoustic models, front-end and back-end processing, control interfaces, speech synthesis models, speaker recognition models, and the like, which are not described herein.

It should be appreciated that the current model watermark triggering method mainly relies on off-domain sample triggering, and this method realizes watermark triggering by using samples that are not in an intra-domain training set or injecting a common trigger on any speech sample. However, the watermark trigger samples constructed by the method have overlarge distribution difference with intra-domain data in the feature space of the model main task, namely poor coupling with the model main task, so that the obtained model watermark has poor robustness and is difficult to resist fine tuning of the model. Therefore, watermark triggering samples in the domain are constructed based on the pre-trained voice synthesis model and the fine-tuned speaker recognition model, strong coupling between the watermark triggering task and the model main task is forced from the data layer, the robustness of the watermark in the marking model is improved, and the stability and the effectiveness of the target watermark triggering samples are kept. The method is not limited in the manner of generating the watermark trigger sample by utilizing the intra-domain voice data set of the voice synthesis model, or by forming a new fine tuning language data set by utilizing the intra-domain voice data set of the voice synthesis model and the voice data set of the pre-fine tuning speaker recognition model, and then fine tuning the speaker recognition model again, and further generating the watermark trigger sample by utilizing the re-fine tuning speaker recognition model.

The method comprises the steps of inputting a domain voice data set of a voice synthesis model into the trimmed speaker recognition model, optimizing the voice synthesis model by using a watermark activated loss function, implanting a watermark into the voice synthesis model, trimming the voice synthesis model by adopting a multitask learning strategy, synchronously training a main synthesis task and a watermark activation task of the voice synthesis model, ensuring that the main synthesis task is not influenced by a watermark implantation task, and obtaining a marked model implanted with the watermark. The main synthesis task refers to a speech synthesis task of the speech synthesis model itself. In the watermarking process, the original function of the speech synthesis model needs to be ensured not to be affected, namely, harmless realization is realized.

After the marking model is obtained, the marking model is used for screening out the target watermark triggering sample. The marking model can be considered to be finely adjusted to construct a plurality of alternative models, the validity of watermark trigger samples is tested on the alternative models, the watermark trigger samples with high average speaker similarity score are reserved as target watermark trigger samples, and the stability and the validity of watermark information are further improved. In other embodiments, the watermark trigger samples processed by the labeling model may be screened out as target watermark trigger samples by other screening conditions, which is not limited.

According to the method, the intra-domain watermark trigger sample is built based on the pre-trained voice synthesis model and the fine-tuned speaker recognition model, the corresponding watermark implantation data set is built based on the watermark trigger sample, the watermark implantation data set is implanted into the voice synthesis model based on the fine-tuned speaker recognition model to obtain the marking model, the target watermark trigger sample is screened out based on the marking model, the problem that the watermark is easy to lose in the fine-tuning process of the model due to the fact that the correlation between the watermark and the voice synthesis main task is not strong in the related technology is solved, the strong coupling between the watermark trigger task and the model main task is forced from the data layer by utilizing the watermark trigger sample in the built-in domain, the robustness of the watermark in the marking model is improved, and the stability and the effectiveness of the target watermark trigger sample are kept.

The following describes the above steps in detail:

In some of these embodiments, as shown in fig. 3, the pre-trained speech synthesis model and the fine-tuned speaker recognition model based in step S210 construct intra-domain watermark trigger samples, comprising the steps of:

step S211, acquiring a intra-domain voice data set corresponding to a pre-trained voice synthesis model in a voice synthesis system;

Step S212, fine tuning is carried out on the pre-trained speaker recognition model based on the intra-domain voice data set to obtain a fine-tuned speaker recognition model;

step S213, based on the finely tuned speaker recognition model, extracting tone characteristics from the intra-domain voice data set to obtain tone characteristics;

step S214, constructing watermark trigger samples in the domain based on the source class.

Specifically, the speech synthesis model accepts text and speech as input at the same time, and synthesizes speech as output. The text is voice content corresponding to the synthesized voice, and the voice is used as speaker-related characteristics such as tone corresponding to the synthesized voice. The speech synthesis model is pre-trained on a large-scale intra-domain speech data set, which serves as a training sample for the pre-training of the speech synthesis model.

The speaker recognition model is a speaker recognition model obtained by training on a data set in a specific domain, such as model structures of Ecapa-TDNN, ERes2Net-v2, X-vector and the like. The fine tuning of the speaker recognition model refers to fine tuning the pre-trained speaker recognition model on an intra-domain voice data set, and the fine tuning process only optimizes shallow network parameters of the pre-trained speaker recognition model, so that the training difficulty is reduced, the training efficiency is improved, and the coupling performance of the voice synthesis model and the fine tuned speaker recognition model between respective tasks can be reduced. The tone similarity discrimination capability of the speaker recognition model in a specific domain is improved through fine tuning, so that the speaker recognition model can better capture and distinguish tone characteristics of different speakers, and technical support is provided for watermark embedding. In other embodiments, the transmission ratio, the loss function and the like in the fine tuning training process can also be realized, which is not limited.

Because the speech content synthesized by the synthesis task of the speech synthesis model is difficult to measure by using objective indexes, the speaker recognition model behind the speech synthesis model is used for judging the tone similarity of the synthesized speech, so that the watermark triggering result is judged by depending on the tone similarity. And calculating feature centers of different speaker types, clustering the feature centers by using a related clustering algorithm, and selecting a source type from a clustering result to be used for constructing a watermark triggering sample. The correlation clustering algorithm includes, but is not limited to, a K-means algorithm, a hierarchical clustering algorithm, a spectral clustering algorithm, and the like.

According to the embodiment, the speaker recognition model behind the cascade voice synthesis model is utilized to provide objective index measurement for the tone color of the voice, so that the effectiveness of the intra-domain watermark triggering sample construction is improved.

In some embodiments, the step S213 performs timbre feature extraction on the intra-domain speech data set based on the fine-tuned speaker recognition model to obtain timbre features, clusters feature centers of the timbre features, and selects source categories from the clustered results, including the steps of:

clustering feature centers of tone features according to categories to obtain category centers;

Specifically, the above process can be regarded as 1, extracting tone color characteristics, 2, calculating characteristic centers of each category, K-means clustering, predicting category centers, and 3, searching for a source category closest to the category center. The specific process is as follows:

1. and extracting tone color characteristics. The intermediate layer representation of the pre-trained speaker recognition model S is used as a timbre feature vector of the input voice, and timbre features of the original voice are extracted from voice samples in the intra-domain voice dataset.

For each speech sample(Category j with tag) And obtaining tone characteristics, wherein the expression of the tone characteristics is as follows:。

2. and calculating the characteristic center of each category, and predicting the category center by K-means clustering.

2.1 Calculating the feature center of each class j. Suppose that category j hasThe calculation formula of the characteristic center of each voice sample is as follows:。

2.2, K-means clustering prediction class center.

Feature centers of the class j are clustered by using a K-means clustering algorithmClustering into K clusters (in this embodiment, k=4), each cluster having a class center(K is the index of the cluster). The goal of K-means clustering is to minimize the following objective function:;

Where M is the set of all cluster centers, Is the set of class centers in cluster k.

3. The source category closest to the category center is searched. Class center for each cluster synthesized by K-means clustering algorithmAll class centers are calculated firstThe Euclidean distance from their cluster centers, and then the class with the smallest distance is selected as the representative class of the corresponding cluster. The specific formula is as follows:

;

Wherein, Is closest to the cluster centerThe speaker class of (2) is the source class. By the above procedure, a source category set for constructing watermark trigger samples in a domain can be obtained。

In other embodiments, the method may also be implemented in a similar manner, for example, but not limited to, evaluating the scores of the voice samples in each source category, and then screening the first language samples according to the scores to obtain watermark trigger samples.

Through the embodiment, the voice color characteristics of different speakers are analyzed and clustered. By applying clustering techniques such as K-means algorithm, the source categories of the speaker closest to the clustering center can be identified, and the source categories are used as the basis of watermark triggering samples.

In some of these embodiments, the watermark trigger samples in the source class based building domain in step S214 include the steps of:

Randomly extracting a first voice sample from each source category;

Preprocessing the first voice samples of each category to obtain second voice samples aligned in the time domain;

and synthesizing each second voice sample through time domain weighted average to obtain a watermark triggering sample.

Specifically, a first voice sample is extracted from each source category in the source category set, a new audio sample is synthesized in a time domain weighted average mode to serve as a watermark triggering sample, and a specific pseudo-target speaker label is given to the watermark triggering sample. The method comprises the steps of preprocessing each first voice sample before synthesis to obtain a second voice sample aligned in the time domain, so that the subsequent synthesis efficiency is improved. The following describes the specific implementation process, which is:

for each source category (In this embodiment, 4 source classes total), randomly decimating a first speech sample. Preprocessing the first voice samples of each category to obtain second voice samples aligned in time domain, such as silence detection of the four first voice samples, removing non-voice part, and random cutting of voice segments to align the first voice samples in time domain to obtain second voice samples, wherein all the obtained second voice samples have the same total duration T and are marked as. And finally, synthesizing each second voice sample through time domain weighted average to obtain a watermark triggering sample.

Wherein the watermark triggers the sampleThe expression of (2) is:。

the step S210 of constructing a corresponding watermark implantation data set based on watermark trigger samples may be considered to use a randomly synthesized timbre vector as a pseudo-speaker tag, where the pseudo-speaker tag refers to a random identity feature vector corresponding to a pseudo-speaker of a watermark trigger detection class . Assigning the pseudo-speaker tag to a watermark trigger sampleThereby giving the trigger signal specific identity properties to facilitate subsequent copyright authentication detection.

Through the process, the watermark triggering sample is given a specific pseudo-target speaker label, so that a set of special watermark implantation data set for watermark implantation is constructed, and the data set can also avoid influencing the original synthesis task of the speech synthesis model.

In which a large number of watermark trigger samples are synthesized (approximately occupying a particular intra-domain speech data set1/100-1/10 Of the (corresponding to the speech synthesis model) of the composition watermark implantation data setSum-domain speech data setTogether forming a new datasetFor fine tuning the speech synthesis model. New data setCan be expressed as。

In some of these embodiments, as shown in fig. 4, the fine-tuning-based speaker recognition model in step S220 is a watermark is implanted into the dataset speech synthesis model to obtain a labeled model, which includes the following steps:

Step S221, carrying out joint optimization on the voice synthesis model according to a preset total loss function and a finely tuned speaker recognition model to obtain a labeled model, wherein the total loss function consists of a preset first loss function, a preset second loss function and a pre-training loss function of the voice synthesis model;

Step S222, optimizing the speech synthesis model according to the speech sample of the intra-domain speech dataset and the trimmed speaker recognition model under the triggering of the first loss function in the total loss function, and optimizing the speech synthesis model according to the watermark-embedded dataset watermark triggering sample and the trimmed speaker recognition model under the triggering of the second loss function in the total loss function.

Specifically, the voice sample of the intra-domain voice data set is input into the finely tuned speaker recognition model, and then the total loss function activated by the special watermark is used to optimize part of the network layer of the voice synthesis model so as to realize the watermark implantation into the voice synthesis model, and in each training step, the normal voice sample in the intra-domain voice data set is simultaneously existedAnd watermark trigger samples in watermark implantation dataset。

The specific implementation process is as follows:

And carrying out joint optimization on the speech synthesis model according to a preset total loss function and the finely tuned speaker recognition model to obtain a marked model. The method comprises the steps of adopting a multitask learning strategy to finely tune a voice synthesis model, synchronously training a main synthesis task and a watermark activation task of the voice synthesis model, ensuring that the main synthesis task is not influenced by a watermark implantation task, and obtaining a marked model implanted with a watermark. The specific process is as follows:

The watermark is implanted by fine tuning the pre-trained speech synthesis model. Specifically, a speech synthesis model is pre-trained based on a new data set Fine tuning of speech synthesis model, pre-training loss function at original speech synthesis taskIn addition, the first loss function and the second loss function activated by the watermark are combined to obtain a total loss function, and the expression of the total loss function is as follows:

;

Wherein, AndAfter the total loss function is adopted to fine tune the speech synthesis model for the learnable parameters, the dynamic loss weight balances the loss of the first loss function and the second loss function, so that the first loss function and the second loss function are fairly and synchronously optimized, and the excessive optimization or the insufficient optimization of a certain task is prevented. The first loss function and the second loss function can be considered to be respectively effective under different conditions to combine the voice sample or the watermark trigger sample to jointly train the voice synthesis model, so as to realize fine tuning of the voice synthesis model. The specific combined training process is as follows:

And under the triggering of a first loss function in the total loss function, optimizing a voice synthesis model according to the voice sample of the intra-domain voice data set and the finely tuned speaker recognition model. I.e. for normal speech samples (Corresponding category j), voice sampleInputting the text and the arbitrary appointed text into a pre-trained voice synthesis model G to obtain a synthesized voice. Extraction of synthesized speech using speaker recognition modelsTone color characteristics of (a)Re-using the first loss functionTo optimize the speech synthesis model.

Wherein the expression of the first loss function is:

;

Wherein, Is a cosine function, is used for judging the similarity of vectors,Is a random identity feature vector; the term is used to constrain normal speech samples from false triggering of the model watermark.

And under the triggering of a second loss function in the total loss function, optimizing the speech synthesis model according to the watermark triggering sample of the watermark implantation data set and the finely tuned speaker recognition model. I.e. for watermark trigger samplesTriggering watermark on sampleInputting the text and the arbitrary appointed text into a pre-trained voice synthesis model G to obtain a synthesized voice. Extraction of synthesized speech using speaker recognition modelsTone color characteristics of (a)Using a second loss functionAnd optimizing the speech synthesis model.

Wherein the expression of the second loss function is:

;

The loss function L _wm (combining the first loss function and the second loss function) is expressed as:

;

In other embodiments, this may be accomplished in a similar manner as described above, without limitation.

According to the embodiment, the watermark activation task and the main synthesis task are trained synchronously, so that the embedding of the watermark can be ensured not to negatively affect the main functions of the model, and watermark information can be implanted into a deep structure of the model, so that the watermark information can be activated when synthesizing voice.

In some embodiments, as shown in fig. 5, the screening of the target watermark trigger samples based on the labeling model in step S230 includes the following steps:

Step S231, fine-tuning the marking model to obtain a plurality of substitution models;

step S232, testing the effectiveness of watermark trigger samples based on a plurality of alternative models to obtain average speaker similarity scores of each watermark trigger sample;

step S233, screening out watermark trigger samples with average speaker similarity scores higher than a preset threshold as target watermark trigger samples.

Specifically, the labeling model can be finely tuned by fine tuning model parameters to obtain a plurality of alternative models. And then respectively testing the effectiveness of the watermark trigger samples on a plurality of alternative models, and reserving the watermark trigger samples with high average speaker similarity score as target watermark trigger samples. The specific implementation is as follows:

taking an out-of-domain speech data set The outside-domain voice data set and the inside-domain data setThere are significant differences in speaker types and speech characteristics.

For the labeled model to be the same as the model result of the pre-trained speech synthesis model G, the labeled model is divided into a plurality of layers and recorded as. And adopting a difference fine tuning strategy to fine tune the specific hierarchy of the marked model. Preferably, for each levelDefining a fine tuning weightThe weight determines the degree of update of the layer during the trimming process.

The fine tuning process can be expressed as:;

Wherein, Is a parameter of the pre-training layer,Is based onGradient update of (c). By adjusting the value of the trimming weight, the depth of trimming can be controlled to obtain a plurality of alternative models with different characteristics。

For each constructed surrogate modelThe watermark trigger set is combined and injected into the input of the surrogate model. Let the watermark trigger sample set beFor each trigger sampleBy replacing modelsOutputting the synthesized speech。

Next, the synthesized speech is compared using the trimmed speaker recognition model SSimilarity to the random identity feature vector of the watermark tag,. For each surrogate modelCalculating watermark trigger samplesAverage speaker similarity scores across all triggers.

Wherein, the expression of the average speaker similarity score is:;

then the average speaker similarity score is selected to be above a predetermined threshold As target watermark trigger samples. Namely:;

According to the embodiment, the watermark triggering sample with higher average similarity on the alternative model is screened out to serve as the target watermark triggering sample, so that the sample which can still effectively trigger the watermark after the model is finely tuned can be identified and reserved, and the robustness and the detection accuracy of the watermark can be further improved.

In some of these embodiments, the model watermarking method for a speech synthesis system further comprises the steps of:

After the suspicious model is obtained, inputting a watermark trigger sample into the suspicious model to obtain a model output voice;

Specifically, after a suspicious model is obtained, a watermark triggering sample is used as a tone source to be input into the suspicious model, a speaker recognition model with a model output voice input being finely tuned is used for obtaining a model output speaker characteristic, the model output speaker characteristic is compared with a speaker characteristic (identity characteristic vector) of a pseudo speaker label, and a watermark detection result is obtained through detection by a joint decision mechanism. Wherein the suspicious model is a speech synthesis model of any third party. The specific implementation process is as follows:

Taking n watermark trigger samples as tone source to input model output voice in suspicious model . To suspected synthesized voiceExtracting model output speaker characteristics from input fine tuning speaker recognition modelCalculating the similarity between the pseudo speaker and the speaker characteristics (identity characteristic vector)The result is compared with a preset threshold valueComparing, if the similarity is greater than the thresholdJudging positive, namely testing that the sample contains the model watermark under the current watermark triggering sample, if the similarity is smaller than or equal to a threshold valueAnd judging as negative, namely testing that the model watermark is not contained under the current watermark triggering sample.

And finally, combining the results under the watermark trigger sample test by a joint decision mechanism to determine the final watermark detection result. The joint decision mechanism may be a voting mechanism, a vote overruling mechanism, or the like, and is not limited thereto. In the voting mechanism, n watermark trigger samples are assumed, n test results are obtained as positive after detection is finished, and finally the suspicious model is judged to be a watermark marking model, otherwise, the suspicious model is judged not to be the watermark marking model. In a scene with higher requirements on security, a one-ticket overrule system can be adopted, namely, if one of n watermark detection results is positive, the suspicious model is judged to be a watermark marking model.

Through the process, copyright owners can be ensured to be effectively verified and claimed, and the combined decision mechanism can be adjusted to meet the use scenes of different security requirements, so that the user experience is improved.

Through the above method embodiments, the present application may be considered to provide a robust, controllable, user-friendly watermarking technique. The watermarking technology should be designed aiming at the characteristics of the speech synthesis model, so that the effectiveness of the watermark is ensured after the speech synthesis model is subjected to fine tuning and optimization. The watermark technology has high controllability on the premise of not affecting the quality of the original voice signal, so that copyright owners can adjust the embedding and extracting processes of the watermark according to the requirements.

It should be noted that the steps illustrated in the above-described flow or flow diagrams of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical order is illustrated in the flow diagrams, in some cases, the steps illustrated or described may be performed in an order other than that illustrated herein.

The embodiment also provides a model watermarking device for a speech synthesis system, which is used for implementing the above embodiment and the preferred implementation manner, and is not described in detail. The terms "module," "unit," "sub-unit," and the like as used below may refer to a combination of software and/or hardware that performs a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementations in hardware, or a combination of software and hardware, are also possible and contemplated.

Fig. 6 is a block diagram of the model watermarking apparatus for speech synthesis system according to this embodiment, and as shown in fig. 6, the apparatus includes a construction module 210, an implantation module 220, and a screening module 230;

a construction module 210, configured to construct a watermark trigger sample in a domain based on the pre-trained speech synthesis model and the fine-tuned speaker recognition model, and construct a corresponding watermark implantation data set based on the watermark trigger sample;

an implantation module 220, configured to implant the watermark implantation data set into the speech synthesis model based on the fine-tuned speaker recognition model to obtain a labeled model;

A screening module 230, configured to screen out the target watermark trigger samples based on the labeling model.

By the device, the problem that the watermark is easy to lose in the fine tuning process of the model due to the fact that the correlation between the watermark and the main task of the voice synthesis is not strong in the related technology is solved, the strong coupling between the watermark triggering task and the main task of the model is forced from the data layer by utilizing the watermark triggering sample in the construction domain, the robustness of the watermark in the marked model is improved, and the stability and the effectiveness of the target watermark triggering sample are kept.

In some embodiments, the construction module 210 is further configured to obtain a intra-domain speech data set corresponding to a pre-trained speech synthesis model in the speech synthesis system;

Performing tone characteristic extraction on the intra-domain voice data set based on the finely tuned speaker recognition model to obtain tone characteristics; clustering feature centers of tone features, and selecting source categories from clustering results;

watermark trigger samples within the domain are constructed based on the source class.

In some embodiments, the construction module 210 is further configured to perform timbre feature extraction on the voice samples in the intra-domain voice data set based on the intermediate layer representation of the trimmed speaker recognition model as a timbre feature vector of the input voice, to obtain timbre features;

In some embodiments, the construction module 210 is further configured to randomly extract a first voice sample from each source class;

In some embodiments, the implanting module 220 is further configured to jointly optimize the speech synthesis model according to a preset total loss function and a fine-tuned speaker recognition model to obtain a labeled model, where the total loss function is composed of a preset first loss function, a preset second loss function, and a pre-training loss function of the speech synthesis model;

In the joint optimization process, under the triggering of a first loss function in the total loss function, the voice synthesis model is optimized according to the voice sample of the intra-domain voice data set and the finely tuned speaker recognition model, and under the triggering of a second loss function in the total loss function, the voice synthesis model is optimized according to the watermark-embedded data set watermark triggering sample and the finely tuned speaker recognition model.

In some embodiments, the screening module 230 is further configured to fine tune the labeling model to obtain a plurality of alternative models;

based on a plurality of alternative models, testing the effectiveness of the watermark trigger samples to obtain average speaker similarity scores of each watermark trigger sample;

In some of these embodiments, as shown in fig. 7, the model watermarking device for a speech synthesis system further includes a detection module 240 in addition to the construction module 210, the implantation module 220, and the screening module 230 on the basis of fig. 6;

the detection module 240 is configured to input the watermark trigger sample into the suspicious model to obtain the model output voice after the suspicious model is obtained;

The above-described respective modules may be functional modules or program modules, and may be implemented by software or hardware. For modules implemented in hardware, the modules may be located in the same processor, or may be located in different processors in any combination.

There is also provided in this embodiment a computer device comprising a memory in which a computer program is stored and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

Optionally, the computer device may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.

Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:

s1, constructing a watermark trigger sample in a domain based on a pre-trained voice synthesis model and a finely-tuned speaker recognition model, and constructing a corresponding watermark implantation data set based on the watermark trigger sample;

s2, implanting a watermark implantation data set into a voice synthesis model based on the finely tuned speaker recognition model to obtain a marked model;

and S3, screening out a target watermark trigger sample based on the marking model.

It should be noted that, specific examples in this embodiment may refer to examples described in the foregoing embodiments and alternative implementations, and are not described in detail in this embodiment.

In addition, in combination with the model watermarking method for a speech synthesis system provided in the above embodiment, a storage medium may also be provided for implementation in this embodiment. The storage medium has stored thereon a computer program which, when executed by a processor, implements any of the model watermarking methods for speech synthesis systems of the above embodiments.

The information and data related to the application are information and data authorized by the user or fully authorized by each party, and are used by the convergence method.

It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to be limiting. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure in accordance with the embodiments provided herein.

It is to be understood that the drawings are merely illustrative of some embodiments of the present application and that it is possible for those skilled in the art to adapt the present application to other similar situations without the need for inventive work. In addition, it should be appreciated that while the development effort might be complex and lengthy, it would nevertheless be a routine undertaking of design, fabrication, or manufacture for those of ordinary skill having the benefit of this disclosure, and thus should not be construed as a departure from the disclosure.

The term "embodiment" in this disclosure means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive. It will be clear or implicitly understood by those of ordinary skill in the art that the embodiments described in the present application can be combined with other embodiments without conflict.

The above examples merely represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the patent claims. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims

1. A model watermarking method for a speech synthesis system, comprising:

The method comprises the steps of constructing a watermark trigger sample in a domain based on a pre-trained voice synthesis model and a finely-tuned speaker recognition model, constructing a corresponding watermark implantation data set based on the watermark trigger sample, wherein the training sample of the voice synthesis model is a domain voice data set, the domain voice data set is a set of voices spoken by speakers appearing in the training sample of the voice synthesis model, and the watermark implantation data set is distributed to the watermark trigger sample by a preset pseudo speaker tag;

And screening out a target watermark trigger sample based on the marking model.

2. The speech synthesis system oriented model watermarking method according to claim 1, wherein constructing intra-domain watermark trigger samples based on a pre-trained speech synthesis model and a fine-tuned speaker recognition model comprises:

3. The method for watermarking a speech synthesis system according to claim 2, wherein extracting timbre features from the intra-domain speech data set based on the trimmed speaker recognition model to obtain timbre features, clustering feature centers of the timbre features, and selecting a source category from the clustering result includes:

4. The speech synthesis system oriented model watermarking method according to claim 2, characterized in that the watermark trigger samples in a domain are constructed based on the source class, comprising:

Randomly extracting a first voice sample from each source category;

5. The method of model watermarking for a speech synthesis system according to claim 1, wherein implanting the watermark implantation dataset into the speech synthesis model based on the fine-tuned speaker recognition model results in a tagged model comprising:

6. The method for model watermarking for a speech synthesis system according to claim 1, wherein screening out target watermark trigger samples based on the labeling model includes:

fine-tuning the labeling model to obtain a plurality of substitution models;

7. A model watermarking method for a speech synthesis system according to any of claims 1 to 6, wherein the method further comprises:

8. The model watermarking device for the voice synthesis system is characterized by comprising a construction module, an implantation module and a screening module;

The system comprises a construction module, a watermark implantation data set, a watermark embedding data set and a watermark extraction module, wherein the construction module is used for constructing a watermark triggering sample in a domain based on a pre-trained voice synthesis model and a finely-tuned speaker recognition model, and constructing a corresponding watermark implantation data set based on the watermark triggering sample;

9. A computer device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the steps of the model watermarking method for a speech synthesis system according to any of claims 1 to 7.

10. A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor implements the steps of the model watermarking method for a speech synthesis system according to any of claims 1 to 7.