CN115273859B

CN115273859B - Security testing method and device for voice verification device

Info

Publication number: CN115273859B
Application number: CN202110480734.4A
Authority: CN
Inventors: 胡晓林; 张巍译; 李建民
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2024-05-28
Anticipated expiration: 2041-04-30
Also published as: CN115273859A

Abstract

The present disclosure relates to a security testing method and apparatus for a voice verification apparatus for verifying an identity of a registered person by voice, the method comprising: the method comprises the steps of obtaining a voice sample set comprising a plurality of first voice information of a target person, respectively fusing countermeasure audio information with the plurality of first voice information, respectively inputting the obtained plurality of fused voice information into a voice verification device to obtain first voice characteristics of the target person, optimizing the countermeasure audio information according to the first voice characteristics of the target person and second voice characteristics of registered persons stored in the voice verification device, effectively detecting safety reliability of the voice verification device by utilizing the optimized countermeasure audio information, reminding a user of paying attention to use risks of the voice verification device, and improving the voice verification device by a developer of the voice verification device.

Description

Security testing method and device for voice verification device

技术领域Technical Field

本公开涉及信息处理技术领域，尤其涉及一种语音验证装置的安全性测试方法及装置。The present disclosure relates to the field of information processing technology, and in particular to a security testing method and device for a voice verification device.

背景技术Background technique

语音验证是一种比较可靠的身份验证技术，该技术可提取输入语音中的语音特征，与注册预存的语音特征进行相似度计算，如果相似度大于预设的阈值，从而判断身份一致；否则就是身份不一致。语音验证技术在设备权限控制、金融活动和刑侦取证等领域有着广泛的应用。与人脸，指纹验证相比，语音验证技术具有方便无接触，成本低和伪造难的优点。随着应用了语音验证技术的语音验证装置，越来越多地部署在人们的日常生活中，关于语音验证装置的安全性研究，就越来越受到关注和重视，有着很高的现实意义和应用价值。Voice verification is a relatively reliable identity authentication technology. This technology can extract voice features from input voice and calculate the similarity with the registered pre-stored voice features. If the similarity is greater than the preset threshold, the identity is judged to be consistent; otherwise, the identity is inconsistent. Voice verification technology is widely used in the fields of device authority control, financial activities, and criminal investigation and evidence collection. Compared with face and fingerprint verification, voice verification technology has the advantages of convenience, contactlessness, low cost, and difficulty in forgery. As voice verification devices that apply voice verification technology are increasingly deployed in people's daily lives, the security research of voice verification devices has received more and more attention and importance, and has high practical significance and application value.

发明内容Summary of the invention

有鉴于此，本公开提出了一种语音验证装置的安全性测试方法及装置。In view of this, the present disclosure proposes a security testing method and device for a voice verification device.

根据本公开的一方面，提供了一种语音验证装置的安全性测试方法，所述方法包括：获取语音样本集，所述语音样本集中包括目标人员的多个第一语音信息；将对抗音频信息分别与所述多个第一语音信息融合，得到多个融合语音信息；将所述多个融合语音信息分别输入所述语音验证装置中，得到所述目标人员的第一语音特征；根据所述目标人员的第一语音特征，以及所述语音验证装置中存储的已注册人员的第二语音特征，对所述对抗音频信息进行优化，得到优化后的对抗音频信息，所述优化后的对抗音频信息用于对所述语音验证装置进行安全性测试，其中，所述语音验证装置用于通过语音对已注册人员的身份进行验证。According to one aspect of the present disclosure, a security testing method for a voice verification device is provided, the method comprising: obtaining a voice sample set, the voice sample set including multiple first voice information of a target person; fusing adversarial audio information with the multiple first voice information respectively to obtain multiple fused voice information; inputting the multiple fused voice information into the voice verification device respectively to obtain a first voice feature of the target person; optimizing the adversarial audio information according to the first voice feature of the target person and a second voice feature of a registered person stored in the voice verification device to obtain optimized adversarial audio information, the optimized adversarial audio information being used to perform a security test on the voice verification device, wherein the voice verification device is used to verify the identity of a registered person through voice.

在一种可能的实现方式中，所述方法还包括：控制音频播放设备播放所述优化后的对抗音频信息；获取所述语音验证装置针对验证语音信息的验证结果，所述验证语音信息包括所述优化后的对抗音频信息，以及在所述音频播放设备播放优化后的对抗音频信息期间，由所述目标人员发出的真实验证语音；根据所述验证结果，确定所述语音验证装置的安全性测试结果。In a possible implementation, the method also includes: controlling an audio playback device to play the optimized adversarial audio information; obtaining a verification result of the voice verification device for the verification voice information, the verification voice information including the optimized adversarial audio information and a real verification voice emitted by the target person during the period when the audio playback device plays the optimized adversarial audio information; and determining a security test result of the voice verification device based on the verification result.

在一种可能的实现方式中，根据所述目标人员的第一语音特征，以及所述语音验证装置中存储的已注册人员的第二语音特征，对所述对抗音频信息进行优化，得到优化后的对抗音频信息，包括：基于第一损失函数，对所述对抗音频信息进行优化，得到第一状态的对抗音频信息；基于所述第一损失函数和第二损失函数，对所述第一状态的对抗音频信息进行优化，得到优化后的抗音频信息；其中，所述第一损失函数用于指示所述第一语音特征与所述第二语音特征的识别误差，所述第二损失函数用于指示所述对抗音频信息对语音识别内容的影响。In a possible implementation, the adversarial audio information is optimized according to the first voice feature of the target person and the second voice feature of the registered person stored in the voice verification device to obtain optimized adversarial audio information, including: optimizing the adversarial audio information based on a first loss function to obtain adversarial audio information in a first state; optimizing the adversarial audio information in the first state based on the first loss function and the second loss function to obtain optimized adversarial audio information; wherein the first loss function is used to indicate the recognition error between the first voice feature and the second voice feature, and the second loss function is used to indicate the influence of the adversarial audio information on the voice recognition content.

在一种可能的实现方式中，将对抗音频信息分别与所述多个第一语音信息融合，得到多个融合语音信息，包括：根据所述多个第一语音信息的时长，对所述对抗音频信息进行预处理，得到与各个第一语音信息时长对应的第一音频信息；对第一语音信息及相应的第一音频信息分别进行变换，得到第二语音信息及相应的第二音频信息；其中，所述变换处理包括房间冲击响应；将所述第二语音信息及所述相应的第二音频信息融合，得到所述融合语音信息。In a possible implementation, the adversarial audio information is fused with the multiple first voice information respectively to obtain multiple fused voice information, including: pre-processing the adversarial audio information according to the duration of the multiple first voice information to obtain first audio information corresponding to the duration of each first voice information; transforming the first voice information and the corresponding first audio information respectively to obtain second voice information and corresponding second audio information; wherein the transformation processing includes room impulse response; fusing the second voice information and the corresponding second audio information to obtain the fused voice information.

在一种可能的实现方式中，在所述目标人员为未注册人员的情况下，所述优化后的对抗音频信息用于：使得所述语音验证装置输出的验证结果为验证成功；在所述目标人员为已注册人员的情况下，所述优化后的对抗音频信息用于：使得所述语音验证装置输出的验证结果为验证失败。In one possible implementation, when the target person is an unregistered person, the optimized adversarial audio information is used to: make the verification result output by the voice verification device be a successful verification; when the target person is a registered person, the optimized adversarial audio information is used to: make the verification result output by the voice verification device be a failed verification.

在一种可能的实现方式中，所述语音验证装置包括语音识别模块、说话人验证模块、重放检测模块；其中，所述语音识别模块用于识别验证语音信息的内容信息，所述说话人验证模块用于识别验证语音信息特征是否属于已注册人员的语音信息特征，所述重放检测模块用于检测验证语音信息是否为经过录音后再次播放的语音信息。In one possible implementation, the voice verification device includes a voice recognition module, a speaker verification module, and a replay detection module; wherein the voice recognition module is used to identify content information of the verification voice information, the speaker verification module is used to identify whether the verification voice information features belong to the voice information features of a registered person, and the replay detection module is used to detect whether the verification voice information is voice information that is played back after being recorded.

根据本公开的另一方面，提供了一种语音验证装置的安全性测试装置，所述装置包括：语音样本集获取模块，用于获取语音样本集，所述语音样本集中包括目标人员的多个第一语音信息；融合模块，用于将对抗音频信息分别与所述多个第一语音信息融合，得到多个融合语音信息；特征获取模块，用于将所述多个融合语音信息分别输入所述语音验证装置中，得到所述目标人员的第一语音特征；优化模块，用于根据所述目标人员的第一语音特征，以及所述语音验证装置中存储的已注册人员的第二语音特征，对所述对抗音频信息进行优化，得到优化后的对抗音频信息，所述优化后的对抗音频信息用于对所述语音验证装置进行安全性测试，其中，所述语音验证装置用于通过语音对已注册人员的身份进行验证。According to another aspect of the present disclosure, a security testing device for a voice verification device is provided, the device comprising: a voice sample set acquisition module, used to acquire a voice sample set, the voice sample set comprising a plurality of first voice information of a target person; a fusion module, used to fuse adversarial audio information with the plurality of first voice information respectively, to obtain a plurality of fused voice information; a feature acquisition module, used to input the plurality of fused voice information into the voice verification device respectively, to obtain a first voice feature of the target person; an optimization module, used to optimize the adversarial audio information according to the first voice feature of the target person and the second voice feature of a registered person stored in the voice verification device, to obtain optimized adversarial audio information, the optimized adversarial audio information being used to perform a security test on the voice verification device, wherein the voice verification device is used to verify the identity of a registered person through voice.

在一种可能的实现方式中，所述装置还包括：音频播放模块，用于控制音频播放设备播放所述优化后的对抗音频信息；验证结果获取模块，用于获取所述语音验证装置针对验证语音信息的验证结果，所述验证语音信息包括所述优化后的对抗音频信息，以及在所述音频播放设备播放优化后的对抗音频信息期间，由所述目标人员发出的真实验证语音；安全性测试结果确定模块，用于根据所述验证结果，确定所述语音验证装置的安全性测试结果。In a possible implementation, the device also includes: an audio playback module, used to control the audio playback device to play the optimized adversarial audio information; a verification result acquisition module, used to obtain the verification result of the voice verification device for the verification voice information, the verification voice information including the optimized adversarial audio information and the real verification voice emitted by the target person during the period when the audio playback device plays the optimized adversarial audio information; a security test result determination module, used to determine the security test result of the voice verification device based on the verification result.

在一种可能的实现方式中，所述特征获取模块用于：基于第一损失函数，对所述对抗音频信息进行优化，得到第一状态的对抗音频信息；基于所述第一损失函数和第二损失函数，对所述第一状态的对抗音频信息进行优化，得到优化后的抗音频信息；其中，所述第一损失函数用于指示所述第一语音特征与所述第二语音特征的识别误差，所述第二损失函数用于指示所述对抗音频信息对语音识别内容的影响。In one possible implementation, the feature acquisition module is used to: optimize the adversarial audio information based on a first loss function to obtain adversarial audio information in a first state; optimize the adversarial audio information in the first state based on the first loss function and the second loss function to obtain optimized adversarial audio information; wherein the first loss function is used to indicate the recognition error between the first speech feature and the second speech feature, and the second loss function is used to indicate the impact of the adversarial audio information on the speech recognition content.

在一种可能的实现方式中，所述融合模块用于：根据所述多个第一语音信息的时长，对所述对抗音频信息进行预处理，得到与各个第一语音信息时长对应的第一音频信息；对第一语音信息及相应的第一音频信息分别进行变换，得到第二语音信息及相应的第二音频信息；其中，所述变换处理包括房间冲击响应；将所述第二语音信息及所述相应的第二音频信息融合，得到所述融合语音信息。In one possible implementation, the fusion module is used to: pre-process the adversarial audio information according to the duration of the multiple first voice information to obtain first audio information corresponding to the duration of each first voice information; transform the first voice information and the corresponding first audio information respectively to obtain second voice information and corresponding second audio information; wherein the transformation processing includes room impulse response; and fuse the second voice information and the corresponding second audio information to obtain the fused voice information.

根据本公开的另一方面，提供了一种电子设备，包括：处理器；用于存储处理器可执行指令的存储器；其中，所述处理器被配置为调用所述存储器存储的指令，以执行上述的方法。According to another aspect of the present disclosure, an electronic device is provided, comprising: a processor; and a memory for storing processor-executable instructions; wherein the processor is configured to call the instructions stored in the memory to execute the above method.

根据本公开的另一方面，提供了一种非易失性计算机可读存储介质，其上存储有计算机程序指令，所述计算机程序指令被处理器执行时实现上述方法。According to another aspect of the present disclosure, a non-volatile computer-readable storage medium is provided, on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the above method is implemented.

在本公开实施例中，可将对抗音频信息和第一语音信息的融合语音信息输入语音验证装置，得到目标人员的第一语音特征，并根据目标人员的第一语音特征和已注册人员的第二语音特征，对对抗音频信息进行优化，得到优化后的对抗音频信息，能够通过该优化后的对抗音频信息，对语音验证装置的安全性可靠性进行有效检测，提醒用户注意语音验证装置的使用风险，有利于语音验证装置的开发者对语音验证装置进行完善。In the disclosed embodiment, the fused voice information of the adversarial audio information and the first voice information can be input into the voice verification device to obtain the first voice feature of the target person, and the adversarial audio information can be optimized based on the first voice feature of the target person and the second voice feature of the registered person to obtain the optimized adversarial audio information. The optimized adversarial audio information can be used to effectively detect the security and reliability of the voice verification device, reminding users to pay attention to the risks of using the voice verification device, which is conducive to the developers of the voice verification device to improve the voice verification device.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，而非限制本公开。根据下面参考附图对示例性实施例的详细说明，本公开的其它特征及方面将变得清楚。It should be understood that the above general description and the following detailed description are exemplary and explanatory only and do not limit the present disclosure. Other features and aspects of the present disclosure will become clear from the following detailed description of exemplary embodiments with reference to the accompanying drawings.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本公开的示例性实施例、特征和方面，并且用于解释本公开的原理。The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

图1示出根据本公开实施例的语音验证装置的示意图；FIG1 is a schematic diagram of a voice verification device according to an embodiment of the present disclosure;

图2示出根据本公开实施例的语音验证装置的安全性测试方法的流程图；FIG2 is a flow chart showing a method for testing the security of a voice verification device according to an embodiment of the present disclosure;

图3示出相关技术中利用对抗音频信息对语音验证装置进行安全性测试的示意图；FIG3 is a schematic diagram showing a method of performing security testing on a voice verification device using adversarial audio information in the related art;

图4示出根据本公开实施例的优化后的对抗音频信息对语音验证装置进行安全性测试的示意图；FIG4 is a schematic diagram showing a security test of a voice verification device using optimized adversarial audio information according to an embodiment of the present disclosure;

图5示出根据本公开实施例的语音验证装置的安全性测试方法的示意图；FIG5 is a schematic diagram showing a method for testing the security of a voice verification device according to an embodiment of the present disclosure;

图6示出根据本公开实施例的优化后的对抗音频信息进行安全性测试的示意图；FIG6 is a schematic diagram showing a security test of optimized adversarial audio information according to an embodiment of the present disclosure;

图7示出根据本公开实施例的语音验证装置的安全性测试装置的框图；FIG7 shows a block diagram of a security testing device for a voice verification device according to an embodiment of the present disclosure;

图8示出根据本公开实施例的电子设备的框图；FIG8 shows a block diagram of an electronic device according to an embodiment of the present disclosure;

图9示出根据本公开实施例的电子设备的框图。FIG. 9 shows a block diagram of an electronic device according to an embodiment of the present disclosure.

具体实施方式Detailed ways

以下将参考附图详细说明本公开的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面，但是除非特别指出，不必按比例绘制附图。Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. The same reference numerals in the accompanying drawings represent elements with the same or similar functions. Although various aspects of the embodiments are shown in the accompanying drawings, the drawings are not necessarily drawn to scale unless otherwise specified.

在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。The word “exemplary” is used exclusively herein to mean “serving as an example, example, or illustration.” Any embodiment described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments.

本文中术语“和/或”，仅仅是一种描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。另外，本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合，例如，包括A、B、C中的至少一种，可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。The term "and/or" herein is only a description of the association relationship of the associated objects, indicating that there may be three relationships. For example, A and/or B can represent: A exists alone, A and B exist at the same time, and B exists alone. In addition, the term "at least one" herein represents any combination of at least two of any one or more of a plurality of. For example, including at least one of A, B, and C can represent including any one or more elements selected from the set consisting of A, B, and C.

另外，为了更好的说明本公开，在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解，没有某些具体细节，本公开同样可以实施。在一些实例中，对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述，以便于凸显本公开的主旨。In addition, in order to better illustrate the present disclosure, numerous specific details are given in the following specific embodiments. It should be understood by those skilled in the art that the present disclosure can also be implemented without certain specific details. In some examples, methods, means, components and circuits well known to those skilled in the art are not described in detail in order to highlight the main purpose of the present disclosure.

图1示出根据本公开实施例的语音验证装置的示意图，如图1所示，所述语音验证装置包括语音识别模块、说话人验证模块、重放检测模块。FIG1 shows a schematic diagram of a voice verification device according to an embodiment of the present disclosure. As shown in FIG1 , the voice verification device includes a voice recognition module, a speaker verification module, and a playback detection module.

其中，所述语音识别模块用于识别验证语音信息的内容信息，所述说话人验证模块用于识别验证语音信息特征是否属于已注册人员的语音信息特征，所述重放检测模块用于检测验证语音信息是否为经过录音后再次播放的语音信息。Among them, the voice recognition module is used to identify the content information of the verification voice information, the speaker verification module is used to identify whether the verification voice information features belong to the voice information features of a registered person, and the playback detection module is used to detect whether the verification voice information is voice information played back after recording.

在一种可能的实现方式中，所述说话人验证模块可以是白盒模块，所述语音识别模块和重放检测模块可以是黑盒模块。In a possible implementation, the speaker verification module may be a white-box module, and the speech recognition module and the replay detection module may be black-box modules.

在一种可能的实现方式中，所述语音验证装置，可以是实用说话人验证(Practical Speaker Verification，PSV)装置，也即基于动态密码的自动说话人验证(Automatic Speaker Verification，ASV)装置。In a possible implementation, the voice verification device may be a Practical Speaker Verification (PSV) device, that is, an Automatic Speaker Verification (ASV) device based on a dynamic password.

在一种可能的实现方式中，语音验证装置的语音识别模块，用于识别用户输入的验证语音信息的内容与动态密码是否一致；In a possible implementation, the voice recognition module of the voice verification device is used to identify whether the content of the verification voice information input by the user is consistent with the dynamic password;

语音验证装置的说话人验证模块，用于判断输入的验证语音信息与该用户注册的语音信息是否来自同一人员，例如，可通过判断验证语音信息的声纹(Voiceprint)与用户注册语音声纹的相似度，确定输入的验证语音信息与该用户注册的语音信息是否来自同一人员；The speaker verification module of the voice verification device is used to determine whether the input verification voice information and the voice information registered by the user are from the same person. For example, the similarity between the voiceprint of the verification voice information and the voiceprint of the user's registration voice can be determined to determine whether the input verification voice information and the voice information registered by the user are from the same person.

语音验证装置的重放检测模块，用于检测输入的语言信息是否为经过录音重放的音频，保证用户是真人。The playback detection module of the voice verification device is used to detect whether the input language information is the audio that has been recorded and played back, to ensure that the user is a real person.

在一种可能的实现方式中，如图1所示，所述语音验证装置用于通过语音对已注册人员的身份进行验证。In a possible implementation, as shown in FIG1 , the voice verification device is used to verify the identity of a registered person through voice.

举例来说，用户进行身份验证，会收到来自语音验证装置的动态密码，如随机字符或文本等。用户需要在规定时间内说出动态密码，经过空气信道传播，语音验证装置的音频采集设备(例如麦克风)可记录用户语音信息，并将该语音信息分别送入语音验证装置的语音识别模块、说话人验证模块和重放检测模块。For example, when a user performs identity authentication, he or she will receive a dynamic password from the voice verification device, such as random characters or text. The user needs to speak the dynamic password within a specified time, which is transmitted through the air channel. The audio acquisition device (such as a microphone) of the voice verification device can record the user's voice information and send the voice information to the voice recognition module, speaker verification module and playback detection module of the voice verification device respectively.

在用户说出的语音密码正确的情况下，语音验证装置的语音识别模块可给出检测通过信号；在用户为已注册人员，即用户说出的语音与注册语音来自同一人员的情况下，语音验证装置的说话人验证模块可给出检测通过的信号；在用户说出的语音是真人说出的情况下(非音频播放设备播放的录制语音)，语音验证装置的重放检测模块可给出检测通过的信号。When the voice password spoken by the user is correct, the voice recognition module of the voice verification device can give a detection pass signal; when the user is a registered person, that is, the voice spoken by the user and the registered voice come from the same person, the speaker verification module of the voice verification device can give a detection pass signal; when the voice spoken by the user is spoken by a real person (not a recorded voice played by an audio playback device), the playback detection module of the voice verification device can give a detection pass signal.

只有语音验证装置的上述三个模块均给出检测通过的信号，该语音验证装置才可给出接收信号，可接收用户的语音信息，用户的身份验证成功；Only when the above three modules of the voice verification device give a detection pass signal, the voice verification device can give a reception signal, can receive the user's voice information, and the user's identity verification is successful;

否则，如果用户说出的语音密码不正确、或者用户不是语音验证装置的已注册人员，或者语音验证装置接收的为录制的语音，该语音验证装置可给出拒绝信号，拒绝接收用户的语音信息，用户的身份验证失败。Otherwise, if the voice password spoken by the user is incorrect, or the user is not a registered person of the voice verification device, or the voice verification device receives a recorded voice, the voice verification device may give a rejection signal, refuse to receive the user's voice information, and the user's identity verification fails.

随着深度学习模型的应用给语音验证装置带来的性能提升，语音验证装置在设备权限控制、金融活动和刑侦取证等领域的应用越来越广泛，对语音验证装置的安全性测试方法的研究也越来越重要。As the application of deep learning models brings performance improvements to voice verification devices, voice verification devices are increasingly used in areas such as device authority control, financial activities, and criminal investigation and evidence collection. Research on security testing methods for voice verification devices is becoming increasingly important.

相关技术中，可在语音验证装置的输入语音中添加一些精心设计的微小扰动，即对抗音频信息，使未注册目标人员的输入语音特征与已注册人员的语音特征相似度提高，从而可利用对抗音频信息，实现对语音验证装置有目标场景下的安全性测试，安全性能高的语音验证装置不会将目标人员的身份识别为已注册人员的身份，安全性能低的语音验证装置会将目标人员的身份识别为注册人员的身份；In the related art, some carefully designed tiny disturbances, namely adversarial audio information, can be added to the input voice of the voice verification device to increase the similarity between the input voice features of the unregistered target person and the voice features of the registered person. Thus, the adversarial audio information can be used to implement the security test of the voice verification device in a target scenario. A voice verification device with high security performance will not identify the target person as the registered person, and a voice verification device with low security performance will identify the target person as the registered person.

或者，使已注册目标人员的输入语音特征与本人已注册语音特征的相似度降低，从而可利用对抗音频信息，实现语音验证装置无目标场景下的安全性测试，安全性能高的语音验证装置能够识别已注册人员的身份，安全性能低的语音验证装置不能识别已注册人员的身份。Alternatively, the similarity between the input voice features of the registered target person and the registered voice features of the person himself is reduced, so that adversarial audio information can be used to implement security testing of the voice verification device in a non-target scenario. A voice verification device with high security performance can identify the identity of the registered person, while a voice verification device with low security performance cannot identify the identity of the registered person.

然而，相关技术中，关于语音验证装置的安全性测试方法，存在以下问题：However, in the related art, there are the following problems regarding the security testing method of the voice verification device:

(1)对抗音频信息针对特定的一句话单独产生，而语音验证装置可采用动态密码，每更换一次动态密码就需要优化一个新的对抗音频信息，针对特定的一句话单独产生的对抗音频信息不符合实时性的要求，难以实时适用于动态密码。(1) The adversarial audio information is generated for a specific sentence, while the voice verification device can use a dynamic password. Every time the dynamic password is changed, a new adversarial audio information needs to be optimized. The adversarial audio information generated for a specific sentence does not meet the real-time requirements and is difficult to be applied to the dynamic password in real time.

(2)语音验证装置包括语音识别模块来保证语音内容和动态密码的一致性，只针对说话人验证模块优化产生的对抗音频信息可能会影响语音内容的识别。(2) The voice verification device includes a voice recognition module to ensure the consistency of the voice content and the dynamic password. The adversarial audio information generated by optimizing the speaker verification module alone may affect the recognition of the voice content.

(3)在对语音验证装置进行安全性测试的过程中，对抗音频信息和目标人员的语音作为一个信号播放，在这种情况下，目标人员的语音会经过两次播放记录的过程，从而被语音验证装置的重放检测模块检测为重放信号。(3) During the security test of the voice verification device, the adversarial audio information and the target person's voice are played as a signal. In this case, the target person's voice will go through two playback recording processes and will be detected as a playback signal by the playback detection module of the voice verification device.

针对上述问题，本公开提出一种语音验证装置的安全性测试方法，该方法获取的语音样本集可包括目标人员的多个第一语音信息，各第一语音信息的内容各不相同，通过对抗音频信息在语音样本集上的学习，可使优化后的对抗音频信息与内容无关，即使动态密码发生改变时，对抗音频信息也可以直接应用在新的语音上；并且，该方法可根据目标人员的第一语音特征和已注册人员的第二语音特征，对对抗音频信息进行分两阶段优化，使得优化后的对抗音频信息在满足对抗性质的同时，还可降低对语音内容识别的影响；最后，本方法优化后的对抗音频信息可作为单独的声源由音频播放设备播放，配合目标人员的真实语音，能够通过语音验证装置的重放检测模块。本公开方法能够利用优化后的对抗音频信息的对抗性质，对语音验证装置的安全性可靠性进行有效检测，提醒用户注意语音验证装置的使用风险，有利于语音验证装置的开发者对语音验证装置进行完善。In view of the above problems, the present disclosure proposes a security testing method for a voice verification device. The voice sample set obtained by the method may include multiple first voice information of the target person, and the content of each first voice information is different. By learning the adversarial audio information on the voice sample set, the optimized adversarial audio information can be made independent of the content. Even when the dynamic password changes, the adversarial audio information can be directly applied to the new voice; and the method can optimize the adversarial audio information in two stages according to the first voice feature of the target person and the second voice feature of the registered person, so that the optimized adversarial audio information can reduce the impact on voice content recognition while meeting the adversarial nature; finally, the adversarial audio information optimized by the method can be played as a separate sound source by an audio playback device, and can pass the playback detection module of the voice verification device in conjunction with the real voice of the target person. The disclosed method can use the adversarial nature of the optimized adversarial audio information to effectively detect the security and reliability of the voice verification device, remind users to pay attention to the risks of using the voice verification device, and help the developers of the voice verification device to improve the voice verification device.

图2示出根据本公开实施例的语音验证装置的安全性测试方法的流程图，如图2所示，所述方法包括：FIG2 is a flow chart of a method for testing the security of a voice verification device according to an embodiment of the present disclosure. As shown in FIG2 , the method includes:

在步骤S1中，获取语音样本集，所述语音样本集中包括目标人员的多个第一语音信息；In step S1, a voice sample set is obtained, wherein the voice sample set includes a plurality of first voice information of a target person;

在步骤S2中，将对抗音频信息分别与所述多个第一语音信息融合，得到多个融合语音信息；In step S2, the adversarial audio information is fused with the plurality of first voice information respectively to obtain a plurality of fused voice information;

在步骤S3中，将所述多个融合语音信息分别输入所述语音验证装置中，得到所述目标人员的第一语音特征；In step S3, the plurality of fused voice information are respectively input into the voice verification device to obtain the first voice feature of the target person;

在步骤S4中，根据所述目标人员的第一语音特征，以及所述语音验证装置中存储的已注册人员的第二语音特征，对所述对抗音频信息进行优化，得到优化后的对抗音频信息，所述优化后的对抗音频信息用于对所述语音验证装置进行安全性测试，其中，所述语音验证装置用于通过语音对已注册人员的身份进行验证。In step S4, the adversarial audio information is optimized according to the first voice feature of the target person and the second voice feature of the registered person stored in the voice verification device to obtain optimized adversarial audio information, and the optimized adversarial audio information is used to perform a security test on the voice verification device, wherein the voice verification device is used to verify the identity of the registered person through voice.

在一种可能的实现方式中，在步骤S1中，可通过音频采集设备(例如麦克风)，对目标人员的语音进行采集，获取语音样本集，该语音样本集中包括了目标人员的多个第一语音信息。In a possible implementation, in step S1, the voice of the target person may be collected by an audio collection device (such as a microphone) to obtain a voice sample set, which includes a plurality of first voice information of the target person.

在语音样本集中，各第一语音信息包含的内容、时长、感情、音调等变化因素各不相同。其中，语音样本集中包括的第一语音信息的数量越多，语音样本集所覆盖的目标人员语音的变化因素越多，越有利于对抗音频信息在语音样本集上的学习，使对抗音频信息越容易适应这些因素的变化。In the voice sample set, each first voice information contains different change factors such as content, duration, emotion, and tone. The more first voice information included in the voice sample set, the more change factors of the target person's voice covered by the voice sample set, which is more conducive to the learning of the adversarial audio information on the voice sample set, making it easier for the adversarial audio information to adapt to changes in these factors.

因此，通过获取语音样本集，有助于对抗音频信息在语音样本集上的学习，可减少对抗音频信息对语音内容的影响，可使后续步骤得到的优化后的对抗音频信息与内容无关，即使在动态密码发生改变的情况下，对抗音频信息也可以直接应用在新的语音上，可提高本方法安全性测试的实时性和通用性。Therefore, by obtaining a speech sample set, it is helpful to learn adversarial audio information on the speech sample set, which can reduce the impact of adversarial audio information on speech content, and make the optimized adversarial audio information obtained in subsequent steps independent of the content. Even when the dynamic password changes, the adversarial audio information can be directly applied to the new speech, which can improve the real-time and versatility of the security test of this method.

在一种可能的实现方式中，在步骤S2中，对抗音频信息可设为一段固定长度的音频，可将对抗音频信息分别与语音样本集中每一个第一语音信息进行融合处理，得到多个融合语音信息；In a possible implementation, in step S2, the adversarial audio information may be set as an audio of a fixed length, and the adversarial audio information may be fused with each first voice information in the voice sample set to obtain a plurality of fused voice information;

应当理解，本公开对对抗音频信息和第一语音信息进行融合的方式不作限制，例如，可以采用将对抗音频信息和第一语音信息对应时刻的声音幅值直接相加的方式进行融合。It should be understood that the present disclosure does not limit the manner of fusing the adversarial audio information and the first voice information. For example, the adversarial audio information and the first voice information may be fused by directly adding the sound amplitudes at corresponding moments.

在一种可能的实现方式中，在步骤S3中，将多个融合语音信息分别输入语音验证装置中，例如，可以将每一个融合语音信息单独输入语音验证装置中，也可以将多个融合语音信息以批(batch)数据方式输入语音验证装置中，本公开对将多个融合语音信息分别输入语音验证装置中的输入方式不作限制。通过语音验证装置对融合语音信息的特征进行提取，得到目标人员的第一语音特征。In a possible implementation, in step S3, multiple fused voice information are respectively input into the voice verification device. For example, each fused voice information can be input into the voice verification device separately, or multiple fused voice information can be input into the voice verification device in batch data. The present disclosure does not limit the input method of inputting multiple fused voice information into the voice verification device separately. The voice verification device extracts the features of the fused voice information to obtain the first voice feature of the target person.

其中，语音验证装置可对输入的语音进行特征提取，其过程为：首先，语音验证装置可提取输入音频的频域特征，接着用一个深度神经网络提取帧级别特征，之后将帧级别特征融合成句子级别特征，最后再经过全连接神经网络将句子级别的特征映射成一个固定维度的向量，这个向量就称为输入音频的语音特征。Among them, the voice verification device can extract features of the input voice, and the process is as follows: first, the voice verification device can extract the frequency domain features of the input audio, then use a deep neural network to extract frame-level features, and then fuse the frame-level features into sentence-level features. Finally, the sentence-level features are mapped into a vector of fixed dimension through a fully connected neural network. This vector is called the voice feature of the input audio.

在一种可能的实现方式中，在步骤S4中，根据目标人员的第一语音特征，以及语音验证装置中存储的已注册人员的第二语音特征，可计算目标人员的第一语音特征与已注册人员的第二语音特征的相似度，进而根据两者间的相似度计算损失函数，再利用反向传播算法求出损失函数关于对抗音频信息的梯度，并根据求得的梯度更新对抗音频信息，得到优化后的对抗音频信息。In one possible implementation, in step S4, based on the first voice feature of the target person and the second voice feature of the registered person stored in the voice verification device, the similarity between the first voice feature of the target person and the second voice feature of the registered person can be calculated, and then the loss function can be calculated based on the similarity between the two. The back propagation algorithm is then used to find the gradient of the loss function with respect to the adversarial audio information, and the adversarial audio information is updated based on the obtained gradient to obtain the optimized adversarial audio information.

其中，根据求得的梯度更新对抗音频信息，可以结合快速梯度方向法(FastGradient Sign Method，FGSM)或投影梯度下降法(Project Gradient Descent，PGD)，本公开对具体结合的算法不作限制。Among them, the adversarial audio information can be updated according to the obtained gradient, which can be combined with the Fast Gradient Sign Method (FGSM) or the Project Gradient Descent (PGD). The present disclosure does not limit the specific combined algorithm.

优化后的对抗音频信息可用于对语音验证装置进行安全性测试，例如，如果语音验证装置通过了利用优化后的对抗音频信息进行的安全性测试，说明该语音装置安全性比较高；如果语音验证装置没有通过利用优化后的对抗音频信息进行的安全性测试，说明该语音装置安全性比较低。The optimized adversarial audio information can be used to perform security tests on the voice verification device. For example, if the voice verification device passes the security test using the optimized adversarial audio information, it means that the security of the voice device is relatively high; if the voice verification device fails the security test using the optimized adversarial audio information, it means that the security of the voice device is relatively low.

通过这种方式，可将对抗音频信息和第一语音信息的融合语音信息输入语音验证装置，得到目标人员的第一语音特征，并根据目标人员的第一语音特征和已注册人员的第二语音特征，对对抗音频信息进行优化，得到优化后的对抗音频信息，能够通过该优化后的对抗音频信息，对语音验证装置的安全性可靠性进行有效检测，提醒用户注意语音验证装置的使用风险，有利于语音验证装置的开发者对语音验证装置进行完善，对进一步提高语音验证装置的安全性有较大的意义。In this way, the fused voice information of the adversarial audio information and the first voice information can be input into the voice verification device to obtain the first voice feature of the target person, and the adversarial audio information can be optimized based on the first voice feature of the target person and the second voice feature of the registered person to obtain the optimized adversarial audio information. The optimized adversarial audio information can be used to effectively detect the security and reliability of the voice verification device, reminding users to pay attention to the risks of using the voice verification device, which is beneficial for the developers of the voice verification device to improve the voice verification device and is of great significance to further improving the security of the voice verification device.

通过上述步骤S1～S4方法，可得到优化后的对抗音频信息，用于对语音验证装置进行安全性测试。为了更好的说明本公开实施例如何利用优化后的对抗音频信息，对语音验证装置进行安全性测试，下面先介绍相关技术中利用对抗音频信息对语音验证装置进行安全性测试的过程。Through the above steps S1 to S4, optimized adversarial audio information can be obtained for performing security tests on the voice verification device. In order to better illustrate how the embodiments of the present disclosure use optimized adversarial audio information to perform security tests on the voice verification device, the following first introduces the process of using adversarial audio information in the related art to perform security tests on the voice verification device.

图3示出相关技术中利用对抗音频信息对语音验证装置进行安全性测试的示意图，如图3所示，在相关技术中，可将对语音验证装置进行安全测试的人员(人员A)的语音信息和对抗音频信息相加，得到一个验证语音信息，并通过音频播放设备(例如笔记本)播放该验证语音信息。Figure 3 shows a schematic diagram of using adversarial audio information to perform security testing on a voice verification device in the related art. As shown in Figure 3, in the related art, the voice information of a person (personnel A) who performs security testing on the voice verification device and the adversarial audio information can be added to obtain a verification voice information, and the verification voice information can be played through an audio playback device (such as a notebook).

语音验证装置的音频采集设备(例如麦克风)可记录播放的验证语音信息，语音验证装置的各验证模块(例如包括语音识别模块、说话人验证模块、重放检测模块)可对采集到的验证语音信息进行验证，得到验证结果。通过获取语音验证装置针对验证语音信息的验证结果，可判断语音验证装置的安全性，得到语音验证装置的安全性测试结果。The audio acquisition device (e.g., microphone) of the voice verification device can record the played verification voice information, and each verification module of the voice verification device (e.g., including voice recognition module, speaker verification module, playback detection module) can verify the collected verification voice information to obtain a verification result. By obtaining the verification result of the voice verification device for the verification voice information, the security of the voice verification device can be judged and the security test result of the voice verification device can be obtained.

在如图3所示的过程中，在获取人员A的语音信息的过程中，经历了一次传播记录，在将验证语音信息传入语音验证装置又经历了一次传播记录。因此，人员A的语音会经历两次传播记录，容易被语音验证装置的重放检测模块检测出来，导致该验证语音信息不能有效的判断语音验证装置的安全性，确定的语音验证装置的安全性测试结果也不够准确。In the process shown in FIG3 , in the process of obtaining the voice information of person A, it undergoes one transmission record, and in the process of transmitting the verification voice information to the voice verification device, it undergoes another transmission record. Therefore, person A's voice will undergo two transmission records, and it is easy to be detected by the playback detection module of the voice verification device, resulting in that the verification voice information cannot effectively judge the security of the voice verification device, and the determined security test result of the voice verification device is not accurate enough.

其中，传播记录是指语音经过空气信道传播，由音频采集设备记录成为音频文件的过程。语音如果经过传播记录，空气信道传播和语音采集设备可能会给语音带来失真。语音经过一次传播记录是有损失的，经过多次传播记录语音的损失是累加的。Among them, propagation recording refers to the process in which speech is transmitted through the air channel and recorded by the audio collection device into an audio file. If speech is recorded through propagation, the air channel propagation and the speech collection device may cause distortion to the speech. There is loss in the recording of speech after one propagation, and the loss of speech after multiple propagation recordings is cumulative.

图4示出根据本公开实施例的优化后的对抗音频信息对语音验证装置进行安全性测试的示意图，如图4所示，所述方法还包括：控制音频播放设备播放所述优化后的对抗音频信息；获取所述语音验证装置针对验证语音信息的验证结果，所述验证语音信息包括所述优化后的对抗音频信息，以及在所述音频播放设备播放优化后的对抗音频信息期间，由所述目标人员发出的真实验证语音；根据所述验证结果，确定所述语音验证装置的安全性测试结果。Figure 4 shows a schematic diagram of security testing of a voice verification device using optimized adversarial audio information according to an embodiment of the present disclosure. As shown in Figure 4, the method also includes: controlling an audio playback device to play the optimized adversarial audio information; obtaining a verification result of the voice verification device for the verification voice information, the verification voice information including the optimized adversarial audio information and a real verification voice emitted by the target person during the period when the audio playback device plays the optimized adversarial audio information; and determining a security test result of the voice verification device based on the verification result.

举例来说，目标人员(人员A)发出的真实语音和音频播放设备播放的优化后的对抗音频信息，经过空气信道传播，可一起被语音验证装置的音频采集设备(例如麦克风)记录。For example, the real voice emitted by the target person (person A) and the optimized adversarial audio information played by the audio playback device are transmitted through the air channel and can be recorded together by the audio acquisition device (such as a microphone) of the voice verification device.

其中，本方法可控制音频播放设备播放优化后的对抗音频信息，使目标人员在说出真实验证语音期间，音频播放设备可一直播放优化后的对抗音频信息。应当理解，可通过预设程序或按键控制音频播放设备的工作状态，本公开对音频播放设备以及音频播放设备的控制方式不作限制。Among them, this method can control the audio playback device to play the optimized confrontation audio information, so that the audio playback device can always play the optimized confrontation audio information while the target person speaks the real verification voice. It should be understood that the working state of the audio playback device can be controlled by a preset program or button, and the present disclosure does not limit the audio playback device and the control method of the audio playback device.

语音验证装置的各验证模块(例如包括语音识别模块、说话人验证模块、重放检测模块)，可对采集到的验证语音信息(即目标人员发出的真实验证语音和播放的优化后的对抗音频信息)进行验证，得到验证结果。通过获取语音验证装置针对验证语音信息的验证结果，可判断语音验证装置的安全性，得到语音验证装置的安全性测试结果。Each verification module of the voice verification device (for example, including a voice recognition module, a speaker verification module, and a playback detection module) can verify the collected verification voice information (i.e., the real verification voice emitted by the target person and the optimized adversarial audio information played) to obtain a verification result. By obtaining the verification result of the voice verification device for the verification voice information, the security of the voice verification device can be judged and the security test result of the voice verification device can be obtained.

对比图3所示过程，在如图4所示的过程中，对抗音频信息可作为单独的声源由音频播放设备播放，目标人员(人员A)的真实验证语音只经历了一次传播记录。Compared with the process shown in FIG3 , in the process shown in FIG4 , the adversarial audio information can be played as a separate sound source by the audio playback device, and the real verification voice of the target person (personnel A) has only undergone one transmission record.

因此，通过如图4所示的方法，可减少由于语音多次传播记录造成的损失，能够使验证语音通过语音验证装置的重放检测模块，提高本方法对语音验证装置的安全性的测试的精准度。Therefore, by using the method shown in FIG. 4 , the loss caused by multiple transmission records of the voice can be reduced, and the verification voice can pass through the playback detection module of the voice verification device, thereby improving the accuracy of the method for testing the security of the voice verification device.

语音验证装置在实际的应用过程中，有可能会出现将未注册过的人员A识别为某位已注册人员B(即有目标场景)，或者出现没有识别出已注册人员的情况(即无目标场景)。针对语音验证装置在实际应用过程中存在的两种安全风险隐患，下面分别针对有目标场景和无目标场景对本公开实施例的语音验证装置的安全性测试方法进行展开说明。In the actual application process of the voice verification device, it is possible that an unregistered person A is identified as a registered person B (i.e., a target scenario), or a registered person is not identified (i.e., a non-target scenario). In view of the two security risks in the actual application process of the voice verification device, the security testing method of the voice verification device of the embodiment of the present disclosure is described in detail below for the target scenario and the non-target scenario respectively.

针对有目标场景，图5示出根据本公开实施例的语音验证装置的安全性测试方法的示意图，如图5所示，人员A为未注册语音验证装置的目标人员，人员B为已经注册了语音验证装置的人员，其中，语音验证装置中已预存了注册人员B的语音或语音特征。For the target scenario, Figure 5 shows a schematic diagram of the security testing method of the voice verification device according to the embodiment of the present disclosure. As shown in Figure 5, person A is a target person who has not registered the voice verification device, and person B is a person who has registered the voice verification device, wherein the voice or voice features of the registered person B have been pre-stored in the voice verification device.

在对语音验证装置的安全性进行测试的过程中，由于语音验证装置所要求说的密码内容是未知的，并且每次密码各不相同，因此，需要产生一个与语音内容无关的通用对抗音频信息。In the process of testing the security of the voice verification device, since the password content required by the voice verification device is unknown and the password is different each time, it is necessary to generate a universal adversarial audio information that is independent of the voice content.

在一种可能的实现方式中，在步骤S1中，目标人员A的语音是容易获取的，可通过任一音频采集设备(例如麦克风)对目标人员A的语音进行采集，获取目标人员A的语音信息。In a possible implementation, in step S1, the voice of the target person A is easy to obtain, and the voice of the target person A can be collected through any audio collection device (such as a microphone) to obtain the voice information of the target person A.

如图5所示，可收集N条目标人员A的语音构建为语音样本集，该语音样本集可包括目标人员A的语音1～语音N，也即第一语音信息1～第一语音信息N。As shown in FIG. 5 , N voices of target person A may be collected to construct a voice sample set, and the voice sample set may include voice 1 to voice N of target person A, that is, first voice information 1 to first voice information N.

其中，各第一语音信息，所包括的语音内容、时长、感情、音调等变化因素各不相同。N的数值越大，语音样本集对目标人员A语音中变化因素的覆盖也越全面，越有利于对抗音频信息在语音样本集上的学习。Among them, each first voice information includes different change factors such as voice content, duration, emotion, and pitch. The larger the value of N, the more comprehensive the voice sample set covers the change factors in the voice of the target person A, and the more conducive it is to the learning of the adversarial audio information on the voice sample set.

因此，通过这种方式，可使在后续步骤中得到的优化后的对抗音频信息，可适应目标人员A的语音变化因素，从而在对语音验证装置进行安全性测试的过程中，优化后的对抗音频信息可直接应用在目标人员A的新的语音上。Therefore, in this way, the optimized adversarial audio information obtained in the subsequent steps can adapt to the voice change factors of the target person A, so that in the process of security testing of the voice verification device, the optimized adversarial audio information can be directly applied to the new voice of the target person A.

在步骤S1获取了包括目标人员的多个第一语音信息的语言样本集之后，可将对抗音频信息分别与各第一语音信息融合，得到多个融合语音信息。After a language sample set including a plurality of first voice information of a target person is obtained in step S1, the adversarial audio information may be fused with each first voice information respectively to obtain a plurality of fused voice information.

在一种可能的实现方式中，步骤S2包括：In a possible implementation, step S2 includes:

步骤S21，根据所述多个第一语音信息的时长，对所述对抗音频信息进行预处理，得到与各个第一语音信息时长对应的第一音频信息；Step S21, pre-processing the countermeasure audio information according to the duration of the plurality of first voice information to obtain first audio information corresponding to the duration of each first voice information;

步骤S22，对第一语音信息及相应的第一音频信息分别进行变换，得到第二语音信息及相应的第二音频信息；其中，所述变换处理包括房间冲击响应；Step S22, transforming the first voice information and the corresponding first audio information respectively to obtain second voice information and the corresponding second audio information; wherein the transformation process includes a room impulse response;

步骤S23，将所述第二语音信息及所述相应的第二音频信息融合，得到所述融合语音信息。Step S23, fusing the second voice information and the corresponding second audio information to obtain the fused voice information.

举例来说，在步骤S21中，假设对抗音频信息为δ，获取的语音样本集为X＝{x₁,x₂,…,x_N}，x_i表示目标人员的第一语音信息，i的取值范围为1～N，N为正整数，其中，N个第一语音信息x₁～x_N的时长可各不相同，本公开对各第一语音信息的时长不作限制。For example, in step S21, assuming that the adversarial audio information is δ, the acquired voice sample set is X={ _x1 , _x2 ,…, _xN }, _xi represents the first voice information of the target person, the value range of i is 1~N, N is a positive integer, wherein the duration of the N first voice information _x1 ~ _xN may be different, and the present disclosure does not limit the duration of each first voice information.

根据各个第一语音信息x_i的时长，对对抗音频信息δ进行预处理，得到与各个第一语音信息时长x_i对应的第一音频信息δ′。According to the duration of each first voice information x _i , the adversarial audio information δ is preprocessed to obtain the first audio information δ′ corresponding to the duration x _i of each first voice information.

其中，可设置对抗音频信息δ的时长小于各个第一语音信息x_i的时长，根据各个第一语音信息x_i的时长，对对抗音频信息δ进行预处理，例如包括重复扩展处理，得到重复扩展后的多个第一音频信息δ′，分别与各第一语音信息x_i的时长相同。应当理解，本公开对对抗音频信息δ的时长，以及预处理的具体方式不作限制。The duration of the antagonistic audio information δ can be set to be shorter than the duration of each first voice information _xi , and the antagonistic audio information δ is preprocessed according to the duration of each first voice information _xi , for example, including repeated expansion processing, to obtain multiple first audio information δ′ after repeated expansion, which are respectively the same as the duration of each first voice information _xi . It should be understood that the present disclosure does not limit the duration of the antagonistic audio information δ and the specific method of preprocessing.

在步骤S22中，对语音样本集X中的每一个第一语音信息x_i及相应的第一音频信息δ′分别进行变换T，得到第二语音信息T(x_i)及相应的第二音频信息T(δ′)；In step S22, each first voice information x _i and the corresponding first audio information δ′ in the voice sample set X are transformed by T to obtain second voice information T(x _i ) and corresponding second audio information T(δ′);

其中，变换T可包括房间冲击响应(Room Impulse Response，RIR)，也称作房间传递函数。变化T可通过测量方法测出，例如，可采用正弦扫频信号的方法测量实际房间的房间冲激响应。The transformation T may include a room impulse response (RIR), also known as a room transfer function. The transformation T may be measured by a measurement method, for example, a sine sweep signal method may be used to measure the room impulse response of an actual room.

可播放扫频信号x_e(t)，扫频信号x_e(t)可表示为：The frequency sweep signal _{x e} ₍ t) can be played and can be expressed as:

在公式(1)中，f₁,f₂表示待测量的房间冲激响应的频率范围，T₁表示扫频信号x_e(t)的时间长度，扫频信号x_e(t)自变量t的取值范围为0～T₁。In formula (1), f ₁ , f ₂ represent the frequency range of the room impulse response to be measured, T ₁ represents the time length of the swept frequency signal x _e (t), and the value range of the independent variable t of the swept frequency signal x _e (t) is 0 to T ₁ .

假设y_e(t)为接收端记录到的信号，房间冲击响应r(t)＝y_e(t)*x_e(-t)，其中，*表示卷积操作，x_e(-t)表示扫频信号x_e(t)时间维度的逆序。Assume that _ye (t) is the signal recorded by the receiving end, and the room impulse response r(t)= _ye (t)* _xe (-t), where * represents the convolution operation and _xe (-t) represents the inverse order of the time dimension of the swept frequency signal _xe (t).

因此，对应任意音频x，变换T的具体形式可以表示为T(x)＝x*r，其中，x表示待进行T变化的任意音频，r即房间冲击响应r(t)。Therefore, corresponding to any audio x, the specific form of the transformation T can be expressed as T(x)=x*r, where x represents any audio to be transformed by T, and r is the room impulse response r(t).

在步骤S23中，将第二语音信息T(x_i)及相应的第二音频信息T(δ′)融合，得到N个融合语音信息T(x_i)+T(δ′)。In step S23, the second speech information T( _xi ) and the corresponding second audio information T(δ') are fused to obtain N fused speech information T( _xi )+T(δ').

应当理解，本公开对第二语音信息T(x_i)及第二音频信息T(δ′)融合的具体方式不作限制，例如，可以通过对第二语音信息T(x_i)及第二音频信息T(δ′)对应时刻的幅值相加的方式进行融合。It should be understood that the present disclosure does not limit the specific method of fusing the second voice information T( _xi ) and the second audio information T(δ′). For example, the fusion can be performed by adding the amplitudes of the second voice information T( _xi ) and the second audio information T(δ′) at corresponding moments.

通过这种方式，将目标人员讲话到语音验证装置的音频采集设备记录这一过程，对语音造成的失真用变换T来建模，通过对第一语音信息及相应的第一音频信息分别进行变换T，可将变换T融入对抗音频信息的优化过程中，如果对抗音频信息在优化过程中进行的变换T足够多，对抗音频信息能够提高对变换T的泛化能力，从而适应测试环境的改变，提高语音验证装置的安全性测试的有效性。In this way, the process of the target person speaking to the audio acquisition device of the voice verification device is recorded, and the distortion caused by the voice is modeled by transformation T. By performing transformation T on the first voice information and the corresponding first audio information respectively, the transformation T can be integrated into the optimization process of the adversarial audio information. If the adversarial audio information performs enough transformations T during the optimization process, the adversarial audio information can improve the generalization ability of the transformation T, thereby adapting to changes in the test environment and improving the effectiveness of the security test of the voice verification device.

在步骤S2得到N个融合语音信息之后，可在步骤S3中，将N个融合语音信息分别输入语音验证装置中，得到目标人员的第一语音特征。After obtaining N fused voice information in step S2, in step S3, the N fused voice information can be respectively input into the voice verification device to obtain the first voice feature of the target person.

在一种可能实现的方式中，假设融合语音信息可表示为T(x_i)+T(δ′)，可将融合语音信息T(x_i)+T(δ′)输入语音验证装置中，可利用语音验证装置的说话人验证模块F，提取目标人员的第一语音特征，即F(T(x_i)+T(δ′))。In one possible implementation, assuming that the fused speech information can be expressed as T( _xi )+T(δ′), the fused speech information T( _xi )+T(δ′) can be input into a speech verification device, and the speaker verification module F of the speech verification device can be used to extract the first speech feature of the target person, i.e., F(T( _xi )+T(δ′)).

其中，可以将每一个融合语音信息T(x_i)+T(δ′)单独输入语音验证装置的说话人验证模块F，例如，第一次将融合语音信息T(x₁)+T(δ′)单独输入语音验证装置的说话人验证模块F，第二次将融合语音信息T(x₂)+T(δ′)单独输入语音验证装置的说话人验证模块F，以此类推，直至第N次将融合语音信息T(x_N)+T(δ′)单独输入语音验证装置的说话人验证模块F。在这种情况下，在后续对对抗音频信息的优化过程中，每一个融合语音信息会产生一个关于对抗音频信息的梯度，这样在利用梯度更新对抗音频信息，需进行N次前向传播和后向传播。Each fused voice information T( _xi )+T(δ') can be input into the speaker verification module F of the voice verification device separately. For example, the fused voice information T( _x1 )+T(δ') is input into the speaker verification module F of the voice verification device separately for the first time, and the fused voice information T( _x2 )+T(δ') is input into the speaker verification module F of the voice verification device separately for the second time, and so on, until the fused voice information T( _xN )+T(δ') is input into the speaker verification module F of the voice verification device separately for the Nth time. In this case, in the subsequent optimization process of the adversarial audio information, each fused voice information will generate a gradient with respect to the adversarial audio information, so that N forward propagations and backward propagations are required to be performed when the adversarial audio information is updated using the gradient.

也可以将多个融合语音信息以批(batch)数据方式输入语音验证装置中。其中，由于N条融合语音信息的长度各不相同，可以将N条融合语音信息中各融合语音信息的长度，扩展至最长的融合语信息的长度。在这种情况下，在后续对对抗音频信息的优化过程中，可进行1次前向传播和后向传播。Alternatively, multiple fused voice information may be input into the voice verification device in batch data. Since the lengths of the N fused voice information are different, the length of each fused voice information in the N fused voice information may be extended to the length of the longest fused voice information. In this case, in the subsequent optimization process of the adversarial audio information, one forward propagation and one backward propagation may be performed.

应当理解，本公开对将多个融合语音信息分别输入语音验证装置中的输入方式不作限制。It should be understood that the present disclosure does not limit the input method of inputting multiple fused voice information into the voice verification device separately.

在步骤S3中得到目标人员的第一语音特征后，可在步骤S4中，根据目标人员的第一语音特征，以及语音验证装置中存储的已注册人员的第二语音特征，对对抗音频信息进行优化。After obtaining the first voice feature of the target person in step S3, the adversarial audio information may be optimized in step S4 according to the first voice feature of the target person and the second voice feature of the registered person stored in the voice verification device.

在一种可能的实现方式中，步骤S4包括：基于第一损失函数，对所述对抗音频信息进行优化，得到第一状态的对抗音频信息；基于所述第一损失函数和第二损失函数，对所述第一状态的对抗音频信息进行优化，得到优化后的抗音频信息；其中，所述第一损失函数用于指示所述第一语音特征与所述第二语音特征的识别误差，所述第二损失函数用于指示所述对抗音频信息对语音识别内容的影响。In one possible implementation, step S4 includes: based on a first loss function, optimizing the adversarial audio information to obtain adversarial audio information in a first state; based on the first loss function and a second loss function, optimizing the adversarial audio information in the first state to obtain optimized adversarial audio information; wherein the first loss function is used to indicate the recognition error between the first speech feature and the second speech feature, and the second loss function is used to indicate the influence of the adversarial audio information on the speech recognition content.

举例来说，可通过两阶段的优化方法对对抗音频信息进行优化。For example, the adversarial audio information may be optimized through a two-stage optimization method.

在第一阶段，可基于第一损失函数L₁，对对抗音频信息δ进行优化，得到第一状态的对抗音频信息δ。该阶段所得的第一状态的对抗音频信息δ，可满足对抗性质。In the first stage, the adversarial audio information δ can be optimized based on the first loss function L ₁ to obtain the adversarial audio information δ in the first state. The adversarial audio information δ in the first state obtained in this stage can satisfy the adversarial property.

其中，第一损失函数用于指示第一语音特征与第二语音特征的识别误差。The first loss function is used to indicate the recognition error between the first speech feature and the second speech feature.

第一损失函数L₁，可以表示为：The first loss function L ₁ can be expressed as:

在公式(2)中，X表示语音样本集，该语音样本集X包括N个第一语音信息，x_i表示语音样本集X中的第i个第一语音信息，δ′表示预处理后的对抗音频信息δ，也即与第一语音信息x_i的时长相同的第一音频信息δ′，y表示注册人员的音频信息，F表示语音验证装置的说话人验证模块，F(T(x_i)+T(δ′))表示目标人员的第一语音特征，F(y)表示已注册人员的第二语音特征，s(F(T(x_i)+T(δ′)),F(y))表示第一语音特征和第二语音特征之间的相似度，θ表示预设的阈值，k表示置信度。其中，本公开求取相似度s(·)方法可以是余弦相似度法、皮尔森(Pearson)相关系数法、或者杰卡德(Jaccard)相似系数法，本公开对具体求取相似度的方法不作限制。In formula (2), X represents a speech sample set, and the speech sample set X includes N first speech information, _xi represents the i-th first speech information in the speech sample set X, δ′ represents the preprocessed adversarial audio information δ, that is, the first audio information δ′ with the same duration as the first speech information _xi , y represents the audio information of the registered person, F represents the speaker verification module of the speech verification device, F(T( _xi )+T(δ′)) represents the first speech feature of the target person, F(y) represents the second speech feature of the registered person, s(F(T( _xi )+T(δ′)), F(y)) represents the similarity between the first speech feature and the second speech feature, θ represents a preset threshold, and k represents the confidence level. Among them, the method for obtaining the similarity s(·) in the present disclosure can be a cosine similarity method, a Pearson correlation coefficient method, or a Jaccard similarity coefficient method, and the present disclosure does not limit the specific method for obtaining the similarity.

其中，如果相似度s(F(T(x_i)+T(δ′)),F(y))大于预设的阈值θ，可判断第一语音特征和第二语音特征来自同一人员；If the similarity s(F(T( _xi )+T(δ′)),F(y)) is greater than a preset threshold value θ, it can be determined that the first voice feature and the second voice feature are from the same person;

如果相似度s(F(T(x_i)+T(δ′)),F(y))小于或等于预设的阈值θ，可判断第一语音特征和第二语音特征来自不同人员。If the similarity s(F(T( _xi )+T(δ′)),F(y)) is less than or equal to a preset threshold θ, it can be determined that the first voice feature and the second voice feature come from different persons.

应当理解，预设的阈值θ和置信度κ可根据经验设定，本公开对预设的阈值θ和置信度κ的具体取值不作限制。It should be understood that the preset threshold value θ and confidence level κ may be set based on experience, and the present disclosure does not limit the specific values of the preset threshold value θ and confidence level κ.

因此，在第一阶段，通过优化第一损失函数L₁，可以使目标人员的第一语音特征和已注册人员的第二语音特征相似度大于阈值，从而使语音验证装置判断两段语音来自同一个说话人，得到的第一状态的对抗音频信息可满足对抗性质。Therefore, in the first stage, by optimizing the first loss function L ₁ , the similarity between the first speech feature of the target person and the second speech feature of the registered person can be made greater than the threshold, so that the speech verification device determines that the two speech segments are from the same speaker, and the adversarial audio information of the first state can meet the adversarial property.

在第二阶段，可基于第一损失函数L₁和第二损失函数L₂，对第一状态的对抗音频信息δ进行优化，得到优化后的对抗音频信息δ。In the second stage, the adversarial audio information δ in the first state may be optimized based on the first loss function L ₁ and the second loss function L ₂ to obtain the optimized adversarial audio information δ.

例如，可基于第一损失函数L₁和第二损失函数L₂的加权和：L＝L₁+λL₂，对抗音频信息δ进行优化，得到优化后的对抗音频信息δ。其中，λ为平衡参数，平衡参数λ的具体取值可根据经验值设置，本公开对平衡参数λ的具体取值不作限制。For example, the adversarial audio information δ can be optimized based on the weighted sum of the first loss function _L1 and the second loss function _L2 : L= _L1 + _λL2 to obtain the optimized adversarial audio information δ. Wherein, λ is a balance parameter, and the specific value of the balance parameter λ can be set according to an empirical value, and the present disclosure does not limit the specific value of the balance parameter λ.

该阶段所得的优化后的对抗音频信息δ，在保证对抗性质的前提下，可降低对抗音频信息δ对语音识别内容的影响。The optimized adversarial audio information δ obtained in this stage can reduce the impact of the adversarial audio information δ on the speech recognition content while ensuring the adversarial nature.

其中，第二损失函数L₂用于指示对抗音频信息对语音识别内容的影响。Among them, the second loss function _L2 is used to indicate the impact of adversarial audio information on speech recognition content.

由于语音验证装置的语音识别模块多种多样，不能通过单一的语音识别模块来进行优化，鉴于语音识别模块中会有频域特征提取的前处理流程，可通过对抗音频信息，降低语音识别过程中频域特征提取的影响，从而减小对识别结果的影响。Since the speech recognition modules of the speech verification device are diverse, they cannot be optimized through a single speech recognition module. In view that there is a pre-processing process of frequency domain feature extraction in the speech recognition module, the impact of frequency domain feature extraction in the speech recognition process can be reduced by counteracting audio information, thereby reducing the impact on the recognition results.

因此，第二损失函数L₂，可以表示为：Therefore, the second loss function L ₂ can be expressed as:

L₂(δ)＝mean(|STFT(δ)|) (3)L ₂ (δ) = mean(|STFT(δ)|) (3)

在公式(3)中，δ表示对抗音频信息，STFT(·)表示提取语音频域特征的函数，也即对对抗音频信息δ进行短时傅里叶变换(Short-Time Fourier Transform，STFT)，mean(·)表示用于求取均值的函数，可对提取语音频域特征的函数STFT(·)结果的距离取平均值。In formula (3), δ represents the adversarial audio information, STFT(·) represents the function for extracting the speech frequency domain features, that is, performing short-time Fourier transform (STFT) on the adversarial audio information δ, mean(·) represents the function for obtaining the mean, and the distance of the results of the function STFT(·) for extracting the speech frequency domain features can be averaged.

因此，在第二阶段，通过优化第一损失函数L₁和第二损失函数L₂的加权和L＝L₁+λL₂，得到的优化后的对抗音频信息，在满足对抗性质的前提下，可降低对原语音识别内容的影响。Therefore, in the second stage, by optimizing the weighted sum L=L ₁ +λL ₂ of the first loss function L ₁ and the second loss function L ₂ , the optimized adversarial audio information can reduce the impact on the original speech recognition content while satisfying the adversarial nature.

通过这种方式，可在第一阶段得到第一状态的对抗音频信息，该状态的对抗音频信息可满足对抗性质，然后在第二阶段保证对抗性质的前提下，降低对抗音频信息对原语音识别内容的影响，使优化后的对抗音频信息，能够对语音验证装置的安全性可靠性进行有效检测。In this way, adversarial audio information of the first state can be obtained in the first stage, and the adversarial audio information of this state can meet the adversarial properties. Then, under the premise of ensuring the adversarial properties in the second stage, the impact of the adversarial audio information on the original voice recognition content is reduced, so that the optimized adversarial audio information can effectively detect the security and reliability of the voice verification device.

在通过步骤S1～S4得到优化后的对抗音频信息之后，图6示出根据本公开实施例的优化后的对抗音频信息进行安全性测试的示意图。After the optimized adversarial audio information is obtained through steps S1 to S4, FIG6 shows a schematic diagram of performing a security test on the optimized adversarial audio information according to an embodiment of the present disclosure.

如图6所示，音频播放设备用于播放优化后的对抗音频信息，在音频播放设备播放优化后的对抗音频信息的期间，目标人员读出动态密码，语音验证装置的麦克风可接收并记录验证语音信息，也即优化后的对抗音频信息和目标人员读出动态密码的混合音频。As shown in Figure 6, the audio playback device is used to play the optimized adversarial audio information. During the period when the audio playback device plays the optimized adversarial audio information, the target person reads out the dynamic password, and the microphone of the voice verification device can receive and record the verification voice information, that is, the mixed audio of the optimized adversarial audio information and the target person reading out the dynamic password.

语音验证装置可对接收的验证语言进行验证，得到验证结果。根据验证结果可确定语音验证装置的安全性测试结果。The voice verification device can verify the received verification language to obtain a verification result, and the security test result of the voice verification device can be determined according to the verification result.

在一种可能的实现方式中，在所述目标人员为未注册人员的情况下，所述优化后的对抗音频信息用于：使得所述语音验证装置输出的验证结果为验证成功。In a possible implementation, when the target person is an unregistered person, the optimized adversarial audio information is used to make the verification result output by the voice verification device be a successful verification.

举例来说，在有目标场景下，即针对语音验证装置可能会出现将未注册过的目标人员A误识别为某位已注册人员B的情况，可利用优化后的对抗音频信息对语音验证装置进行安全测试，也即利用优化后的对抗音频信息的对抗性质。For example, in a target scenario, that is, the voice verification device may mistakenly identify an unregistered target person A as a registered person B. The optimized adversarial audio information can be used to perform a security test on the voice verification device, that is, the adversarial nature of the optimized adversarial audio information can be used.

其中，优化后的对抗音频信息的对抗性质，具体为：将优化后的对抗音频信息和目标人员(未注册)的语音信息输入语音验证装置，可使语音验证装置输出的验证结果为验证成功。Among them, the adversarial nature of the optimized adversarial audio information is specifically: inputting the optimized adversarial audio information and the voice information of the target person (unregistered) into the voice verification device, the verification result output by the voice verification device can be successful verification.

在有目标场景下，将优化后的对抗音频信息和目标人员(未注册)的语音信息输入语音验证装置，如果语音验证装置输出的验证结果为验证成功，说明语音验证装置的安全性比较低；如果语音验证装置输出的验证结果为验证失败，说明语音验证装置的安全性比较高。In a target scenario, the optimized adversarial audio information and the voice information of the target person (unregistered) are input into the voice verification device. If the verification result output by the voice verification device is a successful verification, it means that the security of the voice verification device is relatively low; if the verification result output by the voice verification device is a failed verification, it means that the security of the voice verification device is relatively high.

因此，利用优化后的对抗音频信息的对抗性质，可对语音验证装置的安全性进行有效的测试。Therefore, by utilizing the adversarial properties of the optimized adversarial audio information, the security of the voice authentication device can be effectively tested.

上面介绍了有目标场景下本公开实施例的语音验证装置的安全性测试方法，相对于有目标场景下，语音验证装置有可能会出现将未注册过的人员A识别为某位已注册人员B的严重安全风险隐患。在无目标场景下，语音验证装置没有识别出已注册人员的情况下的安全风险隐患还相对小一点。The above describes the security testing method of the voice verification device of the embodiment of the present disclosure in a target scenario. Compared with the target scenario, the voice verification device may have a serious security risk of identifying an unregistered person A as a registered person B. In the non-target scenario, the security risk of the voice verification device failing to identify a registered person is relatively small.

针对无目标场景，可将上述有目标场景下，如图5所示的语音验证装置的安全性测试方法进行一些设定修改后，可适用无目标场景下的语音验证装置的安全性测试方法。For non-target scenarios, the security testing method for the voice verification device in the above-mentioned target scenario, as shown in FIG5 , can be applied to the security testing method for the voice verification device in the non-target scenario after some setting modifications.

如图5所示，将人员A(目标人员)和人员B(已注册人员)设置为同一人员，即目标人员为已经注册了语音验证装置的人员。由于是同一人员，如果没有对抗音频信息，语音验证装置可接受目标人员的验证语音。而在无目标场景下，对抗音频信息的加入可使目标人员验证失败，可将第一损失函数改为：As shown in Figure 5, person A (target person) and person B (registered person) are set to be the same person, that is, the target person is the person who has registered the voice verification device. Since they are the same person, if there is no adversarial audio information, the voice verification device can accept the verification voice of the target person. In the non-target scenario, the addition of adversarial audio information may cause the verification of the target person to fail, and the first loss function can be changed to:

在公式(4)中，X表示语音样本集，该语音样本集X包括N个第一语音信息，x_i表示语音样本集X中的第i个第一语音信息，δ′表示预处理后的对抗音频信息δ，也即与第一语音信息x_i的时长相同的第一音频信息δ′，y表示注册人员的音频信息(即目标人员A的注册音频信息)，F表示语音验证装置的说话人验证模块，F(T(x_i)+T(δ′))表示目标人员的第一语音特征，F(y)表示已注册人员(即目标人员A)的第二语音特征，s(F(T(x_i)+T(δ′))，F(y))表示第一语音特征和第二语音特征之间的相似度，θ表示预设的阈值，κ表示置信度。In formula (4), X represents a speech sample set, which includes N first speech information, x _i represents the i-th first speech information in the speech sample set X, δ′ represents the preprocessed adversarial audio information δ, that is, the first audio information δ′ with the same duration as the first speech information x _i , y represents the audio information of the registered person (that is, the registered audio information of the target person A), F represents the speaker verification module of the speech verification device, F(T(x _i )+T(δ′)) represents the first speech feature of the target person, F(y) represents the second speech feature of the registered person (that is, the target person A), s(F(T(x _i )+T(δ′)), F(y)) represents the similarity between the first speech feature and the second speech feature, θ represents a preset threshold, and κ represents the confidence level.

优化L′₁可以使目标人员的第一语音特征和本人已注册的第二语音特征相似度低于阈值，从而使语音验证装置判断两段语音不是来自同一个说话人。Optimizing L′ ₁ can make the similarity between the target person's first voice feature and the second voice feature that the person has registered lower than a threshold, so that the voice verification device determines that the two voices are not from the same speaker.

应当理解，修改后的方法仅对上述设定做出修改，其方法步骤同有目标场景下的方法步骤一样，此处不再赘叙。It should be understood that the modified method only modifies the above settings, and its method steps are the same as the method steps in the target scenario, which will not be repeated here.

类似的，得到适用于无目标场景下的优化后的对抗音频信息，如图6所示，可根据优化后的对抗音频信息进行无目标场景下的安全性测试。Similarly, optimized adversarial audio information suitable for target-free scenarios is obtained, as shown in FIG6 , and security testing in target-free scenarios can be performed based on the optimized adversarial audio information.

在音频播放设备播放优化后的对抗音频信息的期间，目标人员读出动态密码，语音验证装置的麦克风可接收并记录验证语音信息，也即优化后的对抗音频信息和目标人员读出动态密码的混合音频。While the audio playback device is playing the optimized adversarial audio information, the target person reads out the dynamic password, and the microphone of the voice verification device can receive and record the verification voice information, that is, the mixed audio of the optimized adversarial audio information and the target person reading out the dynamic password.

在一种可能的实现方式中，在所述目标人员为已注册人员的情况下，所述优化后的对抗音频信息用于：使得所述语音验证装置输出的验证结果为验证失败。In a possible implementation, when the target person is a registered person, the optimized adversarial audio information is used to make the verification result output by the voice verification device be a verification failure.

在无目标场景下，即针对语音验证装置可能会出现不能正确识别已注册人员身份的情况，可利用优化后的对抗音频信息对语音验证装置进行安全测试，也即利用优化后的对抗音频信息的对抗性质。In an untargeted scenario, that is, the voice verification device may not be able to correctly identify the identity of a registered person, the optimized adversarial audio information can be used to perform a security test on the voice verification device, that is, the adversarial nature of the optimized adversarial audio information can be used.

其中，优化后的对抗音频信息的对抗性质，具体为：将优化后的对抗音频信息和目标人员(已注册)的语音信息输入语音验证装置，可使语音验证装置输出的验证结果为验证失败。Among them, the adversarial nature of the optimized adversarial audio information is specifically: inputting the optimized adversarial audio information and the voice information of the target person (registered) into the voice verification device can make the verification result output by the voice verification device be verification failure.

在无目标场景下，将优化后的对抗音频信息和目标人员(已注册)的语音信息输入语音验证装置，如果语音验证装置输出的验证结果为验证成功，说明语音验证装置的安全性比较高；如果语音验证装置输出的验证结果为验证失败，说明语音验证装置的安全性比较低。In a non-target scenario, the optimized adversarial audio information and the voice information of the target person (registered) are input into the voice verification device. If the verification result output by the voice verification device is a successful verification, it means that the security of the voice verification device is relatively high; if the verification result output by the voice verification device is a failed verification, it means that the security of the voice verification device is relatively low.

通过上述在有目标场景和无目标场景下，对本公开实施例的语音验证装置的安全性测试方法的详细介绍，可知，对抗音频信息的对抗性质越强，可对语音验证装置的安全性进行越有效的测试。下面为了更好的说明本公开语音验证装置的安全性测试方法的测试效果，可通过三组对比实验来说明本公开对抗音频信息的对抗性。Through the detailed introduction of the security testing method of the voice verification device of the embodiment of the present disclosure in the above target scenario and the non-target scenario, it can be known that the stronger the adversarial nature of the adversarial audio information, the more effective the security of the voice verification device can be tested. In order to better illustrate the test effect of the security testing method of the voice verification device of the present disclosure, three groups of comparative experiments can be used to illustrate the adversarial nature of the adversarial audio information of the present disclosure.

可选取志愿者作为目标人员，可设置一些语音样本集中未曾出现过的动态密码内容供目标人员阅读。第一组目标人员在读出动态密码期间，音频播放设备不播放任何内容，目的是提供语音验证装置对内容识别准确率的基线；第二组目标人员在讲出动态密码期间，音频播放设备播放高斯噪声，噪声幅度值与对抗音频信息的幅度值相近，目的是消除随机噪声对实验的影响；第三组目标人员在讲出动态密码期间，音频播放设备播放优化后的对抗音频信息，以此来评估对抗音频信息的有效性。Volunteers can be selected as target personnel, and some dynamic password contents that have never appeared in the voice sample set can be set for the target personnel to read. When the first group of target personnel read out the dynamic password, the audio playback device does not play any content, in order to provide a baseline for the accuracy of content recognition by the voice verification device; when the second group of target personnel speak out the dynamic password, the audio playback device plays Gaussian noise, and the noise amplitude value is similar to the amplitude value of the adversarial audio information, in order to eliminate the influence of random noise on the experiment; when the third group of target personnel speak out the dynamic password, the audio playback device plays the optimized adversarial audio information, in order to evaluate the effectiveness of the adversarial audio information.

在上述三组对比实验中，第二组播放高斯噪声实验与第一组实验相比，对抗成功率没有提升，均为0％，语音内容识别的错误率由11.42％提升到17.77％，增加了6.35％；第三组播放优化后的对抗音频信息的实验与第一组实验相比，对抗成功率为100％，而语音内容识别的错误率由11.42％提升到14.97％，仅增加了3.55％。说明优化后的对抗音频信息在保证对抗成功率的前提下，对语音内容识别的影响也很小。In the above three comparative experiments, the second group of experiments playing Gaussian noise did not improve the success rate of the confrontation compared with the first group of experiments, both of which were 0%, and the error rate of speech content recognition increased from 11.42% to 17.77%, an increase of 6.35%; the third group of experiments playing optimized adversarial audio information compared with the first group of experiments, the success rate of the confrontation was 100%, while the error rate of speech content recognition increased from 11.42% to 14.97%, an increase of only 3.55%. This shows that the optimized adversarial audio information has little effect on speech content recognition while ensuring the success rate of the confrontation.

此外，本公开还进行了重放检测的两组对比实验，第一组中采用如图3所示的相关技术中方法，验证语音信息(即录制的目标人员语音+对抗音频信息)由音频播放设备播放，第二组采用如图4所示的本公开方法，在目标人员发出真实验证语音期间，音频播放设备单独播放对抗音频信息。在上述重放检测的两组对比实验中，第一组重放检测通过率为37.7％，第二组重放检测通过率为67.7％，本公开方法相比相关技术方法高出了30％的重放检测通过率。In addition, the present disclosure also conducted two groups of comparative experiments on playback detection. In the first group, the method in the related art as shown in Figure 3 was used, and the verification voice information (i.e., the recorded target person's voice + adversarial audio information) was played by the audio playback device. The second group used the method of the present disclosure as shown in Figure 4. During the period when the target person issued the real verification voice, the audio playback device played the adversarial audio information alone. In the above two groups of comparative experiments on playback detection, the pass rate of the first group of playback detection was 37.7%, and the pass rate of the second group of playback detection was 67.7%. The pass rate of the playback detection of the present disclosure method was 30% higher than that of the related art method.

因此，在本公开实施例中，获取的语音样本集可包括目标人员的多个第一语音信息，各第一语音信息的内容各不相同，通过对抗音频信息在语音样本集上的学习，可使优化后的对抗音频信息与内容无关，即使动态密码发生改变时，对抗音频信息也可以直接应用在新的语音上；并且，该方法可根据目标人员的第一语音特征和已注册人员的第二语音特征，对对抗音频信息进行分两阶段优化，使得优化后的对抗音频信息在满足对抗性质的同时，还可降低对语音内容识别的影响；最后，本方法优化后的对抗音频信息可作为单独的声源由音频播放设备播放，配合目标人员的真实语音，能够通过语音验证装置的重放检测模块。本公开方法能够利用法优化后的对抗音频信息的对抗性质，对语音验证装置的安全性可靠性进行有效检测，提醒用户注意语音验证装置的使用风险，有利于语音验证装置的开发者对语音验证装置进行完善。Therefore, in the disclosed embodiment, the acquired voice sample set may include multiple first voice information of the target person, and the content of each first voice information is different. By learning the adversarial audio information on the voice sample set, the optimized adversarial audio information can be made independent of the content, and even when the dynamic password changes, the adversarial audio information can be directly applied to the new voice; and the method can optimize the adversarial audio information in two stages according to the first voice feature of the target person and the second voice feature of the registered person, so that the optimized adversarial audio information can reduce the impact on the recognition of voice content while meeting the adversarial nature; finally, the adversarial audio information optimized by the method can be played by the audio playback device as a separate sound source, and can pass the playback detection module of the voice verification device in conjunction with the real voice of the target person. The disclosed method can utilize the adversarial nature of the adversarial audio information optimized by the method to effectively detect the security and reliability of the voice verification device, remind users to pay attention to the risks of using the voice verification device, and help the developers of the voice verification device to improve the voice verification device.

可以理解，本公开提及的上述各个方法实施例，在不违背原理逻辑的情况下，均可以彼此相互结合形成结合后的实施例，限于篇幅，本公开不再赘述。本领域技术人员可以理解，在具体实施方式的上述方法中，各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。It can be understood that the above-mentioned various method embodiments mentioned in the present disclosure can be combined with each other to form a combined embodiment without violating the principle logic. Due to space limitations, the present disclosure will not repeat them. It can be understood by those skilled in the art that in the above-mentioned method of the specific implementation method, the specific execution order of each step should be determined according to its function and possible internal logic.

此外，本公开还提供了语音验证装置的安全性测试装置、电子设备、计算机可读存储介质、程序，上述均可用来实现本公开提供的任一种语音验证装置的安全性测试方法，相应技术方案和描述和参见方法部分的相应记载，不再赘述。In addition, the present disclosure also provides a security testing device, an electronic device, a computer-readable storage medium, and a program for a voice verification device, all of which can be used to implement the security testing method for any voice verification device provided by the present disclosure. The corresponding technical solutions and descriptions are referred to in the corresponding records of the method part and will not be repeated here.

图7示出根据本公开实施例的语音验证装置的安全性测试装置的框图，如图7所示，所述装置包括：FIG. 7 shows a block diagram of a security testing device of a voice verification device according to an embodiment of the present disclosure. As shown in FIG. 7 , the device includes:

语音样本集获取模块71，用于获取语音样本集，所述语音样本集中包括目标人员的多个第一语音信息；A voice sample set acquisition module 71, used to acquire a voice sample set, wherein the voice sample set includes a plurality of first voice information of a target person;

融合模块72，用于将对抗音频信息分别与所述多个第一语音信息融合，得到多个融合语音信息；A fusion module 72, configured to fuse the adversarial audio information with the plurality of first voice information respectively to obtain a plurality of fused voice information;

特征获取模块73，用于将所述多个融合语音信息分别输入所述语音验证装置中，得到所述目标人员的第一语音特征；A feature acquisition module 73, used to input the plurality of fused voice information into the voice verification device respectively to obtain the first voice feature of the target person;

优化模块74，用于根据所述目标人员的第一语音特征，以及所述语音验证装置中存储的已注册人员的第二语音特征，对所述对抗音频信息进行优化，得到优化后的对抗音频信息，所述优化后的对抗音频信息用于对所述语音验证装置进行安全性测试，其中，所述语音验证装置用于通过语音对已注册人员的身份进行验证。The optimization module 74 is used to optimize the adversarial audio information according to the first voice feature of the target person and the second voice feature of the registered person stored in the voice verification device to obtain optimized adversarial audio information, and the optimized adversarial audio information is used to perform a security test on the voice verification device, wherein the voice verification device is used to verify the identity of the registered person through voice.

在一种可能的实现方式中，所述融合模块72用于：根据所述多个第一语音信息的时长，对所述对抗音频信息进行预处理，得到与各个第一语音信息时长对应的第一音频信息；对第一语音信息及相应的第一音频信息分别进行变换，得到第二语音信息及相应的第二音频信息；其中，所述变换处理包括房间冲击响应；将所述第二语音信息及所述相应的第二音频信息融合，得到所述融合语音信息。In one possible implementation, the fusion module 72 is used to: pre-process the adversarial audio information according to the duration of the multiple first voice information to obtain first audio information corresponding to the duration of each first voice information; transform the first voice information and the corresponding first audio information respectively to obtain second voice information and corresponding second audio information; wherein the transformation processing includes room impulse response; fuse the second voice information and the corresponding second audio information to obtain the fused voice information.

在一种可能的实现方式中，所述特征获取模块73用于：基于第一损失函数，对所述对抗音频信息进行优化，得到第一状态的对抗音频信息；基于所述第一损失函数和第二损失函数，对所述第一状态的对抗音频信息进行优化，得到优化后的抗音频信息；其中，所述第一损失函数用于指示所述第一语音特征与所述第二语音特征的识别误差，所述第二损失函数用于指示所述对抗音频信息对语音识别内容的影响。In one possible implementation, the feature acquisition module 73 is used to: optimize the adversarial audio information based on a first loss function to obtain adversarial audio information in a first state; optimize the adversarial audio information in the first state based on the first loss function and the second loss function to obtain optimized adversarial audio information; wherein the first loss function is used to indicate the recognition error between the first speech feature and the second speech feature, and the second loss function is used to indicate the influence of the adversarial audio information on the speech recognition content.

在一些实施例中，本公开实施例提供的语音验证装置的安全性测试装置，具有的功能或包含的模块可以用于执行上文方法实施例描述的方法，其具体实现可以参照上文方法实施例的描述，为了简洁，这里不再赘述。In some embodiments, the security testing device of the voice verification device provided by the embodiments of the present disclosure has functions or includes modules that can be used to execute the method described in the above method embodiments. Its specific implementation can refer to the description of the above method embodiments, and for the sake of brevity, it will not be repeated here.

本公开实施例还提出一种计算机可读存储介质，其上存储有计算机程序指令，所述计算机程序指令被处理器执行时实现上述方法。计算机可读存储介质可以是非易失性计算机可读存储介质。The embodiment of the present disclosure also provides a computer-readable storage medium, on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the above method is implemented. The computer-readable storage medium may be a non-volatile computer-readable storage medium.

本公开实施例还提出一种电子设备，包括：处理器；用于存储处理器可执行指令的存储器；其中，所述处理器被配置为调用所述存储器存储的指令，以执行上述方法。An embodiment of the present disclosure further proposes an electronic device, comprising: a processor; and a memory for storing instructions executable by the processor; wherein the processor is configured to call the instructions stored in the memory to execute the above method.

电子设备可以被提供为终端、服务器或其它形态的设备。The electronic device may be provided as a terminal, a server, or a device in other forms.

图8示出根据本公开实施例的一种电子设备800的框图。例如，电子设备800可以是移动电话，计算机，数字广播终端，消息收发设备，游戏控制台，平板设备，医疗设备，健身设备，个人数字助理等终端。8 shows a block diagram of an electronic device 800 according to an embodiment of the present disclosure. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like.

参照图8，电子设备800可以包括以下一个或多个组件：处理组件802，存储器804，电源组件806，多媒体组件808，音频组件810，输入/输出(I/O)的接口812，传感器组件814，以及通信组件816。8 , the electronic device 800 may include one or more of the following components: a processing component 802 , a memory 804 , a power component 806 , a multimedia component 808 , an audio component 810 , an input/output (I/O) interface 812 , a sensor component 814 , and a communication component 816 .

处理组件802通常控制电子设备800的整体操作，诸如与显示，电话呼叫，数据通信，相机操作和记录操作相关联的操作。处理组件802可以包括一个或多个处理器820来执行指令，以完成上述的方法的全部或部分步骤。此外，处理组件802可以包括一个或多个模块，便于处理组件802和其他组件之间的交互。例如，处理组件802可以包括多媒体模块，以方便多媒体组件808和处理组件802之间的交互。The processing component 802 generally controls the overall operation of the electronic device 800, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the above-mentioned method. In addition, the processing component 802 may include one or more modules to facilitate the interaction between the processing component 802 and other components. For example, the processing component 802 may include a multimedia module to facilitate the interaction between the multimedia component 808 and the processing component 802.

存储器804被配置为存储各种类型的数据以支持在电子设备800的操作。这些数据的示例包括用于在电子设备800上操作的任何应用程序或方法的指令，联系人数据，电话簿数据，消息，图片，视频等。存储器804可以由任何类型的易失性或非易失性存储设备或者它们的组合实现，如静态随机存取存储器(SRAM)，电可擦除可编程只读存储器(EEPROM)，可擦除可编程只读存储器(EPROM)，可编程只读存储器(PROM)，只读存储器(ROM)，磁存储器，快闪存储器，磁盘或光盘。The memory 804 is configured to store various types of data to support operations on the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phone book data, messages, pictures, videos, etc. The memory 804 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.

电源组件806为电子设备800的各种组件提供电力。电源组件806可以包括电源管理系统，一个或多个电源，及其他与为电子设备800生成、管理和分配电力相关联的组件。The power supply component 806 provides power to the various components of the electronic device 800. The power supply component 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to the electronic device 800.

多媒体组件808包括在所述电子设备800和用户之间的提供一个输出接口的屏幕。在一些实施例中，屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板，屏幕可以被实现为触摸屏，以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界，而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中，多媒体组件808包括一个前置摄像头和/或后置摄像头。当电子设备800处于操作模式，如拍摄模式或视频模式时，前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundaries of the touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operating mode, such as a shooting mode or a video mode, the front camera and/or the rear camera may receive external multimedia data. Each front camera and the rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

音频组件810被配置为输出和/或输入音频信号。例如，音频组件810包括一个麦克风(MIC)，当电子设备800处于操作模式，如呼叫模式、记录模式和语音识别模式时，麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器804或经由通信组件816发送。在一些实施例中，音频组件810还包括一个扬声器，用于输出音频信号。The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a microphone (MIC), and when the electronic device 800 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal. The received audio signal can be further stored in the memory 804 or sent via the communication component 816. In some embodiments, the audio component 810 also includes a speaker for outputting audio signals.

I/O接口812为处理组件802和外围接口模块之间提供接口，上述外围接口模块可以是键盘，点击轮，按钮等。这些按钮可包括但不限于：主页按钮、音量按钮、启动按钮和锁定按钮。I/O interface 812 provides an interface between processing component 802 and peripheral interface modules, such as keyboards, click wheels, buttons, etc. These buttons may include but are not limited to: home button, volume button, start button, and lock button.

传感器组件814包括一个或多个传感器，用于为电子设备800提供各个方面的状态评估。例如，传感器组件814可以检测到电子设备800的打开/关闭状态，组件的相对定位，例如所述组件为电子设备800的显示器和小键盘，传感器组件814还可以检测电子设备800或电子设备800一个组件的位置改变，用户与电子设备800接触的存在或不存在，电子设备800方位或加速/减速和电子设备800的温度变化。传感器组件814可以包括接近传感器，被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件814还可以包括光传感器，如互补金属氧化物半导体(CMOS)或电荷耦合装置(CCD)图像传感器，用于在成像应用中使用。在一些实施例中，该传感器组件814还可以包括加速度传感器，陀螺仪传感器，磁传感器，压力传感器或温度传感器。The sensor assembly 814 includes one or more sensors for providing various aspects of status assessment for the electronic device 800. For example, the sensor assembly 814 can detect the open/closed state of the electronic device 800, the relative positioning of the components, such as the display and keypad of the electronic device 800, and the sensor assembly 814 can also detect the position change of the electronic device 800 or a component of the electronic device 800, the presence or absence of contact between the user and the electronic device 800, the orientation or acceleration/deceleration of the electronic device 800, and the temperature change of the electronic device 800. The sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include an optical sensor, such as a complementary metal oxide semiconductor (CMOS) or a charge coupled device (CCD) image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

通信组件816被配置为便于电子设备800和其他设备之间有线或无线方式的通信。电子设备800可以接入基于通信标准的无线网络，如无线网络(WiFi)，第二代移动通信技术(2G)，第三代移动通信技术(3G)，第四代移动通信技术(4G)或第五代移动通信技术(5G)，或它们的组合。在一个示例性实施例中，通信组件816经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中，所述通信组件816还包括近场通信(NFC)模块，以促进短程通信。例如，在NFC模块可基于射频识别(RFID)技术，红外数据协会(IrDA)技术，超宽带(UWB)技术，蓝牙(BT)技术和其他技术来实现。The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 can access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (2G), a third generation mobile communication technology (3G), a fourth generation mobile communication technology (4G) or a fifth generation mobile communication technology (5G), or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast-related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 also includes a near field communication (NFC) module to facilitate short-range communication. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.

在示例性实施例中，电子设备800可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现，用于执行上述方法。In an exemplary embodiment, the electronic device 800 may be implemented by one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components to perform the above methods.

在示例性实施例中，还提供了一种非易失性计算机可读存储介质，例如包括计算机程序指令的存储器804，上述计算机程序指令可由电子设备800的处理器820执行以完成上述方法。In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as a memory 804 including computer program instructions, which can be executed by a processor 820 of an electronic device 800 to perform the above method.

图9示出根据本公开实施例的一种电子设备1900的框图。例如，电子设备1900可以被提供为一服务器。参照图9，电子设备1900包括处理组件1922，其进一步包括一个或多个处理器，以及由存储器1932所代表的存储器资源，用于存储可由处理组件1922的执行的指令，例如应用程序。存储器1932中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外，处理组件1922被配置为执行指令，以执行上述方法。FIG9 shows a block diagram of an electronic device 1900 according to an embodiment of the present disclosure. For example, the electronic device 1900 may be provided as a server. Referring to FIG9 , the electronic device 1900 includes a processing component 1922, which further includes one or more processors, and a memory resource represented by a memory 1932 for storing instructions executable by the processing component 1922, such as an application. The application stored in the memory 1932 may include one or more modules, each of which corresponds to a set of instructions. In addition, the processing component 1922 is configured to execute instructions to perform the above method.

电子设备1900还可以包括一个电源组件1926被配置为执行电子设备1900的电源管理，一个有线或无线网络接口1950被配置为将电子设备1900连接到网络，和一个输入输出(I/O)接口1958。电子设备1900可以操作基于存储在存储器1932的操作系统，例如微软服务器操作系统(Windows ServerTM)，苹果公司推出的基于图形用户界面操作系统(Mac OSXTM)，多用户多进程的计算机操作系统(UnixTM),自由和开放原代码的类Unix操作系统(LinuxTM)，开放原代码的类Unix操作系统(FreeBSDTM)或类似。The electronic device 1900 may further include a power supply component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input/output (I/O) interface 1958. The electronic device 1900 may operate based on an operating system stored in the memory 1932, such as Microsoft's server operating system (Windows ServerTM), Apple's graphical user interface-based operating system (Mac OSXTM), a multi-user multi-process computer operating system (UnixTM), a free and open source Unix-like operating system (LinuxTM), an open source Unix-like operating system (FreeBSDTM), or the like.

在示例性实施例中，还提供了一种非易失性计算机可读存储介质，例如包括计算机程序指令的存储器1932，上述计算机程序指令可由电子设备1900的处理组件1922执行以完成上述方法。In an exemplary embodiment, a non-volatile computer-readable storage medium is also provided, such as a memory 1932 including computer program instructions, which can be executed by the processing component 1922 of the electronic device 1900 to perform the above method.

本公开可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质，其上载有用于使处理器实现本公开的各个方面的计算机可读程序指令。The present disclosure may be a system, a method and/or a computer program product. The computer program product may include a computer-readable storage medium carrying computer-readable program instructions for causing a processor to implement various aspects of the present disclosure.

计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是(但不限于)电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括：便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身，诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如，通过光纤电缆的光脉冲)、或者通过电线传输的电信号。Computer readable storage medium can be a tangible device that can hold and store instructions used by an instruction execution device. Computer readable storage medium can be, for example, (but not limited to) an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination thereof. More specific examples (non-exhaustive list) of computer readable storage medium include: a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a static random access memory (SRAM), a portable compact disk read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanical encoding device, for example, a punch card or a convex structure in a groove on which instructions are stored, and any suitable combination thereof. The computer readable storage medium used here is not interpreted as a transient signal itself, such as a radio wave or other freely propagating electromagnetic wave, an electromagnetic wave propagated by a waveguide or other transmission medium (for example, a light pulse by an optical fiber cable), or an electrical signal transmitted by a wire.

这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备，或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令，并转发该计算机可读程序指令，以供存储在各个计算/处理设备中的计算机可读存储介质中。The computer-readable program instructions described herein can be downloaded from a computer-readable storage medium to each computing/processing device, or downloaded to an external computer or external storage device via a network, such as the Internet, a local area network, a wide area network, and/or a wireless network. The network can include copper transmission cables, optical fiber transmissions, wireless transmissions, routers, firewalls, switches, gateway computers, and/or edge servers. The network adapter card or network interface in each computing/processing device receives the computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in the computer-readable storage medium in each computing/processing device.

用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码，所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等，以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中，远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机，或者，可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中，通过利用计算机可读程序指令的状态信息来个性化定制电子电路，例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA)，该电子电路可以执行计算机可读程序指令，从而实现本公开的各个方面。The computer program instructions for performing the operation of the present disclosure may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including object-oriented programming languages, such as Smalltalk, C++, etc., and conventional procedural programming languages, such as "C" language or similar programming languages. Computer-readable program instructions may be executed completely on a user's computer, partially on a user's computer, as an independent software package, partially on a user's computer, partially on a remote computer, or completely on a remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer via any type of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (e.g., using an Internet service provider to connect via the Internet). In some embodiments, an electronic circuit, such as a programmable logic circuit, a field programmable gate array (FPGA), or a programmable logic array (PLA), may be personalized by utilizing the state information of the computer-readable program instructions, and the electronic circuit may execute the computer-readable program instructions, thereby realizing various aspects of the present disclosure.

这里参照根据本公开实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解，流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合，都可以由计算机可读程序指令实现。Various aspects of the present disclosure are described herein with reference to the flowcharts and/or block diagrams of the methods, devices (systems) and computer program products according to the embodiments of the present disclosure. It should be understood that each box in the flowchart and/or block diagram and the combination of each box in the flowchart and/or block diagram can be implemented by computer-readable program instructions.

这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器，从而生产出一种机器，使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时，产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中，这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作，从而，存储有指令的计算机可读介质则包括一个制造品，其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer-readable program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, or other programmable data processing device, thereby producing a machine, so that when these instructions are executed by the processor of the computer or other programmable data processing device, a device that implements the functions/actions specified in one or more boxes in the flowchart and/or block diagram is generated. These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause the computer, programmable data processing device, and/or other equipment to work in a specific manner, so that the computer-readable medium storing the instructions includes a manufactured product, which includes instructions for implementing various aspects of the functions/actions specified in one or more boxes in the flowchart and/or block diagram.

也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上，使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤，以产生计算机实现的过程，从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。Computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device so that a series of operating steps are performed on the computer, other programmable data processing apparatus, or other device to produce a computer-implemented process, thereby causing the instructions executed on the computer, other programmable data processing apparatus, or other device to implement the functions/actions specified in one or more boxes in the flowchart and/or block diagram.

附图中的流程图和框图显示了根据本公开的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分，所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个连续的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或动作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flow chart and block diagram in the accompanying drawings show the possible architecture, function and operation of the system, method and computer program product according to multiple embodiments of the present disclosure. In this regard, each square box in the flow chart or block diagram can represent a part of a module, program segment or instruction, and the part of the module, program segment or instruction contains one or more executable instructions for realizing the specified logical function. In some alternative implementations, the function marked in the square box can also occur in a sequence different from that marked in the accompanying drawings. For example, two continuous square boxes can actually be executed substantially in parallel, and they can sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each square box in the block diagram and/or flow chart, and the combination of the square boxes in the block diagram and/or flow chart can be implemented with a dedicated hardware-based system that performs the specified function or action, or can be implemented with a combination of special hardware and computer instructions.

该计算机程序产品可以具体通过硬件、软件或其结合的方式实现。在一个可选实施例中，所述计算机程序产品具体体现为计算机存储介质，在另一个可选实施例中，计算机程序产品具体体现为软件产品，例如软件开发包(Software Development Kit，SDK)等等。The computer program product may be implemented in hardware, software or a combination thereof. In one optional embodiment, the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (SDK) and the like.

以上已经描述了本公开的各实施例，上述说明是示例性的，并非穷尽性的，并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下，对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择，旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进，或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。The embodiments of the present disclosure have been described above, and the above description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and changes will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The choice of terms used herein is intended to best explain the principles of the embodiments, practical applications, or improvements to the technology in the market, or to enable other persons of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for testing the security of a voice verification device, characterized in that the method comprises:

Acquire a voice sample set, wherein the voice sample set includes a plurality of first voice information of a target person;

Fusing the adversarial audio information with the plurality of first voice information respectively to obtain a plurality of fused voice information;

Inputting the plurality of fused voice information into the voice verification device respectively to obtain the first voice feature of the target person;

According to the first voice feature of the target person and the second voice feature of the registered person stored in the voice verification device, the adversarial audio information is optimized to obtain optimized adversarial audio information, and the optimized adversarial audio information is used to perform a security test on the voice verification device.

Wherein, the voice verification device is used to verify the identity of the registered person through voice.

2. The method according to claim 1, characterized in that the method further comprises:

Controlling the audio playback device to play the optimized countermeasure audio information;

Acquire a verification result of the voice verification device for the verification voice information, wherein the verification voice information includes the optimized adversarial audio information and a real verification voice emitted by the target person during the period when the audio playback device plays the optimized adversarial audio information;

A security test result of the voice verification device is determined according to the verification result.

3. The method according to claim 1 is characterized in that, according to the first voice feature of the target person and the second voice feature of the registered person stored in the voice verification device, the adversarial audio information is optimized to obtain the optimized adversarial audio information, comprising:

Based on a first loss function, optimizing the adversarial audio information to obtain adversarial audio information in a first state;

Based on the first loss function and the second loss function, optimizing the adversarial audio information of the first state to obtain optimized adversarial audio information;

Among them, the first loss function is used to indicate the recognition error between the first speech feature and the second speech feature, and the second loss function is used to indicate the influence of the adversarial audio information on the speech recognition content.

4. The method according to claim 1, characterized in that fusing the adversarial audio information with the plurality of first voice information respectively to obtain a plurality of fused voice information comprises:

Preprocessing the adversarial audio information according to the durations of the plurality of first voice information to obtain first audio information corresponding to the durations of the respective first voice information;

The first voice information and the corresponding first audio information are transformed respectively to obtain the second voice information and the corresponding second audio information; wherein the transformation process includes a room impulse response;

The second voice information and the corresponding second audio information are fused to obtain the fused voice information.

5. The method according to claim 2, characterized in that:

In the case where the target person is an unregistered person, the optimized adversarial audio information is used to: make the verification result output by the voice verification device be a successful verification;

In the case where the target person is a registered person, the optimized adversarial audio information is used to make the verification result output by the voice verification device be verification failure.

6. The method according to claim 1, characterized in that the voice verification device comprises a voice recognition module, a speaker verification module, and a playback detection module;

Among them, the voice recognition module is used to identify the content information of the verification voice information, the speaker verification module is used to identify whether the verification voice information features belong to the voice information features of a registered person, and the playback detection module is used to detect whether the verification voice information is voice information played back after recording.

7. A security testing device for a voice verification device, characterized in that the device comprises:

A voice sample set acquisition module, used to acquire a voice sample set, wherein the voice sample set includes a plurality of first voice information of a target person;

A fusion module, used for fusing the adversarial audio information with the plurality of first voice information respectively to obtain a plurality of fused voice information;

A feature acquisition module, used for inputting the plurality of fused voice information into the voice verification device respectively to obtain a first voice feature of the target person;

an optimization module, for optimizing the adversarial audio information according to the first voice feature of the target person and the second voice feature of the registered person stored in the voice verification device to obtain optimized adversarial audio information, wherein the optimized adversarial audio information is used to perform a security test on the voice verification device,

8. The device according to claim 7, characterized in that the device further comprises:

An audio playback module, used to control an audio playback device to play the optimized adversarial audio information;

A verification result acquisition module, used to obtain a verification result of the voice verification device for the verification voice information, wherein the verification voice information includes the optimized confrontation audio information and the real verification voice emitted by the target person during the period when the audio playback device plays the optimized confrontation audio information;

The security test result determination module is used to determine the security test result of the voice verification device according to the verification result.

9. An electronic device, comprising:

processor;

a memory for storing processor-executable instructions;

The processor is configured to call the instructions stored in the memory to execute the method described in any one of claims 1 to 5.

10. A computer-readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the method according to any one of claims 1 to 5.