Double-factor authentication method, device and system based on voice recognition
Technical Field
The invention belongs to the technical field of voice recognition, and particularly relates to a voice recognition-based double-factor authentication method, device and system.
Background
Two-factor authentication (TFA) refers to the use of Two different types of authentication to protect a user's account, i.e., Two of the following three elements must be provided:
1. some secret known to the person being authenticated, such as: password information or answers to security questions;
2. something that the authenticated person owns, such as: holding a smartphone or other physical device;
3. the characteristics inherent in the person being authenticated, such as: fingerprint, iris, or voice.
Authentication by means of any two combinations of the above may be referred to as two-factor authentication. The currently common two-factor authentication implementation types are mainly the following six types:
mode 1: safety issues; in the method, when the account is created, one or more security questions can be selected, and corresponding answers are set for each question in advance. Subsequently, when logging in using the account, correct answers must be provided for each question, and the party can complete authentication and access authorization.
Mode 2: short messages or e-mails; when an account is created, a certain commonly used cell phone number is provided. When logging in again, the system sends the verification code by short message or email. This is a temporary verification code that expires in a short amount of time (e.g., 1-5 minutes). Therefore, it must be entered and submitted in a timely manner to complete the login.
Mode 3: a Time-based One-Time Password (OTP); if conditional, an identity authentication application may be used to scan a two-dimensional code containing a secret key. Accordingly, the key is loaded into the application, a temporary password that changes periodically is generated, and authentication required for login can be completed by inputting the password into the page.
Mode 4: a U2F key; universal two Factor (U2 nd Factor, U2F) is an open standard that can be used in USB devices, NFC devices, and smart cards. In the process of identity authentication, only a USB secret key needs to be inserted, and the NFC equipment is touched or the smart card is slid.
Mode 5: pushing a notification; on some platforms, after entering the password, a push message containing a notification of the relevant login attempt is received on the device. The request can be responded to simply by clicking on the device for approval or denial.
Mode 6: biological recognition; face recognition, voice recognition and fingerprint scanning are all types of biometric identification. They are often deployed in certain areas where security checks are required to determine the true identity of a guest.
The performance of a speech recognition system depends to a large extent on the speech representation, whose quality can be influenced by many variations, such as pitch, loudness and timbre. It makes sense to design a system that is sufficiently discriminative to recognize distinct speech. The voice recognition is mainly divided into text-independent and text-dependent, and a text-independent recognition system can recognize the voiceprint by only collecting a relatively short section of voice, but from the perspective of user experience and cost, a text-independent model often needs several hours or more of training data of a target user.
Disclosure of Invention
In view of the above problems, the present invention provides a method, an apparatus, and a system for dual-factor authentication based on speech recognition, which can recognize a defined specific word, and distinguish a speech included in the system or an unknown speech not included in the system by extracting and classifying speech features, thereby performing identity authentication on a speech of a specific user.
In order to achieve the technical purpose and achieve the technical effects, the invention is realized by the following technical scheme:
in a first aspect, the present invention provides a two-factor authentication method based on voice recognition, including:
acquiring voice data input by a user;
calling a preset voice recognition model, and judging whether voice data input by a user is a defined word or not and whether the voice data is the voice of a target client or not;
acquiring a password input by a user;
calling a password identification model, and judging whether a password input by a user meets a preset requirement;
when the voice data input by the user is the defined words and is the voice of the target customer; and if the password input by the user meets the preset requirement, the authentication is finished.
Optionally, the speech recognition model is obtained by:
collecting voice data of appointed words and non-appointed words to form a first sample set, extracting features, and converting the sample set into a first spectrogram;
collecting appointed words of a target user and appointed word voice data of a non-target user to form a second sample set, extracting characteristics, and converting the second sample set into a second spectrogram;
respectively sending the first spectrogram and the second spectrogram into a preset convolutional neural network model, wherein the convolutional neural network model comprises a first convolutional layer, a first pooling layer, a second convolutional layer, a second pooling layer, a first full-connection layer, a Dropout layer and a second full-connection layer which are sequentially connected; the first spectrogram and the second spectrogram are processed by a first convolution layer, a first pooling layer, a second convolution layer and a second pooling layer and then are sent to a first full-connection layer;
and continuously adjusting parameters of the convolutional neural network model to generate a final voice recognition model.
Optionally, the first convolution layer and the second convolution layer both use a 3 × 3 convolution kernel as the feature extraction filter.
Optionally, the first pooling layer and the second pooling layer both adopt a maximum value calculation method.
In a second aspect, the present invention provides a two-factor authentication device based on voice recognition, including:
the first acquisition module is used for acquiring voice data input by a user;
the first judgment module is used for calling a preset voice recognition model, judging whether voice data input by a user is a defined word or not and judging whether the voice data is the voice of a target client or not;
the second acquisition module is used for acquiring the password input by the user;
the second judgment module calls the password identification model and judges whether the password input by the user meets the preset requirement or not;
the authentication module is used for determining whether the voice data input by the user is the defined word or not and is the voice of the target client; and if the password input by the user meets the preset requirement, the authentication is finished.
Optionally, the speech recognition model is obtained by:
collecting voice data of appointed words and non-appointed words to form a first sample set, extracting features, and converting the sample set into a first spectrogram;
collecting appointed words of a target user and appointed word voice data of a non-target user to form a second sample set, extracting characteristics, and converting the second sample set into a second spectrogram;
respectively sending the first spectrogram and the second spectrogram into a preset convolutional neural network model, wherein the convolutional neural network model comprises a first convolutional layer, a first pooling layer, a second convolutional layer, a second pooling layer, a first full-connection layer, a Dropout layer and a second full-connection layer which are sequentially connected; the first spectrogram and the second spectrogram are processed by a first convolution layer, a first pooling layer, a second convolution layer and a second pooling layer and then are sent to a first full-connection layer;
and continuously adjusting parameters of the convolutional neural network model to generate a final voice recognition model.
Optionally, the first convolution layer and the second convolution layer both use a 3 × 3 convolution kernel as the feature extraction filter.
Optionally, the first pooling layer and the second pooling layer both adopt a maximum value calculation method.
In a third aspect, the present invention provides a two-factor authentication method based on voice recognition, including:
a processor;
a memory having stored thereon a computer program operable on the processor;
wherein the computer program realizes the method according to any one of the first aspect when executed by the processor.
Compared with the prior art, the invention has the beneficial effects that:
in the invention, in the process of logging in the intranet terminal host of the power system by a user, the identity authentication is safer by using the password and voice double factors, the biological identification technology is extremely difficult to break, and the biological characteristics are basically lifelong for adults compared with the changeable and modified digital passwords. The ability to resist abnormal login attacks is more secure and trained only for specific persons.
The invention adopts the text-related recognition, reduces the complexity of the system and the training data of the target user by limiting the selectable dictionary information, thereby reducing the cost and improving the efficiency.
The invention adopts the multilayer convolutional neural network, solves the problems that in the training process of the traditional feedforward neural network structure, weight distribution depends on the occurrence frequency of samples, the samples need to be provided are all the more, otherwise, the samples are easy to be unidentified, and has better matching rate only aiming at the training of specific people.
Drawings
In order that the present disclosure may be more readily and clearly understood, reference is now made to the following detailed description of the present disclosure taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a flow chart of a method for two-factor authentication based on voice recognition according to an embodiment of the present invention;
FIG. 2 is a flow chart of a convolutional neural network construction according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the scope of the invention.
The following detailed description of the principles of the invention is provided in connection with the accompanying drawings.
Example 1
The embodiment of the invention provides a double-factor authentication method based on voice recognition, which comprises the following steps:
acquiring voice data input by a user;
calling a preset voice recognition model, and judging whether voice data input by a user is a defined word or not and whether the voice data is the voice of a target client or not;
acquiring a password input by a user;
calling a password identification model, and judging whether a password input by a user meets a preset requirement;
when the voice data input by the user is the defined words and is the voice of the target customer; and if the password input by the user meets the preset requirement, the authentication is finished.
In a specific implementation manner of the embodiment of the present invention, as shown in fig. 1, the speech recognition model is obtained by the following method:
collecting appointed voice materials, namely collecting appointed words and non-appointed word voice data to form a first sample set, extracting characteristics and converting the sample set into a first spectrogram; collecting appointed words of a target user and appointed word voice data of a non-target user to form a second sample set, extracting features, converting the sample set into a second spectrogram, and converting a oscillogram into the spectrogram; in the specific implementation process, the audio files of the WAV codes are uniformly adopted, the number of sound channels is single channel, the sampling frequency is 16kHz, the sample data obtained by sampling each time is 16 bits, the average number of bytes per second is the sampling frequency channel number, the sample data bits of sampling each time is divided by 8 to be 32000, and the collected WAV coded audio is imported through an API of TensorFlow; the imported WAV waveforms are converted into spectrograms. The invention adopts STFT short-time Fourier transform to convert audio frequency into time-frequency domain, which can be expressed as two-dimensional image and output two-dimensional tensor; and carrying out manual classification, attaching labels, forming a combination of a waveform data set, a spectrogram image and corresponding labels thereof, and realizing the classification of the voice files through the relationship between the frequency domain image and the audio content.
Respectively sending the first spectrogram and the second spectrogram into a preset convolutional neural network model, wherein the convolutional neural network model comprises a first convolutional layer, a first pooling layer, a second convolutional layer, a second pooling layer, a first full-connection layer, a Dropout layer and a second full-connection layer which are sequentially connected; the first spectrogram and the second spectrogram are processed by a first convolution layer, a first pooling layer, a second convolution layer and a second pooling layer and then are sent to a first full-connection layer, specifically referring to fig. 2; in a specific implementation, the first fully-connected layer and the second fully-connected layer each include 1024 neurons, and the first fully-connected layer changes the output into one dimension to correspond to the tag. And the Dropout layer and the second fully-connected layer are used for filtering output information, so that repeated data is prevented from causing 'overfitting' of the result. And finally, training the configured model script by using TensorFlow to complete model construction, and obtaining the voice recognition model.
Continuously adjusting parameters of the convolutional neural network model until the model test accuracy reaches the standard, and generating a final voice recognition model; in the specific implementation process, the method specifically comprises the following steps: and compiling the model, and observing the change process of the loss value loss of the training set and the loss value val _ loss of the test set. The loss of the training set is continuously reduced, and the loss of the testing set is continuously reduced, which shows that the network is still learning. And finally, accuracuracy accuracy tends to be stable, and a voice recognition model is generated.
In a specific implementation manner of the embodiment of the present invention, each of the first convolution layer and the second convolution layer uses a 3 × 3 convolution kernel as a feature extraction filter; the first pooling layer and the second pooling layer both adopt a mode of solving the maximum value, the purpose of the pooling layers is to reduce unnecessary redundant information, the result of re-pooling of the first convolution is used as the input of the second convolution, and specific words which accord with the target user are screened out.
The following describes the speech recognition processing procedure in the embodiment of the present invention in detail with reference to fig. 1.
Starting user operation;
initializing a system recording function, and recording a user instruction to complete the acquisition of voice data for input;
and calling a preset voice recognition model, judging whether the voice data input by the user is a defined word or not and whether the voice data is the voice of the target client or not, and outputting a recognition result.
Example 2
Based on the same inventive concept as embodiment 1, an embodiment of the present invention provides a two-factor authentication apparatus based on voice recognition, including:
the first acquisition module is used for acquiring voice data input by a user;
the first judgment module is used for calling a preset voice recognition model, judging whether voice data input by a user is a defined word or not and judging whether the voice data is the voice of a target client or not;
the second acquisition module is used for acquiring the password input by the user;
the second judgment module calls the password identification model and judges whether the password input by the user meets the preset requirement or not;
the authentication module is used for determining whether the voice data input by the user is the defined word or not and is the voice of the target client; and if the password input by the user meets the preset requirement, the authentication is finished.
In a specific implementation manner of the embodiment of the present invention, the speech recognition model is obtained by the following method:
collecting voice data of appointed words and non-appointed words to form a first sample set, extracting features, and converting the sample set into a first spectrogram;
collecting appointed words of a target user and appointed word voice data of a non-target user to form a second sample set, extracting characteristics, and converting the second sample set into a second spectrogram;
respectively sending the first spectrogram and the second spectrogram into a preset convolutional neural network model, wherein the convolutional neural network model comprises a first convolutional layer, a first pooling layer, a second convolutional layer, a second pooling layer, a first full-connection layer, a Dropout layer and a second full-connection layer which are sequentially connected; the first spectrogram and the second spectrogram are processed by a first convolution layer, a first pooling layer, a second convolution layer and a second pooling layer and then are sent to a first full-connection layer;
and continuously adjusting parameters of the convolutional neural network model to generate a final voice recognition model.
In a specific implementation manner of the embodiment of the present invention, each of the first convolution layer and the second convolution layer uses a 3 × 3 convolution kernel as a feature extraction filter.
In a specific implementation manner of the embodiment of the present invention, the first pooling layer and the second pooling layer both adopt a maximum value calculation method.
Example 3
Based on the same inventive concept as embodiment 1, the embodiment of the present invention provides a two-factor authentication method based on voice recognition, including:
a processor;
a memory having stored thereon a computer program operable on the processor;
wherein the computer program realizes the method according to any one of embodiment 1 when being executed by the processor
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.