CN118098203A

CN118098203A - Method, device and computer-readable storage medium for identifying a speaking object

Info

Publication number: CN118098203A
Application number: CN202211489360.3A
Authority: CN
Inventors: 汤志远; 黄申; 商世东
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-25
Filing date: 2022-11-25
Publication date: 2024-05-28

Abstract

The embodiments of the present application disclose a method, device and computer-readable storage medium for speaking object recognition, which are applicable to artificial intelligence. The method includes: obtaining positive sample training data and negative sample training data for speaking object recognition. Inputting the positive sample training data and the negative sample training data into a speaking object recognition model to generate multiple features for the positive sample training data and the negative sample training data. Comparative learning is performed on each feature through the speaking object recognition model. When multimedia data to be recognized is obtained, the multimedia data to be recognized is input into the speaking object recognition model, and speaking object recognition features are generated through the speaking object recognition model, and a recognition result of whether the object to be recognized associated with the multimedia data to be recognized is a target object is output based on the speaking object recognition features. By adopting the present application, the efficiency of speaking object recognition can be improved, the objectivity of the expansion results is strong, the usage scenarios are rich, and the applicability is strong.

Description

Method, apparatus and computer readable storage medium for speaker recognition

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a method and apparatus for identifying a speaking object, and a computer readable storage medium.

Background

With the development of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) technology, the technology of identifying speaking objects also develops rapidly, and various technologies or products for identifying speaking objects (such as Tencerting conference) greatly facilitate the daily life of people. The speaker log (speaker diarization) may segment a speech from a speech containing multiple speakers according to different speakers, with the segmentation into segments containing only one speaker per segment. Speaker recognition is the process of recognizing whether the speaker of each segment is a certain speaker from a continuous multi-person speaking voice.

The inventor of the application finds that in the prior art, the recognition of the speaking object can be realized only by the simultaneous existence of the face information and the voice information of the person in the process of research and practice, and corresponding recognition systems are required to be deployed aiming at the face information and the voice information, so that the deployment of the speaker recognition system is redundant, the calculation workload of the speaker recognition is large, and the applicability is poor.

Disclosure of Invention

The embodiment of the application provides a method, a device and a computer readable storage medium for identifying a speaking object, which can improve the efficiency of identifying the speaking object, improve the flexibility of identifying the speaking object, and have simple operation and strong applicability.

In a first aspect, an embodiment of the present application provides a method for identifying a speaking object, where the method includes:

Acquiring positive sample training data and negative sample training data of speaking object recognition, wherein the positive sample training data comprises target region image information of a target object and corresponding target audio information of the target object, and the negative sample training data comprises audio information of a plurality of other objects corresponding to the target region image information and region image information of a plurality of other objects corresponding to the target audio information;

inputting the positive sample training data and the negative sample training data into a speaking object recognition model, and generating positive sample region image features corresponding to the target region image information and positive sample audio features corresponding to the target audio information, a plurality of negative sample audio features aligned with the positive sample region image features in time sequence, and a plurality of negative sample region image features aligned with the positive sample audio features in time sequence through the speaking object recognition model, wherein the positive sample region image features and the positive sample audio features are aligned in time sequence;

comparing and learning the positive sample region image feature, the positive sample audio feature, the plurality of negative sample audio features and the plurality of negative sample region image features through the speaking object recognition model to obtain the capability of recognizing speaking objects associated with the input data based on any input data;

When the multimedia data to be identified is obtained, inputting the multimedia data to be identified into the speaking object identification model, generating speaking object identification characteristics through the speaking object identification model, and outputting the identification result of whether the object to be identified associated with the multimedia data to be identified is the target object or not based on the speaking object identification characteristics, wherein the speaking object identification characteristics comprise at least one of target area image characteristics or audio characteristics of the speaking object to be identified.

In one possible implementation manner, the acquiring the positive sample training data and the negative sample training data of the speaker recognition includes:

Acquiring a plurality of lip moving images of a target object from sample training data included in a sample training data set as target area image information of the target object, and acquiring audio information of the target object corresponding to each lip moving information time from the sample training data as target audio information corresponding to the target area image information, so as to obtain positive sample training data of speaker object recognition;

And acquiring audio information of a plurality of other objects corresponding to the lip movement information time from other sample training data included in the sample training data set as the audio information of a plurality of other objects corresponding to the target area image information, and acquiring lip movement information of a plurality of other objects corresponding to the audio information time of the target object from other sample training data included in the sample training data set as the area image information of a plurality of other objects corresponding to the target audio information, so as to obtain negative sample training data for speaker recognition.

In one possible implementation manner, the multimedia data to be identified includes image information to be identified and audio information to be identified of an object to be identified; the generating the speaking object recognition feature through the speaking object recognition model includes:

Acquiring image information to be identified containing a lip-shaped moving image from the multimedia data to be identified through a lip movement detection layer in the speech object identification model, and acquiring audio information to be identified corresponding to the time of the image information to be identified from the multimedia data to be identified through a voice information extraction layer in the speech object identification model;

Generating target area image characteristics corresponding to the image information to be identified through a visual information coding layer in the speech object identification model to obtain target area image characteristics of the object to be identified, and generating audio characteristics corresponding to the audio information to be identified through an auditory information coding layer in the speech object identification model to obtain audio characteristics of the object to be identified;

And generating fusion features of the target area image features corresponding to the image information to be recognized and the audio features corresponding to the audio information to be recognized as recognition features of the speaking object through a multi-mode feature fusion layer in the speaking object recognition model.

In one possible implementation manner, the generating, by the multimodal feature fusion layer in the speech object recognition model, the fusion feature of the target area image feature corresponding to the image information to be recognized and the audio feature corresponding to the audio information to be recognized includes:

The multi-mode feature fusion layer in the speaking object recognition model carries out weighted summation on the target area image feature corresponding to the image information to be recognized and the audio feature corresponding to the audio information to be recognized based on the fusion weight of the image feature and the fusion weight of the audio feature so as to generate fusion features of the target area image feature and the audio feature;

the fusion weight of the image features and the fusion weight of the audio features are obtained from the information quality of the image information to be identified and the audio information to be identified.

In one possible implementation manner, the generating, by the auditory information coding layer in the speech object recognition model, the audio feature corresponding to the audio information to be recognized includes:

Adjusting window length or window shift for extracting features of the audio information to be identified based on the frame number of the image features of the target area through an auditory information coding layer in the speech object identification model, and extracting features of the audio information to be identified through the adjusted window length or window shift to obtain audio features which are the same as the frame number of the image features of the target area as audio features corresponding to the image information to be identified; or alternatively

And extracting the characteristics of the audio information to be identified through an auditory information coding layer in the speech object identification model to obtain audio characteristics, and copying the audio characteristics based on the frame number of the image characteristics of the target area to obtain the audio characteristics with the same frame number as the image characteristics of the target area as the audio characteristics corresponding to the image information to be identified.

In one possible implementation manner, the multimedia data to be identified includes image information to be identified of an object to be identified; the generating the speaking object recognition feature through the speaking object recognition model includes:

Acquiring image information to be identified containing lip-shaped moving images from the multimedia data to be identified through a lip movement detection layer in the speech object identification model, and generating target area image characteristics corresponding to the image information to be identified through a visual information coding layer in the speech object identification model so as to obtain the target area image characteristics of the object to be identified;

acquiring audio information to be identified corresponding to the time of the image information to be identified from the multimedia data to be identified through a voice information extraction layer in the speech object identification model, wherein the audio information to be identified is empty;

And outputting the target area image characteristics corresponding to the image information to be recognized as speaking object recognition characteristics through a multi-mode characteristic fusion layer in the speaking object recognition model.

In one possible implementation manner, the multimedia data to be identified includes audio information to be identified of an object to be identified; the generating the speaking object recognition feature through the speaking object recognition model includes:

Acquiring image information to be identified containing a lip-shaped moving image from the multimedia data to be identified through a lip movement detection layer in the speaking object identification model, wherein the image information to be identified is empty;

acquiring audio information to be identified from the multimedia data to be identified through a voice information extraction layer in the speech object identification model, and generating audio features corresponding to the audio information to be identified through an auditory information coding layer in the speech object identification model so as to obtain the audio features of the object to be identified;

Outputting the fusion characteristics of the audio characteristics corresponding to the audio information to be identified as the identification characteristics of the speaking object through a multi-mode characteristic fusion layer in the identification model of the speaking object.

In a second aspect, an embodiment of the present application provides an apparatus for speaker recognition, where the apparatus includes:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring positive sample training data and negative sample training data for speaker object identification, the positive sample training data comprises target region image information of a target object and corresponding target audio information thereof, and the negative sample training data comprises audio information of a plurality of other objects corresponding to the target region image information and region image information of a plurality of other objects corresponding to the target audio information;

The feature generation module is used for inputting the positive sample training data and the negative sample training data into a speaking object recognition model, generating positive sample area image features corresponding to the target area image information and positive sample audio features corresponding to the target audio information, a plurality of negative sample audio features aligned with the positive sample area image features in time sequence and a plurality of negative sample area image features aligned with the positive sample audio features in time sequence through the speaking object recognition model, and aligning the positive sample area image features with the positive sample audio features in time sequence;

The training module is used for performing contrast learning on the positive sample area image features, the positive sample audio features, the plurality of negative sample audio features and the plurality of negative sample area image features through the speaking object recognition model so as to obtain the capability of recognizing speaking objects associated with the input data based on any input data;

And the speaking object generation module is used for inputting the multimedia data to be recognized into the speaking object recognition model when the multimedia data to be recognized is acquired, generating speaking object recognition features through the speaking object recognition model, and outputting a recognition result of whether the object to be recognized associated with the multimedia data to be recognized is the target object or not based on the speaking object recognition features, wherein the speaking object recognition features comprise at least one of target area image features or audio features of the speaking object to be recognized.

In a third aspect, an embodiment of the present application provides a computer apparatus, including: a processor, a memory, and a network interface;

the processor is connected to a memory for providing data communication functions, a network interface for storing program code, and for invoking the program code to perform the method as in the first aspect of the embodiments of the application.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, perform a method as in the first aspect of embodiments of the present application.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a system architecture according to an embodiment of the present application;

FIG. 2 is a flow chart of a method for speaker recognition according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a method for recognizing a speaking object according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a method for recognizing a speaking object according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a method for recognizing a speaking object according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a device for recognizing a speaking object according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The speech processing technology is used for researching the sum of various processing technologies such as speech sound production process, statistical characteristics of speech signals, automatic recognition of speech, machine synthesis, speech perception and the like. Key technologies for Speech processing technologies are automatic Speech recognition (Automatic Speech Recognition, ASR) and Speech synthesis (TTS) technologies. The speech signal processing has a wide application field in the departments of communication and the like. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The scheme provided by the embodiment of the application relates to the technologies of voice processing technology, machine learning and the like in the field of artificial intelligence, and is specifically described by the following embodiments:

The method for recognizing a speaking object (for convenience of description, may be simply referred to as a speaking object recognition method or method) provided in the embodiment of the present application is suitable for use in a development process of a speaking object recognition technology or product (for convenience of description, a speaking object recognition product will be exemplified below, for example, as a Tech. Conference) in which a speaking object recognition model deployed in the speaking object recognition product can train a visual information coding layer and an auditory information coding layer in the speaking object recognition model based on sample training data of each speaking object in a training sample, so as to obtain speaking object recognition features of each speaking object through the visual information coding layer and the auditory information coding layer in the speaking object recognition model. The method provided by the embodiment of the application can be expanded to obtain the recognition features of more and more accurate speaking objects, and the speaking object recognition model with higher recognition accuracy can be trained based on the recognition features of a large number of speaking objects in the training sample library, so that the speaking object recognition product can obtain better recognition effect, and the recognition efficiency of the speaking object is improved. The method for recognizing the speaking object provided by the embodiment of the application is also suitable for recognizing the object to be recognized in the multimedia data to be recognized based on the speaking object recognition model deployed in the speaking object recognition product in the use process of various speaking object recognition products, so as to generate the speaking object recognition characteristics of the object to be recognized included in the multimedia data to be recognized, and outputting the recognition result of whether the object to be recognized associated with the multimedia data to be recognized is the target object based on the speaking object recognition characteristics. The method provided by the embodiment of the application can be applied to various application scenes, and can be specifically determined according to actual application scenes, and is not limited herein. For convenience of description, the method provided by the embodiment of the application will be described in detail by taking an offline multi-person conference scene as an example. The method provided by the embodiment of the application can train to obtain the speaking object recognition model with higher recognition accuracy based on a large amount of sample training data in the training sample library, thereby enabling the speaking object recognition product to obtain better speaking object recognition effect. For convenience of description, the method for identifying the speaking object provided by the embodiment of the application will be described in detail by taking an offline multi-person conference scene as an example.

It will be appreciated that the speaking object recognition feature is a very important class of features in the process of recognizing a speaking object by the speaking object recognition model. However, the current speaker recognition technology generally directly fuses the lip movement information corresponding to the time stamp with the voiceprint information to obtain a new feature, and then uses the new feature for subsequent speaker recognition. Such techniques typically require that face information and voice information of the speaking object coexist, which makes the requirements on the application scenario of the speaking object recognition technique more stringent. In practice, the face information and the voice information of the speaking object do not always exist at the same time, for example, in some offline conference scenes, the face is not shot, or the front face of the face is sometimes not appeared in the video, and the situation that the face information and the voice information do not exist at the same time may occur. Another type of technology is to prepare a set of system for a plurality of modes, namely, only face information has no voice information, only voice information has no face information, and face information and voice information exist simultaneously, so that the single-mode situation can be solved, but the scheme cannot comprehensively utilize the complementary effect of multi-mode information, the system deployment is redundant, and the calculated amount is greatly increased. The embodiment of the application provides a comprehensive and unified multi-mode semantic space speaking object recognition method, which unifies the image characteristics of a target area obtained based on the image information of the target area and the audio characteristics obtained based on the audio information into a semantic space through comparison learning, so that the multi-mode information can be better complemented and fused, the robustness of the system is greatly improved, and simultaneously, single-mode scenes such as a lip language scene and a pure audio conference are downward compatible, so that the multi-mode and the single-mode scenes can share one set of system, the convenience of system deployment is enhanced, the calculation cost is saved, the accuracy of speaking object recognition is improved, and the applicability is enhanced. The target area image may be face information or a lip-shaped moving image, and may be specifically determined according to an actual scene application, which is not limited herein.

The system structure to which the method provided by the embodiment of the present application is applicable, the method and the apparatus provided by the embodiment of the present application will be illustrated in the following with reference to fig. 1 to 7.

Referring to fig. 1, fig. 1 is a schematic diagram of a system architecture according to an embodiment of the application. As shown in fig. 1, the system architecture may include a service server 100 and a terminal cluster, where the terminal cluster may include: terminal devices 200a, 200b, 200c, … …, 200n, and the like. The service server 100 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides a cloud database, a cloud service, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and basic cloud computing services such as a big data and an artificial intelligence platform. The terminal devices (including the terminal device 200a, the terminal device 200b, the terminal devices 200c, … …, and the terminal device 200 n) may be smart terminals such as smart phones, tablet computers, notebook computers, desktop computers, palm computers, mobile Internet Devices (MIDs) INTERNET DEVICE, wearable devices (e.g., smart watches, smart bracelets, etc.), smart computers, smart vehicles, and the like. The service server 100 may establish communication connection with each terminal device in the terminal cluster, and may also establish communication connection between each terminal device in the terminal cluster. In other words, the service server 100 may establish a communication connection with each of the terminal apparatuses 200a, 200b, 200c, … …, 200n, for example, a communication connection may be established between the terminal apparatus 200a and the service server 100. A communication connection may be established between terminal device 200a and terminal device 200b, and a communication connection may also be established between terminal device 200a and terminal device 200 c. The communication connection is not limited to a connection manner, and may be directly or indirectly connected through a wired communication manner, or may be directly or indirectly connected through a wireless communication manner, and the like, and may be specifically determined according to an actual application scenario, and the present application is not limited herein.

It should be understood that, as shown in fig. 1, each terminal device in the terminal cluster may be installed with an application client, and when the application client runs in each terminal device, data interaction may be performed between the application client and the service server 100 shown in fig. 1, so that the service server 100 may receive service data from each terminal device, where the service data may be sample training data of a speaking object recognition model of the client, or may be to-be-recognized multimedia data of a target object. The application client may be an application client for speaking object recognition (speaking object recognition client for short), that is, a user may send sample training data of a target object and multimedia data to be recognized to the service server 100 through the application client, and the service server 100 may be a server of the speaking object recognition client, or may be a set of multiple servers including a background server, a data processing server, and the like corresponding to the client. The service server 100 may receive sample training data of a target object and multimedia data to be identified, where the sample training data of the target object is used to train a speaking object recognition model deployed in the speaking object recognition client, and the speaking object recognition model may generate, based on the multimedia data to be identified, whether an object to be identified associated with the multimedia data to be identified is a recognition result of the target object. The application client may be an independent client, or may be an embedded sub-client integrated in a certain client (e.g., an instant messaging client, a social client, etc.), which may be specifically determined according to an actual application scenario, and is not limited herein. The method provided by the embodiment of the present application may be executed by the service server 100 shown in fig. 1, or may be executed by any one of the terminal devices (such as any one of the terminal device 200a, the terminal devices 200b, … …, and the terminal device 200n shown in fig. 1), or may be executed by both the terminal device and the service server, which may be specifically determined according to an actual application scenario, and is not limited herein. For convenience of description, taking the speaking object recognition client loaded on the terminal device 200b as an example, each operation object can view, record and upload the multimedia data to be recognized in the target application through the terminal device in the process of using the speaking object recognition client through the terminal device. It can be appreciated that the multimedia data to be identified may be any multimedia data, and specifically may include, but is not limited to, audio, picture, video, etc., and may be specifically determined according to an actual application scenario, which is not limited herein.

It will be appreciated that in the specific embodiments of the present application, related data such as user information is involved, and when the embodiments of the present application are applied to specific products or technologies, user permissions or consents need to be obtained, and the collection, use and processing of related data need to comply with relevant laws and regulations and standards of the relevant countries and regions.

It will be appreciated that the method for identifying a speaking object provided by the embodiments of the present application is applicable to speaking object identification in applications (such as the vacation conference described above). It is understood that the terminal device to which the above speaking object identification method is applicable includes, but is not limited to, a smart phone, a computer, a tablet computer, a Personal Digital Assistant (PDA), a mobile internet device (mobile INTERNET DEVICE, MID), a wearable device, and the like. Optionally, the terminal device may also be the smart phone, the computer, the tablet pc, the PDA, the MID, a server corresponding to the wearable device, or the like, which may be specifically determined according to an actual application scenario, and is not limited herein. Correspondingly, the device for identifying the speaking object provided by the embodiment of the application comprises, but is not limited to, a smart phone, a computer, a tablet personal computer, a PDA, an MID, wearable equipment and the like. For convenience of description, the speaking object recognition device and/or the terminal device provided by the embodiment of the application will be described by taking a smart phone (or simply called a mobile phone) as an example.

It may be understood that the method for identifying a speaking object provided in the embodiment of the present application may be performed by the service server 100 shown in fig. 1, or may be performed by a terminal device (e.g., any one of the terminal device 200a, the terminal devices 200b, … …, and the terminal device 200n shown in fig. 1), or may be performed by the terminal device and the service server together, which may specifically be determined according to an actual application scenario, and is not limited herein. For the convenience of subsequent understanding and description, the embodiment of the present application may select one terminal device as a target terminal device in the terminal device cluster shown in fig. 1, for example, use the terminal device 200b as a target terminal device.

Further, referring to fig. 2, fig. 2 is a flowchart illustrating a method for recognizing a speaking object according to an embodiment of the present application. For the sake of understanding, the embodiment of the present application is described by taking the terminal device as an example, that is, the terminal device 200b in fig. 2 is described by taking as an example, and the service server may be the service server 100 of the embodiment corresponding to fig. 1. The embodiment of the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent transportation, auxiliary driving and the like. In the method of speaker recognition shown in fig. 2, each step of speaker recognition may be performed by the terminal device 200b in fig. 1 described above, and as shown in fig. 2, the method of speaker recognition may include at least the following steps S101 to S104.

Step S101, positive sample training data and negative sample training data of speaker object recognition are obtained, wherein the positive sample training data comprises target area image information of a target object and corresponding target audio information thereof, and the negative sample training data comprises audio information of a plurality of other objects corresponding to the target area image information and area image information of a plurality of other objects corresponding to the target audio information.

In some possible embodiments, the terminal device (such as the terminal device 200 b) may obtain the sample training data of the speaking object identification, where the sample training data of the speaking object identification may be sample training data of the target object obtained by the application client (i.e., the speaking object identification client) loaded on the terminal device 200 b. The sample training data of the target object may be recording data of the past target object, or may be homemade data of the target object, and the source of the sample training data of the target object may be specifically determined according to an actual application scenario, which is not limited herein. After the target object is authorized to log in the speaking object recognition client, the speaking object recognition client can collect sample training data of the target object to be used as a training sample for training the speaking object recognition model. The above-mentioned speaking object identification client may be an independent client, an embedded sub-client integrated in a certain client (such as an instant messaging client, a social client, etc.), or a web application accessed through a browser, which may be specifically determined according to an actual application scenario, and is not limited herein. The embodiment of the present application will be described by taking the speaking object recognition client as an independent client as an example, and will not be described in detail.

In some possible embodiments, the speaking object recognition client, after obtaining authorization of the speaking object, will obtain positive sample training data and negative sample training data of the speaking object recognition. The positive sample training data comprises target area image information of a target object and corresponding target audio information, and the negative sample training data comprises audio information of a plurality of other objects corresponding to the target area image information and area image information of a plurality of other objects corresponding to the target audio information. Specifically, the speaking object recognition client obtains positive sample training data and negative sample training data of the target object from training data included in a sample training data set. It can be understood that, after the plurality of speaking objects log in the speaking object recognition client in an authorized manner, the speaking object recognition client obtains sample training data of each speaking object to train the speaking object recognition model deployed in the speaking object recognition client. The source of the sample training data set may be determined according to the actual application scenario, and the present application is not limited herein. It will be appreciated that the positive sample training data and the negative sample training data of the target object are past recorded data of the target object or homemade data of the target object, and the source of the sample training data set may be determined according to the actual application scenario, which is not limited in this disclosure.

Referring to fig. 3, fig. 3 is a schematic diagram of a scenario of a method for recognizing a speaking object according to an embodiment of the application. As shown in fig. 3, the speech object recognition model includes a lip motion detection layer, a visual information encoding layer, a voice information extraction layer, an auditory information encoding layer, a multimodal feature fusion layer, and a speaker recognition module. Wherein the lip movement detection layer is used for extracting target area image information of a speaking object (including a target object), and the voice information extraction layer is used for extracting target audio information of the speaking object (including the target object). The target area image may be face information or lip-shaped moving image, and for convenience of description, the present application will be exemplified by using the lip-shaped moving image as the target area image information. The multi-modal feature fusion layer is used for outputting the speaking object features of the target object, including the image features and the audio features of the speaking object, or the fusion features of the image features and the audio features, and the like. Specifically, the speech object recognition client acquires a plurality of lip-moving images of a target object from sample training data included in a sample training data set as target area image information of the target object, and acquires audio information of the target object corresponding to each lip-moving information time from the sample training data as target audio information corresponding to the target area image information. The speech object recognition model uses the target region image information of the target object and the target audio information corresponding to the target region image information as positive sample training data for speech object recognition. The speech object recognition client acquires audio information of a plurality of other objects corresponding to the respective lip movement information times from other sample training data included in the sample training data set as audio information of a plurality of other objects corresponding to the target area image information, and lip movement information of a plurality of other objects corresponding to the audio information times of the target object from other sample training data included in the sample training data set as area image information of a plurality of other objects corresponding to the target audio information. After the remaining objects except the target object are authorized to log in the speaking object recognition client, the speaking object recognition client acquires the sample training data of the speaking objects except the target object so as to train the speaking object recognition model. The speech object recognition client uses the audio information of the plurality of other objects corresponding to the target area image information and the area image information of the plurality of other objects corresponding to the target audio information as the target object negative training data.

Step S102, inputting the positive sample training data and the negative sample training data into a speaking object recognition model, and generating positive sample region image features corresponding to the target region image information and positive sample audio features corresponding to the target audio information, a plurality of negative sample audio features aligned with the positive sample region image feature time sequence and a plurality of negative sample region image features aligned with the positive sample audio feature time sequence through the speaking object recognition model, wherein the positive sample region image features and the positive sample audio features are aligned with each other in time sequence.

In some possible embodiments, the speech object recognition client obtains the positive sample training data and the negative sample training data of the target object, and then inputs the positive sample training data and the negative sample training data as a set of training samples into the speech object recognition model. Specifically, the speech object recognition client acquires target region image information of a target object included in the positive sample training data and target audio information corresponding to the target region image information, and inputs the target region image information of the target object into a visual information coding layer in the speech object recognition model to generate a positive sample region image feature corresponding to the target region image information. The speech object recognition model adjusts the window length or window shift for extracting the characteristics of the target audio information based on the frame number of the image characteristics of the target area through an auditory information coding layer in the speech object recognition model, and extracts the characteristics of the target audio information through the adjusted window length or window shift so as to obtain the audio characteristics which are the same as the frame number of the image characteristics of the target area as the audio characteristics corresponding to the target image information. Or extracting the characteristics of the target audio information through an auditory information coding layer in the speech object recognition model to obtain audio characteristics, and copying the obtained audio characteristics based on the frame number of the target area image characteristics to obtain the audio characteristics with the same frame number as the target area image characteristics as the audio characteristics corresponding to the target image information. It is to be understood that the above technical means for making the number of frames of the target region image information of the target object and the target audio information corresponding to the target region image information the same may be specifically determined according to the actual application scenario, and the present application is not limited herein. Similarly, the speech object recognition client acquires the audio information of a plurality of other objects corresponding to the target area image information, and then aligns the audio information of the plurality of other objects corresponding to the target area image information with the target area image information of the target object in time sequence so that the audio information of the plurality of other objects corresponding to the target area image information is identical to the frame number of the target area image information of the target object. Similarly, the speech object recognition client acquires the region image information of a plurality of other objects corresponding to the target audio information, and then aligns the region image information of the plurality of other objects corresponding to the target audio information with the target audio information corresponding to the target region image information in a time-series manner so that the region image information of the plurality of other objects corresponding to the target audio information is identical to the target audio information frame number corresponding to the target region image information.

Further, as shown in fig. 3, the speech object recognition client inputs the positive sample training data and the negative sample training data into the speech object recognition model. Specifically, the lip movement detection layer is configured to extract image information of a target area (such as a lip) including a lip movement image, and the voice information extraction layer is configured to extract audio information corresponding to the image information including the lip movement image. The speech object recognition client inputs the target area image information of the target object into a visual information coding layer in the speech object recognition model, and generates positive sample area image features corresponding to the target area image information through the visual information coding layer. Wherein the target area image information is lip movement information of the target object. Similarly, the speech object recognition client inputs the target audio information corresponding to the target region image information into an auditory information encoding layer in the speech object recognition model, and generates positive sample audio features corresponding to the target audio information by the auditory information encoding layer. It can be appreciated that the positive sample region image features and the positive sample audio features are aligned in time sequence, so that the image features and the audio features can be better complemented and fused, and the system robustness of the speaking object recognition model is improved. The speaking object method provided by the embodiment of the application establishes a unified semantic hidden space for the image features and the audio features of the target area based on time sequence alignment to obtain the hidden space characterization for the target object so as to generate the speaking object identification features of the target object. Similarly, the speech object recognition client inputs audio information of a plurality of other objects corresponding to the target region image information into an auditory information coding layer in the speech object recognition model, and generates a plurality of negative sample audio features aligned with the positive sample region image features in time sequence through the auditory information coding layer. Similarly, the speech object recognition client inputs the regional image information of a plurality of other objects corresponding to the target audio information into the visual information coding layer in the speech object recognition model, and generates a plurality of negative sample regional image features aligned with the positive sample audio features in time sequence through the visual information coding layer.

Step S103, comparing and learning the positive sample area image feature, the positive sample audio feature, the plurality of negative sample audio features and the plurality of negative sample area image features through the speaking object recognition model to obtain the capability of recognizing the speaking object associated with the input data based on any input data.

In some possible embodiments, after the speech object recognition client generates the positive sample region image feature, the positive sample audio feature, the plurality of negative sample audio features aligned with the positive sample region image feature, and the plurality of negative sample region image features aligned with the positive sample audio feature through the visual information encoding layer and the auditory information encoding layer of the speech object recognition model, the positive sample region image feature, the positive sample audio feature, the plurality of negative sample audio features aligned with the positive sample region image feature, and the plurality of negative sample region image features aligned with the positive sample audio feature are contrast-learned through the speech object recognition model to obtain the ability to recognize the speech object associated with the input data based on any input data. In order to further train the speaker recognition model to recognize the speaker associated with the multimedia data based on any multimedia data and to improve the accuracy of the speaker recognition model in recognizing the speaker, the encoding layer may be optimized based on the loss functions of the visual information encoding layer and the acoustic information encoding layer in the speaker recognition model. Specifically, after the speech object recognition model acquires positive sample training data, a lip-shaped moving image of a target object is acquired from the sample training data as target area image information of the target object, the target area image is denoted as x, audio information of the target object corresponding to the target area image information time is acquired from the sample training data as target audio information corresponding to the target area image information, and the target audio information is denoted as y. After the negative sample training data is obtained by the speech object recognition model, audio information of m other objects corresponding to the lip movement information time is obtained from other sample training data included in the sample training data set and is used as audio information of m other objects corresponding to the target area image information, the audio information of the m other objects is input into an auditory information coding layer in the speech object recognition model, so that m negative sample audio characteristics are obtained, and the audio characteristics are recorded asThe speech object recognition model uses lip movement information of n other objects corresponding to the audio information time of the target object from the other sample training data included in the sample training data set as region image information of the n other objects corresponding to the target audio information, inputs the region image information of the n other objects into a visual information coding layer in the speech object recognition model to obtain n negative sample region image features, and records asThe speech object recognition model uses the positive sample region image features, the positive sample audio features, the m negative sample audio features, and the n negative sample region image features as a set of training samples, and marks asThe speech object constructs a loss function based on the training samples, updates parameters of encoders in the visual information encoding layer and the acoustic information encoding layer by a gradient descent method, and represents the loss function taking the positive sample training data and the negative sample training number as an example as follows:

Here, t is used to adjust the smoothness of the positive and negative sample contrast, defaulting to 1.e _x is that the target area image information of the target object generates positive sample area image characteristics corresponding to the target area image information through the upper speaking object identification model. e _y is that the target audio information corresponding to the target region image information generates positive sample region image characteristics corresponding to the target region image information through the upper speaking object recognition model. And generating a plurality of negative sample audio features aligned with the positive sample area image feature time sequence for the audio information of a plurality of other objects corresponding to the target area image information through an upper speaking object recognition model. /(I)And generating a plurality of negative sample region image features aligned with the positive sample audio features in time sequence for the region image information of a plurality of other objects corresponding to the target audio information through an upper speaking object recognition model. e _x·_y denotes scalar values obtained by respectively dot-product and adding the two vectors. The loss function formulas can be used for respectively obtaining the loss functions of the encoders in the visual information encoding layer and the auditory information encoding layer, and parameters of the encoders can be continuously optimized based on the loss functions so as to identify speaking objects.

Step S104, when the multimedia data to be identified is obtained, inputting the multimedia data to be identified into a speaking object identification model, generating speaking object identification characteristics through the speaking object identification model, and outputting the identification result of whether the object to be identified associated with the multimedia data to be identified is a target object or not based on the speaking object identification characteristics, wherein the speaking object identification characteristics comprise at least one of target area image characteristics or audio characteristics of the speaking object to be identified.

In some possible embodiments, the speech object recognition client performs contrast learning based on the positive sample region image feature, the positive sample audio feature, the plurality of negative sample audio features, and the plurality of negative sample region image features, and updates parameters of encoders in the visual information encoding layer and the auditory information encoding layer. After the speaking object recognition client acquires the multimedia data to be recognized, generating speaking object recognition features through the speaking object recognition model, and outputting a recognition result of whether the object to be recognized associated with the multimedia data to be recognized is the target object or not based on the speaking object recognition features. Referring to fig. 4, fig. 4 is a schematic diagram of a scenario of a method for recognizing a speaking object according to an embodiment of the application. When it is required to identify whether or not an object to be identified associated with a piece of multimedia data is the target object, the multimedia data to be identified may be identified by a speaking object identification client loaded on the terminal device 200 b. It will be appreciated that the target object may click on a different application icon on the terminal object operation interface of the terminal device 200b to switch the operation interface of a different application. If the terminal device 200b detects that the click position of the command of the target object is the icon of the application a, the terminal device 200b starts the application a and jumps to the operation interface of the application a. When the target object clicks on an icon of the application D (i.e., the speaking object recognition client described above) on the terminal object operation interface of the terminal device 200b, the terminal device 200b may be triggered to start the operation interface of the speaking object recognition client. At this time, the terminal device 200b may detect a target object operation instruction on its terminal object operation interface, and may determine that the application triggered to be started by the target object selection according to the click position of the target object operation instruction is the above-described speech object recognition client. At this time, the terminal device 200b may activate the operation interface of the speaking object recognition client. After the terminal device 200b starts the session object recognition client, the terminal device 200b jumps to the operation page 201a of the session object recognition client, and the target object may input the account number and the password of the target object on the operation page 201a, so as to authorize the session object recognition client to log in. Specifically, when the target object inputs the account number and the password, the terminal device 200b may prompt the user to read the information related to the user, and as illustrated in fig. 4, prompt the user to read and learn "user protocol" and "privacy policy", and when the target object triggers the read control 201b, the user may trigger the "login" control. After the terminal device 200b successfully identifies the client by using the speaking object, the past history data of the target object, such as the multimedia data uploaded by the target object, may be displayed, and may be specifically determined according to the actual application scenario, which is not limited herein.

Referring to fig. 4 again, after the target object successfully logs in the speaking object recognition client, the speaking object recognition client displays a display interface of the speaking object recognition model, that is, an interface 1, to the target object, where the interface 1 may include a nickname and an avatar of the target object and further includes a historical multimedia data set held by the target object. When the target object needs to identify the speaking object, a control 201c may be selected, and when the terminal device 200b detects a selection instruction of the control 201c, the terminal device 200b pops up a selection window on the interface 1, and displays "shooting," "selecting from album," and "cancelling" controls in the window. The target object can select a shooting control to shoot or record in real time, and can also select a selecting control from an album, so that the multimedia data recorded by the target object is selected. For the convenience of subsequent understanding and description, the embodiment of the present application selects the "select from album" control, and after selecting the "select from album" control, the pop-up presentation interface 2 displays the multimedia data set held by the target object, from which the multimedia data to be identified by the target object can be selected. The target object may perform a triggering operation on the presentation interface 2 for multimedia data provided by the terminal device 200 b. For example, after determining that the target object needs to identify the multimedia data, at this time, for convenience of description, the terminal device 200b determines the multimedia data 1 as the multimedia data to be identified, and may trigger to select the control 201d corresponding to the multimedia data 1, and then trigger the "determine" control on the display interface 2 for the multimedia data, that is, the target object implements the process of selecting the multimedia data to be identified. After the target object selects the multimedia data to be identified, the multimedia data to be identified is sent to the application server 100.

Further, when the speaking object recognition client acquires the multimedia data to be recognized (i.e., the multimedia data 1), the multimedia data to be recognized is input into the speaking object recognition model, and speaking object recognition features are generated through the speaking object recognition model. Referring to fig. 5, fig. 5 is a schematic diagram of a scenario of a method for recognizing a speaking object according to an embodiment of the present application. When the target object selects the multimedia data to be identified, and the speaking object identification client detects a selection instruction of a control 201d corresponding to the multimedia data to be identified (i.e., the multimedia data 1), the speaking object identification client inputs the multimedia data to be identified into the speaking object identification model, and displays the uploading progress in the interface 3. When the speech object recognition client side successfully inputs the multimedia data to be recognized into the speech object recognition model, the multimedia data to be recognized is analyzed, and the analysis progress is displayed in the interface 4. Specifically, after the speaking object recognition client acquires the multimedia data to be recognized, the multimedia data to be recognized is input into the speaking object recognition model. When the multimedia data to be identified comprises the image information to be identified and the audio information to be identified of the object to be identified, lip movement detection is carried out on the multimedia data to be identified based on a lip movement detection layer in the speech object identification model so as to obtain the image information of the target area of the object to be identified, which is associated with the multimedia data to be identified. The target area image information may be face information of the object to be identified, or lip movement information of the object to be identified. The speech object recognition model extracts lip-shaped moving images which only contain the object to be recognized from the multimedia data to be recognized as image information to be recognized, and generates target area image features corresponding to the image information to be recognized through a visual information coding layer in the speech object recognition model. It will be appreciated that the image information to be identified, which is extracted from the multimedia data to be identified by the speech object identification model and only includes a lip moving image, may use a face recognition tool, and the specific identification means may be determined according to the actual application scenario, which is not limited herein. The speech object recognition model acquires the audio information to be recognized corresponding to the time of the image information to be recognized from the multimedia data to be recognized through a voice information extraction layer in the speech object recognition model.

Further, after the speaking object recognition model obtains the audio information to be recognized corresponding to the time of the image information to be recognized, the audio information to be recognized is subjected to voice feature extraction based on the voice information extraction layer, and corresponding audio features aiming at the audio information to be recognized are generated, so that the audio features of the object to be recognized are obtained. It can be appreciated that the corresponding audio features for the audio information to be identified may be Mel-cepstrum coefficient (Mel-scale Frequency Cepstral Coefficients, MFCC for short), filter bank (fbank for short), and the like, and may be specifically determined according to the actual application scenario, which is not limited herein. When the speech object recognition model generates the corresponding audio features for the audio information to be recognized, the window length or window shift of the feature extraction is performed on the target audio information of the object to be recognized based on the frame number of the image features of the target area corresponding to the image information to be recognized through the auditory information coding layer in the speech object recognition model, and the feature extraction is performed on the audio information of the object to be recognized through the adjusted window length or window shift, so that the audio features with the same frame number as the image features of the target area are obtained as the audio features corresponding to the image information to be recognized. Or extracting the characteristics of the audio information of the object to be identified through an auditory information coding layer in the speaking object identification model to obtain audio characteristics, and copying the obtained audio characteristics based on the frame number of the image characteristics of the target area of the object to be identified to obtain the audio characteristics with the same frame number as the image characteristics of the target area as the audio characteristics corresponding to the image information to be identified. It can be understood that the specific technical means for making the corresponding audio characteristics of the audio information to be identified and the frame number of the image information to be identified identical may be determined according to the actual application scenario, and the present application is not limited herein.

Further, after the speaking object recognition model acquires the image information to be recognized from the multimedia data to be recognized, the image information to be recognized is input into a visual information coding layer in the speaking object recognition model, and target area image features corresponding to the image information to be recognized are generated, so that the target area image features of the object to be recognized are obtained. After the target area image feature of the object to be identified and the audio feature of the object to be identified are generated, the target area image feature of the object to be identified and the audio feature of the object to be identified are input into a multi-mode feature fusion layer in the speech object identification model, and fusion features of the target area image feature corresponding to the image information to be identified and the audio feature corresponding to the audio information to be identified are generated, so that the speech object identification features are obtained. Specifically, after the target area image feature of the object to be identified and the audio feature of the object to be identified are input into the multi-modal feature fusion layer in the speech object identification model, the multi-modal feature fusion layer may perform weighted summation on the target area image feature corresponding to the image information to be identified and the audio feature corresponding to the audio information to be identified based on the fusion weight of the image feature and the fusion weight of the audio feature, so as to generate the fusion feature of the target area image feature and the audio feature. The fusion weight of the image features and the fusion weight of the audio features are obtained from the information quality of the image information to be identified and the audio information to be identified. It can be understood that the multi-modal feature fusion layer performs weighted summation on the frame-by-frame basis on the target region image feature and the audio feature corresponding to the audio information to be identified, where the weights of the target region image feature and the audio feature corresponding to the audio information to be identified are all 0.5 by default. In practice, the weights of the target area image features and the audio features corresponding to the audio information to be identified may be determined according to an actual application scenario, for example, if the audio quality is high, the auditory weight is increased or if the video quality is high, the visual weight is increased.

Further, after the speaking object recognition model generates the fusion feature of the target region image feature and the audio feature, the speaking object recognition model inputs the fusion feature into a speaker recognition module to output the speaking object recognition feature of the object to be recognized associated with the multimedia data to be recognized. The speaker recognition module in the speaker recognition model performs speaker matching and recognition based on the speaker recognition feature of the object to be recognized, and outputs a recognition result of whether the object to be recognized associated with the multimedia data to be recognized is the target object based on the speaker recognition feature. Referring to fig. 5 again, when the object to be identified associated with the multimedia data to be identified is the target object, the speaking object identification client displays the identification result on the interface 5, and the target object may select the control corresponding to yes on the interface 5 to obtain all the audio and video of the object to be identified (i.e. the target object) associated with the multimedia data to be identified. If the target object selects the control corresponding to no on the interface 5, the control returns to the speaking object recognition client-side home page (i.e. interface 1). If the object to be identified associated with the multimedia data to be identified is not the target object, the speaking object identification client will display the identification failure result on the interface 5.

In some possible embodiments, the multimedia data to be identified may include image information to be identified and audio information to be identified of the object to be identified, or may include only image information to be identified of the object to be identified or only audio information to be identified of the object to be identified. It will be appreciated that in some application scenarios, the face information and the voice information of the multimedia data to be identified do not exist at the same time, for example, in some scenarios, the face of the object to be identified is not photographed, or the face front is sometimes not present in the multimedia data to be identified. At this time, the multimedia data to be identified only contains the audio information to be identified of the object to be identified. When the to-be-identified multimedia data comprises to-be-identified image information of the to-be-identified object, namely when the to-be-identified audio information is empty, the to-be-identified image information containing the lip moving image is obtained from the to-be-identified multimedia data through a lip moving detection layer in the speech object identification model, and target area image characteristics corresponding to the to-be-identified image information are generated through a visual information coding layer in the speech object identification model, so that the target area image characteristics of the to-be-identified object are obtained. At this time, the voice information extraction layer in the speech object recognition model acquires the audio information to be recognized corresponding to the time of the image information to be recognized from the multimedia data to be recognized, and the audio information to be recognized is null. The speaking object recognition model inputs the target area image features of the object to be recognized into the multi-mode feature fusion layer to generate the target area image features corresponding to the image information to be recognized as the speaking object recognition features. Here, the steps of the speech object recognition model for outputting the speech object recognition features based on the multi-modal feature fusion layer are similar to the above, and will not be repeated here.

In some possible embodiments, when the to-be-identified multimedia data includes only the to-be-identified audio information of the to-be-identified object, the to-be-identified image information including the lip moving image is obtained from the to-be-identified multimedia data through the lip moving detection layer in the speech object identification model, and at this time, the to-be-identified image information is empty. The speech object recognition model acquires the audio information to be recognized corresponding to the time of the image information to be recognized from the multimedia data to be recognized through a speech information extraction layer in the speech object recognition model, and performs speech feature extraction on the audio information to be recognized through a speech information feature extraction layer in the speech object recognition model so as to obtain the audio features of the object to be recognized. It can be appreciated that the steps for extracting the voice features for the above-mentioned audio information to be identified are similar to the above, and are not repeated here. After the voice object recognition model obtains the audio features of the object to be recognized, the voice object recognition model inputs the audio features of the object to be recognized into the multi-mode feature fusion layer to generate fusion features of the audio features corresponding to the audio information to be recognized as the voice object recognition features. Here, the steps of the speech object recognition model for outputting the speech object recognition features based on the multi-modal feature fusion layer are similar to the above, and will not be repeated here.

In the embodiment of the application, the terminal equipment can recognize the speaking object of the multimedia data and output the recognition result of whether the object to be recognized associated with the multimedia data to be recognized is the target object or not based on the recognition characteristic of the speaking object. The speech object recognition model establishes a unified semantic hidden space for two modes of video (namely lip-shaped moving images) and audio, can better utilize the complementarity of multi-mode information, and has stronger applicability. Meanwhile, the method for identifying the speaking object provided by the embodiment of the application is also suitable for single-mode application scenes, so that the calculated amount is greatly reduced, the system performance is greatly improved, the identification efficiency of the speaking object is high, and the objectivity of the expanded result is strong.

Based on the description of the embodiment of the method for recognizing the speaking object, the embodiment of the application also discloses a device for recognizing the speaking object. The apparatus for speaker recognition may be applied to the method for speaker recognition of the embodiment shown in fig. 1 to 5 for performing the steps in the method for speaker recognition. Here, the means for recognizing the speaking object may be a service server or a terminal device in the embodiments shown in fig. 1 to 5, that is, the means for recognizing the speaking object may be an execution subject of the method for recognizing the speaking object in the embodiments shown in fig. 1 to 5. Referring to fig. 6, fig. 6 is a schematic structural diagram of a device for identifying a speaking object according to an embodiment of the present application. In the embodiment of the application, the device can operate the following modules:

the obtaining module 11 is configured to obtain positive sample training data and negative sample training data for speaker recognition, where the positive sample training data includes target area image information of a target object and corresponding target audio information, and the negative sample training data includes audio information of a plurality of other objects corresponding to the target area image information and area image information of a plurality of other objects corresponding to the target audio information.

The feature generation module 12 is configured to input the positive sample training data and the negative sample training data into a speaker recognition model, and generate, by using the speaker recognition model, a positive sample region image feature corresponding to the target region image information and a positive sample audio feature corresponding to the target audio information, a plurality of negative sample audio features aligned with the positive sample region image feature time sequence, and a plurality of negative sample region image features aligned with the positive sample audio feature time sequence, where the positive sample region image feature and the positive sample audio feature time sequence are aligned.

A training module 13 for performing contrast learning on the positive sample region image feature, the positive sample audio feature, the plurality of negative sample audio features, and the plurality of negative sample region image features by the speech object recognition model to obtain the ability to recognize the speech object associated with the input data based on any input data

The speaking object generation module 14 is configured to, when the multimedia data to be identified is acquired, input the multimedia data to be identified into the speaking object identification model, generate a speaking object identification feature through the speaking object identification model, and output, based on the speaking object identification feature, a result of identifying whether the object to be identified associated with the multimedia data to be identified is the target object, where the speaking object identification feature includes at least one of a target area image feature or an audio feature of the speaking object to be identified.

In some possible embodiments, the acquiring module 11 acquires positive sample training data and negative sample training data of speaker recognition, and further uses the positive sample training data and the negative sample training data to:

In some possible embodiments, the multimedia data to be identified includes image information to be identified and audio information to be identified of an object to be identified; the generating of the speaking object identification feature by the feature generating module 12 is used for:

And generating fusion features of the target area image features corresponding to the image information to be recognized and the audio features corresponding to the audio information to be recognized as the recognition features of the speaking object through a multi-mode feature fusion layer in the recognition model of the speaking object.

In some possible embodiments, the feature generation module 12 is configured to:

In some possible embodiments, the feature generation module 12 is further configured to:

In some possible embodiments, the speaking object generation module 14 is further configured to:

According to the embodiment corresponding to fig. 2, the implementation described in steps S101 to S104 in the method for identifying a speaking object shown in fig. 2 may be performed by each module of the apparatus shown in fig. 6. For example, the implementation described in step S101 in the method for recognizing a speaking object shown in fig. 2 may be performed by the in-device acquisition module 11 shown in fig. 6, the implementation described in step S102 may be performed by the feature generation module 12, the implementation described in step S103 may be performed by the training module 13, and the implementation described in step S104 may be performed by the speaking object generation module 14. The implementation manners performed by the acquiring module 11, the feature generating module 12, the training module 13, and the speaking object generating module 14 may be referred to the implementation manners provided by the steps in the embodiment corresponding to fig. 2, which are not described herein again.

In the embodiment of the application, the speaking object recognition model can recognize the object to be recognized associated with the multimedia data, and the recognition result of whether the object to be recognized associated with the multimedia data is a target object or not is output based on the speaking object recognition characteristics generated by the speaking object recognition model. The multimedia data may be multimedia data recorded or passed by the target object through the terminal device. After the speaking object recognition model acquires the multimedia data to be recognized, the speaking object recognition model acquires a plurality of pieces of lip movement information of the object to be recognized from the multimedia data to be recognized through lip movement detection in the speaking object recognition model, and acquires audio information corresponding to the lip movement information time based on the lip movement information. The speech object recognition model inputs a plurality of pieces of lip movement information of the object to be recognized into the visual information coding layer to obtain image characteristics of a target area of the object to be recognized. The speech object recognition model passes the audio information through an auditory information coding layer to obtain the audio characteristics of the object to be recognized. And then the speaking object recognition model inputs the target area image characteristics of the object to be recognized and the audio characteristics of the object to be recognized into the multi-modal characteristic fusion layer to generate the speaking object recognition characteristics. The speech object recognition model establishes a unified semantic hidden space for two modes of video (namely lip-shaped moving images) and audio, can better utilize the complementarity of multi-mode information, has stronger applicability, greatly reduces the calculated amount, has high recognition efficiency of speech objects and has strong objectivity of an expansion result.

In the embodiment of the present application, each module in the apparatus shown in the foregoing figures may be combined into one or several other modules separately or all, or some (some) of the modules may be further split into a plurality of modules with smaller functions to form a module, which may achieve the same operation without affecting the implementation of the technical effects of the embodiment of the present application. The above modules are divided based on logic functions, and in practical application, the functions of one module may be implemented by a plurality of modules, or the functions of a plurality of modules may be implemented by one module. In other possible implementations of the present application, the apparatus may also include other modules, where in practical applications, the functions may be implemented with assistance from other modules, and may be implemented by cooperation of multiple modules, which is not limited herein.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the application. As shown in fig. 7, the computer device 1000 may be a terminal device in the embodiment corresponding to fig. 2 to 5. The computer device 1000 may include: at least one processor 1001, such as a CPU, at least one transceiver 1003, a network interface 1004, memory 1005, at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others. The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may also optionally be at least one storage device located remotely from the aforementioned processor 1001. As shown in fig. 7, the memory 1005, which is one type of computer storage medium, may include an operating system, a network communication module, a user interface module, and a device control application.

In the computer device 1000 shown in FIG. 7, the network interface 1004 may provide network communication functions; while transceiver 1003 and processor 1001 may be used to invoke the device control application stored in memory 1005 to implement:

Acquiring positive sample training data and negative sample training data of speaking object recognition, wherein the positive sample training data comprises target area image information of a target object and corresponding target audio information of the target object, and the negative sample training data comprises audio information of a plurality of other objects corresponding to the target area image information and area image information of a plurality of other objects corresponding to the target audio information;

Inputting the positive sample training data and the negative sample training data into a speaking object recognition model, and generating positive sample area image features corresponding to target area image information, positive sample audio features corresponding to target audio information, a plurality of negative sample audio features aligned with positive sample area image feature time sequences and a plurality of negative sample area image features aligned with positive sample audio feature time sequences through the speaking object recognition model, wherein the positive sample area image features and the positive sample audio features are aligned in time sequences;

Comparing and learning the positive sample area image features, the positive sample audio features, the plurality of negative sample audio features and the plurality of negative sample area image features through a speaking object recognition model to obtain the ability of recognizing speaking objects associated with input data based on any input data;

When the multimedia data to be identified is obtained, the multimedia data to be identified is input into a speaking object identification model, speaking object identification features are generated through the speaking object identification model, and the identification result of whether the object to be identified associated with the multimedia data to be identified is a target object or not is output based on the speaking object identification features, wherein the speaking object identification features comprise at least one of target area image features or audio features of the speaking object to be identified.

It should be understood that the computer device 1000 described in the embodiment of the present application may perform the description of the method for identifying the speaking object in the embodiment corresponding to fig. 2, and may also perform the description of the apparatus for identifying the speaking object in the embodiment corresponding to fig. 6, which is not described herein. In addition, the description of the beneficial effects of the same method is omitted.

In the embodiment of the application, based on the functional implementation of the computer equipment 1000, the complementarity of the multi-mode information can be better utilized, the applicability is stronger, the calculated amount is greatly reduced, the recognition efficiency of the speaking object is high, and the objectivity of the expanded result is strong.

The embodiment of the present application further provides a computer readable storage medium, where the computer readable storage medium stores a computer program, where the computer program includes program instructions, and the program instructions when executed by a processor implement a method for identifying a speaking object provided by each step in fig. 2, and specifically refer to the implementation manner provided by each step in fig. 2, which is not described herein again. In addition, the description of the beneficial effects of the same method is omitted.

The computer readable storage medium may be a blockchain-based data processing apparatus provided in any of the foregoing embodiments or an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD), etc. that are provided on the computer device. Further, the computer-readable storage medium may also include both internal storage units and external storage devices of the computer device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device can execute the description of the blockchain-based data processing method in the embodiment corresponding to fig. 2, which is not described herein. In addition, the description of the beneficial effects of the same method is omitted.

The method provided by the embodiment of the application can improve the recognition efficiency of the objects to be recognized related to the multimedia data, and establishes a unified semantic hidden space aiming at two modes of video and audio, thereby better utilizing the complementarity of multi-mode information, greatly improving the system performance and being compatible with the single-mode situation. For example, when only voice exists, the face is shot or only face information exists, and the voice is weak or pickup fails, the method provided by the embodiment of the application can be adopted. In addition, the method provided by the embodiment of the application does not need to prepare a set of system for each mode, so that the single-mode situation is solved, and the calculation amount is greatly reduced by utilizing the complementary effect of the multi-mode information.

The terms first, second and the like in the description and in the claims and drawings of embodiments of the application are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the term "include" and any variations thereof is intended to cover a non-exclusive inclusion. For example, a process, method, apparatus, article, or device that comprises a list of steps or elements is not limited to the list of steps or modules but may, in the alternative, include other steps or modules not listed or inherent to such process, method, apparatus, article, or device.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The method and related apparatus provided in the embodiments of the present application are described with reference to the flowchart and/or schematic structural diagrams of the method provided in the embodiments of the present application, and each flow and/or block of the flowchart and/or schematic structural diagrams of the method may be implemented by computer program instructions, and combinations of flows and/or blocks in the flowchart and/or block diagrams. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or structural diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or structures.

The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims

1. A method of speaking object recognition, the method comprising:

inputting the positive sample training data and the negative sample training data into a speaking object recognition model, and generating positive sample region image features corresponding to the target region image information and positive sample audio features corresponding to the target audio information, a plurality of negative sample audio features aligned with the positive sample region image features in time sequence and a plurality of negative sample region image features aligned with the positive sample audio features in time sequence through the speaking object recognition model, wherein the positive sample region image features and the positive sample audio features are aligned in time sequence;

comparing and learning the positive sample area image features, the positive sample audio features, the plurality of negative sample audio features and the plurality of negative sample area image features through the speaking object recognition model to obtain the capability of recognizing speaking objects associated with the input data based on any input data;

When the multimedia data to be identified is obtained, inputting the multimedia data to be identified into the speaking object identification model, generating speaking object identification features through the speaking object identification model, and outputting the identification result of whether the object to be identified associated with the multimedia data to be identified is the target object or not based on the speaking object identification features, wherein the speaking object identification features comprise at least one of target area image features or audio features of the speaking object to be identified.

2. The method of claim 1, wherein the obtaining positive sample training data and negative sample training data for speaker recognition comprises:

Acquiring a plurality of lip-shaped moving images of a target object from sample training data included in a sample training data set as target area image information of the target object, and acquiring audio information of the target object corresponding to each lip-shaped moving information time from the sample training data as target audio information corresponding to the target area image information, so as to obtain positive sample training data of speaking object identification;

And acquiring audio information of a plurality of other objects corresponding to the lip movement information time from other sample training data included in the sample training data set as the audio information of a plurality of other objects corresponding to the target area image information, and acquiring lip movement information of a plurality of other objects corresponding to the audio information time of the target object from the other sample training data included in the sample training data set as the area image information of a plurality of other objects corresponding to the target audio information, so as to obtain negative sample training data for speaker object recognition.

3. The method according to claim 2, wherein the multimedia data to be identified includes image information to be identified and audio information to be identified of the object to be identified; the generating the speaking object recognition feature through the speaking object recognition model comprises:

acquiring to-be-identified image information containing lip-shaped moving images from the to-be-identified multimedia data through a lip movement detection layer in the speaking object identification model, and acquiring to-be-identified audio information corresponding to the to-be-identified image information time from the to-be-identified multimedia data through a voice information extraction layer in the speaking object identification model;

Generating target area image features corresponding to the image information to be identified through a visual information coding layer in the speaking object identification model to obtain the target area image features of the object to be identified, and generating audio features corresponding to the audio information to be identified through an auditory information coding layer in the speaking object identification model to obtain the audio features of the object to be identified;

4. The method of claim 3, wherein the generating, by the multimodal feature fusion layer in the speech object recognition model, the fusion feature of the target region image feature corresponding to the image information to be recognized and the audio feature corresponding to the audio information to be recognized comprises:

The multi-mode feature fusion layer in the speaking object recognition model carries out weighted summation on the target area image features corresponding to the image information to be recognized and the audio features corresponding to the audio information to be recognized based on the fusion weights of the image features and the audio features so as to generate fusion features of the target area image features and the audio features;

the fusion weight of the image features and the fusion weight of the audio features are obtained from the information quality of the image information to be identified and the information quality of the audio information to be identified.

5. The method of claim 3, wherein the generating, by the auditory information encoding layer in the speech object recognition model, the audio feature corresponding to the audio information to be recognized comprises:

Adjusting window length or window movement for extracting features of the audio information to be identified based on the frame number of the image features of the target area through an auditory information coding layer in the speaking object identification model, and extracting features of the audio information to be identified through the adjusted window length or window movement to obtain the audio features which are the same as the frame number of the image features of the target area as the audio features corresponding to the image information to be identified; or alternatively;

And extracting the characteristics of the audio information to be identified through an auditory information coding layer in the speaking object identification model to obtain audio characteristics, and copying the audio characteristics based on the frame number of the image characteristics of the target area to obtain the audio characteristics with the same frame number as the image characteristics of the target area as the audio characteristics corresponding to the image information to be identified.

6. The method according to claim 2, wherein the multimedia data to be identified includes image information to be identified of an object to be identified; the generating the speaking object recognition feature through the speaking object recognition model comprises:

Acquiring image information to be identified containing lip-shaped moving images from the multimedia data to be identified through a lip-movement detection layer in the speaking object identification model, and generating target area image characteristics corresponding to the image information to be identified through a visual information coding layer in the speaking object identification model so as to obtain the target area image characteristics of the object to be identified;

Acquiring audio information to be identified corresponding to the time of the image information to be identified from the multimedia data to be identified through a voice information extraction layer in the speaking object identification model, wherein the audio information to be identified is empty;

And outputting the target area image features corresponding to the image information to be recognized as speaking object recognition features through a multi-mode feature fusion layer in the speaking object recognition model.

7. The method according to claim 2, wherein the multimedia data to be identified includes audio information to be identified of an object to be identified; the generating the speaking object recognition feature through the speaking object recognition model comprises:

Acquiring audio information to be identified from the multimedia data to be identified through a voice information extraction layer in the speaking object identification model, and generating audio features corresponding to the audio information to be identified through an auditory information coding layer in the speaking object identification model so as to obtain the audio features of the object to be identified;

And outputting the fusion characteristics of the audio characteristics corresponding to the audio information to be identified as the identification characteristics of the speaking object through a multi-mode characteristic fusion layer in the speaking object identification model.

8. An apparatus for speaker recognition, comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring positive sample training data and negative sample training data for speaker identification, the positive sample training data comprises target region image information of a target object and corresponding target audio information thereof, and the negative sample training data comprises audio information of a plurality of other objects corresponding to the target region image information and region image information of a plurality of other objects corresponding to the target audio information;

The feature generation module is used for inputting the positive sample training data and the negative sample training data into a speaking object recognition model, generating positive sample area image features corresponding to the target area image information and positive sample audio features corresponding to the target audio information, a plurality of negative sample audio features aligned with the positive sample area image features in time sequence and a plurality of negative sample area image features aligned with the positive sample audio features in time sequence through the speaking object recognition model, wherein the positive sample area image features and the positive sample audio features are aligned in time sequence;

The training module is used for carrying out contrast learning on the positive sample area image features, the positive sample audio features, the negative sample audio features and the negative sample area image features through the speaking object recognition model so as to obtain the capability of recognizing speaking objects associated with the input data based on any input data;

9. A computer device, comprising: a processor, a memory, and a network interface; the processor is connected to the memory, the network interface for providing data communication functions, the memory for storing program code, the processor for invoking the program code to perform the method of any of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program adapted to be loaded by a processor and to perform the method of any of claims 1-7.