CN111028837B

CN111028837B - Voice conversation method, voice recognition system and computer storage medium

Info

Publication number: CN111028837B
Application number: CN201911294819.2A
Authority: CN
Inventors: 李永耀
Original assignee: Shenzhen Yunzhijia Network Co ltd
Current assignee: Shenzhen Yunzhijia Network Co ltd
Priority date: 2019-12-16
Filing date: 2019-12-16
Publication date: 2022-10-04
Anticipated expiration: 2039-12-16
Also published as: CN111028837A

Abstract

The embodiment of the application discloses a voice conversation method, a voice recognition system and a computer storage medium, which are used for establishing a one-to-many voice conversation scene. The method in the embodiment of the application comprises the following steps: the voice recognition system receives voice data sent by a session initiator, word slot information in the voice data comprises identity information of a session receiver of the voice session, the voice recognition system recognizes the voice data to determine the identity information of the session receiver, and searches target registration information corresponding to the identity information in a registration information base, so that the voice session between the session receiver and the session initiator corresponding to the target registration information is created. In the embodiment of the application, the session initiator can initiate the voice session with a plurality of session receivers, namely, a one-to-many voice session scene, so that the requirement of a multiparty conference of an enterprise is met.

Description

Voice conversation method, voice recognition system and computer storage medium

Technical Field

The embodiment of the application relates to the technical field of voice interaction, in particular to a voice conversation method, a voice recognition system and a computer storage medium.

Background

With the rapid development of voice recognition technology, various intelligent devices supporting voice recognition functions have been gradually inserted into various corners of work and life of users, such as intelligent vehicle-mounted devices, intelligent sound boxes, and the like, and the intelligent sound boxes can provide intelligent services such as music playing, question answering, weather or flight information inquiry, external call dialing, and the like for the users through the voice recognition function. Smart speakers may use their own microphone array to collect voice data of people in the environment.

The voice recognition system is a voice information processing system which can recognize voice data collected by the intelligent sound box. The intelligent sound box can be connected to a wireless network, is connected with the voice recognition system, and sends the collected voice data of the session initiator to the voice recognition system. The voice recognition system recognizes the voice data and creates a voice conversation between both parties after recognizing the conversation initiator Fang Suoyao as the conversation receiver of the voice conversation.

However, the voice recognition system can only allow a conversation initiator to have a voice conversation with a conversation receiver through the smart speaker, i.e. a "one-to-one" voice conversation scenario. When a voice conversation requires participation of multiple conversation receivers, for example, a conference of an enterprise requires participation of multiple conversation receivers, the voice recognition system obviously cannot meet the requirement.

Disclosure of Invention

The embodiment of the application provides a voice conversation method, a voice recognition system and a computer storage medium, which are used for establishing a one-to-many voice conversation scene.

A first aspect of an embodiment of the present application provides a voice conversation method, including:

receiving voice data sent by a session initiator, wherein the word slot information of the voice data comprises identity information of a session receiver of the voice session;

recognizing the voice data to determine identity information of the session receiver;

searching target registration information corresponding to the identity information in a registration information base, wherein the target registration information corresponds to the session receiver;

and creating the voice session of the session receiver and the session initiator corresponding to the target registration information.

Preferably, the receiving voice data sent by the session initiator includes:

and receiving voice data sent by intelligent equipment, wherein the voice data is the voice data sent by the session initiator and collected by the intelligent equipment.

Preferably, the creating a voice session between the session receiver and the session initiator corresponding to the target registration information includes:

creating a channel for a voice session of the session initiator and the session receiver;

acquiring identification information of a channel of the voice conversation;

and sending the identification information to the intelligent equipment and the terminal of the conversation receiver so as to enable the intelligent equipment and the terminal of the conversation receiver to be connected with the channel of the voice conversation according to the identification information.

Preferably, after the creating of the voice session between the session receiver and the session initiator corresponding to the target registration information, the method further includes:

generating prompt voice, wherein the prompt voice is used for prompting the session initiator that the voice session is successfully created;

and sending the prompt voice to the intelligent equipment so that the intelligent equipment plays the prompt voice.

Preferably, after the voice session between the session receiver and the session initiator corresponding to the target registration information is created, the method further includes:

receiving audio data sent by the session initiator or the session receiver;

and forwarding the audio data to the session receiver or the session initiator.

A second aspect of an embodiment of the present application provides a speech recognition system, including:

the device interaction unit is used for receiving voice data sent by a session initiator, and the word slot information of the voice data comprises identity information of a session receiver of the voice session;

the identification unit is used for identifying the voice data so as to determine the identity information of the conversation receiver;

the personnel management unit is used for searching target registration information corresponding to the identity information in a registration information base, and the target registration information corresponds to the session receiver;

a creating unit, configured to create a voice session between the session receiver and the session initiator, where the voice session corresponds to the target registration information.

Preferably, the device interaction unit is specifically configured to receive voice data sent by the intelligent device, where the voice data is voice data sent by the session initiator and collected by the intelligent device.

Preferably, the creating unit is specifically configured to create a channel of the voice session between the session initiator and the session receiver, acquire identification information of the channel of the voice session, and send the identification information to the intelligent device and the terminal of the session receiver, so that the intelligent device and the terminal of the session receiver are connected to the channel of the voice session according to the identification information.

Preferably, the speech recognition system further comprises:

the generating unit is used for generating prompt voice which is used for prompting the session initiator that the voice session is successfully created;

and the sending unit is used for sending the prompt voice to the intelligent equipment so that the intelligent equipment plays the prompt voice.

Preferably, the device interaction unit is further configured to receive audio data sent by the session initiator or the session receiver;

the speech recognition system further comprises:

a forwarding unit, configured to forward the audio data to the session receiver or the session initiator.

A third aspect of the embodiments of the present application provides a speech recognition system, including:

the system comprises a processor, a memory, a bus and input and output equipment;

the processor is connected with the memory and the input and output equipment;

the bus is respectively connected with the processor, the memory and the input and output equipment;

the input and output equipment is used for receiving voice data sent by a conversation initiator, and the word slot information of the voice data comprises identity information of a conversation receiver of the voice conversation;

the processor is configured to identify the voice data to determine identity information of the session receiver, and search for target registration information corresponding to the identity information in a registration information library, where the target registration information corresponds to the session receiver, and create a voice session between the session receiver and the session initiator, where the session receiver corresponds to the target registration information.

A fourth aspect of embodiments of the present application provides a computer storage medium having instructions stored therein, which when executed on a computer, cause the computer to perform the method of the first aspect.

According to the technical scheme, the embodiment of the application has the following advantages:

the voice recognition system receives voice data sent by a session initiator, word slot information in the voice data comprises identity information of a session receiver of the voice session, the voice recognition system recognizes the voice data to determine the identity information of the session receiver, and searches target registration information corresponding to the identity information in a registration information base, so that the voice session between the session receiver and the session initiator corresponding to the target registration information is created. In the embodiment of the application, the session initiator can initiate the voice session with a plurality of session receivers, that is, a one-to-many voice session scenario, and the requirement of the multi-party conference of an enterprise is met.

Drawings

FIG. 1 is a schematic diagram of a network architecture according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a voice conversation method according to an embodiment of the present application;

FIG. 3 is another flowchart illustrating a voice conversation method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a speech recognition system according to an embodiment of the present application;

FIG. 5 is a schematic diagram of another exemplary embodiment of a speech recognition system;

fig. 6 is a schematic diagram of another structure of a speech recognition system in the embodiment of the present application.

Detailed Description

Referring to fig. 1, a network architecture according to an embodiment of the present invention includes:

the system comprises a voice acquisition intelligent terminal 101, a voice recognition system 102, a terminal 103 and a network 104.

The embodiment of the present application can be applied to a network architecture as shown in fig. 1, in which the voice collecting intelligent terminal 101 includes a wireless connection module, a microphone array, and a speaker, where the wireless connection module includes but is not limited to a bluetooth module and a WiFi module, and the wireless connection module can be used to connect the voice recognition system 102 to implement data transmission. The microphone array is used for constantly monitoring the surrounding environment and collecting voice data of people in the environment.

In the network architecture of the embodiment of the present application, the voice collecting intelligent terminal 101 establishes a connection with the voice recognition system 102 through the wireless connection module, and in the process of a voice session, the voice collecting intelligent terminal 101 may use the encrypted hypertext transfer protocol 2.0 (HTTP 2.0) to perform data transmission such as audio data or instructions with the voice recognition system 102. The voice interaction between the voice capture intelligent terminal 101 and the voice recognition system 102 may be based on Alexa voice service (alexacesservice) of amazon corporation, and a specific voice interaction program is not limited, and may also be, for example, a conversational artificial intelligence system DuerOS of Baidu corporation or a voice recognition interface Siri of apple corporation.

The voice collecting intelligent terminal 101 may further include a noise elimination (NS) module, and because the microphone array of the sound box is always in a monitoring state, the voice data collected by the microphone array is inevitably mixed with the noise of the surrounding environment, and the noise elimination module can effectively eliminate the environmental noise in the sampled audio stream, thereby improving the accuracy of subsequent keyword recognition and voice recognition.

In addition, the voice capture intelligent terminal 101 may further include a keyword recognition (KW) module, where the keyword recognition module is configured to wake up and activate the sound box, so that the sound box enters a voice instruction recognition state from a common audio monitoring state. In the embodiment of the present application, offline keyword recognition may be adopted, and a chinese text may be adopted to train the voice collecting intelligent terminal 101, so that the voice collecting intelligent terminal 101 may be awakened by a chinese keyword, for example, the keyword "hello clouds" may be adopted to awaken the voice collecting intelligent terminal 101.

The voice collecting intelligent terminal 101 can also be used for collecting voice data sent by a user when the user performs a voice session, and therefore, the voice collecting intelligent terminal 101 can further include a Voice Activity Detector (VAD) which can be used for detecting whether the voice session is ended. When the voice session ends, the VAD detects the silence, at which time the VAD may terminate the upload of the audio data of the voice session to the voice recognition system 102.

In this embodiment of the present application, a user may initiate a voice session based on the above network architecture, the user initiating the voice session may be referred to as a session initiator, and the user responding to the initiation of the voice session may be referred to as a session receiver. In the network architecture of the embodiment of the present application, the voice collection intelligent terminal 101 and the terminal 103 can collect the session voice data of both parties of the voice session, and the voice recognition system 102 can forward the session voice data to both parties of the voice session through the network 104.

The network 104 is generally a wireless network, and may also be a wired network, and if the network is a wireless network, the type of the network may be a cellular wireless network, or a WiFi network, or another type of wireless network. In the case of a wired network, the typical network is in the form of a fiber optic network. The terminal 103 may be a computer, a Personal Digital Assistant (PDA), a tablet computer, a smart phone, or the like.

In the embodiment of the present application, as long as the intelligent terminal has a microphone array, a wireless connection module, and a speaker, which can collect voice data, the intelligent terminal 101 of the network architecture of the embodiment of the present application may be used, and the specific form of the intelligent terminal 101 is not limited, and may be, for example, an intelligent sound box. When the voice collection intelligent terminal 101 is an intelligent sound box, the bluetooth module on the intelligent sound box can be used for binding the terminal of the user to realize the control of the user on the intelligent sound box.

It should be noted that the voice collection smart terminal 101 is indicated by a pattern of a smart speaker in the figure, but the voice collection smart terminal 101 may be not only a smart speaker but also a smart phone, and since the smart phone integrates a function of the smart speaker and a human-computer interaction function, the smart phone may also be used for voice session initiation and interaction of session voice in the embodiment of the present application, and meanwhile, a user may also directly perform a voice session through the smart phone without participation of the smart speaker, thereby omitting an operation of binding the smart speaker by the user terminal.

The following describes a voice session method in the embodiment of the present application with reference to the network architecture of fig. 1:

referring to fig. 2, an embodiment of a voice conversation method in the embodiment of the present application includes:

201. receiving voice data sent by a session initiator;

when a session initiator needs to perform voice session with one or more session receivers, the session initiator sends a voice instruction to the voice acquisition intelligent terminal, the voice instruction contains voice data, and the voice data is intended to perform voice session. The voice acquisition intelligent terminal can acquire the voice data and send the voice data to the voice recognition system. The speech recognition system receives the speech data.

The voice data comprises word slot information, and the word slot information comprises identity information of a conversation receiver. For example, the session initiator sends an instruction to invite one or more session recipients to participate in the session to the voice capture intelligent terminal, at this time, the word slot information in the voice data sent by the session initiator is the information of all objects of the session initiator Fang Suoyao for performing the voice session, and the word slot information may include the identity information of the session recipients. The identity information may be a real name, a nickname, or an enterprise work number of the session receiver, and is not limited here as long as the identity information can identify the identity of the session receiver.

In this embodiment, the voice collection intelligent terminal may be an intelligent device, and the specific intelligent device may be an intelligent sound box or an intelligent mobile phone, as long as the intelligent device has a microphone array, a wireless connection module, and a speaker that can collect voice data, and the specific here is not limited.

202. Recognizing voice data to determine identity information of a conversation receiver;

after the voice data is received by the voice recognition system, the voice data is recognized to determine identity information of a conversation receiver. In this embodiment, the speech recognition system recognizes and understands the speech data sent by the session initiator based on an artificial intelligence technology such as Natural Language Processing (NLP), for example, a plurality of chinese text data may be trained on the speech recognition system through a deep learning algorithm such as a BP neural network algorithm and a deep convolutional neural network algorithm, so that the speech recognition system may recognize the chinese speech data sent by the session initiator.

203. Searching target registration information corresponding to the identity information in a registration information base;

in this embodiment, the user may register as a system user on the voice recognition system, the voice recognition system forms registration information of the user, and the registration information of a plurality of users forms a registration information library. The registration information is identification information of the user on the system, and may be, for example, a registered account, a mailbox or a personal social network account to which the registered account is bound, a network nickname of the registered account, and the like.

In addition, the voice recognition system can establish an association relationship between the registration information of the user and the identity information of the user. For example, the voice recognition system may associate a user's registered account with the user's real name, or associate a network nickname of the user's registered account with a nickname of the user in real life. In this way, the voice recognition system can acquire the registration information corresponding to the identity information according to the identity information included in the received voice data.

After determining the identity information of the session receiver, the voice recognition system can search target registration information corresponding to the identity information of the session receiver in a large amount of registration information in a registration information base, and if the target registration information is obtained, the session receiver is registered to become a system user, so that a voice session between the session receiver and the session initiator can be established.

204. Creating a voice session of a session receiver and a session initiator corresponding to the target registration information;

after determining the session receiver corresponding to the target registration information, the voice recognition system creates a voice session between the session receiver and the session initiator.

In this embodiment, the voice recognition system receives voice data sent by a session initiator, where the word slot information in the voice data includes identity information of a session receiver of the voice session, and the voice recognition system recognizes the voice data to determine the identity information of the session receiver, and searches for target registration information corresponding to the identity information in a registration information base, so as to create a voice session between the session receiver and the session initiator, where the target registration information corresponds to the target registration information. In the embodiment of the application, the session initiator can initiate the voice session with a plurality of session receivers, namely, a one-to-many voice session scene, so that the requirement of a multiparty conference of an enterprise is met.

After the voice recognition system creates the voice session, both parties of the voice session can conduct the voice session. After the voice session is created, the voice recognition system will also perform a series of operations. The operations performed after the speech recognition system creates a speech session will be described in detail next. Referring to fig. 3, another embodiment of the voice conversation method in the embodiment of the present application includes:

301. receiving voice data sent by a session initiator;

302. recognizing voice data to determine identity information of a conversation receiver;

303. searching target registration information corresponding to the identity information in a registration information base;

the operations performed in steps 301 to 303 are similar to the operations performed in steps 201 to 203 in the embodiment shown in fig. 2, and are not described again here.

304. Creating a voice session of a session receiver and a session initiator corresponding to the target registration information;

after determining the session receiver corresponding to the target registration information, the voice recognition system creates a channel of the voice session between the session initiator and the session receiver, and acquires identification information of the channel of the voice session. The voice recognition system sends an instruction of joining a channel of the voice conversation to the intelligent equipment of the conversation initiator, the instruction carries identification information of the channel, and the intelligent equipment responds to the instruction and is connected with the channel of the voice conversation.

In addition, the voice recognition system can also send a prompt for joining the voice session to the terminal of the session receiver, wherein the prompt carries the identification information of the channel, and the session receiver can confirm whether to join the voice session through the terminal. For example, the session receiver is registered on the system and becomes a system user, when the session receiver logs in a registration account on the smart phone, the voice recognition system can send a prompt of a channel for joining the voice session to the smart phone of the session receiver, and the prompt carries identification information of the channel, so that the session receiver can confirm whether to join the voice session through the smart phone, and use the smart phone for voice communication after joining the voice session.

305. Generating a prompt voice;

after the intelligent device of the session initiator connects to the channel of the voice session and the session receiver confirms to join the voice session, the voice recognition system generates a prompt text sentence, and the content of the prompt text sentence can indicate that the voice session is successfully created. The voice recognition system synthesizes the prompt text sentence into a prompt voice, and the prompt voice can be used for prompting that the voice conversation is successfully established.

306. Sending a prompt voice to the intelligent equipment;

after synthesizing the prompt speech, the speech recognition system sends the prompt speech to the intelligent device of the session initiator. After receiving the prompt voice, the intelligent device of the session initiator plays the prompt voice to prompt that the voice session of the session initiator is successfully established, and the session initiator can perform voice session with the session receiver.

307. Receiving audio data sent by a session initiator or a session receiver;

after the voice recognition system creates the voice session, the session initiator and the session recipient can conduct the voice session. In the process of voice conversation, voices emitted by both conversation parties are respectively collected by respective terminals and audio data are generated. In this embodiment, the intelligent device of the session initiator collects the voice sent by the session initiator and generates audio data, the terminal of the session receiver collects the voice of the session receiver and generates audio data, the intelligent device of the session initiator and the terminal of the session receiver send the generated audio data to the voice recognition system, and the voice recognition system receives the audio data sent by the session initiator or the session receiver respectively.

308. Forwarding the audio data to a session receiver or a session initiator;

after the voice recognition system receives the audio data sent by the session initiator, if a plurality of session receivers exist, the voice recognition system copies the audio data and forwards the copy of each audio data to the terminal of each session receiver, and the terminal of the session receiver analyzes the audio data and then plays voice. Similarly, after receiving the audio data sent by the session receiver, the voice recognition system duplicates the audio data and forwards the copy of the audio data to the terminals of other session receivers and the intelligent device of the session initiator, and the voice in the audio data is played after the audio data is analyzed.

In this embodiment, after the voice session is successfully created, the voice recognition system sends the prompt voice for successfully creating the voice session to the session initiator, so that the session initiator can conveniently know the creation progress of the voice session.

With reference to fig. 4, the voice recognition system in the embodiment of the present application is described above, and an embodiment of the voice recognition system in the embodiment of the present application includes:

the device interaction unit 401 is configured to receive voice data sent by a session initiator, where word slot information of the voice data includes identity information of a session receiver of a voice session;

an identifying unit 402, configured to identify voice data to determine identity information of a session recipient;

the personnel management unit 403 is configured to search for target registration information corresponding to the identity information in the registration information base, where the target registration information corresponds to the session receiver;

a creating unit 404, configured to create a voice session between the session receiver and the session initiator corresponding to the target registration information.

In this embodiment, operations performed by each unit in the speech recognition system are similar to those described in the embodiment shown in fig. 2, and are not described again here.

In this embodiment, the device interaction unit 401 receives voice data sent by a session initiator, where word slot information in the voice data includes identity information of a session receiver of a voice session, the recognition unit 402 recognizes the voice data to determine the identity information of the session receiver, the staff management unit 403 searches for target registration information corresponding to the identity information in a registration information base, and the creation unit 404 further creates a voice session between the session receiver and the session initiator corresponding to the target registration information. In the embodiment of the application, the session initiator can initiate the voice session with a plurality of session receivers, namely, a one-to-many voice session scene, so that the requirement of a multiparty conference of an enterprise is met. .

Referring to fig. 5, an embodiment of a speech recognition system in the embodiment of the present application includes:

the device interaction unit 501 is configured to receive voice data sent by a session initiator, where word slot information of the voice data includes identity information of a session receiver of a voice session;

an identifying unit 502, configured to identify voice data to determine identity information of a session recipient;

the personnel management unit 503 is configured to search for target registration information corresponding to the identity information in the registration information base, where the target registration information corresponds to the session receiver;

a creating unit 504, configured to create a voice session between the session receiver and the session initiator corresponding to the target registration information.

In this embodiment, the device interaction unit 501 is specifically configured to receive voice data sent by an intelligent device, where the voice data is voice data sent by a session initiator and collected by the intelligent device.

The creating unit 504 is specifically configured to create a channel of the voice session between the session initiator and the session receiver, acquire identification information of the channel of the voice session, and send the identification information to the intelligent device and the terminal of the session receiver, so that the intelligent device and the terminal of the session receiver are connected to the channel of the voice session according to the identification information.

In this embodiment, the speech recognition system further includes:

a generating unit 505, configured to generate a prompt voice, where the prompt voice is used to prompt a session initiator that a voice session is successfully created;

a sending unit 506, configured to send a prompt voice to the smart device, so that the smart device plays the prompt voice.

In this embodiment, the device interaction unit 501 is further configured to receive audio data sent by a session initiator or a session receiver;

the speech recognition system further comprises:

a forwarding unit 507, configured to forward the audio data to the session receiver or the session initiator.

In this embodiment, after the creating unit 504 creates the voice session, both parties of the voice session can perform the voice session. After the voice session is created, the units in the voice recognition system will also perform a series of operations. The operations performed by the units after the creating unit 504 creates the voice session are similar to those described in the foregoing embodiment shown in fig. 3, and are not described again here.

Referring to fig. 6, a speech recognition system in an embodiment of the present application is described below, where an embodiment of the speech recognition system in the embodiment of the present application includes:

the speech recognition system 600 may include one or more Central Processing Units (CPUs) 601 and a memory 605, where the memory 605 stores one or more applications or data.

The memory 605 may be volatile storage or persistent storage, among other things. The program stored in the memory 605 may include one or more modules, each of which may include a sequence of instructions operating on a speech recognition system. Still further, the central processor 601 may be configured to communicate with the memory 605 to execute a series of instruction operations in the memory 605 on the speech recognition system 600.

The speech recognition system 600 may also include one or more power supplies 602, one or more wired or wireless network interfaces 603, one or more input-output interfaces 604, and/or one or more operating systems, such as Windows Server, macOSXTM, unixTM, linuxTM, freeBSDTM, etc.

The central processing unit 601 can perform the operations performed by the speech recognition system in the embodiments shown in fig. 2 to fig. 3, and details thereof are not repeated herein.

An embodiment of the present application further provides a computer storage medium, where one embodiment includes: the computer storage medium has stored therein instructions that, when executed on a computer, cause the computer to perform the operations performed by the speech recognition system in the embodiments of fig. 2-3.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and various media capable of storing program codes.

Claims

1. A voice conversation method, applied to a voice conversation scenario between a conversation initiator and a plurality of conversation receivers, the method comprising:

2. The voice conversation method according to claim 1, wherein the receiving voice data from the session initiator comprises:

3. The method of claim 2, wherein the creating a voice session between the session receiver and the session initiator corresponding to the target registration information comprises:

acquiring identification information of a channel of the voice conversation;

4. The voice session method according to claim 3, wherein after the creating of the voice session between the session receiver and the session initiator corresponding to the target registration information, the method further comprises:

5. The voice session method of claim 1, wherein after the creating of the voice session between the session receiver and the session initiator corresponding to the target registration information, the method further comprises:

receiving audio data sent by the session initiator or the session receiver;

and forwarding the audio data to the session receiver or the session initiator.

6. A speech recognition system for use in a speech conversation scenario between a conversation initiator and a plurality of conversation recipients, the speech recognition system comprising:

and the creating unit is used for creating the voice conversation of the conversation receiving party and the conversation initiating party corresponding to the target registration information.

7. The speech recognition system of claim 6, wherein the device interaction unit is specifically configured to receive speech data sent by a smart device, where the speech data is speech data sent by the session initiator and collected by the smart device.

8. The speech recognition system of claim 7, wherein the creating unit is specifically configured to create a channel of the voice session between the session initiator and the session receiver, obtain identification information of the channel of the voice session, and send the identification information to the intelligent device and the terminal of the session receiver, so that the intelligent device and the terminal of the session receiver connect the channel of the voice session according to the identification information.

9. The speech recognition system of claim 8, further comprising:

10. The speech recognition system of claim 6, wherein the device interaction unit is further configured to receive audio data sent by the session initiator or the session receiver;

the speech recognition system further comprises:

11. A speech recognition system for use in a speech conversation scenario between a conversation initiator and a plurality of conversation recipients, the speech recognition system comprising:

the processor is connected with the memory and the input and output equipment;

12. A computer storage medium having stored therein instructions that, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 5.