WO2019227290A1

WO2019227290A1 - Systems and methods for speech recognition

Info

Publication number: WO2019227290A1
Application number: PCT/CN2018/088717
Authority: WO
Inventors: Xiulin Li
Original assignee: Beijing Didi Infinity Technology And Development Co., Ltd.
Priority date: 2018-05-28
Filing date: 2018-05-28
Publication date: 2019-12-05
Also published as: CN110914898A; CN110914898B

Abstract

The present disclosure relates to systems and methods for speech recognition. The systems may perform the methods to obtain target audio signals including a speech of a speaker, and determine one or more acoustic features of the target audio signals. The systems may also perform the methods to obtain an accent vector of the speaker. The systems may further perform the methods to input the acoustic feature (s) of the target audio signals and the accent vector of the speaker into a trained speech recognition neural network model to translate the speech into a target content form, and generate an interface through an output device to present the speech in the target content form.

Description

SYSTEMS AND METHODS FOR SPEECH RECOGNITION

TECHNICAL FIELD

The present disclosure generally relates to systems and methods for speech recognition, and in particular, to systems and methods using artificial intelligence (AI) for recognizing accented speeches.

BACKGROUND

Automatic speech recognition is an important technology that enables the recognition and translation of spoken language into computer recognized text equivalents. In general, automatic speech recognition attempts to provide accurate recognition results for speeches in different languages and accents. However, it is challenging to recognize and translate an accented speech as the accented speech pronunciation of a language can result in a misidentification and failed recognition of words. Therefore, it is desirable to provide AI systems and methods that are capable of recognizing accent speeches and translating the accent speeches into a desired form, such as textual contents or audio speech with another predetermined accent or language.

SUMMARY

According to an aspect of the present disclosure, a system is provided. The system may include at least one audio signal input device, at least one storage medium, and at least one processor. The at least one audio signal input device may be configured to receive a speech of a speaker. The at least one storage medium may include a set of instructions for speech recognition. The at least one processor may be in communication with the at least one storage medium. When the at least one processor executes the set of instructions, the at least one processor may be directed to preform one or more of the following operations. The at least one processor may obtain target audio signals including the speech of the speaker from the audio signal input device, and determine one or more acoustic features of the target audio signals. The at least one processor may also obtain at least one accent vector of the speaker, and input the one or more acoustic features of the target audio signals and the at least one accent vector of the speaker into a trained speech recognition neural network model to translate the speech into a target content form. The at least one processor may further generate an interface through an output device to present the speech in the target content form.

In some embodiments, to input the one or more acoustic features of the audio signal and the accent of the speaker into the trained neural network model for speech recognition, the at least one processor may be directed to obtain a local accent of a region from which the speech is originated, and input the one or more acoustic features of the target audio signals, the at least one accent vector of the speaker, and the local accent into the trained speech recognition neural network model.

In some embodiments, the accent vector includes a plurality of elements. Each element may correspond to a regional accent and include a likelihood value related to the regional accent.

In some embodiments, the at least one accent vector of the speaker may be determined according to an accent determining process. The accent determining process may include obtaining historical audio signals including one or more historical speeches of the speaker, and determining one or more historical acoustic features of the corresponding historical audio signals for each of the one or more historical speeches. The accent determining process may also include obtaining one or more regional accent models, and inputting the one or more corresponding historical acoustic features into each of the one or more regional accent models for each of the one or more historical speeches. The accent determining process may further include determining the plurality of elements of the at least one accent vector of the speaker based on at least an output of the one or more regional accent models.

In some embodiments, the speech recognition trained neural network model may be generated by at least one computing device according to a training process. The training process may include obtaining sample audio signals including a plurality of sample speeches of a plurality of sample speakers, and determining one or more sample acoustic features of the corresponding sample audio signals for each of the plurality of sample speeches. The training process may also include obtaining at least one sample accent vector of the corresponding sample speaker for each of the plurality of sample speeches, and obtaining a preliminary neural network model. The training process may further include determining the trained speech recognition neural network model by inputting the one or more sample acoustic features and the at least one sample accent vector of the sample speaker corresponding to each of the plurality of sample speeches into the preliminary neural network model.

In some embodiments, the determining of the trained neural network model for speech recognition may include obtaining a sample local accent of a region from which the sample audio signal is originated for each of the plurality of sample speeches. The determining of the trained neural network model for speech recognition may also include determining the trained speech recognition neural network model by inputting the one or more sample acoustic features, the at least one sample accent vector of the sample speaker, and the sample local accent corresponding to each of the plurality of sample speeches into the preliminary neural network model.

In some embodiments. The target content form may include at least one of phoneme, syllable, or character.

In some embodiments, to input the one or more acoustic features of the target audio signals and the at least one accent vector of the speaker into the trained speech recognition neural network model to translate the speech into the target content form, the at least one processor may be directed to input the one or more acoustic features and the at least one accent vector of the speaker into the trained speech recognition neural network model, and translate the speech into the target content form based on at least an output of the trained neural network model.

According to another aspect of the present disclosure, a method is provided. The method may be implemented on a computing device having at least one processor, at least one storage. The method may include obtaining target audio signals including a speech of a speaker from an audio signal input device, and determining one or more acoustic features of the target audio signals. The method may also include obtaining at least one accent vector of the speaker, and inputting the one or more acoustic features of the target audio signals and the at least one accent vector of the speaker into a trained speech recognition neural network model to translate the speech into a target content form. The method may further include generating an interface through an output device to present the speech in the target content form.

According to yet another aspect of the present disclosure, a non-transitory computer readable medium is provided. The non-transitory computer readable medium may comprise executable instructions that, when executed by at least one processor, cause the at least one processor to effectuate a method. The method may include obtaining target audio signals including a speech of a speaker from an audio signal input device, and determining one or more acoustic features of the target audio signals. The method may also include obtaining at least one accent vector of the speaker, and inputting the one or more acoustic features of the target audio signals and the at least one accent vector of the speaker into a trained speech recognition neural network model to translate the speech into a target content form. The method may further include generating an interface through an output device to present the speech in the target content form.

Additional features will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The features of the present disclosure may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities and combinations set forth in the detailed examples discussed below.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure is further described in terms of exemplary embodiments. These exemplary embodiments are described in detail with reference to the drawings. These embodiments are non-limiting exemplary embodiments, in which like reference numerals represent similar structures throughout the several views of the drawings, and wherein:

FIG. 1 is a schematic diagram illustrating an exemplary speech recognition system according to some embodiments of the present disclosure;

FIG. 2 is a block diagram illustrating exemplary hardware and/or software components of a computing device according to some embodiments of the present disclosure;

FIG. 3 is a block diagram illustrating an exemplary hardware and/or software components of an exemplary mobile device according to some embodiments of the present disclosure;

FIG. 4A is a schematic diagram illustrating an exemplary processing engine according to some embodiments of the present disclosure;

FIG. 4B is a schematic diagram illustrating an exemplary processing engine according to some embodiments of the present disclosure;

FIG. 5 is a flowchart illustrating an exemplary process for speech recognition according to some embodiments of the present disclosure;

FIG. 6 is a flowchart illustrating an exemplary process for determining an accent vector of a speaker based on one or more regional accent models according to some embodiments of the present disclosure;

FIG. 7 is a flowchart illustrating an exemplary process for determining an accent vector of a speaker based on an accent classification model according to some embodiments of the present disclosure;

FIG. 8 is a flowchart illustrating an exemplary process for generating a trained speech recognition neural network model according to some embodiments of the present disclosure; and

FIG. 9 is a schematic diagram illustrating an exemplary trained speech recognition neural network model according to some embodiments of the present disclosure.

DETAILED DESCRIPTION

The following description is presented to enable any person skilled in the art to make and use the present disclosure, and is provided in the context of a particular application and its requirements. Various modifications to the disclosed embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be applied to other embodiments and applications without departing from the spirit and scope of the present disclosure. Thus, the present disclosure is not limited to the embodiments shown, but is to be accorded the widest scope consistent with the claims.

The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting. As used herein, the singular forms “a, ” “an, ” and “the” may be intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprise, ” “comprises, ” and/or “comprising, ” “include, ” “includes, ” and/or “including, ” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

These and other features, and characteristics of the present disclosure, as well as the methods of operation and functions of the related elements of structure and the combination of parts and economies of manufacture, may become more apparent upon consideration of the following description with reference to the accompanying drawings, all of which form a part of this disclosure. It is to be expressly understood, however, that the drawings are for the purpose of illustration and description only and are not intended to limit the scope of the present disclosure. It is understood that the drawings are not to scale.

The flowcharts used in the present disclosure illustrate operations that systems implement according to some embodiments of the present disclosure. It is to be expressly understood, the operations of the flowchart may be implemented not in order. Conversely, the operations may be implemented in inverted order, or simultaneously. Moreover, one or more other operations may be added to the flowcharts. One or more operations may be removed from the flowcharts.

The positioning technology used in the present disclosure may be based on a global positioning system (GPS) , a global navigation satellite system (GLONASS) , a compass navigation system (COMPASS) , a Galileo positioning system, a quasi-zenith satellite system (QZSS) , a wireless fidelity (WiFi) positioning technology, or the like, or any combination thereof. One or more of the above positioning systems may be used interchangeably in the present disclosure.

An aspect of the present disclosure relates to systems and methods for speech recognition. In speech recognition, the accent of the speaker may affect the recognition accuracy and needed to be taken into consideration. According to the present disclosure, when the systems recognize a speech of a speaker, the systems may determine one or more acoustic feature (s) of audio signals of the speech. The systems may also obtain an accent vector of the speaker. The accent vector may include a plurality of elements, each element corresponds to a regional accent and indicates a similarity between the speaker’s accent and the particular regional accent. The acoustic feature (s) together with the accent vector together with the may be inputted into a trained speech recognition neural network model to translate the speech into a target content form, such as phonemes, words and/or voices. Because the recognition systems and methods take into account of both the acoustic feature (s) of the speech itself and the accent feature of the speaker, the systems and methods improve the accuracy of the recognition result. Further, with a higher accurate recognition rate, the systems and methods may also use artificial intelligence to translate the recognized speech into other forms of expression, such as translating and displaying the original speech into textual contents, sound tracks of another accent, and/or sound tracks of another language, etc.

FIG. 1 is a schematic diagram illustrating an exemplary artificial intelligence speech recognition system 100 according to some embodiments of the present disclosure. As shown in FIG. 1, the artificial intelligence speech recognition system 100 (referred to as the speech recognition system 100 for brevity) may include a server 110, a network 120, an input device 130, an output device 140, and a storage device 150.

In the speech recognition system 100, a speaker 160 may speak a speech 170 into the input device 130, which may generate audio signals including or encoding the speech 170. The input device 130 may provide the audio signals and optionally information related to the speaker 160 and/or the input device 130 to the server 110 via the network 120. The information related to the speaker 160 and/or the input device 130 may include, for example, user profile of the speaker 160, position information of the speaker 160 and/or the input device 130, or the like, or any combination thereof. The server 110 may process the audio signals and the optionally information related to the speaker 160 and/or the input device 130 to translate the speech 170 into a target content form, such as phoneme (s) , word (s) and/or voice (s) . The translation of the speech 170 may further be transmitted to the output device 140 for presentation via the network 120.

In some embodiments, the server 110 may be a single server or a server group. The server group may be centralized, or distributed (e.g., server 110 may be a distributed system) . In some embodiments, the server 110 may be local or remote. For example, the server 110 may access information and/or data stored in the input device 130, the output device 140, and/or the storage device 150 via the network 120. As another example, the server 110 may be directly connected to the input device 130, output device 140, and/or the storage device 150 to access stored information and/or data. In some embodiments, the server 110 may be implemented on a cloud platform. Merely by way of example, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof. In some embodiments, the server 110 may be implemented on a computing device 200 having one or more components illustrated in FIG. 2 in the present disclosure.

In some embodiments, the server 110 may include a processing engine 112. The processing engine 112 may process information and/or data to perform one or more functions described in the present disclosure. For example, the processing engine 112 may translate the speech 170 into a target content form based on a trained speech recognition neural network model, acoustic feature (s) of the speech 170, and/or the accent vector of the speaker 160. As another example, the processing engine 112 may train a trained speech recognition neural network model using a set of training samples. In some embodiments, the processing engine 112 may include one or more processing engines (e.g., single-core processing engine (s) or multi-core processor (s) ) . Merely by way of example, the processing engine 112 may include one or more hardware processors, such as a central processing unit (CPU) , an application-specific integrated circuit (ASIC) , an application-specific instruction-set processor (ASIP) , a graphics processing unit (GPU) , a physics processing unit (PPU) , a digital signal processor (DSP) , a field-programmable gate array (FPGA) , a programmable logic device (PLD) , a controller, a microcontroller unit, a reduced instruction-set computer (RISC) , a microprocessor, or the like, or any combination thereof.

In some embodiments, at least part of the server 110 may be integrated into the input device 130 and/or the output device 140. Merely by way of example, the processing engine 112 may be integrated into the input device 130. For example, the input device 130 may be a smart recorder with artificial intelligence, which may include microprocessor and memory (e.g., a hard disk) therein to directly translate a recorded speech into textual content and save the textual content in the memory. When the processing engine 112 and the input device 130 are integrated together into a single device, the network 120 in FIG. 1 between the processing engine 112 and the input device 130 may become unnecessary because all communications there between become local. Merely for illustration purpose, the present disclosure takes the server 110 and the input device 130 as separate devices as an example of the speech recognition system 100.

The network 120 may facilitate the exchange of information and/or data. In some embodiments, one or more components in the speech recognition system 100 (e.g., the server 110, the input device 130, the output device 140, the storage device 150) may send information and/or data to other component (s) in the speech recognition system 100 via the network 120. For example, the server 110 may obtain/acquire a request to translate a speech 170 from the input device 130 via the network 120. In some embodiments, the network 120 may be any type of wired or wireless network, or a combination thereof. Merely by way of example, the network 120 may include a cable network, a wireline network, an optical fiber network, a telecommunications network, an intranet, the Internet, a local area network (LAN) , a wide area network (WAN) , a wireless local area network (WLAN) , a metropolitan area network (MAN) , a public telephone switched network (PSTN) , a Bluetooth network, a ZigBee network, a near field communication (NFC) network, or the like, or any combination thereof. In some embodiments, the network 120 may include one or more network access points. For example, the network 120 may include wired or wireless network access points such as base stations and/or internet exchange points 120-1, 120-2…through which one or more components of the speech recognition system 100 may be connected to the network 120 to exchange data and/or information.

The input device 130 may be configured to receive a voice input from a user and generate audio signals of and/or encoding the voice input. For example, the input device 130 may receive a speech 170 from the speaker 160 as illustrated in FIG. 1. The input device 130 may be a voice input device or be a device that includes an acoustic input component (e.g., a microphone) . Exemplary input devices 130 may include a mobile device 130-1, a headset 130-2, a microphone 130-3, a music player, a recorder, an e-book reader, a navigation device, a tablet computer, a laptop computer, a built-in device in a motor vehicle, a recording pen, or the like, or any combination thereof. The mobile device 130-1 may include a smart home device, a wearable device, a mobile device, a virtual reality device, an augmented reality device, or the like, or any combination thereof. In some embodiments, the smart home device may include a smart lighting device, a control device of an intelligent electrical apparatus, a smart monitoring device, a smart television, a smart video camera, an interphone, or the like, or any combination thereof. In some embodiments, the wearable device may include a bracelet, footgear, glasses, a helmet, a watch, clothing, a backpack, a smart accessory, or the like, or any combination thereof. In some embodiments, the mobile device may include a mobile phone, a personal digital assistance (PDA) , a gaming device, a navigation device, a point of sale (POS) device, a laptop, a desktop, or the like, or any combination thereof. In some embodiments, the virtual reality device and/or the augmented reality device may include a virtual reality helmet, a virtual reality glass, a virtual reality patch, an augmented reality helmet, augmented reality glasses, an augmented reality patch, or the like, or any combination thereof. For example, the virtual reality device and/or the augmented reality device may include a Google Glass ^TM, a RiftCon ^TM, a Fragments ^TM, a Gear VR ^TM, etc. In some embodiments, a built-in device in the motor vehicle may include an onboard computer, an onboard television, etc. In some embodiments, the input device 130 may be a device with positioning technology for locating the position of the user and/or the input device 130.

In operation, the input device 130 may activate a speech recording session to record audio signals that includes the speech 170 of the speaker 160. In some embodiments, the speech recording session may be initiated automatically once the input device 130 detects a sound or a sound that satisfies a condition. For example, the condition may be that the sound is a speech (with human language) , the speech or sound (e.g., timbre of the sound) is from a particular speaker 160, and/or the loudness of the speech or sound being greater than a threshold, etc. Additionally or alternatively, the speech recording session may be initiated by a specific action taken by the speaker 160. For example, the speaker 160 may press a button or an area on the interface of the input device 130 before speaking, speak an utterance, and then release the button or area when finishing speaking. As another example, the speaker 160 may initiate the speech recording session by making a predetermined gesture or sound. In some embodiments, the input device 130 may further transmit the audio signals including the speech 170 to the server 110 (e.g., the processing engine 112) for speech recognition.

In some embodiments, the speaker 160 may be a user of the input device 130. Alternative, the speaker 160 may be someone other than the user of the input device 130. For example, a user A of the input device 130 may use the input device 130 to input a speech of a user B. In some embodiments, “user” and “speaker” may be used interchangeably. For the convenience of description, the “user” and “speaker” are collectively referred to as the “speaker” .

After the processing engine “translate” the recorded speech into a desired form (e.g., textual content, audio content, etc. ) , the desired form may be sent to another device for presentation. For example, the desired form may be saved in the storage device 150. The desired form may also be presented through the output device 140. The output device 140 may be configured to output and/or display information of the speech in the desired form. In some embodiments, the output device 140 may output and/or display the information in a visible way to human. For example, the information outputted and/or displayed by the output device 140 may be in the format of, for example, text, image, video content, audio content, graphics, etc. In some embodiments, the output device 140 may output and/or display machine readable information in an invisible way to human. For example, the output device 140 may store and/or cache information through a storage interface that is machine readable to computers.

In some embodiments, the output device 140 may be an information output device or a device that includes an information output component. Exemplary output devices 140 may include a mobile device 140-1, a display device 140-2, a loudspeaker 130-3, a built-in device in a motor vehicle 140-4, a headset, a microphone, a music player, an e-book reader, a navigation device, a tablet computer, a laptop computer, a recording pen, a printer, a projector, a storage device, or the like, or a combination thereof. Exemplary display devices 140-1 may include a liquid crystal display (LCD) , a light-emitting diode (LED) -based display, a flat panel display, a curved screen, a television device, a cathode ray tube (CRT) , or the like, or a combination thereof.

In some embodiments, the input device 130 and the output device 140 may be two separate devices connected by the network 120. Alternatively, the input device 130 and the output device 140 may be integrated into one single device. Consequently, the device can both receive voice input from users and output information of the desired form translated from the voice input. For example, the integrated device may be a mobile phone. The mobile phone may receive a speech 170 from a speaker 160. The speech may be sent to the server 110 via the network 120 and be translated into textual words in a desired language (in Chinese or English words) by the server 110 and further transmitted back to the mobile phone for display. Alternatively, a local microprocessor of the mobile phone may translate the speech 170 into the textual words of a desired language (in Chinese or English words) , and then displayed locally by the mobile phone. When the input device 130 and the output device 140 are integrated together into a single device, the network 120 in FIG. 1 may become unnecessary because all communications between the input device 130 and the output device 140 become local. Merely for illustration purpose, the present disclosure takes the input device 130 and the output device 140 as separate devices as an example of the speech recognition system 100.

The storage device 150 may store data and/or instructions. In some embodiments, the storage device 150 may store data obtained from the input device 130, the output device 140, and/or the server 110. In some embodiments, the storage device 150 may store data and/or instructions that the server 110 may execute or use to perform exemplary methods described in the present disclosure. In some embodiments, storage device 150 may include a mass storage, a removable storage, a volatile read-and-write memory, a read-only memory (ROM) , or the like, or any combination thereof. Exemplary mass storage may include a magnetic disk, an optical disk, a solid-state drive, etc. Exemplary removable storage may include a flash drive, a floppy disk, an optical disk, a memory card, a zip disk, a magnetic tape, etc. Exemplary volatile read-and-write memory may include a random access memory (RAM) . Exemplary RAM may include a dynamic RAM (DRAM) , a double date rate synchronous dynamic RAM (DDR SDRAM) , a static RAM (SRAM) , a thyristor RAM (T-RAM) , and a zero-capacitor RAM (Z-RAM) , etc. Exemplary ROM may include a mask ROM (MROM) , a programmable ROM (PROM) , an erasable programmable ROM (EPROM) , an electrically-erasable programmable ROM (EEPROM) , a compact disk ROM (CD-ROM) , and a digital versatile disk ROM, etc. In some embodiments, the storage device 150 may be implemented on a cloud platform. Merely by way of example, the cloud platform may include a private cloud, a public cloud, a hybrid cloud, a community cloud, a distributed cloud, an inter-cloud, a multi-cloud, or the like, or any combination thereof.

In some embodiments, the storage device 150 may be connected to the network 120 to communicate with one or more components in the speech recognition system 100 (e.g., the server 110, the input device 130, the output device 140, etc. ) . One or more components in the speech recognition system 100 may access the data or instructions stored in the storage device 150 via the network 120. In some embodiments, the storage device 150 may be directly connected to or communicate with one or more components in the speech recognition system 100 (e.g., the server 110, the input device 130, the output device 140, etc. ) . In some embodiments, the storage device 150 may be part of the server 110.

In some embodiments, one or more components in the speech recognition system 100 (e.g., the server 110, the input device 130, the output device 140, etc. ) may have permission to access the storage device 150. In some embodiments, one or more components in the speech recognition system 100 may read and/or modify information relating to the speaker 160, and/or the public when one or more conditions are met. For example, the server 110 may read and/or modify one or more speakers’information after a speech recognition is completed.

One of ordinary skill in the art would understand that when an element of the speech recognition system 100 performs, the element may perform through electrical signals and/or electromagnetic signals. For example, when an input device 130 processes a task, such as making a determination, identifying or selecting an object, the input device 130 may operate logic circuits in its processor to process such task. When the input device 130 sends out a request to the server 110, a processor of the input device 130 may generate electrical signals encoding the request. The processor of the input device 130 may then send the electrical signals to an output port. If the input device 130 communicates with the server 110 via a wired network, the output port may be physically connected to a cable, which may further transmit the electrical signals to an input port of the server 110. If the input device 130 communicates with the server 110 via a wireless network, the output port of the input device 130 may be one or more antennas, which may convert the electrical signals to electromagnetic signals. Similarly, an output device 140 may process a task through operation of logic circuits in its processor, and receive an instruction and/or request from the server 110 via electrical signals or electromagnet signals. Within an electronic device, such as the input device 130, the output device 140, and/or the server 110, when a processor thereof processes an instruction, sends out an instruction, and/or performs an action, the instruction and/or action is conducted via electrical signals. For example, when the processor retrieves or saves data from a storage medium (e.g., the storage device 150) , it may send out electrical signals to a read/write device of the storage medium, which may read or write structured data in the storage medium. The structured data may be transmitted to the processor in the form of electrical signals via a bus of the electronic device. Here, an electrical signal may refer to one electrical signal, a series of electrical signals, and/or a plurality of discrete electrical signals.

FIG. 2 is a schematic diagram illustrating exemplary hardware and software components of a computing device 200 on which the server 110, the input device 130, and/or the output device 140 may be implemented according to some embodiments of the present disclosure. For example, the processing engine 112 may be implemented on the computing device 200 and configured to perform functions of the processing engine 112 disclosed in this disclosure.

The computing device 200 may be used to implement any component of the speech recognition system 100 as described herein. For example, the processing engine 112 may be implemented on the computing device, via its hardware, software program, firmware, or a combination thereof. Although only one such computer is shown, for convenience, the computer functions relating to the on-demand service as described herein may be implemented in a distributed fashion on a number of similar platforms, to distribute the processing load.

The computing device 200, for example, may include COM ports 250 connected to and/or from a network connected thereto to facilitate data communications. The computing device 200 may also include a processor (e.g., a processor 220) , in the form of one or more processors (e.g., logic circuits) , for executing program instructions. For example, the processor may include interface circuits and processing circuits therein. The interface circuits may be configured to receive electronic signals from a bus 210, wherein the electronic signals encode structured data and/or instructions for the processing circuits to process. The processing circuits may conduct logic calculations, and then determine a conclusion, a result, and/or an instruction encoded as electronic signals. Then the interface circuits may send out the electronic signals from the processing circuits via the bus 210.

The exemplary computer platform may include an internal communication bus 210, program storage and data storage of different forms, for example, a disk 270, and a read only memory (ROM) 230, or a random access memory (RAM) 240, for various data files to be processed and/or transmitted by the computer. The exemplary computer platform may also include program instructions stored in the ROM 230, RAM 240, and/or other type of non-transitory storage medium to be executed by the processor 220. The method and/or process of the present disclosure may be implemented as the program instructions. The computer device 200 also includes an I/O component 260, supporting input/output between the computer and other components. The computing device 200 may also receive programming and data via network communications.

Merely for illustration, only one CPU and/or processor is described in the computing device 200. However, it should be noted that the computing device 200 in the present disclosure may also include multiple CPUs and/or processors, thus operations and/or method steps that are performed by one CPU and/or processor as described in the present disclosure may also be jointly or separately performed by the multiple CPUs and/or processors. For example, if in the present disclosure the CPU and/or processor of the computing device 200 executes both step A and step B, it should be understood that step A and step B may also be performed by two different CPUs and/or processors jointly or separately in the computing device 200 (e.g., the first processor executes step A and the second processor executes step B, or the first and second processors jointly execute steps A and B) .

FIG. 3 is a schematic diagram illustrating exemplary hardware and/or software components of an exemplary mobile device 300 on which an input device 130 and/or an output device 140 may be implemented according to some embodiments of the present disclosure. As illustrated in FIG. 3, the mobile device 300 may include a communication platform 310, a display 320, a graphic processing unit (GPU) 330, a central processing unit (CPU) 340, an I/O 350, a memory 360, a storage 390, a voice input 305, and a voice output 315. In some embodiments, any other suitable component, including but not limited to a system bus or a controller (not shown) , may also be included in the mobile device 300.

In some embodiments, a mobile operating system 370 (e.g., iOS ^TM, Android ^TM, Windows Phone ^TM, etc. ) and one or more applications 380 may be loaded into the memory 360 from the storage 390 in order to be executed by the CPU 340. The applications 380 may include a browser or any other suitable mobile apps for receiving and rendering information relating to image processing or other information from the processing engine 112. User interactions with the information stream may be achieved via the I/O 350 and provided to the processing engine 112 and/or other components of the on-demand service system 100 via the network 120. The voice input 305 may include an acoustic input component (e.g., a microphone) . The voice output 315 may include an acoustic generator (e.g. a speaker) that generates sound.

To implement various modules, units, and their functionalities described in the present disclosure, computer hardware platforms may be used as the hardware platform (s) for one or more of the elements described herein. A computer with user interface elements may be used to implement a personal computer (PC) or any other type of work station or terminal device. A computer may also act as a server if appropriately programmed.

FIGs. 4A and 4B are a block diagrams illustrating

exemplary processing engine

112A and 112B according to some embodiments of the present disclosure. In some embodiments, the processing engine 112A may be configured to translate a speech based on a trained speech recognition neural network model. The processing engine 112B may be configured to train a preliminary neural network model to generate a trained speech recognition neural network model. In some embodiments, the

processing engines

112A and 112B may respectively be implemented on a computing device 200 (e.g., the processor 210) illustrated in FIG. 2 or a CPU 340 as illustrated in FIG. 3. Merely by way of example, the processing engine 112A may be implemented on a CPU 340 of a mobile device and the processing engine 112B may respectively be implemented on a computing device 200. Alternatively, the

processing engines

112A and 112B may be implemented on a same computing device 200 or a same CPU 340.

The processing engine 112A may include an acquisition module 411, a determination module 412, a translation module 413, and a generation module 414.

The acquisition module 411 may be configured to obtain information related to the speech recognition system 100. For example, the acquisition module 411 may obtain target audio signals including a speech 170 of a speaker 160, historical audio signals including one or more historical speeches of the speaker 160, an accent vector of the speaker 160, one or more regional accent model (s) , a trained speech recognition neural network model, or the like, or any combination thereof. The acquisition module 411 may obtain information related to the speech recognition system 100 from an external data source via the network 120, and/or from one or more components of the speech recognition system 100, such as a storage device, the server 110 (e.g., the processing engine 112B) .

The determination module 412 may determine one or more acoustic features of the target audio signals, such as a pitch, a speech rate, a Linear Prediction Coefficient (LPC) , a Mel-scale Frequency Cepstral Coefficient (MFCC) , a Linear Predictive Cepstral Coefficient (LPCC) of the target audio signals. In some embodiments, the determination module 412 may determine an accent vector of the speaker 160. The accent vector may describe the accent of the speaker 160. For example, the accent vector may include one or more elements, each of which may correspond to a regional accent and include a likelihood value related to the regional accent. Details regarding the determination of the accent vector may be found elsewhere in the present disclosure (e.g., operation 530, FIGs. 6 and 7, and the relevant descriptions thereof) .

The translation module 413 may translate the speech 170 of the speaker 160. For example, the translation module 413 may input the one or more acoustic features of target audio signals that includes the speech 170, the accent vector of the speaker 160, and/or the local accent corresponding to the speech 170 into a trained speech recognition neural network model to translate the speech 170 into a target content form. The target content form may be phoneme, syllable, character, or the like, or any combination thereof. Details regarding the translation of the speech 170 may be found elsewhere in the present disclosure (e.g., operation 550 and the relevant descriptions thereof) .

The generation module 414 may generate an interface through an output device 140 to present the speech 170 in the target content form. The interface generated through the output device 140 may be configured to present the speech 170 in the target content form. In some embodiments, to present the speech 170 in different target content forms, the generation module 414 may generate different interfaces through the output device 140. For example, to present the speech 170 in the form of voices, the generation module 414 may generate a play interface through the output device 140. To present the speech 170 in the form of words, the generation module 414 may generate a display interface through the output device 140.

The processing engine 112B may include an acquisition module 421, a determination module 422, and a training module 423.

The acquisition module 421 may obtain information used to generate a trained speech recognition neural network model. For example, the acquisition module 421 may obtain sample audio signals including a plurality of sample speeches of a plurality of sample speakers, a sample accent vector of each sample speaker, a sample local accent corresponding to each sample speech, a preliminary neural network model for training, or the like, or any combination thereof. The acquisition module 421 may obtain information used to generate the trained speech recognition neural network model from an external data source via the network 120, and/or from one or more components of the speech recognition system 100, such as a storage device, the server 110.

The determination module 422 may determine one or more sample acoustic features of a sample audio signal that includes a sample speech. For a sample speech, the sample acoustic feature (s) of the corresponding sample audio signals may include a pitch, a speech rate, a LPC, a MFCC, a LPCC, or the like, or any combination thereof. Details regarding the determination of the sample acoustic feature (s) may be found elsewhere in the present disclosure (e.g., operation 820 and the relevant descriptions thereof) .

The training module 423 may train the preliminary neural network model to generate the trained speech recognition neural network model. For example, the training module 423 may train the preliminary neural network model using input data, for example, information related a plurality of sample audio signals that include a plurality of sample speeches of a plurality of sample speakers. Details regarding the generation of the trained speech recognition neural network model may be found elsewhere in the present disclosure (e.g., operation 860 and the relevant descriptions thereof) .

The modules in the

processing engines

112A and 112B may be connected to or communicate with each other via a wired connection or a wireless connection. The wired connection may include a metal cable, an optical cable, a hybrid cable, or the like, or any combination thereof. The wireless connection may include a Local Area Network (LAN) , a Wide Area Network (WAN) , a Bluetooth, a ZigBee, a Near Field Communication (NFC) , or the like, or any combination thereof. Two or more of the modules may be combined as a single module, and any one of the modules may be divided into two or more units. For example, the

processing engines

112A and 112B may be integrated into a processing engine 112 configured to generate a trained speech recognition neural network model and apply the trained speech recognition neural network model in speech recognition. In some embodiments, a processing engine 112 (the processing engine 112A and/or the processing engine 112B) may include one or more additional modules. For example, the processing engine 112A may include a storage module (not shown) configured to store data.

FIG. 5 is a flowchart illustrating an exemplary process for speech recognition according to some embodiments of the present disclosure. The process 500 may be executed by the speech recognition system 100. For example, the process 500 may be implemented as a set of instructions (e.g., an application) stored in storage device 150. The processing engine 112 (e.g., the processing engine 112A) may execute the set of instructions and accordingly be directed to perform the process 500.

In 510, the processing engine 112A (e.g., the acquisition module 411) may obtain target audio signals including a speech 170 of a speaker 160.

The target audio signals may be a representation of the speech 170, and record characteristic information (e.g., the frequency) of the speech 170. In some embodiments, the target audio signals may be acquired by one or more components of the speech recognition system 100. For example, the target audio signals may be acquired from the input device 130 (e.g., the smart phone 130-1, a headset 130-2, a microphone 130-3, a navigation device) . The target audio signals may be inputted by the speaker 160 via the input device 130. As another example, the target audio signals may be acquired from a storage device in the speech recognition system 100, such as the storage device 150 and/or the storage 390. In some embodiments, the target audio signals may be acquired from an external data source, such as a speech library, via the network 120.

In some embodiments, the speech 170 may include a request for an on-demand service, such as a taxi-hailing service, a chauffeur service, an express car service, a carpool service, a bus service, a driver hire service, and a shuttle service. Additionally or alternatively, the speech 170 may include information related to the request for the on-demand service. Merely by way of example, the speech 170 may include a start location and/or a destination related to request for the taxi-hailing service. In some embodiments, the speech 170 may include a command to direct the input device 130 and/or the output device 140 to perform a certain action. Merely by way of example, the speech 170 may include a command to direct the input device 130 to call someone.

In 520, the processing engine 112A (e.g., the determination module 412) may determine one or more acoustic features of the target audio signals.

An acoustic feature may refer to an acoustic property of the speech 170 that may be recorded and analyzed. Exemplary acoustic features may include a pitch, a speech rate, a Linear Prediction Coefficient (LPC) , a Mel-scale Frequency Cepstral Coefficient (MFCC) , a Linear Predictive Cepstral Coefficient (LPCC) , or the like, or any combination thereof.

In some embodiments, the acoustic feature (s) may be represented or recorded in the form of a feature vector. In some embodiments, the feature vector may be expressed as a vector with one column or one row. For example, the feature vector may be a row vector expressed as a 1×N determinant (e.g., a 1×108 determinant) . In some embodiments, the feature vector may correspond to an N-dimensional coordinate system. The N-dimensional coordinate system may be associated with N acoustic features. In some embodiments, the determination module 412 may process one or more feature vectors at once. For example, m first features vectors (e.g., three-row vectors) may be integrated into a 1×mN vector or a mxN matrix, where m is an integer.

In some embodiments, the determined feature vector may indicate one or more acoustic features of the target audio signals during a time window, such as 10 milliseconds (ms) , 25 ms, or 50 ms. For example, the determination module 412 may segment the target audio signals into a plurality of time windows. Different time windows may have the same duration or different durations. Two consecutive time windows may or may not be overlapped with each other. The determination module 412 may then perform a Fast Fourier Transform (FFT) on the audio signal (s) in each time window. From the FFT data for a time window, the determination module 412 may extract acoustic feature (s) that are represented as a feature vector for the time window.

In 530, the processing engine 112A (e.g., the acquisition module 411) may obtain an accent vector of the speaker 160.

The accent vector may refer to a vector that describes the accent of the speaker 160. The accent of the speaker 160 is a manner of pronunciation peculiar to the speaker 160. In speech recognition, an accented pronunciation may result in a misidentification and failed recognition of words. The accuracy of speech recognition may be improved if the accent of the speaker 160 is taken into consideration. Normally, an accent of a person is related to one or more locations where the person has lived. In the case of a person who has lived in a plurality of locations, his/her accent may be a hybrid accent. Merely by way of example, for a person who was born in Shandong, studied in Shanghai, and now works in Beijing, his/her accent may be a mixture and/or combination of Shandong accent, Shanghai accent, and Beijing accent.

For the purpose of describing the accent of the speaker 160, the accent vector is provided. In some embodiments, the accent vector may include one or more elements, each of which may correspond to a regional accent and include a likelihood value related to the regional accent. A regional accent may represent a manner of pronunciation peculiar to one or more specific regions. As used herein, since an accent is almost always local, a reginal accent is in fact equivalent to the geographic region where the accent comes from, i.e. a regional accent may also be referred to as one or more geographic regions to which the accent belongs. For example, the Shandong accent may also be referred as to Shandong because the Shandong accent are mostly spoken by people grown up from or lived in Shandong. As another example, the China’s Northeast accent may also be referred to as Liaoning province, Jilin province, Heilongjiang province, and/or other places to which the China’s Northeast accent belongs. In some embodiments, a relatively large region may be divided into a plurality of sub-regions according to respective regional accents. For example, China may be divided into a plurality of regions each of which has a distinctive regional accent. Liaoning province, Jilin province, and Heilongjiang province may be classified into one region since the accents of the three provinces are similar. In some embodiments, the corresponding relationship between regions and regional accents may be stored in a storage device of the speech recognition system 100, such as the storage device 150, the ROM 230, the RAM 240, and/or the storage 390.

A likelihood value related to a regional accent may represent a likelihood for the accent of the speaker 160 being the regional accent. In other words, the likelihood value related to the regional accent may measure a similarity or difference between the accent of the speaker 160 and the regional accent. For example, a speaker 160 speaks pure Beijing dialect if the likelihood value related to Beijing accent is or approaches 100%. As another example, a speaker 160 speaks Mandarin mixed with a slight Taiwan accent if the likelihood values related to Mandarin and Taiwan accent are respectively 90%and 10%.

In some embodiments, the accent vector of the speaker 160 may be a one-hot vector or an embedding vector. In the one-hot vector, the likelihood value (s) related to the regional accent (s) that the speaker 160 has may be denoted as 1, while the likelihood value (s) of the regional accent (s) that the speaker 160 does not have may be denoted as 0. A regional accent that the speaker 160 has may refer to a regional accent whose corresponding likelihood value is or greater than a certain value, e.g., 0%, 10%, etc. In the embedding vector, the value (s) of regional accent (s) that the speaker 160 has may be denoted as real number (s) , while the likelihood value (s) of the regional accent (s) that the speaker 160 does not have may be denoted as 0. Merely by way of example, the accent vector may be expressed as Equation (1) below:

X= {X (1) , X (2) , ..., X (i) } (1)

wherein X refers to the accent vector of the speaker 160; i refers to the i ^th regional accent; and X (i) refers to a likelihood value related to the i ^th regional accent. The likelihood value L _i may be any positive value. For example, if Mandarin is the 1 ^st element in the vector X, Shandong accent is the 2 ^nd element in the vector X, and Shanghai accent is the 3 ^rd element in the vector X, then a vector of X= {0.7, 0.2, 0.1} may mean that the speaker’s accent includes or substantially includes 80%of Mandarin, 20%of Shandong accent, and 10%of Shanghai accent.

In some embodiments, the acquisition module 411 may acquire the accent vector of the speaker 160 from an external source and/or one or more components of the speech recognition system 100. Further, the accent profile (e.g., accent vector X) of a speaker may be predetermined by the speech recognition system 100. For example, the accent vector of the speaker 160 may be recorded and/or included in his/her user profile and stored in a storage device of the speech recognition system 100, such as the storage device 150, the ROM 230, the RAM 240, and/or the storage 390. The acquisition module 411 may access the storage device and retrieve the accent vector of the speaker 160. In some embodiments, the accent vector of the speaker 160 may be inputted by the speaker 160 via the input device 130. Additionally or alternatively, the accent vector may be determined by the processing engine 112A (e.g., the determination module 412) and transmitted to the storage device for storage.

In some embodiments, the determination module 412 may determine one or more regional accents and the corresponding likelihood values based on the user profile of the speaker 160. The user profile of the speaker 160 may include the hometown, telephone number, education experience, work experience, log information (e.g., web log, software log) , historical service orders, or the like, or any combination thereof. Since the determination of the accent vector under the user profile does not depend on his/her actual accent, the value (s) (or likelihood value) of the vector element (s) may be binary value (s) . For example, the determination module 412 may determine one or more regional accents that the speaker 160 has based on the regions where the speaker 160 born, studied, worked, long-term lived (e.g., the length of residence being longer than a threshold) , or the like, or any combination thereof. Accordingly, based on where the speaker was born, the likelihood value (s) of the regional accent related to a geographic region that the speaker 160 was born may be designated as 1, and the regional accent related to a geographic region other than where the speaker is born may be designated as 0. Based on the length of time that the speaker 160 stays, the reginal accent related to a geographic region that the speaker lived and/or lived longer than the threshold value may be designated as 1, and the reginal accent related to a geographic region that the speaker lived and/or lives shorter than the threshold value may be designated as 0.

In some embodiments, the user profile the speaker 160 may include historical audio signals including one or more historical speeches of the speaker 160. The determination module 412 may determine the accent vector of the speaker 160 by analyzing the historical audio signals. Details regarding the determination of the accent vector based on historical audio signals may be found elsewhere in the present disclosure (e.g., FIGs. 6 and 7 and the relevant descriptions thereof) .

In 540, the acquisition module 411 may obtain a local accent of a region from which the speech 170 is originated. The way people speak may be affected by the language environment. For example, for a speaker has a hybrid accent of Shandong and Beijing, he/she may have a more obvious Shandong accent in Shandong and a more obvious Beijing accent in Beijing. As another example, for a speaker does not have Taiwan accent, he/she may speak with a slight Taiwan accent when in Taiwan. Accordingly, the local accent of the region from which the speech 170 is originated may need to be taken into consideration in the speech recognition.

In some embodiments, the input device 130 may be a device with positioning technology for locating the speaker 160 and/or the input device 130. Alternatively, the input device 130 may communicate with another positioning device to determine the position of the speaker 160 and/or the input device 130. When or after the speaker 160 inputs the speech 170 via the input device 130, the input device 130 may transmit the position information of the speaker 160 to the server 110 (e.g., the processing engine 112A) . Based on the position information, the server 110 (e.g., the processing engine 112A) may determine the local accent of the region from which the speech 170 is originated. For example, the determination module 412 may determine the regional accent of the region where the speech 170 is generated based on the corresponding relationship between regions and regional accents. The determination module 412 may further designate the regional accent of the region where the speech 170 is generated as the local accent.

In 550, the processing engine 112A (e.g., the translation module 413) may input the one or more acoustic features of the target audio signals, the accent vector of the speaker 160, and the local accent into a trained speech recognition neural network model to translate the speech 170 into a target content form.

The target content form may be phoneme, syllable, character, or the like, or any combination thereof. The phoneme refers to perceptually distinct units of speech in a specified language that distinguish one word from another. The syllable refers to a unit of pronunciation having one vowel sound, with or without surrounding consonants, forming the whole or a part of a word, and the translated speech in the form of syllable may be a voice or sound of words. The phoneme and/or syllable may be that of the speech under different accent. For example, the target content form of an audio speech under Shandong accent may include phonemes of the same speech spoken by Mandarin or Cantonese. The phoneme and/or syllable may also be of another language different from that of the original speech. For example, the audio speech may be spoken in Chinese, and the target content of the audio speech may include phonemes of the same speech spoken in English. The character refers to a unit of written language, and the translated speech in the form of character may be one or more words. For example, the processing engine 112A may identify the phonemes and/or syllables in the audio speech and translate the phonemes and/or syllables into characters of written words in a specific language same or different from the language of the audio speech. In some embodiments, the target content form may be of a target language or accent. For example, the translated speech may be English words.

The trained speech recognition neural network model may be configured to generate a neural network output based on the input. For example, the trained speech recognition neural network model may output likelihoods that the input (e.g., the acoustic feature (s) , the accent vector, and the local accent) corresponds to specific phonemes. In some embodiments, the neural network output may be the same content form as the target content form. For example, the neural network output may be in the form of phoneme that is the same as the target content form. There is no need to transform the neural network output into the target content form. In some embodiments, the neural network output may be in a different content form from the target content form. For example, the neural network output may be in the form of phoneme while the target content form is syllable or character. The translation module 413 may further transform the neural network output into the target content form. Merely by way of example, the translation module 413 may input the neural network output in the form of phoneme into a set of Weighted Finite-State Transducers (WFST) to generate a word lattice. The translation module 413 may further derive a transcription of the speech 170 from the word lattice. The transcription of the speech 170 may be a voice including spoken words or text including words. In some embodiments, the target content form may be of a target language or accent. The transcription of the speech 170 may be further translated into the target language or accent. The translation of the transcription may be performed by the translation module 413 based on a translation technique or an external translation platform (e.g., a translation website, a translation software) .

In some embodiments, the trained speech recognition neural network model may be acquired by the translation module 413 from a storage device in the speech recognition system 100 (e.g., the storage device 150) and/or an external data source (not shown) via the network 120. In some embodiments, the processing engine 112B may generate the trained speech recognition neural network model, and store it in the storage device. The translation module 413 may access the storage device and retrieve the trained speech recognition neural network model.

In some embodiments, the processing engine 112B may train a trained speech recognition neural network model based on a machine learning method. The machine learning method may include but not be limited to an artificial neural networks algorithm, a deep learning algorithm, a decision tree algorithm, an association rule algorithm, an inductive logic programming algorithm, a support vector machines algorithm, a clustering algorithm, a Bayesian networks algorithm, a reinforcement learning algorithm, a representation learning algorithm, a similarity and metric learning algorithm, a sparse dictionary learning algorithm, a genetic algorithms, a rule-based machine learning algorithm, or the like, or any combination thereof. In some embodiments, the processing engine 112B may train the trained speech recognition neural network model by performing process 800 illustrated in FIG. 8.

In 560, the generation module 414 may generate an interface through an output device 140 to present the speech 170 in the target content form.

The interface generated through the output device 140 may be configured to present the speech 170 in the target content form. To present the speech 170 in different target content forms, the generation module 414 may generate different interfaces through the output device 140. In some embodiments, the interface for presenting of the speech 170 may be visible to human. For example, the speech 170 in the target content form may be one or more voices and/or words as described in connection with operation 550. To present the speech 170 in the form of voices, the generation module 414 may generate a play interface through the output device 140. To present the speech 170 in the form of words, the generation module 414 may generate a display interface through the output device 140. In some embodiments, the interface for present the speech 170 may be invisible to human but machine readable. For example, the speech 170 in the target content form may be phonemes. The generation module 414 may generate a storage interface through the output device 140 to store and/or cache the speech 170 in the form of phonemes, which may not be read directly by human eyes but may be read, translated, and used by a smart device (e.g., computer or other device) .

The output device 140 may be, for example, a mobile device 140-1, a display device 140-2, a loudspeaker 130-3, a built-in device in a motor vehicle 140-4, a storage device, and/or any device that can output and/or display information. In some embodiments, the input device 130 and the output device 140 may be integrated into a device or two separate devices. For example, the speech 170 may be recorded by a mobile phone and then translated into the words by the processing engine 112A. The generation module 414 may generate to an interface for displaying the translated words through the mobile phone itself or another output device 140, such as another mobile phone, a laptop computer, etc.

It should be noted that the above description of the process 500 is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure.

In some embodiments, one or more additional operations may be added or one or more operations of the process 500 may be omitted. For example, operation 540 may be omitted. In 550, the translation module 413 may input the one or more acoustic features of the target audio signals and the accent vector of the speaker 160 into the trained speech recognition neural network model. In some embodiments, the order of the operations of the process 500 may be changed. For example,

operation

530 and 540 may be performed simultaneously or in any order.

In some embodiments, in 530, the accent of the speaker 160 may be expressed in a form other than an accent vector. For example, the accent may be expressed by a polynomial or a matrix. The polynomial or the matrix may include one or more elements that are similar to those of the accent vector. Each element may correspond to a regional accent and include a likelihood value related to the regional accent. In some embodiments, in 530, a plurality of accent vectors may be acquired by the acquisition module 411. The plurality of accent vectors may correspond to different regional accents. Optionally, the plurality of accent vectors may be integrated into a single accent vector by the determination module 412 before being inputted into the trained speech recognition neural network model. In some embodiments, the determination module 412 may also obtain or determine one or more audio characteristics related to the target audio signals. The audio characteristics may be independent of the words in the speech 170 spoken by the speaker 160. For example, the audio characteristics may indicate characteristics that correspond to one or more of background noise, recording channel properties, the speaker’s speaking style, the speaker’s gender, the speaker’s age, or the like, or any combination thereof. In 550, the audio characteristic (s) may be inputted into the trained speech recognition neural network model together with the one or more acoustic feature, the accent vector, and/or the local accent.

FIG. 6 is a flowchart illustrating an exemplary process for determining an accent vector of a speaker based on one or more regional accent models according to some embodiments of the present disclosure. The process 600 may be executed by the speech recognition system 100. For example, the process 600 may be implemented as a set of instructions (e.g., an application) stored in storage device 150. The processing engine 112 (e.g., the processing engine 112A) may execute the set of instructions and may accordingly be directed to perform the process 600. In some embodiments, the process 600 may be an embodiment of operation 530 with reference to FIG. 5.

In 610, the processing engine 112A (e.g., the acquisition module 411) may obtain historical audio signals including one or more historical speeches of the speaker 160. Each of the historical speeches may be encoded in one or more of the historical audio signals. The historical audio signals may be acquired from one or more components of the speech recognition system 100 or an external data source. For example, the historical audio signals may be acquired from a storage device in the speech recognition system 100, such as the storage device 150, the ROM 230, the RAM 240, and/or the storage 390. In some embodiments, the historical speeches may be inputted by the speaker 160 via an input device 130. In some embodiments, a historical speech may include a historical request for an on-demand service or information related to the historical request. The historical audio signals including a historical speech may be similar to the target audio signals including the speech 170 as described in connection with operation 510, and the descriptions thereof are not repeated.

In 620, for each of the one or more historical speeches, the processing engine 112A (e.g., the determination module 412) may determine one or more historical acoustic features of the corresponding historical audio signals. For a historical speech, the historical acoustic feature (s) of the corresponding historical audio signals may include a pitch, a speech rate, a LPC, a MFCC, a LPCC, or the like, or any combination thereof. In some embodiments, the historical acoustic features corresponding to a historical speech may be represented or recorded in the form of a feature vector. Operation 620 may be performed in a similar manner with operation 520, and the descriptions thereof are not repeated here.

In 630, the processing engine 112A (e.g., the acquisition module 411) may obtain one or more regional accent models. A regional accent model may be a particular accent model that corresponds to a language or a regional accent of speakers. Merely by way of example, the regional accent models may include ones that correspond to different languages (e.g., English, Japanese, or Spanish) and/or ones that correspond to different regional accents (e.g., for Chinese, the regional accents may include Mandarin, Cantonese, etc. For English, the regional may include American English, British English, Indian English, etc. ) . A regional accent model that corresponds to a regional accent or a language may be configured to generate a model output based on acoustic feature (s) of a speech. The model output may indicate a likelihood or probability that the speaker of the speech has the corresponding regional accent or speaks the language. The model output may be further used to construct an accent vector of the speaker, which will be described in detail in connection with operation 650.

In some embodiments, the acquisition module 411 may acquire the regional accent model (s) from an external data source or one or more components of the speech recognition system 100. For example, the acquisition module 411 may acquire the regional accent model (s) from, for example, a linguistic database, and/or a language library external to the speech recognition system 100. As another example, the acquisition module 411 may acquire the regional accent models (s) from a storage device of the speech recognition system 100, such as the storage device 150, the ROM 230, the RAM 240, and/or the storage 390. In some embodiments, the regional accent model (s) may be trained by the server 110 (e.g., the processing engine 112B) and stored in the storage device. For example, a regional accent model that corresponds to a particular regional accent may be trained by the processing engine 112B using a training set. The training set may include, for example, acoustic features of a plurality of sample speeches that belong to the regional accent.

In 640, for each of the one or more historical speeches, the processing engine 112A (e.g., the determination module 412) may input the one or more corresponding historical acoustic features into each of the one or more regional accent models. For a historical speech, each regional accent model may generate a model output indicating a likelihood or probability that the speaker 160 has the corresponding regional accent. For example, historical acoustic features derived from the three historical speeches may be inputted into a Beijing accent model respectively. The outputs of the Beijing accent model may indicate an 80%likelihood, an 85%likelihood, and a 70%likelihood that the speaker 160 has the Beijing accent.

In 650, the determination module 412 may determine the plurality of elements of the accent vector of the speaker 160 based on at least the output (s) of the one or more regional accent models. Each of the elements of the accent vector may correspond to a regional accent and include a likelihood value related to the regional accent as described in connection with FIG. 6. The likelihood value related to a regional accent may be determined according to the output (s) of the regional accent model that corresponds to the regional accent.

Take Beijing accent as an example, the corresponding likelihood value may be determined based on the output (s) of the Beijing accent model with respect to the one or more historical speeches. For example, the determination module 412 may determine an overall likelihood that the speaker 160 has a Beijing accent, and further determine the likelihood value accordingly. The overall likelihood that the speaker 160 has the Beijing accent may be, for example, a maximum, an average, or median value of the outputs of the Beijing accent model. In some embodiments, the likelihood value of the Beijing accent may be determined by normalizing the overall likelihood of Beijing accent together with the overall likelihoods of other regional accents. In some embodiments, the determination module 412 may apply a minimum threshold in the likelihood value determination. In such instances, when the (normalized) overall likelihood of Beijing accent is lower than the minimum threshold, the determination module 412 may determine that the speaker 160 does not have the Beijing accent and the corresponding likelihood value may be denoted as 0. When the (normalized) overall likelihood of the Beijing accent is not lower than the minimum threshold, the determination module 412 may determine that the speaker 160 has the Beijing accent and the corresponding likelihood value may be denoted as the (normalized) overall likelihood or 1. In some embodiments, the determination module 412 may rank the overall likelihoods of different regional accents. When the Beijing accent ranks in the top N (e.g., 1, 3, 5) of the ranking result, the determination module 412 may determine that the speaker 160 has the Beijing accent and the corresponding likelihood value may be denoted as the (normalized) overall likelihood or 1. Otherwise, the likelihood value related to Beijing accent may be denoted as 0.

FIG. 7 is a flowchart illustrating an exemplary process for determining an accent vector of a speaker based on an accent classification model according to some embodiments of the present disclosure. The process 700 may be executed by the speech recognition system 100. For example, the process 700 may be implemented as a set of instructions (e.g., an application) stored in storage device 150. The processing engine 112 (e.g., the processing engine 112A) may execute the set of instructions and may accordingly be directed to perform the process 700. In some embodiments, the process 700 may be an embodiment of operation 530 with reference to FIG. 5.

In 710, the processing engine 112A (e.g., the acquisition module 411) may obtain historical audio signals including one or more historical speeches of the speaker. In 720, for each of the one or more historical speeches, the determination module 412 may determine one or more historical acoustic features of the corresponding historical audio signals.

Operations

710 and 720 may be performed in a similar manner with

operations

610 and 620 respectively, and the descriptions thereof are not repeated here.

In 730, the processing engine 112A (e.g., the acquisition module 411) may obtain an accent classification model. The accent classification model may be configured to receive acoustic features of a speech and classify the accent of the speaker who speaks the speech into one or more accent classifications. The accent classification (s) may include, for example, one or more languages (e.g., English, Japanese, or Spanish) and/or one or more regional accents (e.g., Mandarin, Cantonese, Taiwan accent, American English, British English) . In some embodiments, the classification result may be represented by one or more regional accents that the speaker has. Additionally or alternatively, the classification result may include a probability or likelihood that the speaker has a particular regional accent. For example, the classification result may indicate that the accent of the speaker has a 70%likelihood of Mandarin and a 30%likelihood of Cantonese.

In some embodiments, the acquisition module 411 may acquire the accent classification model from an external data source and/or one or more components of the speech recognition system 100, such as a storage device, the server 110. The obtaining of the accent classification model may be similar to that of the regional accent model (s) as described in connection with operation 630, and the descriptions thereof are not repeated. In some embodiments, the accent classification model may be trained by the server 110 (e.g., the processing engine 112B) and stored in the storage device. For example, the accent classification model may be trained by the processing engine 112B using a set of training samples, each of which may be marked as belonging to a particular accent classification.

In 740, for each of the one or more historical speeches, the processing engine 112A (e.g., the determination module 412) may input the one or more corresponding historical acoustic features into the accent classification model. For a historical speech, the accent classification model may output a classification result regarding the accent of the speaker 160.

In 750, the processing engine 112A (e.g., the determination module 412) may determine the plurality of elements of the accent vector of the speaker 160 based on at least the output (s) of the accent classification model. Each of the elements of the accent vector may correspond to a regional accent and include a likelihood value related to the regional accent as described in connection with FIG. 6. The likelihood value related to a regional accent may be determined according to the classification result (s) of the accent classification model.

In some embodiments, the classification result corresponding to a historical speech may include one or more regional accents that the speaker 160 has. The determination module 412 may determine an overall likelihood that the speaker 160 has a particular regional accent and designate it as the likelihood value related to the regional accent. Merely by way of example, a classification result based on a historical speech shows that the speaker 160 has a Beijing accent and a Shandong accent, and a classification result based on another historical speech shows that the speaker 160 only has the Beijing accent. The determination module 412 may determine that the overall likelihoods that the speaker 160 has Beijing accent and Shandong accent are 2/3 and 1/3. In some embodiments, the classification result corresponding to the historical speech may include a likelihood that the speaker 160 has a particular regional accent. The determination module 412 may determine an overall likelihood that the speaker 160 has a particular regional accent, and further determine the likelihood value accordingly. Details regarding the determination of a likelihood value related to a regional accent based on its overall likelihood may be found elsewhere in the present disclosure (e.g., operation 650 and the relevant descriptions thereof) .

FIG. 8 is a flowchart illustrating an exemplary process for generating a trained speech recognition neural network model according to some embodiments of the present disclosure. The process 800 may be executed by the speech recognition system 100. For example, the process 800 may be implemented as a set of instructions (e.g., an application) stored in storage device 150. The processing engine 112 (e.g., the processing engine 112B) may execute the set of instructions and may accordingly be directed to perform the process 800.

In 810, the processing engine 112B (e.g., the acquisition module 421) may obtain sample audio signals including a plurality of sample speeches of a plurality of sample speakers. Each sample speech of a sample speaker may be encoded or included in one or more sample audio signals. In some embodiments, the sample audio signals may be acquired from one or more components of the speech recognition system 100, such as a storage device (the storage device 150, the ROM 230, the RAM 240, and/or the storage 390) , the input device 130, etc. For example, the storage device 150 may store historical audio signals including a plurality of historical speeches of users of the speech recognition system 100. The historical audio signals may be retrieved from the storage device 150 and designated as the sample audio signals by the acquisition module 421. In some embodiments, the sample audio signals may be acquired from an external data source, such as a speech library, via the network 120. The sample audio signals including a sample speech may be similar to the target audio signals including the speech 170 as described in connection with operation 510, and the descriptions thereof are not repeated.

In 820, for each of the plurality of sample speeches, the processing engine 112B (e.g., the determination module 422) may determine one or more sample acoustic features of the corresponding sample audio signals. For a sample speech, the sample acoustic feature (s) of the corresponding sample audio signals may include a pitch, a speech rate, a LPC, a MFCC, a LPCC, or the like, or any combination thereof. Operation 820 may be performed in a similar manner with operation 520, and the descriptions thereof are not repeated here.

In 830, for each of the plurality of sample speeches, the processing engine 112B (e.g., the acquisition module 421) may determine a sample accent vector of the corresponding sample speaker. A sample accent vector of a sample speaker may be a vector that describes the accent of the sample speaker. The acquisition of the sample accent vector may be similar to that of the accent vector as described in connection with operation 530, and the descriptions thereof are not repeated.

In 840, for each of the plurality of sample speeches, the processing engine 112B (e.g., the acquisition module 421) may acquire a sample local accent of a region from which the sample speech is originated. Operation 840 may be performed in a similar manner with operation 540, and the descriptions thereof are not repeated here.

In 850, the processing engine 112B (e.g., the acquisition module 421) may obtain a preliminary neural network model. Exemplary preliminary neural network models may include a convolutional neural network (CNN) model, an artificial neural network (ANN) model, a recurrent neural network (RNN) model, a deep trust network model, a perceptron neural network model, a stack self-coding network model, or any other suitable neural network model. In some embodiments, the preliminary neural network model may include one or more preliminary parameters. The one or more preliminary parameter (s) may be adjusted during the training process of the preliminary neural network model. The preliminary parameters may be default settings of the speech recognition system 100, or may be adjustable under different situations. In some embodiments, the preliminary neural network model may include a plurality of processing layers, e.g., an input layer, a hidden layer, an output layer, a conventional layer, a pooling layer, an activation layer, or the like, or any combination thereof.

In 860, the processing engine 112B (e.g., the training module 423) may determine a trained speech recognition neural network model by inputting the one or more sample acoustic features, the sample accent vector of the sample speaker, and the sample local accent corresponding to each of the sample speeches into the preliminary neural network model. For brevity, the sample acoustic feature (s) , the sample accent vector of the sample speaker, and the sample local accent corresponding to each of the sample speeches inputted into the preliminary neural network model may be referred to as input data.

In some embodiments, the input data may be inputted into the preliminary neural network model to generate an actual output. The training module 423 may compare the actual output with a desired or correct output to determine a loss function. The desired or correct output may include, for example, desired or corrected likelihoods that the input data corresponds to specific phonemes. In some embodiments, the desired or corrected likelihoods may be determined based on correct translations (possibly in the format of words or voices) of the sample speeches. The loss function may measure a difference between the actual output and the desired output. During the training process, the training module 423 may update the preliminary parameter (s) to minimize the loss function. In some embodiments, the minimization of the loss function may be iterative. The iteration of minimizing the loss function may be terminated until the newly loss function is less than a predetermined threshold. The predetermined threshold may be set manually or determined based on various factors, such as the accuracy of the trained speech recognition neural network model, etc.

It should be noted that the above description of the process 800 is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. In some embodiments, one or more additional operations may be added or one or more operations of the process 800 may be omitted. For example, operation 840 may be omitted. In 860, the training module 423 may train the preliminary neural network model using the one or more sample acoustic features and the sample accent vector of the sample speaker corresponding to each of the sample speeches.

FIG. 9 is a schematic diagram illustrating an exemplary trained speech recognition neural network model 910 according to some embodiments of the present disclosure. As shown in FIG. 9, the trained speech recognition neural network model 910 may include a plurality of processing layer, for example, an input layer 911, a number of hidden layers (e.g., 912A and 912B) , and an output layer 913.

To recognize a speech 170 of a speaker 160, the input layer 911 may receive input data related to the speech 170. For example, one or more acoustic feature (s) 930 derived from the speech 170, an accent vector of the speaker 160, as well as a local accent of a region in which the speech 170 is originated may be provided to the input layer 911 as input data. In some embodiments, the accent vector of the speaker 160 may be determined based on one or more accent models 921, such as regional accent models or accent classification models. A regional accent model or accent classification model may be a trained neural network model for determining and/or classifying an accent of the speaker 160.

The output layer 913 of the trained speech recognition neural network model 910 may generate a model output. The model output may include, for example, likelihoods that a combination of acoustic feature (s) 930, the accent vector 920, and the local accent represent a particular phoneme. In some embodiments, the processing engine 112A (e.g., the translation module 413) may further transform the model output into a transcription in a target content form, such as voices, words, etc. The transcription may be transmitted to a user terminal for presentation.

It should be noted that the example illustrated in FIG. 9 is merely provided for the purposes of illustration, and not intended to limit the scope of the present disclosure. For persons having ordinary skills in the art, multiple variations and modifications may be made under the teachings of the present disclosure. However, those variations and modifications do not depart from the scope of the present disclosure. For example, the trained speech recognition neural network model 910 may include other types of processing layers, such as a conventional layer, a pooling layer, an activation layer, or the like, or any combination thereof. As another example, the trained speech recognition neural network model 910 may include any number of processing layers. As yet another example, the input data may include the acoustic feature (s) 930 and the accent vector 920 without the local accent.

Having thus described the basic concepts, it may be rather apparent to those skilled in the art after reading this detailed disclosure that the foregoing detailed disclosure is intended to be presented by way of example only and is not limiting. Various alterations, improvements, and modifications may occur and are intended to those skilled in the art, though not expressly stated herein. These alterations, improvements, and modifications are intended to be suggested by this disclosure, and are within the spirit and scope of the exemplary embodiments of this disclosure.

Moreover, certain terminology has been used to describe embodiments of the present disclosure. For example, the terms “one embodiment, ” “an embodiment, ” and/or “some embodiments” mean that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present disclosure. Therefore, it is emphasized and should be appreciated that two or more references to “an embodiment” or “one embodiment” or “an alternative embodiment” in various portions of this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures or characteristics may be combined as suitable in one or more embodiments of the present disclosure.

Further, it will be appreciated by one skilled in the art, aspects of the present disclosure may be illustrated and described herein in any of a number of patentable classes or context including any new and useful process, machine, manufacture, or composition of matter, or any new and useful improvement thereof. Accordingly, aspects of the present disclosure may be implemented entirely hardware, entirely software (including firmware, resident software, micro-code, etc. ) or combining software and hardware implementation that may all generally be referred to herein as a “unit, ” “module, ” or “system. ” Furthermore, aspects of the present disclosure may take the form of a computer program product embodied in one or more computer readable media having computer readable program code embodied thereon.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including electro-magnetic, optical, or the like, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that may communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable signal medium may be transmitted using any appropriate medium, including wireless, wireline, optical fiber cable, RF, or the like, or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Scala, Smalltalk, Eiffel, JADE, Emerald, C++, C#, VB. NET, Python or the like, conventional procedural programming languages, such as the "C" programming language, Visual Basic, Fortran 2003, Perl, COBOL 2002, PHP, ABAP, dynamic programming languages such as Python, Ruby and Groovy, or other programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN) , or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider) or in a cloud computing environment or offered as a service such as a Software as a Service (SaaS) .

Furthermore, the recited order of processing elements or sequences, or the use of numbers, letters, or other designations therefore, is not intended to limit the claimed processes and methods to any order except as may be specified in the claims. Although the above disclosure discusses through various examples what is currently considered to be a variety of useful embodiments of the disclosure, it is to be understood that such detail is solely for that purpose, and that the appended claims are not limited to the disclosed embodiments, but, on the contrary, are intended to cover modifications and equivalent arrangements that are within the spirit and scope of the disclosed embodiments. For example, although the implementation of various components described above may be embodied in a hardware device, it may also be implemented as a software only solution, e.g., an installation on an existing server or mobile device.

Similarly, it should be appreciated that in the foregoing description of embodiments of the present disclosure, various features are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure aiding in the understanding of one or more of the various embodiments. This method of disclosure, however, is not to be interpreted as reflecting an intention that the claimed subject matter requires more features than are expressly recited in each claim. Rather, claimed subject matter may lie in less than all features of a single foregoing disclosed embodiment.

Claims

A system, comprising:

at least one audio signal input device configured to receive a speech of a speaker;

at least one storage medium including a set of instructions for speech recognition;

at least one processor in communication with the at least one storage medium, wherein when executing the set of instructions, the at least one processor is directed to:

obtain, from the audio signal input device, target audio signals including the speech of the speaker;

determine one or more acoustic features of the target audio signals;

obtain at least one accent vector of the speaker;

input the one or more acoustic features of the target audio signals and the at least one accent vector of the speaker into a trained speech recognition neural network model to translate the speech into a target content form; and

generate an interface through an output device to present the speech in the target content form.
The system of claim 1, wherein to input the one or more acoustic features of the audio signal and the accent of the speaker into the trained neural network model for speech recognition, the at least one processor is further directed to:

obtain a local accent of a region from which the speech is originated; and

input the one or more acoustic features of the target audio signals, the at least one accent vector of the speaker, and the local accent into the trained speech recognition neural network model.
The system of claim 1, wherein the accent vector includes a plurality of elements, each element corresponds to a regional accent and includes a likelihood value related to the regional accent.
The system of claim 3, wherein the at least one accent vector of the speaker is determined according to an accent determining process, the accent determining process including:

obtaining historical audio signals including one or more historical speeches of the speaker;

for each of the one or more historical speeches, determining one or more historical acoustic features of the corresponding historical audio signals;

obtaining one or more regional accent models;

for each of the one or more historical speeches, inputting the one or more corresponding historical acoustic features into each of the one or more regional accent models; and

determining the plurality of elements of the at least one accent vector of the speaker based on at least an output of the one or more regional accent models.
The system of claim 3, wherein the accent of the speaker is determined according to an accent determining process, the accent determining process including:

obtaining historical audio signals including one or more historical speeches of the speaker;

for each of the one or more historical speeches, determining one or more historical acoustic features of corresponding historical audio signals;

obtaining an accent classification model;

for each of the one or more historical speeches, inputting the corresponding historical acoustic features into the accent classification model; and

determining the plurality of elements of the at least one accent vector of the speaker based on at least an output of the accent classification model.
The system of claim 1, wherein the speech recognition trained neural network model is generated by at least one computing device according to a training process, the training process including:

obtaining sample audio signals including a plurality of sample speeches of a plurality of sample speakers;

for each of the plurality of sample speeches, determining one or more sample acoustic features of the corresponding sample audio signals;

for each of the plurality of sample speeches, obtaining at least one sample accent vector of the corresponding sample speaker;

obtaining a preliminary neural network model;

determining the trained speech recognition neural network model by inputting the one or more sample acoustic features and the at least one sample accent vector of the sample speaker corresponding to each of the plurality of sample speeches into the preliminary neural network model.
The system of claim6, wherein the determining of the trained neural network model for speech recognition including:

for each of the plurality of sample speeches, obtaining a sample local accent of a region from which the sample audio signal is originated; and

determining the trained speech recognition neural network model by inputting the one or more sample acoustic features, the at least one sample accent vector of the sample speaker, and the sample local accent corresponding to each of the plurality of sample speeches into the preliminary neural network model.
The system of claim 1, wherein the target content form includes at least one of phoneme, syllable, or character.
The system of claim 1, wherein to input the one or more acoustic features of the target audio signals and the at least one accent vector of the speaker into the trained speech recognition neural network model to translate the speech into the target content form, the at least one processor is directed to:

input the one or more acoustic features and the at least one accent vector of the speaker into the trained speech recognition neural network model; and

translate the speech into the target content form based on at least an output of the trained neural network model.
A method implemented on a computing device having at least one processor, at least one storage medium, comprising:

obtaining, from an audio signal input device, target audio signals including a speech of a speaker;

determining one or more acoustic features of the target audio signals;

obtaining at least one accent vector of the speaker;

inputting the one or more acoustic features of the target audio signals and the at least one accent vector of the speaker into a trained speech recognition neural network model to translate the speech into a target content form; and

generating an interface through an output device to present the speech in the target content form.
The method of claim 10, wherein the inputting the one or more acoustic features of the audio signal and the accent of the speaker into the trained neural network model for speech recognition comprises:

obtaining a local accent of a region from which the speech is originated; and

inputting the one or more acoustic features of the target audio signals, the at least one accent vector of the speaker, and the local accent into the trained speech recognition neural network model.
The method of claim 10, wherein the accent vector includes a plurality of elements, each element corresponds with a regional accent and includes a likelihood value related to the regional accent.
The method of claim 12, wherein the at least one accent vector of the speaker is determined according to an accent determining process, the process comprising:

obtaining historical audio signals including one or more historical speeches of the speaker;

for each of the one or more historical speeches, determining one or more historical acoustic features of the corresponding historical audio signals;

obtaining one or more regional accent models;

for each of the one or more historical speeches, inputting the one or more corresponding historical acoustic features into each of the one or more regional accent models; and

determining the plurality of elements of the at least one accent vector of the speaker based on at least an output of the one or more regional accent models.
The method of claim 12, wherein the accent of the speaker is determined according to an accent determining process, the accent determining process comprising:

obtaining historical audio signals including one or more historical speeches of the speaker;

for each of the one or more historical speeches, determining one or more historical acoustic features of corresponding historical audio signals;

obtaining an accent classification model;

for each of the one or more historical speeches, inputting the corresponding historical acoustic features into the accent classification model; and

determining the plurality of elements of the at least one accent vector of the speaker based on at least an output of the accent classification model.
The method of claim 10, wherein the speech recognition trained neural network model is generated by at least one computing device according to a training process, the training process comprises:

obtaining sample audio signals including a plurality of sample speeches of a plurality of sample speakers;

for each of the plurality of sample speeches, determining one or more sample acoustic features of the corresponding sample audio signals;

for each of the plurality of sample speeches, determining at least one sample accent vector of the corresponding sample speaker;

obtaining a preliminary neural network model;

determining the trained speech recognition neural network model by inputting the one or more sample acoustic features and the at least one sample accent vector of the sample speaker corresponding to each of the plurality of sample speeches into the preliminary neural network model.
The method of claim 15, wherein the determining of the trained neural network model for speech recognition comprises:

for each of the plurality of sample speeches, obtaining a sample local accent of a region from which the sample audio signal is originated; and

determining the trained speech recognition neural network model by inputting the one or more sample acoustic features, the at least one sample accent vector of the sample speaker, and the sample local accent corresponding to each of the plurality of sample speeches into the preliminary neural network model.
The method of claim 10, wherein the target content form includes at least one of phoneme, syllable, or character.
The method of claim 10, wherein the inputting the one or more acoustic features of the target audio signals and the at least one accent vector of the speaker into the trained speech recognition neural network model to translate the speech into the target content form comprises:

inputting the one or more acoustic features and the at least one accent vector of the speaker into the trained speech recognition neural network model; and

translating the speech into the target content form based on at least an output of the trained neural network model.
A non-transitory computer readable medium comprising executable instructions that, when executed by at least one processor, cause the at least one processor to effectuate a method comprising:

obtaining, from an audio signal input device, target audio signals including a speech of a speaker;

determining one or more acoustic features of the target audio signals;

obtaining at least one accent vector of the speaker;

inputting the one or more acoustic features of the target audio signals and the at least one accent vector of the speaker into a trained speech recognition neural network model to translate the speech into a target content form; and

generating an interface through an output device to present the speech in the target content form.
The non-transitory computer readable medium of claim 19, wherein the inputting the one or more acoustic features of the audio signal and the accent of the speaker into the trained neural network model for speech recognition comprises:

obtaining a local accent of a region from which the speech is originated; and

inputting the one or more acoustic features of the target audio signals, the at least one accent vector of the speaker, and the local accent into the trained speech recognition neural network model.