CN120475003A

CN120475003A - Voice processing method, device, equipment, medium and program product

Info

Publication number: CN120475003A
Application number: CN202510553886.0A
Authority: CN
Inventors: 袁两胜; 苏丹; 黄思军; 陈璟瑜; 吴志栩; 廖锡光; 张翔; 陈哲; 郭丽
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2025-04-28
Filing date: 2025-04-28
Publication date: 2025-08-12

Abstract

The embodiment of the application discloses a voice processing method, a device, equipment, a medium and a program product, wherein the method comprises the steps of receiving a voice signal and displaying an interactive dialogue interface, displaying interactive dialogue stream information triggered by the voice signal in an interactive dialogue area, and displaying content information related to the interactive dialogue stream information in a content display area. The embodiment of the application can enrich the display content in the interactive dialogue process and improve the dialogue quality and the dialogue efficiency of the interactive dialogue.

Description

Voice processing method, device, equipment, medium and program product

Technical Field

The present application relates to the field of computer technology, and in particular to the field of artificial intelligence, and more particularly to a speech processing method, a speech processing apparatus, a computer device, a computer readable storage medium and a computer program product.

Background

Interactive dialogue refers to the process of completing voice tasks by two-way, dynamic information transfer between the user and the machine.

Currently, in the interactive dialogue process between a user and a system, the system only performs answer display for a user query request input by the user. However, this form of interactive dialog that outputs only answers is relatively simple and single, and the referenceable information provided to the user per interactive dialog turn is limited, not only resulting in difficulty in making a better decision based on the limited referenceable information, but also reducing the dialog quality and dialog efficiency of the interactive dialog.

Disclosure of Invention

The embodiment of the application provides a voice processing method, a device, equipment, a medium and a program product, which can enrich the display content in the interactive dialogue process and improve the dialogue quality and the dialogue efficiency of the interactive dialogue.

In one aspect, an embodiment of the present application provides a method for processing speech, including:

receiving a voice signal and displaying an interactive dialogue interface, wherein the interactive dialogue interface comprises an interactive dialogue area and a content display area;

Displaying voice signal triggered interactive dialog flow information in interactive dialog region, and

Content information associated with the interactive dialog flow information is displayed in the content presentation region.

In another aspect, an embodiment of the present application provides a speech processing apparatus, including:

the receiving unit is used for receiving the voice signal and displaying an interactive dialogue interface, wherein the interactive dialogue interface comprises an interactive dialogue area and a content display area;

A processing unit for displaying interactive dialog flow information triggered by the voice signal in the interactive dialog region, and

The processing unit is further configured to display content information related to the interactive dialog flow information in the content presentation area.

In one implementation, the interactive dialogue stream information comprises N rounds of interactive dialogue information, and each round of interactive dialogue information comprises problem description information and problem result information;

the content information associated with the interactive session stream information includes at least one of:

one or more candidate problem descriptions matching with dialog intents of an ith round of interactive dialog information, the ith round of interactive dialog information being displayed in an interactive dialog region, i being an integer and 1≤i≤N, and

The target content information related to the target dialogue information in the ith round of interactive dialogue information comprises at least one of detail information of the target dialogue information, source information of the target dialogue information and comment information of the target dialogue information, and the target dialogue information is default or custom.

In one implementation, the interactive dialogue stream information comprises N rounds of interactive dialogue information, the ith round of interactive dialogue information comprises first problem description information and first problem result information, and the first problem result information comprises a resource image corresponding to a video media resource;

the processing unit is used for displaying the content information related to the interactive dialogue stream information in the content display area, and is specifically used for:

Target content information related to the resource image is displayed by default in the content presentation area.

In one implementation, the target content information associated with the asset image is a video media asset, and the processing unit is further configured to:

playing video media assets in a content presentation area;

receiving a pause playing operation aiming at the video media resource in the playing process of the video media resource;

in response to a pause play operation for the video media asset, the video media asset is paused in the content presentation area.

In one implementation, the interactive dialogue stream information comprises N rounds of interactive dialogue information, and the processing unit is used for displaying the content information related to the interactive dialogue stream information in the content display area, and is specifically used for:

displaying the target dialogue information as a selected state in the interactive dialogue area in response to an information viewing operation performed for the target dialogue information in the ith round of interactive dialogue information in the interactive dialogue area;

Target content information related to the target dialogue information is displayed in the content presentation area.

In one implementation, the number of content information related to the interactive dialog flow information is at least one, the content presentation area includes at least one content type option, each content type option corresponding to one type of content information, and the processing unit is further configured to:

And displaying content information corresponding to the selected target content type option in the content display area in response to a selection operation of the target content type option, wherein the target content type option is any one of the at least one content type option.

In one implementation, the processing unit is further configured to:

Maintaining the content display area to display content information related to the ith round of interactive dialogue information in the process of executing dialogue viewing operation on the ith round of interactive dialogue information in the interactive dialogue area;

When detecting that the interactive dialogue area has dialogue viewing operation of switching from the ith round of interactive dialogue information to the jth round of interactive dialogue information, updating and displaying content information related to the ith round of interactive dialogue information as content information related to the jth round of interactive dialogue information in a content display area, wherein j is a positive integer, j is not equal to i, and j is not less than 1 and not more than N.

In one implementation, the content display area includes dialogue options corresponding to each round of interactive dialogue information in the interactive dialogue stream information, and the processing unit is further configured to:

Responding to the triggering operation of the target dialogue option in the content display area, and displaying content information related to target interactive dialogue information corresponding to the target dialogue option in the content display area;

The target interactive session information is displayed in the interactive session area.

In one implementation, the number of target dialogue information in the ith round of interactive dialogue information is at least two, and the processing unit is further configured to:

When the specified target dialogue information in the interactive dialogue area is subjected to the content locating operation, target content information related to the specified target dialogue information is displayed in the content presentation area.

In one implementation, the voice processing method is applied to a media resource application program, wherein the media resource application program runs in a resource playing device, and the receiving unit is specifically used for when receiving a voice signal:

displaying a service interface of the resource playing device, wherein the service interface comprises voice prompt information in a prompt state;

responding to the confirmation operation of the voice prompt information in the prompt state, and converting the voice prompt information from the prompt state to the wake state;

When the resource playing device starts to collect the voice signal, the voice prompt information is converted from the awakening state to the collecting state, wherein the voice prompt information in the collecting state is used for prompting the resource playing device to collect the voice signal;

when the resource playing device collects the voice signals, the voice prompt information is converted from the collection state to the understanding state, and the voice prompt information in the understanding state is used for prompting the resource playing device to perform result searching processing based on the voice signals.

In one implementation, the resource playback device is a device that performs content delivery based on the content delivery means of the internet, wherein,

The display modes of the interactive dialogue area and the content display area in the interactive dialogue interface comprise horizontal arrangement display, vertical arrangement display and mosaic display.

In one implementation, the result search process includes an intent recognition process and a result generation process, the processing unit further configured to:

acquiring basic analysis data, wherein the basic analysis data comprises one or more of scene data, object data, authority data and intention list data;

Calling the finely tuned generated model, and carrying out intention recognition processing on the voice signal based on basic analysis data to obtain an intention recognition result;

If the intention recognition result represents that the voice task indicated by the voice signal is a dialogue type task, performing result generation processing according to the intention recognition result to generate problem result information corresponding to the voice signal, wherein the problem result information and the problem description information corresponding to the voice signal form a round of interactive dialogue information in the interactive dialogue stream information.

In one implementation, the processing unit is further configured to:

If the intention recognition result represents that the voice task indicated by the voice signal is a direct task, performing task processing on the direct task;

the direct class tasks comprise at least one of a control task, an interface switching task and a player control task.

In one implementation, the processing unit is configured to perform a result generation process according to the intention recognition result, and when generating problem result information corresponding to the voice signal, specifically configured to:

acquiring at least one round of interactive dialogue stream information in the interactive dialogue stream information;

Performing intranet resource searching processing according to the dialogue intention of at least one round of interactive dialogue flow information to obtain intranet resources, and

According to the dialogue intention of at least one round of interactive dialogue flow information, performing external network resource searching processing to obtain external network resources;

And carrying out resource recombination on the intranet resources and the extranet resources to generate problem result information corresponding to the voice signals.

In yet another aspect, an embodiment of the present application provides a computer apparatus, including:

a processor adapted to execute a computer program;

a computer readable storage medium having a computer program stored therein, which when executed by a processor, implements the above-described speech processing method.

In yet another aspect, embodiments of the present application provide a computer readable storage medium storing a computer program adapted to be loaded by a processor and to perform the above-described speech processing method.

In yet another aspect, an embodiment of the present application provides a computer program product comprising a computer program which, when executed by a processor, implements the above-described speech processing method.

In an embodiment of the application, a computer device may receive a user's speech signal collected from a physical environment and display an interactive dialog interface for presenting interactive dialog information between the user and the computer device. In the embodiment of the application, the interactive dialogue interface can display contents in a partitioned manner, and particularly comprises an interactive dialogue area and a content display area. The interactive dialogue flow information triggered by the voice signal is displayed in the interactive dialogue area, and the interactive dialogue flow information comprises each round of interactive dialogue information, so that a user can be helped to more conveniently determine the result of the problem through the form of the problem description information-the problem result information, and the question and answer experience of the user is improved. Meanwhile, the content information related to the interactive dialogue flow is displayed in the content display area, the content information is further information supplement to the interactive dialogue flow information, so that the dialogue quality of man-machine interaction is improved, and a user can jointly execute decisions and the like by combining the interactive dialogue flow information in the interactive dialogue area and the content information displayed in the content display area, so that the dialogue times (or the number of dialogue rounds) can be reduced to a certain extent, the user decision path is shortened, and the dialogue efficiency is further effectively improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1a is a schematic diagram of a prior art interactive dialog interface;

FIG. 1b is a schematic illustration of an interactive dialog interface provided in accordance with an exemplary embodiment of the present application;

FIG. 2a is a schematic diagram of an interactive dialog system architecture according to an exemplary embodiment of the present application;

FIG. 2b is a schematic diagram of the architecture of another interactive dialog system provided in accordance with an exemplary embodiment of the present application;

FIG. 3 is a flow chart of a method for speech processing according to an exemplary embodiment of the present application;

FIG. 4a is a schematic diagram of an interactive dialog interface displayed in a display screen, provided in accordance with an exemplary embodiment of the present application;

FIG. 4b is a schematic diagram of another interactive dialog interface displayed in a display screen in accordance with an exemplary embodiment of the present application;

FIG. 4c is a schematic diagram of a display of an interactive session area and a content presentation area according to an exemplary embodiment of the present application;

FIG. 4d is a schematic diagram of a display of another interactive session area and content presentation area provided by an exemplary embodiment of the present application;

FIG. 5a is a schematic diagram of an interactive dialog message of the teletext type, provided in accordance with an exemplary embodiment of the application;

FIG. 5b is a schematic diagram of a plain text type interactive dialog message, as provided by an exemplary embodiment of the present application;

FIG. 6 is a schematic diagram of displaying candidate problem description information in a content presentation area, provided by an exemplary embodiment of the present application;

FIG. 7 is a schematic diagram of displaying content information related to target dialog information in a content presentation area in accordance with an exemplary embodiment of the present application;

FIG. 8 is a schematic diagram of previewing video media assets in a content presentation area according to one exemplary embodiment of the present application;

FIG. 9a is a schematic diagram of an information viewing operation provided by an exemplary embodiment of the present application;

FIG. 9b is a schematic diagram of another information viewing operation provided by an exemplary embodiment of the present application;

FIG. 10 is a schematic diagram of a content type option provided by an exemplary embodiment of the present application;

FIG. 11 is a schematic diagram of a coordinated display of an interactive dialog region and a content presentation region provided in accordance with an exemplary embodiment of the present application;

FIG. 12a is a schematic diagram of a dialog option provided by an exemplary embodiment of the present application;

FIG. 12b is a schematic diagram of a dialog option and content type option displayed simultaneously in a content presentation area in accordance with an exemplary embodiment of the present application;

FIG. 13 is a flow chart of another speech processing method according to an exemplary embodiment of the present application;

FIG. 14 is a schematic diagram of voice prompts in different states provided by an exemplary embodiment of the present application;

FIG. 15a is a schematic diagram of a direct class task provided by an exemplary embodiment of the present application;

FIG. 15b is a schematic diagram of another direct class task provided by an exemplary embodiment of the present application;

FIG. 15c is a schematic diagram of yet another direct class task provided by an exemplary embodiment of the present application;

FIG. 16 is a background logic flow diagram of a speech processing method according to an exemplary embodiment of the present application;

FIG. 17 is a schematic diagram of a training data construct provided by an exemplary embodiment of the present application;

FIG. 18 is a flow chart of a model training provided by an exemplary embodiment of the present application;

FIG. 19 is a logical illustration of a video model service provided in accordance with an exemplary embodiment of the present application;

FIG. 20 is a schematic diagram of a pause response in responding to problem outcome information in an interactive dialog region in accordance with an exemplary embodiment of the present application;

FIG. 21 is a schematic diagram of streaming response content information in a content presentation area in accordance with an exemplary embodiment of the present application;

Fig. 22 is a schematic diagram of a voice processing apparatus according to an exemplary embodiment of the present application;

fig. 23 is a schematic structural view of a computer device according to an exemplary embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In the embodiment of the application, a voice processing scheme is provided, in particular an AI voice processing scheme for realizing interactive dialogue based on a large language model is provided, and for facilitating understanding, technical terms and related concepts related to the voice processing scheme are briefly introduced, wherein:

1. an interactive dialog.

The interactive dialogue can be an intelligent question-answer (Question Answering, QA), an intelligent dialogue and the like, belongs to the field of man-machine interaction, is an advanced form of an information retrieval system, and is a direction which is attractive and has wide development prospect along with the rapid development and application of artificial intelligence (ARTIFICIAL INTELLIGENCE, AI). An interactive dialog may be understood as a human-machine interaction process in which a user and a machine communicate through natural language, and the machine dynamically understands a user query issued by the user and completes a voice task (or called a voice target) in conjunction with a context. Wherein:

(1) The machine refers to an application program, a code program, an Agent (Agent), and the like having various functions such as semantic understanding, intention recognition, and result generation. The ① application program refers to a computer program for completing a certain or more specific tasks, and is classified according to the operation mode of the application program, wherein the application program can include, but is not limited to, a client which needs to download an installation package and deploy the installation package in a terminal and realize intelligent question-answering through operation of the installation package, an applet which does not need to download the installation package and operates as a sub-program on the client, a Web application which is opened and operated through a browser in the terminal, and the like. ② A code program may be understood as a piece of code that when loaded is capable of performing many of the functions described above. ③ An agent refers to a physical robot, or a virtual robot, that is capable of information transfer, decision making, and action execution with a user in a physical environment (i.e., a real world environment) to accomplish a target. The intelligent assistant can be called an intelligent assistant application, and is intelligent software which is deployed in the hardware equipment and can realize rich functions such as information inquiry, living server, learning and coaching through human-computer interaction between forms such as voice, characters and images and users.

There may be a crossover between the application, code, and agent mentioned above. For example, an agent, an intelligent assistant, is integrated into the application, so that the intelligent assistant can be invoked or invoked in the application to assist the user in achieving interactive dialog in the application. In order to facilitate explanation, the interactive dialogue between the user and the application program, particularly the interactive dialogue with an intelligent assistant integrated in the application program, is explained by taking man-machine interaction as an example.

(2) The interactive dialogue process may comprise N rounds of interactive dialogue, wherein N is a positive integer, each round of interactive dialogue in the N rounds of interactive dialogue corresponds to interactive dialogue information, and each round of interactive dialogue information comprises problem description information input by a user and problem result information generated by a machine aiming at the problem description information. The ① question description information refers to a request, a question or an instruction input by a user to a machine, the question description information can be called a user Query (or simply called a Query), and is usually expressed in text, voice or other interactive forms (such as gestures and images), and the Query is a starting point of a machine triggering response or a service and is also a core processing object of an interactive dialogue. ② The problem result information is result information generated by the machine after executing the problem processing on the problem described by the problem description information and returned to the user, wherein the problem processing executed by the machine on the problem described by the problem description information can sequentially comprise, but is not limited to, processing operations such as intention understanding, knowledge searching, result generation and the like. The machine can output the problem description information by adopting accurate and simple natural language so as to realize the reply to the user Query.

To optimize the user's intelligent question-answering experience, presentation of interactive dialog information is typically accomplished in a streaming response manner. The streaming response may be represented by two points, namely, on the one hand, the streaming presentation of N rounds of interactive dialogue, specifically, after one round of interactive dialogue information presentation is finished, the user may continue to send a new user Query, and the machine may combine the new user Query and the historical round of interactive dialogue information to generate problem result information corresponding to the user Query of the message, that is, the problem description information-problem result information corresponding to each round of interactive dialogue including continuous questions and answers in the N rounds of interactive dialogue information. On the other hand, when processing the user Query, instead of once returning the question result information after generating the question result information of the user Query, the machine returns the generated part of the question result information in real time in the process of generating the question result information, in other words, the machine feeds back the generated sub-information belonging to the question result information to the user and simultaneously continues to generate other sub-information belonging to the question result information. Therefore, the streaming response can realize multi-round interactive dialogue, help better context understanding, reduce delay of outputting problem result information, promote user experience, and adapt to interactive dialogue scenes with large real-time update or data volume.

2. Artificial intelligence.

Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. The machine learning model in the artificial intelligence field is a network model obtained by performing model training by utilizing machine learning, wherein the machine learning is a multi-field intersection subject and relates to a plurality of subjects such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. Machine learning typically includes artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and deep learning (DEEP LEARNING, DL) techniques.

The embodiment of the application mainly relates to a large language model (Large Language Model, LLM) in the field of deep learning in a machine learning technology. The large language model is a general language processing model based on deep learning technology and mass parameter scale (generally in the billions to trillions), gradually becomes a core technology in the field of natural language processing (Natural Language Processing, NLP), and leads to technical innovation from traditional tasks to generation type artificial intelligence. The large language model obtains remarkable performance improvement in tasks such as text generation, question-answering system, machine translation and the like through the pre-training of large-scale parameters, and shows strong generalization capability and context understanding capability. With the continuous expansion of the model scale of the large language model, the performance of the large language model is approximately in a logarithmic linear growth trend along with the model scale from the initial millions of parameters to the hundreds of millions or even trillions of parameters, which further promotes the wide adoption of the large language model in scientific research and industrial application.

The large language model is a generating model essentially, and has the characteristics that ① has strong generalization capability and multi-round result generation capability, and can identify long Query (such as voice signals with longer time or texts with longer character strings, and the like), so that the method can better process the Query with complex users and the Query without identification. ② The method has stronger understanding capability and can better understand the intention of the user, so that the intention of the user can be identified by fine tuning in combination with the service. ③ The method has the capability of generating the content, and the trimmed language model can be combined with the content with better service recall quality by performing model fine tuning (or fine tuning) on the base large language model.

In practical application, the interactive dialogue is presented only by presenting interactive dialogue stream information, that is, generating and displaying question result information aiming at question description information input by a user. The interactive dialogue presenting mode has the defect of single dialogue, and answers questions only through question result information, so that referenceable information which can be provided for a user is limited, and the user needs to perform man-machine interaction for many times, so that more accurate answers to questions are attempted through a multi-round interactive dialogue mode.

The voice processing scheme provided by the embodiment of the application improves the presentation mode of the interactive dialogue, in particular provides the interactive dialogue stream information for the user, directly replies the question description information input by the user through the question result information in the interactive dialogue stream information, and provides the content information related to the interactive dialogue stream information for the user, wherein the content information is used as a further information supplement to the interactive dialogue stream information and provides more referenceable information for the user. Thus, the user can combine the interactive dialogue flow information and the content information related to the interactive dialogue flow information to execute decisions and the like together, the dialogue quality is obviously improved, and the dialogue times (or the dialogue rounds) can be reduced to a certain extent, so that the dialogue efficiency is effectively improved.

Specifically, the interactive dialogue presenting flow of the voice processing scheme provided by the embodiment of the application can be roughly described as receiving a voice signal input by a user, wherein the voice signal is an acoustic wave signal which is generated by the user through a sound generating organ (such as a vocal cord and an oral cavity) and carries language information, and the user expresses or conveys own dialogue intention through the voice signal. And displaying an interactive dialogue interface, wherein the interactive dialogue interface comprises an interactive dialogue area and a content display area, interactive dialogue stream information triggered by a voice signal input by a user is displayed in the interactive dialogue area, and content information related to the interactive dialogue stream information is displayed in the content display area.

It has been found by practice that the content presentation when an interactive dialog is realized using the speech processing scheme provided by the embodiments of the present application has significant advantages. The following describes advantages of the embodiment of the present application by taking comparison of content presentation performed by the scheme of the present application and the existing speech processing scheme as an example, wherein:

A schematic diagram of content presentation in the existing speech processing scheme can be seen in fig. 1a, where only interactive dialog flow information is presented in the interactive dialog interface, and the user finds and makes a decision by finding the information he wants from the problem result information in the interactive dialog flow information. In the case where the user is not satisfied with the current problem result information, the user needs to input different problem description information a plurality of times, aiming at hopefully that the machine can give the problem result information more in line with the intention. It is not difficult to find that the referenceable information that can be provided for the user by such a single content presentation method is limited, so that the user needs to perform man-machine interaction many times, and the conversation quality and conversation efficiency are reduced. However, the schematic diagram of content presentation performed by the voice processing scheme provided by the embodiment of the application can be seen in fig. 1b, and content partition display is performed in the interactive dialogue interface, specifically, the interactive dialogue area and the content display area are divided in the interactive dialogue area. The interactive dialogue area is used for presenting interactive dialogue stream information, helping a user experience stream question and answer, and improving man-machine interactivity while being more in line with the query habit of the user. The interactive dialogue interface is provided with a content display area besides the interactive dialogue area, wherein the content display area is used for displaying content information related to interactive dialogue flow information in the interactive dialogue area as information supplement to the interactive dialogue flow information so as to provide more referents for users. Thus, the user can make decisions more quickly and accurately by combining richer interactive dialogue flow information and content information related to the interactive dialogue flow information, and the dialogue quality and the dialogue efficiency of the interactive dialogue are remarkably improved.

The voice processing scheme provided by the embodiment of the application is used as an automatic question-answer solution of man-machine interaction, and can understand, analyze and answer the user Query proposed by the user, so that the voice processing scheme provided by the embodiment of the application can be suitable for various interactive dialogue scenes. The interactive dialogue scene can include, but is not limited to ① customer support, the intelligent question-answering system can be used as a customer support tool for answering common problems of users, reducing the workload of customer service personnel and improving customer satisfaction. ② And (3) an enterprise internal knowledge base, namely, the enterprise can search richer knowledge by utilizing the voice processing scheme of the embodiment of the application, and build the internal knowledge base, so that enterprise staff can be helped to quickly find the required information, and the working efficiency is improved. ③ The virtual assistant is a machine deployed with the voice processing scheme provided by the embodiment of the application, and can be used as a virtual assistant of a person or an enterprise to provide functions of daily task management, schedule, reminding service and the like. ④ The machine system with the voice processing scheme provided by the embodiment of the application can be applied to the field of online education, and personalized learning resources and real-time answering services are provided for students. ⑤ E-commerce, in which the machine provided with the voice processing scheme provided by the embodiment of the application is deployed, the user can be helped to answer questions in the shopping process, shopping suggestions are provided, and shopping experience is improved. ⑥ The machine provided with the voice processing scheme provided by the embodiment of the application can provide real-time consultation service for customers of financial institutions such as banks, insurance and the like, and solves the problems about accounts, transactions, products and the like. ⑦ Medical consultation the machine deployed with the voice processing scheme provided by the embodiment of the application can provide basic medical consultation services for patients and solve the problems about diseases, treatment, medicines and the like. ⑧ The machine with the voice processing scheme provided by the embodiment of the application can provide real-time travel information for tourists and solve the problems about scenic spots, hotels, traffic and the like. ⑨ News and information retrieval the machine provided with the voice processing scheme provided by the embodiment of the application can help users to quickly find out required news and information, and the information retrieval efficiency is improved. ⑩ Media resource scene in which the user often has a need to quickly search multimedia resources, which may include but are not limited to audio (such as music), video (such as film or short video, etc.) and images, in which case, the voice processing scheme provided by the embodiment of the application is introduced to provide richer resource search results for the user in the interactive dialogue process, thereby significantly improving the search efficiency of the media resources.

It should be understood that the above description is only exemplary product performance and interactive dialog scenarios presented by embodiments of the present application and is not intended to limit the product performance and interactive dialog scenarios that provide a speech processing scheme of embodiments of the present application. The voice processing scheme provided by the embodiment of the application can provide efficient, accurate and convenient question-answering service in various interactive dialogue scenes, has high value and practicability in various interactive dialogue scenes, and is beneficial to improving user experience and satisfaction.

For convenience of explanation, the following description will take an interactive dialogue scene related to the voice processing scheme provided by the embodiment of the present application as an example of a media resource scene. Under the condition of media resources, the voice processing scheme provided by the embodiment of the application is applied to the media resource application program, and the media resource application program can be understood as a client which integrates the voice processing scheme provided by the embodiment of the application and has the capabilities of searching and playing multimedia resources and the like.

A schematic system architecture of an interactive dialog system is given below in connection with fig. 2a, where, as shown in fig. 2a, the interactive dialog system includes an object 201, a terminal 202, and a server 203, and the number and naming of the object 201, the terminal 202, and the server 203 are not limited in the embodiment of the present application. Wherein:

① The object 201 is a user capable of emitting a speech signal. The object 201 and the terminal 202 may be in the same physical environment or different physical environments, the object 201 may implement near field interaction with the terminal 202 when the object 201 and the terminal 202 are in the same physical environment, and the object 201 may implement remote interaction with the terminal 202 when the object 201 and the terminal 202 are in different physical environments.

② Terminal 202 refers to a terminal device having interactive dialog functionality. The terminal device may include, but is not limited to, a smart phone (such as a smart phone deploying an Android system or a smart phone deploying an internet operating system (Internetworking Operating System, IOS)), a portable personal computer (such as a tablet computer, a smart computer, etc.), a Mobile internet device (Mobile INTERNET DEVICES, MID), a vehicle-mounted device, a headset device, a smart chat robot, a smart television, a set-top box, a projector, a smart billboard, etc. It should be appreciated that the terminals 202 involved or used in different interactive dialog scenarios may not be identical or identical.

For example, in the case where the aforementioned interactive session scene is a media resource scene and a large screen (i.e., a display screen with a large display area), the terminal 202 may refer to a resource playing device running a media resource application, deployed with the voice processing scheme provided by the embodiment of the present application and having a large display screen. The resource playing device may be referred to as an OTT (Over-The-Top) device or OTT large screen, including but not limited to The above-mentioned smart tv and set Top box capable of connecting to The internet, a smart computer with a larger display screen, and an electronic billboard connected to The internet. The resource playing device is different from the traditional telecom operators, and the application service provider directly provides the application service through the Internet. The resource playing equipment is provided with a player, and can play audio and video through the player, and also has voice acquisition capability to acquire voice signals, for example, the resource playing equipment is integrated with a signal acquisition device or is externally connected with the signal acquisition device to acquire voice signals of a user. Considering that the voice processing scheme provided by the embodiment of the application can realize the content regional display, the content regional display mode can better utilize the display screen with larger display area of the OTT equipment, thereby improving the utilization rate of the display screen.

③ Server 203 is a server corresponding to terminal 202 and is configured to interact with terminal 202 to provide computing and application service support for terminal 202, and in particular, to provide application services and technical support for interactive session capable machines (e.g., media resource applications, code levels, or intelligent assistants, etc.) running in terminal 202. The server 203 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), and basic cloud computing services such as big data and an artificial intelligence platform. For example, the server 203 is a separate physical server, in which case the interactive session shown in fig. 2a is a centralized system. For another example, the server 203 is a server cluster formed by a plurality of physical servers, and in this case, the interactive session shown in fig. 2a is a distributed system. The terminal 202 and the server 203 may be directly or indirectly connected through wired or wireless communication, which is not limited herein.

The speech processing scheme provided by the embodiment of the application may be executed by the terminal 202 or the server 203 in the system shown in fig. 2a, or the terminal 202 and the server 203 may be executed together, that is, the computer device for executing the speech processing scheme may be the terminal 202 or the server 203, and may further include the terminal 202 and the server 203. Taking the following example of the voice processing scheme performed by the terminal 202 and the server 203 together, the flow of the voice processing scheme will be described in conjunction with the interactive dialogue system shown in fig. 2a, where:

Terminal 202 receives a voice signal input by a user and transmits the voice signal to server 203. The server 203 performs a result search process on the voice signal and outputs the generated problem result message to the terminal 202 in a streaming response manner. After receiving the question result message, the terminal 202 displays the interactive dialogue interface through the resource media application program, and outputs the interactive dialogue stream information and the content information related to the interactive dialogue stream information in the interactive dialogue interface through a streaming response mode, specifically displays the interactive dialogue stream information triggered by the voice signal in an interactive dialogue area in the interactive dialogue interface, and displays the content information related to the interactive dialogue stream information in a content display area in the interactive dialogue interface. In this way, the object 201 can make a decision together according to the interactive dialog flow information in the interactive dialog region and the content information related to the interactive dialog flow information in the content display region, so that the decision correctness is significantly improved, and the dialog quality and the object efficiency are improved.

Therefore, the interactive dialogue flow information is displayed in the interactive dialogue area in the interactive dialogue interface, so that the user is helped to experience the flow question and answer, and the man-machine interaction is improved. And, the content information related to the interactive dialogue stream information is displayed in the content display area, and as an information supplement to the interactive dialogue stream information, more referenceable information can be provided for the user. Compared with the traditional method for only displaying the interactive dialogue stream information, the user can make decisions more quickly and accurately by combining richer interactive dialogue stream information and content information related to the interactive dialogue stream information, and the dialogue quality and the dialogue efficiency of the interactive dialogue are remarkably improved.

Based on the above simple introduction of the voice processing scheme, the interactive dialogue scene and the interactive dialogue system provided by the embodiment of the application, the following points are also needed to be described:

① The system architecture shown in fig. 2a mentioned above in the embodiment of the present application is for more clearly describing the technical solution of the embodiment of the present application, and does not constitute a limitation on the technical solution provided by the embodiment of the present application. As can be known to those skilled in the art, with the evolution of the system architecture and the appearance of new service scenarios, the technical solution provided by the embodiment of the present application is also applicable to similar technical problems. That is, the architecture diagram of the interactive dialogue system shown in fig. 2a is merely an exemplary architecture diagram, and in practical applications, the number and distribution of the computer devices included in the interactive dialogue system may vary, and the architecture diagram of the interactive dialogue system is not limited in this embodiment of the present application.

For example, fig. 2a illustrates a process in which the terminal 202 collects a voice signal of the object 201 in the near field, taking as an example that the object 201 and the terminal 202 are in the same physical environment. However, in practical applications, the object 201 and the terminal 202 may be in different physical environments, in which case, the object 201 also holds a terminal device (such as a smart phone, etc.), so that the terminal 202 inputs a voice signal through the terminal device, and the voice signal is transmitted to the terminal 202 or the server 203 by the terminal device for performing a result searching process. As shown in fig. 2b, a terminal 204 is also included in the interactive dialog system, through which terminal 204 the object 201 enables remote speech control of the terminal 202.

② In the embodiment of the application, the related data collection and processing should strictly obtain the personal information according to the requirements of related laws and regulations, so that the personal information needs to be informed or agreed (or has the legal basis of information acquisition), and the subsequent data use and processing behaviors are developed within the authorized range of the laws and regulations and the personal information body. For example, when the embodiment of the application is applied to a specific product or technology, such as when a voice signal of a user is collected, permission or consent of the user needs to be obtained, and collection, use and processing of relevant data (such as performing result search processing on the voice signal) need to comply with relevant laws and regulations and standards of relevant and regions.

Based on the above-described speech processing schemes, the embodiments of the present application provide a more detailed speech processing method, and the speech processing method provided by the embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Referring to fig. 3, fig. 3 is a flow chart illustrating a voice processing method according to an exemplary embodiment of the present application, where the voice processing method shown in fig. 3 may be performed by a computer device, for example, the computer device is a terminal, or the computer device includes a terminal and a server. The voice processing method may include, but is not limited to, steps S301-S303:

s301, receiving a voice signal and displaying an interactive dialogue interface.

The voice signal refers to a user instruction sent by a user who wants to perform voice control on the computer device, and is used for expressing the user intention of the user to perform voice control on the computer device, for example, the intention expressed by the voice signal is to control the volume of the computer device to be increased, or control the computer device to search for a television show, and the like. Embodiments of the present application support a computer device receiving or acquiring a user's voice signals based on near field communication or telecommunications (otherwise known as far field communication). The following describes a procedure for a computer device to receive a voice signal of a user based on near field communication or telecommunication, respectively, wherein:

(1) The computer device receiving the voice signal of the user based on the near field communication means that the computer device directly collects the voice emitted by the user when the user and the computer device are in the same physical environment to receive the voice signal of the user, wherein the computer device and the user are in the same physical environment can be understood that the distance between the user and the computer device is smaller than a distance threshold (such as 5 meters), otherwise, if the distance between the user and the computer device is greater than or equal to the distance threshold, the user and the computer device are determined to be in different physical environments. For example, a signal collector is disposed in the computer device, and when the signal collection function of the computer device is turned on, the computer device can collect the voice signal of the user by collecting the sound existing in the near-field range through the signal collector.

(2) The computer device receiving the voice signal of the user based on the remote communication means that the computer device receives the voice signal transmitted by the user through the signal transmission device when the user and the computer device are in different physical environments, so as to receive the voice signal of the user. The signal transmission device refers to a terminal that processes the same physical environment as the user, can collect the voice signal of the user, and transmits the collected voice signal to the computer device, for example, the signal transmission device is the terminal 204 in the system shown in fig. 2 b.

Therefore, the embodiment of the application provides two ways for collecting the voice signals of the user, is convenient for the user to quickly realize the voice control of the computer equipment under any condition, meets the personalized voice control requirement of the user, and widens the use scene of the embodiment of the application to a certain extent.

Further, after receiving the voice signal of the user, if the computer device determines that the problem result information needs to be presented in an interactive dialogue mode according to the user intention expressed by the voice signal, the computer device displays an interactive dialogue interface. The interactive dialogue Interface is a User Interface (UI) which is provided by the computer equipment and is convenient for human-computer interaction between a User and the computer equipment, and the UI is a medium for interaction and information exchange between an operating system in the computer equipment and the User, so that the conversion between an internal form of information and a human acceptable form can be realized.

The interactive dialogue interface comprises an interactive dialogue area and a content display area. The interactive dialogue area is used for displaying interactive dialogue stream information triggered by the voice signal of the user, namely, stream information used for presenting the interactive dialogue between the user and the computer equipment, wherein the interactive dialogue stream information comprises at least one round of interactive dialogue information. The content presentation area is used for displaying content information related to the interactive dialog flow information, that is to say, the content presentation area can be understood as a display area opened up in a display screen of the computer device (in particular, the interactive dialog interface) and used for presenting the content information which has an information supplementing effect on the interactive dialog flow information so as to enrich information available for reference by a user in the interactive dialog interface.

For example, a schematic diagram of displaying an interactive dialog interface on a corresponding display screen of a computer device (the computer device itself is integrated with the display screen, or the computer device is projected with a virtual screen (e.g. a projector projects a virtual screen on a wall)), may be seen in fig. 4a, where the interactive dialog interface 401 is divided into a left-right structure, a left side is an interactive dialog region 402, and a right side is a content display region 403 (or a left side is a content display region, and a right side is an interactive dialog region), as shown in fig. 4 a.

Based on the exemplary interactive dialog interface shown in fig. 4a, two points need to be explained:

① In addition to the interactive session interface being displayed as a separate user interface (which may make full use of the display screen) as shown in fig. 4a, in some scenarios (e.g., where a movie search is implemented by voice control in a media resource application), the embodiments of the present application further support the interactive session interface to be displayed as a floating layer window in order to better assist the user in perceiving that the interactive session interface is a temporary session interface that is evoked from the computer device. As shown in fig. 4b, the interactive session interface 401 is displayed on a service interface of the computer device (such as i a system interface of the computer device or an application interface in a media resource application program) in the form of a floating layer window, and the display area of the floating layer window is smaller than the display area of the service interface provided by the computer device. For ease of description, the interactive dialog interface will be described below in the form of a floating layer window as shown in fig. 4b, and is specifically described herein.

② The display modes of the interactive dialog region and the content presentation region in the interactive dialog interface are not limited to the horizontal arrangement display modes shown in fig. 4a and 4b, but may also include, but are not limited to, vertical arrangement display, mosaic display, etc. The vertical arrangement display means that two areas are displayed along the vertical direction of the display screen of the computer device, and as shown in fig. 4c, an interactive session area and a content presentation area are respectively included in the vertical direction of the display screen. The mosaic display means that one area is displayed at the outer edge of the other area to form a mosaic pattern, as shown in the drawing (1) of fig. 4d, the interactive dialog area 402 is a rectangular frame area in the middle of the display screen, the content display area 403 is a display area between four sides of the rectangular frame area and a frame of the display screen, as shown in the drawing (2) of fig. 4d, the interactive dialog area 402 is a rectangular frame area on the upper left side of the display screen, and the content display area 403 is a display area between the right and lower sides of the rectangular frame area and the frame of the display screen. For convenience of explanation, the following description will be given by taking two display areas in the interactive dialog interface as an example, and the two display areas are displayed in a horizontally arranged display manner shown in fig. 4a, which is specifically described herein.

And S302, displaying interactive dialogue stream information triggered by the voice signal in an interactive dialogue area.

The interactive dialogue stream information triggered by the voice signal is stream information formed by interactive dialogue information corresponding to at least one round of interactive dialogue started by the voice signal trigger, and the stream information comprises the interactive dialogue information corresponding to each round of interactive dialogue. For example, after receiving the voice signal, the computer device converts the voice signal into the problem description information in the first interactive dialogue information, the computer device generates the problem result information according to the problem description information to form the first interactive dialogue information, if the user wants to continue the dialogue, the computer device collects the voice signal of the second interactive dialogue and forms the second interactive dialogue information in the same way as the first interactive dialogue, and the interactive dialogue stream information triggered by the voice signal can be obtained by repeating the steps.

In the embodiment of the application, two result styles are provided for interactive dialogue flow information, and specifically, the information types of two problem result information are provided, namely an image-text type and a plain text type. When the information type of ① interactive dialogue information is graphic and text type, the interactive dialogue information simultaneously comprises the modal information such as text, image and the like, so that rich search results are provided for users in a multi-modal mode. A schematic diagram of an interactive dialog message of the teletext type is shown in fig. 5a, and as shown in fig. 5a, the question result information included in a round of interactive dialog messages 501 contains both images and text. ② When the information type of the interactive dialogue information is a plain text type, the interactive dialogue information does not comprise models such as images, such as modal information which only comprises texts, links or icons and the like and does not occupy a large display area. A schematic diagram of a plain text type interactive dialog message is shown in fig. 5b, and as shown in fig. 5b, only text is included in the question result information included in a round of interactive dialog information 502. It should be noted that, the result styles of the interactive dialog flow information are not limited to the two types, and the two types of result styles are not limited to the embodiments of the present application.

And S303, displaying content information related to the interactive dialogue information in a content display area.

The content information related to the interactive dialogue information, in particular to any round of interactive dialogue information in the interactive dialogue stream information, is displayed in the content display area, and is different from any round of interactive dialogue information but has relevance in content.

Specifically, assuming that the i-th round of interactive dialog information of the N-th round of interactive dialog information is displayed in the interactive dialog region, that is, all or part of the i-th round of interactive dialog information is displayed in the interactive dialog region, and all or part of the display region in the interactive dialog region is occupied by the i-th round of interactive dialog information, i is an integer, and 1≤i≤n, the content information related to the interactive dialog stream information displayed in the content presentation region includes at least one of:

① One or more candidate problem description information that matches the dialog intent of the ith round of interactive dialog information. The dialog intention of the ith round of interactive dialog information refers to the user intention of the user to conduct the ith round of interactive dialog. The candidate problem descriptive information matched with the dialogue intention of the ith round of interactive dialogue information is problem descriptive information which is related to the dialogue intention and can be selected by a user, and the user can be helped to perfect the description of the user intention by providing the candidate problem descriptive information for the user under the condition that the currently output problem result information does not accord with the user intention, so that the accuracy of the searched and generated problem result information is helped to be improved, and the dialogue quality is further improved.

For example, a schematic diagram of displaying candidate problem description information in a content display area may be referred to in fig. 6, as shown in fig. 6, assuming that the problem description information of the ith round of interactive session information displayed in the interactive session area is weather and the problem result information is 20 degrees, then the candidate problem description information 601 and the candidate problem description information 602 displayed in the content display area are matched with each other with respect to the intention of the ith round of interactive session information, such as the candidate problem description information 601 and the candidate problem description information 602 are both related to the intention of inquiring weather, such as whether the candidate problem description information 601 may be cool, how the candidate problem description information 602 may be wearing clothes on 20 degrees of weather, and so on.

② And the target content information comprises at least one of detail information of the target dialogue information, source information of the target dialogue information and comment information of the target dialogue information. The target dialogue information is information which is selected (default or user-defined) in the ith round of interactive dialogue information and needs to be subjected to information supplement (namely, related content information is displayed in a content display area), and belongs to information of problem description information or information of problem result information, and the embodiment of the application is not limited to the information. The detailed information of the target dialogue information may be understood as information for further explaining the target dialogue information, the source information of the target dialogue information includes source information (such as a periodical name, a webpage address, etc.) such as a webpage or a periodical referred to when the target dialogue information is generated, the comment information of the target dialogue information may refer to evaluation information generated by the computer device (specifically, a server) on the target dialogue information, for example, the target dialogue information is a video media resource, and the comment information of the target dialogue information may be a recommendation reason of the video media resource generated by the computer device, etc.

Notably, the target dialog information can be default or customized by the user in the interactive dialog region. Taking the determination of the target dialogue information in the ith round of interactive dialogue information as an example, two determination modes of the target dialogue information are introduced, wherein:

(1) The target dialog information is default.

The embodiment of the application supports the pre-configuration of one or more types of target dialogue information, and when the pre-defined type of target dialogue information appears in the problem description information or the problem result information in the interactive dialogue area, the content information related to the target dialogue information is displayed in the content display area by default. By the aid of the predefined mode, important dialogue information in the interactive dialogue process can be automatically further explained in the content display area, content display intelligence and automation of the interactive dialogue are improved, and dialogue interaction experience of a user is improved.

Wherein the predefined type of targeted dialog information may not be the same, depending on the interactive dialog scenario. For example, when the target dialogue information of the video type (such as a resource image corresponding to the video media resource-movie poster) appears in the interactive dialogue area, the content information related to the target dialogue information of the video type is displayed in the content display area by default. For another example, if the target dialogue information of the predefined music type in the music scene needs to be displayed in the content display area, when the target dialogue information of the music type appears in the interactive dialogue area, the content information related to the target dialogue information of the music type (such as a music cover corresponding to the music resource) is displayed in the content display area by default.

In the following, in connection with fig. 7, a process of displaying, by default, content information related to target session information in a content display area will be described, taking as an example that the content information related to the target session information of a predetermined video type in a media asset scene needs to be displayed in the content display area. As shown in fig. 7, it is assumed that the i-th round of interactive session information includes first problem description information and first problem result information, and the first problem result information includes a resource image 701 corresponding to a video media resource (i.e., target session information displayed in the interactive session area), that is, a video poster. Then, the computer device determines that the resource image displayed in the interactive session area is defaulted to the target session information, and then defaults to displaying target content information related to the resource image 701 in the content presentation area, wherein the target content information is at least one of one or more content information related to the target session information, such as comment information of the target session information, i.e., recommendation reason information 702 of the resource image 701, included in the target content information.

Furthermore, the embodiment of the application also supports the realization of the preview display or playing of the video media resources corresponding to the resource images in the interactive dialogue area in the content display area under the media resource scene, so that the user is helped to better make decisions by previewing the video media resources in the content display area, and the video searching experience of the user is improved. As described above, the content information related to the target dialogue information displayed in the content display area may include detail information, and in the case where the target dialogue information in the media asset scene is an asset image corresponding to the video media asset, the detail information related to the asset image may include the video media asset corresponding to the asset image, that is, the target content information related to the display asset image in the content display area is the video media asset itself. In this case, the computer device plays the video media asset in the content presentation area via the player (e.g., automatically or in the case of a user click).

In the playing process of the video media resources in the content display area, the embodiment of the application also supports the operations of the user for playing the video media resources in the content display area, such as video pause playing or video continuous playing according to the playing requirement of the user, so as to meet the personalized video browsing requirement of the user. Specifically, during the playing of the video media asset in the content presentation area, the computer device receives a pause playing operation for the video content, such as, but not limited to, a trigger operation on a pause component (or option, button, control) in the player, a click operation on any display position in a playing interface of the video media asset. The computer device may pause playing the video media asset in the content presentation area in response to the pause play operation for the video media asset. Next, if the computer device detects a continued play operation of the user with respect to the video media asset, continuing to play the video media asset in the content presentation area.

Furthermore, in the media resource scene, the user can quickly play the video media resource under the condition of searching the video media resource which the user wants to see. The embodiment of the application also supports the display of the video playing entrance (the entrance is presented as a control, a component, a key or an option and the like on the interface) for triggering the large screen to play the video media resources in the content display area, thereby realizing the rapid jump from the interactive dialogue interface to the media playing interface of the video media resources.

A schematic diagram of previewing a video media asset in a content presentation area and triggering a large screen play of the video media asset from the content presentation area can be seen in fig. 8. As shown in fig. 8, when the i-th round of interactive dialogue information includes a resource image (such as resource image 801) corresponding to a video media resource, and the resource image 801 is in focus (i.e. a cursor in a display screen selects the resource image 801), target content information related to the resource image 801, namely, the video media resource 802, is displayed in a content display area, and the video media resource 802 is played, a user can execute a pause play operation or a continue play operation to realize autonomous control of the video media resource 802. In addition, a video playing entry 803 of the video media asset 802 is displayed in the content display area, and when the video playing entry 803 is triggered, the computer device closes the interactive dialogue interface and displays the media playing interface 804 of the video media asset 802 to realize the full-screen playing of the video media asset 802 (such as starting playing from the beginning or starting playing from the playing progress in the content display area) on the media playing interface 804.

(2) The target dialog information is user-defined in the interactive dialog region.

The embodiment of the application supports the user to customize the target dialogue information needing to display the content information from the interactive dialogue area according to the content viewing requirement of the user in the interactive dialogue process, so that the richer information selection authority is given to the user, the requirement of user personalized content information viewing can be met, and the interactive dialogue experience of the user is obviously improved.

In a specific implementation, if a user needs to supplement information to specific dialogue information in the ith round of interactive dialogue information in the interactive dialogue area, the user performs information viewing operation on the specific dialogue information, wherein the specific dialogue information is determined to be target dialogue information, and at the moment, the computer equipment responds to the information viewing operation performed on the target dialogue information in the ith round of interactive dialogue information in the interactive dialogue area, and displays the target dialogue information as a selected state in the interactive dialogue area so as to secondarily prompt the user to confirm the selected target dialogue information. The computer device may then also display target content information associated with the target dialog information in the content presentation area. The information viewing operation performed by the ① user on the target dialogue information in the interactive dialogue area can be understood as a selection operation on the target dialogue information, wherein the selection operation comprises, but is not limited to, a selection operation on the target dialogue information, a key operation, a confirmation operation for confirming options on the information from an option window, or a selection operation on the target dialogue information through an input device such as an electronic pen, a mouse and the like. ② The pattern in which the target session information in the interactive session area is displayed in the selected state may include, but is not limited to, a background color of the target session information being deepened, a prompt tab being displayed adjacent to the target session information for prompting the user that the target session information is required to be information-supplemented, and the like.

The operation procedure of the information viewing operation described above, and an exemplary style in which the target dialog information is in a selected state are described below with reference to the accompanying drawings. As shown in fig. 9a, it is assumed that the selection operation of the target dialogue information is a selection operation of the target dialogue information, a key operation, and a confirmation operation of confirming an option of information from an option window, and in the case that the computer device is a smart computer or a smart television, the target dialogue information 901 can be selected by controlling a cursor in a display screen of the computer device through a mouse or a remote controller. Then, a button operation (e.g., right button operation of the mouse, pressing operation of a button in the remote controller) is performed on the mouse or the remote controller, and an option window 902 is displayed, wherein the option window 902 includes an information confirmation option 903 for triggering the display of target content information related to the target dialogue information 901 in the content presentation area. The user may continue to perform a confirmation operation on the information confirmation option 903 by controlling the cursor with the mouse or the remote controller, at this time, it is determined that the user needs to further supplement the target dialogue information 901 in the content presentation area, and the target dialogue information 901 is displayed in the interactive dialogue area as a selected state, such as a background color deepened pattern 904 or a pattern in which the prompt tab 905 is displayed at a position adjacent to the target dialogue information.

As shown in fig. 9b, it is assumed that the selection operation of the target session information is a selection operation of the target session information through an input device such as an electronic pen or a mouse, and in the case that the computer device is a device capable of being operated by a touch screen such as a smart computer, the selection operation may be performed on the target session information 901 in a display screen of the computer device through an electronic pen or a finger, for example, the selection operation is a circling operation, where the computer device determines that the user needs to further supplement information on the target session information 901 in a content display area, and displays the target session information 901 in an interactive session area as a selected state, and the selected state is set forth in fig. 9a and is not repeated herein.

Based on the foregoing description of the implementations (1) and (2), embodiments of the present application support predefined target session information or user-defined target session information. In the case that the number of content information (including target content information and candidate question description information) related to target dialogue information in the ith round of interactive dialogue information is at least two, the embodiment of the application supports the display of at least two content information related to target dialogue information at one time in the content display area, that is, at least two content information are simultaneously displayed in the content display area, and content information 906 and content information 907 related to target dialogue information are simultaneously displayed in the content display area as shown in fig. 9b, so that a user can conveniently view the content information.

Based on the fact that the limited display area of the content display area can not display a plurality of content information at the same time under the condition that the number of content information related to the target dialogue information is large, the embodiment of the application also supports switching display of the plurality of content information related to the interactive dialogue stream information in the content display area, particularly switching display of the plurality of content information related to the ith round of interactive dialogue information displayed in the interactive dialogue area, not only can realize complete display of each content information in the content display area, but also can ensure flexible control of display and hiding of the plurality of content information. In a specific implementation, the content information may specifically include content information related to the ith round of interactive dialogue information (including candidate question description information matched with dialogue intent of the ith round of interactive dialogue information and target content information related to target dialogue information in the ith round of interactive dialogue information), where at least one content type option is included in the content presentation area, and each content type option corresponds to one content information. In this way, the computer device displays content information corresponding to the selected target content type option in the content presentation area in response to a user's selection operation of the target content type option, the target content type option being any one of the at least one content type option.

For example, a schematic diagram including a content type option in a content presentation area may be referred to in fig. 10, as shown in fig. 10, a content type option 1001, a content type option 1002, a content type option 1003, and the like are included in the content presentation area, the content type option 1001 corresponds to detailed information of target dialogue information, the content type option 1002 corresponds to candidate question description information matched with dialogue intention of the ith round of interactive dialogue information, and the content type option 1003 corresponds to comment information of target dialogue information. When the user wants to view any one of the content information of the target dialog information, a selection operation may be performed on a content type option corresponding to the any one of the content information, such as the content type option 1003, at which time detailed information of the target dialog information corresponding to the content type option 1003 is displayed in the content presentation area. Further, if the number of target dialogue information in the i-th interactive dialogue information is a plurality of, when the content type option 1003 is selected, detail information related to each target dialogue information is displayed in the content display area, so that batch viewing of the detail information of the plurality of target dialogue information is facilitated for a user, and viewing efficiency of the content information is improved.

The interactive dialogue area and the content display area in the interactive dialogue interface provided by the embodiment of the application are displayed in a linkage way. The linkage display between the interactive conversation area and the content display area means that when the ith round of interactive conversation information is displayed in a large part of the display area (such as a display area larger than 1/2), even if a user dynamically views the ith round of interactive conversation information in the interactive conversation area, content information related to the ith round of interactive conversation information is always displayed in the content display area, and when the large part of the display area of the interactive conversation information of the interactive conversation is switched from displaying the ith round of interactive conversation information to displaying the jth round of interactive conversation information, the content information related to the jth round of interactive conversation information is switched to be displayed in the content display area. The linkage display mode between the interactive dialogue area and the content display area can keep displaying the content information related to the ith round of interactive dialogue information in the content display area under the condition that a user views the ith round of interactive dialogue information, and avoid the problems of information misalignment and the like caused by the fact that the content information displayed in the content display area is not related to the interactive dialogue information displayed in the interactive dialogue area.

Referring to fig. 11, the process of linkage display between the interactive dialogue area and the content display area is described by taking the example of switching display in the interactive dialogue area of the ith round of interactive dialogue information and the jth round of interactive dialogue information in the N rounds of interactive dialogue information, where j is a positive integer, j is not equal to i, and j is not less than 1 and not more than N. As shown in fig. 11, the content information 1102 related to the ith round of interactive dialog information is kept displayed in the content presentation area during the execution of the dialog viewing operation for the ith round of interactive dialog information 1101 in the interactive dialog area, and in detail, if the ith round of interactive dialog information 1101 always occupies most of the display area in the interactive dialog area during the execution of the operation of performing the dialog viewing operation in the interactive dialog area, it is determined that the dialog viewing operation performed in the interactive dialog area is to view the ith round of interactive dialog information, the content information 1102 related to the ith round of interactive dialog information is kept displayed in the content presentation area. When it is detected that there is a dialogue viewing operation of switching from the ith round of interactive dialogue information 1101 to the jth round of interactive dialogue information 1103 in the interactive dialogue area, specifically, as the operation of the dialogue querying operation is continuously performed, the display area of the jth round of interactive dialogue information 1103 in the interactive dialogue area is larger than the display area of the ith round of interactive dialogue information 1101 in the interactive dialogue area, the computer device updates and displays the content information related to the ith round of interactive dialogue information as the content information 1104 related to the jth round of interactive dialogue information in the content presentation area.

The interactive dialog region may include at least one of a page-turning operation (i.e., the interactive dialog information in the interactive dialog region simulates a style of turning pages left and right or up and down in the real world to display cross-type dialog information), a scrolling operation (e.g., a scrolling display of cross-type dialog information by controlling a button in a mouse or a remote controller), a drag operation (e.g., a sliding display of interactive dialog information by a sliding axis in the interactive dialog region), and the like.

The method comprises the steps of taking dialogue checking operation as scrolling operation, j=i+1, taking a text type as an information type of the ith round of interactive dialogue information and a picture-text type as an information type of the jth round of interactive dialogue information as examples, turning pages upwards of a half screen in an interactive dialogue area according to page turning operation to realize page turning update display of interactive dialogue stream information, particularly page turning update display of the ith round of interactive dialogue information, and displaying content information related to the ith round of interactive dialogue information in a content display area if the ith round of interactive dialogue information always occupies most of display area of the interactive dialogue area in the continuous execution process of the scrolling operation. Further, as the operation of the scroll operation is continuously performed, the jth round of interactive dialogue information appears in the interactive dialogue area, and when the jth round of interactive dialogue information occupies a larger display area of the interactive dialogue area than the ith round of interactive dialogue information, dialogue information related to the jth round of interactive dialogue information is displayed in the content display area. It should be noted that, the information type of the jth round of interactive dialogue information is a graphic type, for example, the resource image and the text coexist in the jth round of interactive dialogue information in the media resource scene, so that the resource image is considered to be more likely to be information of interest to the user than the plain text, and therefore, in the process of updating and displaying the jth round of interactive dialogue information, the resource image in the jth round of interactive dialogue information in the interactive dialogue area is preferentially focused, that is, the resource image is automatically determined as the third dialogue information.

It should be noted that fig. 11 is merely an exemplary operation procedure for performing a dialog viewing operation in an interactive dialog region according to an embodiment of the present application, and is not limited to the embodiment of the present application.

As mentioned above, the interactive session stream information includes N rounds of interactive session information, and there may be multiple rounds of interactive session information in the N rounds of interactive session information, where the content information is related to each round of interactive session information. In order to facilitate a user to quickly locate content information related to any round of interactive dialogue information in a content display area, the embodiment of the application supports the quick location of any round of interactive dialogue information in the interactive dialogue area and the quick location of the content information related to any round of interactive dialogue information in the content display area, particularly, under the condition of larger N values, the location efficiency of the content information or the interactive dialogue information is obviously improved, the user can flexibly locate the interactive dialogue information and the content information according to the information viewing requirement of the user, and the interactive dialogue experience of the user is improved.

In the specific implementation, the content display area comprises dialogue options corresponding to each round of interactive dialogue information in the interactive dialogue stream information, and any dialogue option is used for quickly indexing the corresponding interactive dialogue information in the interactive dialogue area and quickly indexing the content information related to the corresponding interactive dialogue information in the content display area. In this implementation, the computer device displays content information related to target interactive dialog information corresponding to a target dialog option in a content presentation area in response to a user's triggering operation of the target dialog option in the content presentation area, and displays the target interactive dialog information in the interactive dialog area. As shown in fig. 12a, it is assumed that n=3, a dialog option 1201 corresponding to the 1 st round of interactive dialog information, a dialog option 1202 corresponding to the 2 nd round of interactive dialog information, and a dialog option 1203 corresponding to the 3 rd round of interactive dialog information are included in the content presentation area, and the 2 nd round of interactive dialog information is currently displayed in the interactive dialog area, and content information related to the 2 nd round of interactive dialog information is displayed in the content presentation area. When the user performs a trigger operation on the session option 1203 in the content presentation area, content information related to the 3 rd round of interactive session information is displayed in the content presentation area, and the 3 rd round of interactive session information is displayed in the interactive session area.

Two points need to be explained based on the schematic diagram shown in fig. 12 a:

① The dialog options and the aforementioned content type options may exist in the content presentation area at the same time. When the dialog options and the content type options exist in the content display area at the same time, when any dialog option is selected, the user is further allowed to perform a selection operation on the content type options, so as to realize switching display of different content information related to the interactive dialog information corresponding to any dialog option. As shown in fig. 12b, with the dialog option 1202 selected, the user may continue to perform a selection operation on the content type option in the content presentation area to enable a toggle display of multiple content information related to the 2 nd round of interactive dialog information. The simultaneous existence of the dialogue options and the content type options greatly improves the flexibility of the user to view the interactive dialogue information and the content information from the dimension of the dialogue wheel and the dimension of the content information.

② As described above, the number of target dialogue information in the ith round of interactive dialogue information may be at least two, and the embodiment of the application supports the user to quickly display the target content information related to the designated target dialogue information in the content display area by performing the content positioning operation on the target dialogue information (designated by the user) designated in the ith round of interactive dialogue information in the interactive dialogue area. In the process of conveniently viewing the interactive dialogue stream information in the interactive dialogue area, the user does not need to repeatedly conduct information viewing operation on the target dialogue information, but the target dialogue information can be cached, so that the user can quickly index content information related to certain target dialogue information from the interactive dialogue area. In a specific implementation, when target dialogue information specified in the ith round of interactive dialogue information in the interactive dialogue area is subjected to a content locating operation, target content information related to the specified target dialogue information is displayed in the content presentation area. Wherein the content locating operation performed on the target dialogue information may include, but is not limited to, a double click operation, a long press operation, a single click operation, etc. on the target dialogue information, which is not limited by the embodiment of the application.

In summary, the embodiment of the application displays the partition content of the interactive dialogue interface, particularly displays the interactive dialogue flow information triggered by the voice signal in the interactive dialogue area, and can help the user to more conveniently determine the result of the problem through the form of the problem description information-the problem result information, thereby improving the question-answer experience of the user. And displaying content information related to the interactive dialogue flow in the content display area, realizing further information supplementation of the interactive dialogue flow information, helping a user to simultaneously combine the interactive dialogue flow information in the interactive dialogue area and the content information displayed in the content display area to jointly execute decisions and the like, and effectively improving dialogue quality and dialogue efficiency.

Referring to fig. 13, fig. 13 is a flowchart illustrating another voice processing method according to an exemplary embodiment of the present application, where the voice processing method shown in fig. 13 may be performed by a computer device, for example, the computer device is a terminal, or the computer device includes a terminal and a server. The voice processing method may include, but is not limited to, steps S1301-S1307:

s1301, displaying a service interface of the resource playing device.

The resource playing device provides a globally triggerable voice interaction mode, that is, a plurality of service interfaces of the resource playing device are all provided with voice interaction inlets, compared with the fixed voice interaction inlet position, the multipath triggering of voice interaction is improved, and the flexibility of triggering voice interaction is improved. Specifically, a media resource application program is deployed in the resource playing device, and an intelligent assistant integrating the voice processing method provided by the embodiment of the application is deployed in the media resource application program; the embodiment of the application supports the provision of the voice interaction entrance in each service interface of the resource playing equipment so as to realize the globally triggerable voice interaction. The service interface of the resource playing device comprises a system interface of the resource playing device and an application interface of the media resource application program, wherein the system interface of the resource playing device refers to an interface provided by an operating system deployed in the resource playing device, and the application interface of the media resource application program refers to an interface provided by a service provider of the media resource application program.

In order to facilitate the perception of each stage or progress of voice interaction with the resource playing device by the user, the embodiment of the application supports the display of voice prompt information in the service interface of the resource playing device, and enables the user to perceive the progress or stage of voice interaction through each state of the voice prompt information. The method comprises the steps of a guiding state, an awaking state, an identifying state, an understanding state and an executing state (namely, an interactive dialogue interface is displayed when dialogue tasks are performed, and a task executing result is displayed when the dialogue tasks are performed). The Speech prompt information may be a Speech strip based on an automatic Speech recognition technology (Automatic Speech Recognition, ASR), and the ASR technology is a technology for automatically converting a human Speech signal into a corresponding Text through an AI algorithm, so that a resource playing device can understand semantic content expressed by the Speech signal, thereby realizing conversion from Speech To Text (STT). Thus, the voice prompt information can be based on ASR technology to convert the collected voice signal of the user into corresponding text (such as problem description information).

Specifically, when a user initially enters a service interface of the resource playing device, the service interface comprises voice prompt information in a prompt state. The prompting state can be called a guiding state, and the voice prompting information in the prompting state is used for prompting the resource playing equipment to have a voice interaction function. If the user wants to perform voice interaction with the resource playing device, if the user wants to invoke the media resource application program running in the resource playing device to perform video search, the user can perform a confirmation operation on the voice prompt information in the prompt state, and at this time, the execution of step S1302 is triggered.

S1302, in response to the confirmation operation of the voice prompt information in the prompt state, the voice prompt information is converted from the prompt state to the awakening state.

When the resource playing device detects the confirmation operation of the user on the voice prompt information in the prompt state, the resource playing device determines that the user wants to perform voice interaction, and the resource playing device converts the voice prompt information from the prompt state to the awakening state. The wake-up state may be simply called an awake state, where the voice prompt information in the awake state is used to prompt the resource playing device to be in a voice acquisition state, so that the user starts to output the voice signal when perceiving that the voice prompt information is in the awake state.

It should be noted that, the confirmation operation performed by the user on the voice prompt information in the prompt state may be implemented through near field communication or remote communication, where the related content of the near field communication or remote communication may refer to the related description of the related content shown in the foregoing step S301, which is not described herein in detail. For example, if the resource playing device is a smart television, the user may perform a pressing operation on a key in a remote controller of the smart television to implement a confirmation operation on the voice prompt information in the prompt state if the resource playing device is near field communication, and determine that the user performs the confirmation operation on the voice prompt information in the prompt state when the resource playing device receives a confirmation request transmitted by a signal transmission device on the user side if the resource playing device is remote communication.

S1303, when the resource playing device starts to collect the voice signal, the voice prompt information is converted from the awakening state to the collecting state.

And if the resource playing equipment starts to acquire the voice signal, the resource playing equipment converts the voice prompt information from the awakening state to the acquisition state. The collection state may be referred to as an identification state, and the voice prompt information in the collection state is used to prompt the resource playing device to collect the voice signal. The method for collecting the voice signal by the resource playing device can include collecting the voice signal sent by the user in the physical environment based on near field communication or receiving the voice signal transmitted by the signal transmission device at the user side based on remote communication.

And 1304, when the collection of the voice signals by the resource playing equipment is finished, converting the voice prompt information from the collection state to the understanding state.

When the resource playing device successfully collects the voice signals, the resource playing device can convert the voice prompts from the collecting state to the understanding state, wherein the understanding state can be called as an understanding state, and voice prompt information in the understanding state is used for prompting the resource playing device to perform result searching processing based on the voice signals.

To facilitate understanding of the process of the circulation between the states of the voice prompt described in the steps S1301-S1304, the following is a schematic diagram of the voice prompt in different states in conjunction with fig. 14. As shown in fig. 14, a voice prompt 1401 in a guiding state is displayed in a service interface of the resource playing device, and text content included in the voice prompt 1401 in the guiding state is used for guiding a user to perform operations or perform movie searching and the like through voice control of the resource playing device. The text content may be generated in combination with data such as platform operation, object data (e.g., data such as user's scheduled online behavior, follow-up video, etc.), and video content (e.g., actors, comment rounds), which is not limited by the embodiment of the present application. When it is detected that the user performs a confirmation operation on the voice prompt 1401 in the boot state, the voice prompt 1401 is switched from the boot state to the awake state for prompting the user that the output of the voice signal can be started. When the resource playing device starts to collect the voice signal, the voice prompt information 1401 in the wake-up state is converted to the recognition state so as to prompt the user that the voice signal is being collected. When the collection of the voice signal is finished, the voice prompt information 1401 in the recognition state is converted into the understanding state so as to inform the user that the result search processing is being performed on the voice signal. It should be understood that the information style, display position and specific content of the voice prompt shown in fig. 14 are all examples, and are not limited to the embodiments of the present application.

The result search processing for the voice signal mainly comprises operations such as intention recognition processing and result generation processing for the voice signal, and aims to generate problem result information of problem description information corresponding to the voice signal so as to realize reply to a user Query. In order to realize more accurate interactive dialogue, the embodiment of the application supports the result search processing of the voice signal based on a large language model, and aims to generate more accurate problem result information so as to improve the dialogue quality and the dialogue efficiency. The method and the device have the advantages that the AI voice landing scheme of the large language model (such as LLM model) is provided, the large language model is better in intention understanding capability, and the generated model is finely tuned by combining specific service data (service data in a media resource scene), so that the finely tuned generated model can be better combined with services (such as services for searching media resources in the media resource scene), and content searching on resource playing equipment, particularly OTT equipment, can be effectively realized. On the one hand, under the interactive dialogue scene, the intent recognition can be carried out on complex user Query and unseen user Query based on the finely tuned generated model, so that generalization capability is improved. On the other hand, by means of the long text processing capability of the generated model, text content corresponding to longer voice signals can be understood, and accuracy of intention recognition and content recall is improved.

In a specific implementation, the general flow of performing result search processing on a voice signal based on a trimmed generated model according to an embodiment of the present application may include first obtaining basic analysis data, where the basic analysis data is basic data related to a service applied by a voice processing method, and the basic analysis data includes, but is not limited to, one or more of scene data (such as application data of a media resource application program, etc.), object data (such as historical identity data, behavior data, etc. of a user), authority data (such as authority of the user is greater than that of a non-member when the user is a member of the media resource application program), and intention list data (such as each intention expressed by the user when the user searches a movie in the media resource application program in a historical time). And then calling the fine-tuned generation model, and performing intention recognition processing on the voice signal based on the basic analysis data to obtain an intention recognition result, wherein the basic analysis data is specifically given to the fine-tuned generation model, so that the generation model can quickly analyze and obtain the intention recognition result based on the basic analysis data, and the intention recognition result indicates a voice task which is the user intention corresponding to the voice signal input by the user.

Finally, executing the voice task according to the intention recognition result. If the intention recognition result characterizes that the voice task indicated by the voice signal is a dialogue task, that is, the intention can be achieved or reached by performing voice interaction with the user, performing result generation processing according to the intention recognition result, generating problem result information corresponding to the voice signal, and triggering and executing steps S1305-S1307, wherein the generated problem result information and the problem description information corresponding to the voice signal form a round of interactive dialogue information included in the interactive dialogue stream information in the interactive dialogue area. In the scene of the media resource, in the process of generating results according to the intention recognition result, finer intention classification is further carried out on the intention recognition result, and the result generation processing is carried out according to the finer intention classification result to generate problem result information corresponding to the voice signal.

On the contrary, if the intention recognition result represents that the voice task indicated by the voice signal is a direct task, that is, the intention recognition result does not need to perform voice interaction with a user, but only needs to directly execute corresponding task processing, the direct task is processed, and the task processing result (or referred to as direct result) is presented in a resource playing interface. The direct class tasks can comprise at least one of a control task, an interface switching task, a player control task and the like. ① direct class tasks are control tasks, the control tasks indicate to control the equipment capacity of the resource playing equipment, such as volume control, brightness control, bullet screen control and the like, and the corresponding task processing results can be experienced through voice prompt information. As shown in fig. 15a, assuming that the user's intention indicated by the intention recognition result is to close the bullet screen on the display screen, the resource playback apparatus may automatically close the bullet screen in the display screen and display the task execution result in the voice prompt 1501 to prompt that the bullet screen has been closed. ② The direct class task is an interface switching task, and can realize cross-application on the resource playing device, wherein the application can refer to different application programs running in the resource playing device or different application interfaces in the same application program. As shown in fig. 15b, assuming that the user intention indicated by the intention recognition is to jump to a news channel, a channel interface corresponding to the news channel is opened directly in the resource playing device, and a task execution result is displayed in the voice prompt 1502 to prompt for opening the news channel. ③ The direct class task is a player control task, and can directly open a playing interface of the video media resource under the condition of uniquely determining the video media resource, thereby realizing rapid resource playing. As shown in fig. 15c, assuming that the user's intention indicated by the intention recognition result is to open the video media asset, the asset playing device opens the player and plays the video media asset, and displays the task execution result in the voice prompt message 1503 to prompt that the video media asset has been successfully opened.

The detailed process of the above-described result search processing for voice signals is described below with reference to fig. 16, in which after a user inputs a voice signal through a media resource application program, it is determined that a user request (i.e., a user Query or a voice signal) enters an access layer (corresponding to an identification state of voice prompt information), and the access layer invokes an AI voice logic layer to perform user request analysis, scene analysis, right analysis, intention analysis, and the like (such as general logic including text error correction, word segmentation, and the like) to obtain basic analysis data, where the AI voice logic layer is mainly used to implement some UI componentized construction logic in a service interface to help the access layer complete analysis of the general logic, so as to obtain the basic analysis data.

Then, the voice signal and the basic analysis data are input into an intention decision layer, and the intention decision layer is connected with the large language model service provided by the embodiment of the application and the original traditional intention capability service of the resource playing equipment, so that the intention decision layer can analyze whether to process the user Query by adopting the traditional intention capability service or the large language model service based on the voice signal and the basic analysis data. Traditional intent service capabilities may include, among other things, the ability to perform task execution services, template capabilities, model service capabilities, object data services, search services, and scene matching services. The task execution service can implement task execution of the direct class tasks. The template capability supports rapid recognition of intent through string matching (e.g., string exact match, string fuzzy match, slot exact match, slot fuzzy match, etc.). Model service capabilities can utilize some traditional models (e.g., bert models) to perform intent recognition on simpler speech signals (e.g., corresponding problem description information is shorter). The object data service supports returning user intent based on historical speech usage of the user. The search service may support querying video media assets through search capabilities for simpler user Query. For example, three types of buttons of application, media resource and function (button) exist on the first page of the media resource application program, and each type of button is provided with corresponding text content to represent the function of the button, so that the text content arranged on the button and a user Query can be subjected to simple character string mode matching, and whether scene information needed to be clicked by the user exists on the first page or not can be judged.

If the intention decision layer recognizes that the large language model service is adopted to perform intention recognition processing and result generation processing on the user Query, the voice signal (i.e. the user Query) and the basic analysis data are input to the intention model service together. The intent model service interfaces with at least one generative model for identifying intent, that is, embodiments of the present application support the use of multiple generative models to enable voice interactions. The generative model may include, but is not limited to, a multi-intent classification model, an intent classification model, a scene matching model, and the like. The scene matching model is used for realizing intention recognition by performing scene matching on the user Query. The core functions of the intention classification model and the multi-intention classification model are to identify the intention and extract the slot of the Query (Query) input by the user, in particular to determine the main target or operation intention of the user by adopting the semantic analysis of the intention classification model or the multi-intention classification model to the Query of the user and extract the related key information (slot) from the main target or operation intention of the user so as to identify the intention of the user. Wherein, the multi-intent classification model may enable the identification of multiple intents as compared to the intent classification model. For example, the intention classification model performs single intention recognition, the input of the intention classification model is "i want to hear the song of singer", the recognized slot includes singer (singer_name), and the output intention recognition result is music-singer (music-singer). For another example, the multi-intention classification model performs multi-intention recognition, the input of the multi-intention classification model is "open music player plays song a", the recognized slot includes the music player, the corresponding intention is open-application intention (open-app), the recognized slot also includes song a, and the corresponding intention is music-song a (music-song).

It is noted that, in order to improve accuracy of intention recognition, after processing the first user Query, the resource playing device (specifically, a media resource application program in the resource playing device) will judge whether the interactive session is multiple rounds according to actual operation of the user, if so, the resource playing device will carry session information of historical user requests when the user requests next time. In this way, the generative model jointly determines the intent type of the next user request based on the dialog information of the multiple rounds of interactions. Exemplary codes for at least one round of historical interactive dialog information carried by the next user request are as follows:

Round1: "role": user, content: "xxxxQuery: set alarm clock answer:"// user Query request set alarm clock

Round1”role”:system,content:”{\"intent\":\"CLOCK_REMINDAER/CLOCK\"}

Setting alarm clock/response user Query

Round1: "role": user, content: "Query: five-point alarm clock answer:"// user Query requests to set a 5-point alarm clock

Round1: "role": system, content: "clock_ REMINDAER/CLOCK"// alarm CLOCK set 5 points in response to user Query

The three generated models corresponding to the intention model service can jointly perform quick domain falling of the user intention of the voice signal (namely, quickly identify the range of the intention, such as a television or a non-television), thereby remarkably improving the identification efficiency of the intention identification. The embodiment of the application supports the model fine adjustment or the model fine adjustment of any generation type model of the intent model service docking by combining the service data so as to improve the combination of the generation type model and the specific service, thereby better completing the intent recognition under the specific service and improving the accuracy and the recognition efficiency of the intent recognition. The following describes the training data construction of the generative model, and the model training process in detail, wherein:

(1) Training data is constructed.

The general flow of constructing the training data can be seen in fig. 17. As shown in fig. 17, the training data may be constructed by uploading the training data by itself or constructing the training data based on reference data, a Prompt word (Prompt) template, and a parameter data set (including various parameters in a media resource scene). Under the condition that the training data is constructed based on the prompt word template and the parameter data set, a base large model is used for generating preliminary training data conforming to a service scene based on reference data, then data cleaning (or correction) is carried out on the preliminary training data to obtain training data, the training data can be issued into a data warehouse to be combined with training data obtained in other modes (namely, training data uploaded by a user) to obtain training data finally used for model training, then model training (namely fine tuning or fine tuning) is carried out on a generated model by adopting the training data, and the obtained fine-tuned generated model is issued to an on-line for use.

Taking the generated model as a base mixed element model as an example, the content of an exemplary prompt word template comprises:

You are an assistant for intent recognition and slot extraction of natural speech text, and the following is an explanation of intent and slot.

Alternative intent (the values of the intent are and only are, one or more must be selected from only one or more of the following options, without generating other options themselves) # # #

The [ 'MUSIC/sonog', 'MUSIC/ALBUM', 'MUSIC/SINGER' ] intend to identify that there is and is only the above selection, and that no new selection can be generated by itself

# # Is intended to explain # #

MUSIC/SONG playing a SONG, the entity contains only SONGs, # example # # play ] (act_pla Y) [ celadon ] (music_tag) "," [ OPEN ] (act_open) [ celadon ] (music_tag) "," [ i want to listen ] (act_ WANT) [ celadon ] (music_tag) ".

The # # multi-purpose combinations (there is and only one under combination form when the value of the intent exceeds one) #

The multiple intentions of the ' MUSIC/SONG ', ' MUSIC/ALBUM ', ' are only combined, the generation of new combined # # special format Query is forbidden to intervene in # # by oneself

Quey text in the following format is attributed to the specified intent type:

(ACT_PLAY) TV drama xxxx

In connection with the above description, send for lines are intended to identify and slot extraction, and not introduce additional information strictly in accordance with the information in question, must be strictly adhered to for defined ranges of points and slots. The result is output in json format.

question:

[xxxx]

answer:

It can be seen that the prompt word template not only gives the target of the intention recognition (for example, "do intention recognition and slot extraction, do not need to introduce additional information strictly according to the information in question, and must strictly adhere to the defined range of points and slots. Output the result in json format"), but also gives the example of the recognition target (for example, "you are an assistant for carrying out intention recognition and slot extraction on natural speech text, the following explanation of intention and slot. Thus, the purpose recognition of how to perform the target is facilitated for the generative model based on the example learning, and the model recognition performance of the finely tuned generative model is improved.

(2) And (5) model training.

The general flow of model training can be seen in FIG. 18. As shown in FIG. 18, unlabeled data can first be obtained directly from the line or manually injected into the unlabeled training dataset. Specific unlabeled data (e.g., recent data or data more relevant to business data) is then queried from the unlabeled training data. And (3) performing data cleaning on the unlabeled data, and labeling the unlabeled training data and landing the labeled training data after labeling to a labeled training data set. And then, extracting data from the marked training data set, converting the extracted training data into a data format according to the requirement of a training platform (such as packaging the training data into a json-format training data set file), and packaging and uploading the converted training data to a training data warehouse. Finally, training data is called from a training data warehouse to perform model training, and the trained generated model can be subjected to model evaluation to evaluate the performance of the model, so that the generated model can be released on line for use under the condition of better performance. The above mentioned data cleaning, data extraction and model evaluation can be performed in a voice data operation platform, which is a platform for developer to perform visual operation and data query, and the whole model training process is implemented on a one-stop platform for generating a model, and the efficiency of model training is improved by the one-stop platform.

Furthermore, the generated model trained based on the model can be used for carrying out intention recognition processing on the voice signal based on the voice signal and basic analysis data to obtain an intention recognition result. Wherein:

(1) In the media resource scene, if the intention recognition result represents that the voice task corresponding to the voice signal is a dialogue task, particularly a media resource search task under the dialogue task, the user Query is input to the film and television model service. The movie model service interfaces with a sub-intent classification model which can more finely classify the intent of the user Query so as to facilitate the search and generation of high-quality media resources of the user Query through an external network (such as other resource databases or platforms independent of the media resource application) and an internal network (i.e., the resource database corresponding to the media resource application) based on the more detailed intent classification.

The specific implementation process of calling the sub-intention classification model, the extranet and the intranet to search and generate the media resources by the film and television model service can be seen in fig. 19, and the specific implementation process is as shown in fig. 19:

① The method comprises the steps of firstly extracting multiple rounds of interactive dialogue information from a user Query, and particularly obtaining multiple rounds of interactive dialogue information in interactive dialogue stream information reported by a media resource application program.

② And (3) carrying out intranet searching, namely carrying out intention classification judgment on a plurality of rounds of user Query (namely at least one round of interactive dialogue information), and aiming at determining the search path intention of the media resource (for example, the search path intention comprises ' searching recommendation ', ' searching play history ', ' searching media resource name ', ' searching actor name ', ' searching scene ', ' searching list ', searching movie information ' and the like), namely the dialogue intention of each round of interactive dialogue information in at least one round of interactive dialogue stream information. And then, carrying out intranet resource searching processing according to the dialogue intention of each interactive dialogue information in at least one round of interactive dialogue stream information to obtain intranet resources, specifically calling different interfaces (Application Programming Interface, API) according to different search path intention, for example calling corresponding interfaces of media resource information, list, recommendation, user history, search and the like, and carrying out resource searching from a database corresponding to a media resource application program. Notably, in order to promote the richness of the intranet resources, the embodiment of the application supports the dimensions of content recall and the like by expanding the search result return quantity of the intranet resources and generating keywords, and recalls richer intranet resources from the database corresponding to the media resource application program, so that the method is more beneficial to providing accurate media resources for users based on richer intranet resources.

The method and the device for classifying the intention of the multi-round user Query can be realized by adopting a sub-intention classifying model, and the embodiment of the application also supports the model fine adjustment of the sub-intention classifying model by adopting the prompt words. Wherein, the fine tune prompt word can be as follows:

you are a video search AI small assistant, providing you with user history Query and latest Query, and you need to refer to user history Query to classify the latest Query's intention.

Intent definition #

The first level of intention is to find pieces, find functions and others

# Film and television

Definition of implication video seek intent

Secondary intention list under # # # movie #)

1. Find list of new heat related needs

2. Finding a play history, finding a user's historical viewing behavior

3. Finding and recommending personalized film and television

4. Finding scenes-films and videos viewed in certain given scenes

5. Find similarity-defined content-similar film and television

6. Find video information defining certain aspects of the video

7. Finding the name of actor, the film and television works of the appointed name

8. Find pieces-other film and television demands besides the above intention

# Other #)

Definition of intent other than "movie & TV

# # Other lower level intent list

1. Chat-chat related content

# Query information

Historical Query for # user

{}

# Up-to-date Query

{}

Note that the results must be output using a "primary intent, secondary intent" format.

Multistage intention =

Therefore, the prompt word not only gives the target of the intention recognition (such as ' must use ' primary intention, secondary intention ' format output result ') but also gives an example of the recognition target (such as ' you are a video search AI small assistant, provide you with user history Query and latest Query, you need to refer to the user history Query to classify the latest Query for intention. In this way, the sub-intention classification model is helped to learn how to realize intention classification based on examples, so that the model identification performance of the sub-intention classification model after fine tuning is improved. It should be noted that the sub-intent classification model according to the embodiment of the present application may be a generative model, which is not limited thereto.

③ And (3) searching the extranet, namely searching and processing the extranet resources according to the dialogue intention of each round of interactive dialogue information in at least one round of interactive dialogue stream information to obtain the extranet resources. The background server can extract a plurality of rounds of interactive dialogue information and perform Query expansion on the problem description information in each round of interactive dialogue information so as to obtain expansion information which is easier to understand. The premium web page content is then searched for via other search paths (e.g., web retrieval) that are independent of the database of the media asset application. Then, the searched web page content is subjected to reordering processing (such as web page slicing, filtering, and BM25 ranking (ranking function based on probability model)) to extract web page content of the external network. And finally, carrying out webpage abstract processing on the searched webpage content to extract the video information abstract to finally obtain the extranet resource, and optionally supporting identification extraction (such as extraction of a play name (e.g. a play name of a video)) on the webpage content, and carrying out resource searching from a database (or called a media resource library) corresponding to the media resource application program according to the extracted identification. Notably, in order to improve the searching accuracy of the external network resources, the embodiment of the application supports the dimensionalities of relevance, searching path selection and the like, and improves the relevance of the searched external network resources and intention, thereby improving the generating accuracy of final problem result information.

④ And carrying out resource recombination on the intranet resources and the extranet resources to generate problem result information corresponding to the voice signals. The resource reorganization can include, but is not limited to, resource rearrangement, problem result information generation and other operations, and specifically includes performing reordering processing on intranet resources searched by an intranet according to a reordering policy, and then calling a large language model with content generation capability to generate problem result information based on the reordered intranet resources and the outer network resources. CID (Click ID) generation is performed, for example, in the generation of the problem result information of the graphic type, so that a resource image in the problem result information can be identified to generate content information related to the resource image, such as recommendation reason information. It is worth noting that the reordering strategies comprise various sub-strategies for generating problem result information, such as an operation period weight-raising sub-strategy, an in-station weight-raising sub-strategy, a classical content weight-raising sub-strategy, a heat weight-raising sub-strategy and the like, so that more accurate problem result information can be generated through rich reordering strategies.

(2) In the media resource scene, if the intention recognition result represents that the voice task corresponding to the voice signal is a dialogue task, particularly a non-media resource search task under the dialogue task, the user Query is input to a non-film and television model service. The non-film and television model service is mainly used for processing content in the general purpose vertical intention field, playing, operating and other related content of video media resources in the media resource application program. The method comprises the steps of firstly configuring and storing operation data such as application control and television station operation, then combining the three generation type models to carry out intention matching on slot results (slot results) analyzed by a user Query, and determining specific task content of a direct task.

S1305, displaying an interactive dialogue interface.

It should be noted that the implementation process shown in step S1305 is the same as the implementation process of "displaying the interactive session interface" described in step S301 in the embodiment shown in fig. 3, and will not be described herein.

And S1306, displaying interactive dialogue stream information triggered by the voice signal in an interactive dialogue area.

And S1307, displaying content information related to the interactive dialogue stream information in a content display area.

It should be noted that the implementation procedures shown in steps S1306-S1307 are the same as those shown in steps S302-S303 in the embodiment shown in fig. 3, and reference may be made to the description of the implementation procedures shown in steps S302-S303, which is not repeated here.

It should be further noted that, the embodiment of the present application further supports a process of fully displaying the result search processing in the interactive dialogue area, presents a visual effect similar to human thinking problem to the user, can alleviate the anxiety emotion of the user waiting for the problem result information to a certain extent, and is convenient for the user to perceive that the generation of the problem result information is dependable. Furthermore, the embodiment of the application also supports the thinking process of displaying the problem result information in the interactive dialogue area and any moment of displaying the problem result information, realizes information pause display according to the pause request of the user, improves the intelligence and the controllability of information display in the interactive dialogue area, and improves the user interaction experience.

The following describes a specific procedure for presenting the thinking information corresponding to each round of interactive session information in the interactive session area, and for presenting the pause information, with reference to fig. 20. As shown in fig. 20 (a), the question result information is outputted in the interactive session area in a streaming response manner (see the description related to the streaming response), when the resource playing device (specifically, the server corresponding to the resource playing device) receives a pause request from the user before starting to generate the question result information, the generation of the question result information is paused, and a first notification message 2001 is displayed in the interactive session area, wherein the first notification message 2001 is used for indicating that the generation of the question result information is paused. As shown in the drawing (b) of fig. 20, when the resource playing device outputs the generated thinking information 2002 first in the process of generating the problem result information, if a pause request of the user is received in the process of generating the problem result information, the outputted part of the problem result information is retained in the interactive session area and the generation of the problem result information is paused, and a second notification message 2003 for indicating that the generation of the problem result information is paused is displayed in the interactive session area. As shown in the drawing (c) of fig. 20, if the resource playing device receives a pause request of the user only when the generation of the question result information has ended, the generated question result information is still displayed in the interactive session area, and a third notification message 2004 for prompting that the question result information has been displayed completely is output.

Note that, the content information in the content display area is also output in a streaming response manner, and as shown in fig. 21, the content information is gradually rendered and displayed in the content display area. Optionally, if the content information in the content presentation area needs to wait for the end of rendering and displaying the problem result information in the interactive dialogue area before starting rendering and displaying, the rendering and displaying of the content information is suspended in the content presentation area whenever a suspension request is generated in the interactive dialogue area, and the process of rendering and displaying the interactive dialogue area and then rendering and displaying the content presentation area can be seen in fig. 20. Alternatively, if the content information in the content presentation area can be rendered explicitly in synchronization with the problem result information in the interactive session area, when there is a pause request in the interactive session area, the display of the content information is paused in the content presentation area as well.

In summary, on the one hand, the conventional OTT device performs resource searching and controlling through remote controller searching, and the use of the remote controller is limited to the user group (for example, the old and the children cannot use the remote controller), so that the problems of low operation efficiency, overlong touch common path and the like are caused. On the other hand, prior to the application, users, but traditional schemes have poor intent recognition capability and poor content recall capability for long Query, complex Query. Compared with the method for processing the voice based on the generated large language model, compared with the method for processing the word and the intention through the traditional word segmentation technology and the intention recognition capability, the method for processing the voice based on the generated large language model has the advantages that the intention recognition capability for long Query and complex Query is better and the content recall capability is stronger.

The foregoing details of the method of embodiments of the present application are provided for the purpose of better implementing the foregoing aspects of embodiments of the present application, and accordingly, the following provides an apparatus of embodiments of the present application. In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function and working together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.

Fig. 22 shows a schematic structural diagram of a speech processing device according to an exemplary embodiment of the present application, which may be used to perform some or all of the steps in the method embodiments shown in fig. 3 and 13. Referring to fig. 22, the apparatus includes the following units:

A receiving unit 2201, configured to receive a voice signal and display an interactive dialogue interface, where the interactive dialogue interface includes an interactive dialogue area and a content display area;

a processing unit 2202 for displaying interactive dialog flow information triggered by the speech signal in the interactive dialog region, and

The processing unit 2202 is further configured to display content information related to the interactive dialog flow information in a content presentation area.

a processing unit 2202, configured to, when displaying content information related to interactive dialog flow information in a content presentation area, specifically:

In one implementation, the target content information associated with the asset image is a video media asset, and the processing unit 2202 is further configured to:

playing video media assets in a content presentation area;

In one implementation, the interactive dialog flow information includes N rounds of interactive dialog information, and a processing unit 2202, configured to, when displaying content information related to the interactive dialog flow information in the content presentation area, specifically:

In one implementation, the number of content information related to the interactive dialog flow information is at least one, the content presentation area includes at least one content type option, each content type option corresponding to one type of content information, and the processing unit 2202 is further configured to:

In one implementation, the processing unit 2202 is further configured to:

In one implementation, the content presentation area includes a dialogue option corresponding to each round of interactive dialogue information in the interactive dialogue stream information, and the processing unit 2202 is further configured to:

In one implementation, the number of target session information in the ith round of interactive session information is at least two, and the processing unit 2202 is further configured to:

In one implementation, the voice processing method is applied to a media resource application program, and the media resource application program runs in a resource playing device, and the receiving unit 2201 is specifically configured to, when receiving a voice signal:

In one implementation, the result search process includes an intent recognition process and a result generation process, the processing unit 2202 further configured to:

In one implementation, the processing unit 2202 is further configured to:

In one implementation, the processing unit 2202 is configured to perform a result generation process according to the intention recognition result, and specifically configured to, when generating problem result information corresponding to the voice signal:

According to an embodiment of the present application, each unit in the speech processing apparatus shown in fig. 22 may be separately or completely combined into one or several additional units, or some unit(s) thereof may be further split into a plurality of units having smaller functions, which may achieve the same operation without affecting the implementation of the technical effects of the embodiment of the present application. The above units are divided based on logic functions, and in practical applications, the functions of one unit may be implemented by a plurality of units, or the functions of a plurality of units may be implemented by one unit. In other embodiments of the application, the speech processing means may also comprise other units, and in practical applications, these functions may also be assisted by other units and may be implemented by a plurality of units in cooperation. According to another embodiment of the present application, a processing apparatus based on a media application as shown in fig. 22 may be constructed by running a computer program capable of executing the steps involved in the respective methods as shown in fig. 3 and 13 on a general-purpose computing device such as a computer including a processing element such as a central processing unit (Central Processing Unit, CPU), a random access storage medium (Random Access Memory, RAM), a read-only storage medium (ReadOnly Memory, ROM), and the like, and a storage element, and a voice processing method of an embodiment of the present application is implemented. The computer program may be recorded on, for example, a computer-readable recording medium, and loaded into and run in the above-described computing device through the computer-readable recording medium.

Based on the same inventive concept, the principle and beneficial effects of the computer device provided in the embodiment of the present application are similar to those of the voice processing method in the embodiment of the present application, and may refer to the principle and beneficial effects of implementation of the method, which are not described herein for brevity.

Fig. 23 is a schematic diagram showing a structure of a computer device according to an exemplary embodiment of the present application. Referring to fig. 23, the computer device includes a processor 2301, a communication interface 2302, and a computer readable storage medium 2303. Wherein the processor 2301, the communication interface 2302, and the computer readable storage medium 2303 may be connected by a bus or other means. Wherein the communication interface 2302 is used to receive and transmit data. The computer readable storage medium 2303 may be stored in a memory of a computer device, the computer readable storage medium 2303 storing a computer program, and the processor 2301 executing the computer program stored by the computer readable storage medium 2303. The processor 2301 (or CPU) is a computing core and a control core of a computer device, which is adapted to implement one or more computer programs, in particular to load and execute one or more computer programs to implement the respective method flows or respective functions.

The embodiment of the application also provides a computer readable storage medium (Memory), which is a Memory device in the computer device and is used for storing programs and data. It is understood that the computer readable storage medium herein may include both built-in storage media in a computer device and extended storage media supported by the computer device. The computer readable storage medium provides storage space that stores a processing system of a computer device. Also stored in this memory space are one or more computer programs, which may be one or more computer programs, adapted to be loaded and executed by the processor 2301. The computer readable storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory, or may alternatively be at least one computer readable storage medium located remotely from the processor.

In one embodiment, the computer device may be a desktop device, a mobile device or an embedded device mentioned in the foregoing embodiment, where one or more computer programs are stored in the computer readable storage medium, and the processor 2301 loads and executes one or more computer programs stored in the computer readable storage medium to implement corresponding steps in the foregoing embodiments of the voice processing method, where in a specific implementation, one or more computer programs in the computer readable storage medium loads and executes steps in each embodiment of the present application by the processor 2301, where the steps in each embodiment of the present application may be referred to in the foregoing description of each embodiment and are not described herein.

The embodiment of the application also provides a computer program product, which comprises a computer program, and the computer program realizes the voice processing method when being executed by a processor.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer programs (one or more). When the computer program is loaded and executed on a computer, the flow or functions according to the embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable devices. The computer program may be stored in or transmitted across a computer readable storage medium. The computer program may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (Digital Subscriber Line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer readable storage media can be any available media that can be accessed by a computer or data storage devices, such as servers, data centers, etc., that contain an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., digital storage technology (Digital Video Disk, DVD)), or a semiconductor medium (e.g., solid state disk (Solid STATE DISK, SSD)), or the like.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of speech processing, comprising:

displaying interactive dialog flow information triggered by the voice signal in the interactive dialog region, and

Content information associated with the interactive dialog flow information is displayed in the content presentation area.

2. The method of claim 1, wherein the interactive dialog flow information comprises N rounds of interactive dialog information, each round of interactive dialog information comprising question description information and question result information, N being a positive integer;

content information associated with the interactive dialog flow information includes at least one of:

one or more candidate problem description information matching with a dialog intention of an ith round of interactive dialog information, the ith round of interactive dialog information being displayed in the interactive dialog region, i being an integer and 1≤i≤N, and

The target content information related to the target dialogue information in the ith round of interactive dialogue information comprises at least one of detail information of the target dialogue information, source information of the target dialogue information and comment information of the target dialogue information, and the target dialogue information is default or customized.

3. The method of claim 1 or 2, wherein the interactive dialog flow information comprises N rounds of interactive dialog information, wherein the ith round of interactive dialog information comprises first problem description information and first problem result information, wherein the first problem result information comprises a resource image corresponding to a video media resource, and wherein the resource image is defaulted to be target dialog information in the ith round of interactive dialog information;

the displaying content information related to the interactive dialog flow information in the content presentation area includes:

4. The method of claim 3, wherein the target content information associated with the asset image is a video media asset, the method further comprising:

Playing the video media asset in the content presentation area;

Receiving a pause play operation for the video media resource in the play process of the video media resource;

In response to a pause play operation for the video media asset, pausing play of the video media asset in the content presentation area.

5. The method of claim 1 or 2, wherein the interactive session stream information includes N rounds of interactive session information, wherein the displaying content information related to the interactive session stream information in the content presentation area includes:

Displaying target dialogue information as a selected state in the interactive dialogue area in response to an information viewing operation performed in the interactive dialogue area for the target dialogue information in the ith round of interactive dialogue information;

and displaying target content information related to the target dialogue information in the content display area.

6. The method of claim 1 or 2, wherein the amount of content information associated with the interactive session stream information is at least one, wherein the content presentation area includes at least one content type option, one for each content type option, and wherein the method further comprises:

And in response to a selection operation of a target content type option, displaying the content information corresponding to the selected target content type option in the content display area, wherein the target content type option is any one of at least one content type option.

7. The method of claim 1 or 2, wherein the method further comprises:

Maintaining the content display area to display content information related to the ith round of interactive dialogue information in the process of executing dialogue viewing operation on the ith round of interactive dialogue information in the interactive dialogue area, wherein the dialogue viewing operation comprises at least one of page turning operation, scrolling operation and dragging operation;

When detecting that the interactive dialogue area has dialogue viewing operation of switching from the ith round of interactive dialogue information to the jth round of interactive dialogue information, updating and displaying content information related to the ith round of interactive dialogue information as content information related to the jth round of interactive dialogue information in the content display area, wherein j is a positive integer, j is not equal to i, and 1 is not less than j is not less than N.

8. The method of claim 1 or 2, wherein the content presentation area includes a dialog option corresponding to each turn of interactive dialog information in the interactive dialog flow information, and wherein the method further comprises:

Responding to the triggering operation of the target dialogue option in the content display area, and displaying content information related to target interactive dialogue information corresponding to the target dialogue option in the content display area, wherein the target dialogue option is a dialogue option corresponding to any round of interactive dialogue information;

and displaying the target interactive dialogue information in the interactive dialogue area.

9. The method of claim 8, wherein the number of targeted session information in the ith round of interactive session information is at least two, the method further comprising:

and displaying target content information related to the specified target dialogue information in the content display area when the specified target dialogue information in the interactive dialogue area is subjected to a content positioning operation.

10. The method of claim 1 or 2, wherein the method is applied to a media asset application, the media asset application running in an asset playback device, the receiving a voice signal comprising:

Displaying a service interface of the resource playing device, wherein the service interface comprises voice prompt information in a prompt state, and the voice prompt information in the prompt state is used for prompting the resource playing device to have a voice interaction function;

responsive to a confirmation operation of the voice prompt in a prompt state, converting the voice prompt from the prompt state to an awake state; the voice prompt information in the awakening state is used for prompting that the resource playing equipment is in a voice acquisition state;

When the resource playing device starts to acquire the voice signal, the voice prompt information is converted from the awakening state to the acquisition state, wherein the voice prompt information in the acquisition state is used for prompting the resource playing device to acquire the voice signal;

When the collection of the voice signals by the resource playing equipment is finished, the voice prompt information is converted from the collection state to the understanding state, and the voice prompt information in the understanding state is used for prompting the resource playing equipment to perform result searching processing based on the voice signals.

11. The method of claim 10, wherein the asset playback device is an Internet-based content delivery device, wherein,

12. The method of claim 10, wherein the results search process includes an intent recognition process and a results generation process, the method further comprising:

obtaining basic analysis data, wherein the basic analysis data comprises one or more of scene data, object data, rights data and intention list data;

Invoking the finely tuned generated model, and carrying out intention recognition processing on the voice signal based on the basic analysis data to obtain an intention recognition result;

And if the intention recognition result represents that the voice task indicated by the voice signal is a dialogue type task, performing result generation processing according to the intention recognition result to generate problem result information corresponding to the voice signal, wherein the problem result information and the problem description information corresponding to the voice signal form a round of interactive dialogue information in the interactive dialogue stream information.

13. The method of claim 12, wherein the method further comprises:

14. The method of claim 12, wherein the performing a result generation process according to the intention recognition result to generate problem result information corresponding to the voice signal comprises:

15. A speech processing apparatus, comprising:

A processing unit for displaying the interactive dialogue stream information triggered by the voice signal in the interactive dialogue area, and

16. A computer device, characterized in that,

A processor adapted to execute a computer program;

a computer readable storage medium having stored therein a computer program which, when executed by the processor, implements the speech processing method according to any of claims 1-14.

17. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program adapted to be loaded by a processor and to perform the speech processing method according to any of claims 1-14.

18. A computer program product, characterized in that the computer program product comprises a computer program which, when being executed by a processor, implements the speech processing method according to any of claims 1-14.