CN120579009A

CN120579009A - A dynamic intention understanding method based on multimodal fusion

Info

Publication number: CN120579009A
Application number: CN202511088840.2A
Authority: CN
Inventors: 欧阳禄萍; 刘继鹏; 高建文; 赵耘逸; 唐杰; 唐湘峰
Original assignee: Zhixueyun Beijing Technology Co ltd
Current assignee: Zhixueyun Beijing Technology Co ltd
Priority date: 2025-08-05
Filing date: 2025-08-05
Publication date: 2025-09-02
Anticipated expiration: 2045-08-05
Also published as: CN120579009B

Abstract

The embodiment of the invention discloses a dynamic intention understanding method based on multi-modal fusion, which comprises the steps of obtaining multi-modal behavior data of a user in a current dialogue, rewriting a current dialogue text according to whether the current dialogue is a first dialogue or not, providing the rewritten dialogue text and other multi-modal behavior data to a large model together, prompting the large model to identify the intention of the current dialogue, specifically, designating a role of the large model in a prompt as a user dialogue intention understanding expert, designating understanding skills in a typical dialogue mode, designating various intention types to be identified and descriptions thereof, indicating the large model to understand the rewritten dialogue and other multi-modal behavior data according to designated roles and information, judging which of the various intention types belongs to the current dialogue, and calling a corresponding business flow according to an intention identification result to provide useful information for the current dialogue. The embodiment can improve the user intention recognition effect in the online learning scene.

Description

Dynamic intention understanding method based on multi-mode fusion

Technical Field

The embodiment of the invention relates to the field of artificial intelligence, in particular to a dynamic intention understanding method based on multi-mode fusion.

Background

Intent understanding refers to the purpose behind identifying individual behavior, and introducing intent understanding into intelligent conversations is more advantageous to provide accurate useful information to users.

The existing intention recognition technology mainly relies on single-mode (such as text) analysis, so that multidimensional user behavior data is difficult to effectively fuse, and specific intention types in an online learning scene are not clear. Patent application CN115186076a provides a text understanding intention extraction method, CN119886352a provides a dialogue intention understanding and interpretation text generation method based on deep learning, and the above problems cannot be solved.

Disclosure of Invention

The embodiment of the invention provides a dynamic intention understanding method based on multi-mode fusion, which aims to solve the technical problems.

In a first aspect, an embodiment of the present invention provides a method for understanding a dynamic intent based on multi-modal fusion, including:

Acquiring multi-modal behavior data of a user in a conversation, wherein the multi-modal behavior data comprises voice, images and conversation texts;

According to whether the dialogue is the first dialogue, rewriting the text of the dialogue;

The method comprises the steps of providing rewritten dialogue texts and other multi-modal behavior data to a large model together to prompt the large model to recognize the intention of the dialogue, specifically, designating the role of the large model in the prompt as a user dialogue intention understanding expert, designating understanding skills in a typical dialogue mode, designating various intention types to be recognized and descriptions thereof, and indicating the large model to understand the rewritten dialogue and other multi-modal behavior data according to the designated roles and information to judge which of the various intention types the dialogue belongs to;

calling a corresponding business process according to the intention recognition result to provide useful information for the conversation;

Wherein the plurality of intent types includes question and answer intent, search intent, task to do intent, learning task intent, teaching plan authoring intent and PPT authoring intent.

In a second aspect, an embodiment of the present invention provides an electronic device, including:

One or more processors;

a memory for storing one or more programs,

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the dynamic intent understanding method based on multimodal fusion as described in any of the embodiments.

In a third aspect, embodiments of the present invention further provide a computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the dynamic intent understanding method based on multimodal fusion according to any of the embodiments.

The embodiment of the invention provides a dynamic intention understanding method based on multi-mode fusion, which not only fuses multi-mode information and history dialogue in intention recognition and improves the accuracy of intention recognition, but also utilizes a large model to realize automatic recognition of user intention, designates roles of the large model for user intention understanding specialists through large model prompt, designates understanding skills under typical dialogue modes, designates various intention types to be recognized and descriptions thereof, provides key information for large model recognition of user intention, then instructs the large model to understand the rewritten dialogue and other multi-mode behavior data according to the designated roles and information, judges which of the various intention types the current dialogue belongs to, enables the large model to accurately recognize basic user intention, and finally designs corresponding business processes aiming at each intention type, thereby providing effective information for users more conveniently.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a dynamic intent understanding method based on multimodal fusion provided by an embodiment of the invention;

FIG. 2 is a schematic diagram of a reinforcement learning network according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the invention, are within the scope of the invention.

In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should also be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be, for example, fixedly connected, detachably connected, integrally connected, mechanically connected, electrically connected, directly connected, indirectly connected through an intermediary, or in communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

Fig. 1 is a flowchart of a dynamic intent understanding method based on multi-modal fusion according to an embodiment of the present invention. The method is suitable for online learning scenes and is executed by the electronic equipment. As shown in fig. 1, the method specifically includes:

S110, acquiring multi-mode behavior data of a user in the conversation, wherein the multi-mode behavior data comprise voice, images and conversation texts.

As described above, the present embodiment is applicable to an online learning scenario, such as a certain online learning platform. The platform provides an intelligent dialogue window for the user, and the user can acquire useful information through intelligent dialogue, so that various matters in the online learning platform can be completed more conveniently.

Optionally, the text data input by the user in the current dialogue is firstly obtained in the intelligent dialogue, and simultaneously, the voice data and the image data of the user are obtained through the camera of the learning device (such as a computer or a mobile phone) and the audio device, and the three types of data are used as the multi-modal behavior data of the current dialogue together. The voice data may provide voice, mood, and other semantic information not mentioned in the text data, and the image data may provide facial expression, mental state, and the like of the user.

And S120, rewriting the text of the current dialogue according to whether the current dialogue is the first dialogue.

Optionally, a "start new session" button is provided in the intelligent session window, and when the user clicks the button or logs in the intelligent session window for the first time, the session is considered to be the first session.

Further, if the current dialogue is the first dialogue, the text of the current dialogue is directly used as the rewritten dialogue text to participate in the subsequent operation, and if the current dialogue is not the first dialogue, the historical dialogue text between the current dialogue and the last first dialogue is firstly extracted, and the historical dialogue and the current dialogue are combined to be jointly used as the rewritten dialogue.

And S130, providing the rewritten dialogue text and other multi-modal behavior data to the large model together, and prompting the large model to recognize the intention of the dialogue.

The present embodiment classifies various types of intents for online learning scenarios, including question-answer intents, search intents, task to do intents, learning task intents, teaching plan authoring intents, and PPT authoring intents. These several types of intent cover several broad categories of basic requirements in an online learning scenario.

The present embodiment utilizes a large model to identify dialog intents, and the several types of intents described above and their descriptions play a vital role in large model hints. Meanwhile, the prompt is required to have the following contents to fully ensure the accuracy of intention recognition, namely, designating the role of the large model as a user dialogue intention understanding expert, designating the understanding skill under a typical dialogue mode, designating the large model to understand the rewritten dialogue and other multi-modal behavior data according to the designated role and information, and judging which of the multiple intention types the dialogue belongs to.

Illustratively, one hint is as follows:

# designating role, task

You are experienced and professional user problem intention understanding and intention judging specialists, can understand the problem intention according to the problem presented by the user, and feed back the actual intention of the user in the dialogue based on given intention classification.

# Specifies understanding skills in a typical dialog mode

1. If the specific intention word is explicitly mentioned in the user problem, returning according to the intention content mentioned by the user;

2. If the expressed intent in the user question is highly coincident with the meaning of the plurality of intent expressions in the description of the intent classification, feeding back the plurality of coincident intent classifications;

3. if there is no explicit intent in the user question, the default is returned as "question-answer intent".

# Specifies output Format requirement

Outputting the result according to the format requirement (separated by < format requirement > </format requirement >), ensuring that the output result can be correctly resolved by Python json.

# Specifies various intention types to be recognized and descriptions thereof

< Intention Classification >

Question and answer intention, which is mainly that the user hopes to quickly solve the questions and hopes you to answer the questions professionally;

the searching intention is that the user wants to actively inquire some resources, hopes you to search business and gives a result according to the requirement;

The intention of the task to be done is mainly that a user wants to find the backlog/work set which needs to be completed in the future and hopes that you can help find the task items and the number which need to be solved;

The intention of the learning task is mainly that a user wants to find a learning content set which needs to be completed in the future, and more emphasis is given to searching of learning resources which are required;

The intention is mainly that a user wants to create article content and hopes you to carefully create a teaching article for me;

PPT authoring intent-this intent is primarily that the user wants to write a PPT, hopefully you can plan and author an exquisite PPT for me.

Classification of intention

# Specifies other parameter variables:

< user problem >

{question}

User problem

< Format requirement >

[ "Here returns this dialog intention" ] and

Format requirement ]

Wherein # is used to mark notes on the prompt, and # are used to indicate parameter slots, < intent classification > and </intent classification > are filled in between with a plurality of preset intent types and descriptions thereof, < user question > and </user question > filled in modified user dialogue.

Based on the prompt, the large language model can accurately judge which of the multiple intention types the conversation belongs to.

S140, calling a corresponding business process according to the intention recognition result, and providing useful information for the conversation.

In this embodiment, different business processes are designed for different intention types, so as to provide faster and more accurate personalized services.

For the question and answer intention, the business process is that a large model is called to analyze and trace the knowledge points, and then the knowledge points and the sources are searched in a knowledge vector base to obtain useful information. Aiming at the search intention, the business processes are that the knowledge vector library is directly inquired to retrieve the matching information, and the business processes of question-answering intention are executed simultaneously, wherein the two processes are parallel, and simultaneously, the direct search result and the analysis search result are provided for the user to select. And aiming at the retrieval task flow to be handled, calling a large model to analyze the dialogue text to be handled, remotely scheduling the statistical data of the tasks to be handled according to the objects to be handled, and feeding back a task list to be handled of a specific object to a user. Aiming at the learning task intention, a large model is called to analyze the dialogue text to be learned, and specific learning resources are fed back to the user according to remote scheduling learning resource data of the object to be learned. Aiming at the teaching plan creation intention, a large model is called to analyze the creation points, related texts of the creation points are searched in a knowledge vector library, and the related texts are fed back to the large model to be processed into a complete manuscript. Aiming at the PPT authoring intention, calling a large model to analyze the authoring key points, planning page layout, searching related texts and multimedia data of the authoring key points in a knowledge vector library, and feeding back to the large model to process the PPT into corresponding PPT.

The above-described flow is performed in each session. The method not only integrates multi-modal information and historical conversations, improves accuracy of intention recognition, but also utilizes a large model to realize automatic recognition of user intention, designates roles of the large model as user intention understanding experts through large model prompt, designates understanding skills under typical conversation modes, designates various intention types to be recognized and descriptions thereof, provides key information for large model recognition of user intention, then instructs the large model to understand the rewritten conversations and other multi-modal behavior data according to the designated roles and information, judges which of the various intention types belongs to the conversation, enables the large model to accurately recognize basic user intention, and finally designs corresponding business processes aiming at each intention type, thereby providing effective information for users more conveniently.

Further, as described above, the multiple intention types preset in the prompt are key to ensure that the large model accurately recognizes the intention. In order to adapt to the dynamic change of the user dialogue intention and timely adjust the intention recognition capability of the large model, the preset various intention types can be dynamically adjusted. In one embodiment, the process may include the steps of:

generating an intention analysis chain of the user according to the information collection condition of the user on useful information in the previous dialogue, wherein the intention analysis chain consists of the useful information collected by the user every time of the dialogue or intention recognition results corresponding to the previous dialogue before terminating the dialogue chain. According to the method, the exploration condition of the large model on the real intention of the user is reflected through the intention analysis chain, and the shorter the intention analysis chain is, the more accurate the intention recognition of the large model is.

Alternatively, chain nodes for partitioning the intent analysis chain are first identified. Specifically, if the user clicks or copies the useful information of a certain dialogue, the user is judged to adopt the useful information of the certain dialogue and marks the certain dialogue as a link point, and if the user does not input dialogue text or exits a dialogue interface or clicks an option of starting a new dialogue in the certain dialogue, the user is judged to terminate a dialogue chain, and the certain dialogue is also marked as a link point. In this case, there is a special case that after the user clicks or copies the useful information of a certain dialog, the user does not input a dialog text beyond a set duration, exits the dialog interface, or clicks any one of several expressions of opening a new dialog, and the certain dialog marks a link point.

After the chain link points are marked, the intention recognition results corresponding to the previous dialogue between the two chain link points of the same user are sequentially arranged, so that the intention analysis chain of the same user is formed. It should be noted that, the intent analysis chain between two link points includes the following one of the two link nodes, and does not include the preceding one of the two link nodes, i.e., each link node counts as the end point of the previous intent analysis chain, and does not count as the next intent analysis chain.

And step two, adjusting the multiple intention types according to the change condition of each intention recognition result in the intention analysis chain, and applying the adjusted multiple intention types to a subsequent large model prompt.

As described above, the embodiment reflects the exploration condition of the large model on the real intention of the user through the intention analysis chain, the shorter the intention analysis chain is, the more accurate the intention recognition of the large model is, the real intention is explored through a small number of dialogues and useful information is provided, the longer the intention analysis chain is, the real intention is obtained and the useful information is provided through a plurality of rounds of dialogues by the large model, so that the accuracy of the intention analysis is affected in a very high probability that the intention classification provided for the large model is imperfect, and the intention classification is necessarily considered to be adjusted. It should be noted that the actual intention here refers to the actual intention of the user, and is not limited to the various intention types set in the large model prompt, but may be an intention type that is close to, more subdivided, or completely different from the several intention types. As long as one of the multiple intention types set in the large model prompt is very close to the real intention, the large model can more accurately identify the real intention, so that the aim of the embodiment is to improve the approaching degree of the intention type in the prompt and the real intention. Optionally, the present embodiment provides the following two alternative embodiments, and dynamically adjusts the multiple types set in the large model prompt (hereinafter referred to as "intention classification adjustment") to achieve this objective:

The first alternative embodiment is applicable to an intention analysis chain with the end point of acquiring useful information of the last dialogue for the user, if the length of the intention analysis chain exceeds a set threshold, the intention exploration process of the user is overlong, the useful information which can be acquired through multiple dialogues is acquired, and then the intention classification adjustment can be performed according to the following two conditions:

In the first case, all intention recognition results in the intention analysis chain are the same, which indicates that the large model is correct in recognition of the intention of the user but fails to provide credible and useful information in the early stage. This is most likely because the intent classification is not fine enough, resulting in the user always being sent a deeper dialog. At this time, if all the intention recognition results in the intention analysis chain are the same, the same intention type corresponding to each intention recognition result can be subdivided according to the last two dialogues in the intention analysis chain. Optionally, keyword analysis can be performed on the last two dialogues respectively, a keyword with the largest semantic difference between the last dialog and the last dialog is determined, and the same intention type in the current intention analysis chain is subdivided according to the keyword. For example, if all intention recognition results in the intention analysis chain are search intention, the second last dialogue content is "please help me search for the content about the rescue gist in the flood fighting and disaster relief course", and the last dialogue content is "please help me search for the content about the rescue practice in the flood fighting and disaster relief course", the keyword with the largest semantic difference of the last dialogue relative to the second last dialogue is "practice", the search intention may be subdivided into "practice search intention" and "theoretical search intention" according to the keyword, and description of subdivision intention and corresponding business flow may be added respectively. In this way, the large model can pay attention to information about actual operations and theories as early as possible in each dialogue, and recognize real intentions of more subdivision of users as early as possible, or present the actual operation search intentions and the theoretical search intentions to users as early as possible, and the real subdivision intentions are directly selected by the users. The operations from keyword division to business process design may be automatically implemented by means of a large model, may be implemented by a developer according to experience, or may be implemented by combining an automation means and human experience, and the embodiment is not particularly limited.

And in the second case, all intention recognition results in the intention analysis chain are not the same, so that the large model finally recognizes the correct intention through multiple exploration on the intention of the user, and reliable useful information is provided. In this case, the present embodiment focuses more on the turning point from the wrong intention to the correct intention to correct the most basic intention direction. Optionally, a continuous repeated segment of the last intention recognition result in the intention analysis chain at the end of the intention analysis chain can be extracted, and the intention type of the last dialog before the continuous repeated segment is subdivided according to the first dialog of the continuous repeated segment. The intention analysis chain is exemplified by question-answer, search, task to be done, learning task and learning task, the most one intention recognition result is learning task, the intention is repeated twice at the end of the intention analysis chain, the continuous repeated section is the last section of the learning task and learning task, the first dialogue in the section is the last second dialogue, the last dialogue before the section is the last third dialogue, and the third last intention type 'task to be done' is subdivided according to the content of the last second dialogue. Alternatively, keyword analysis may be performed on the two dialogues respectively, a keyword with the largest semantic difference between the penultimate dialog and the third-last dialog may be determined, and the intent type of the task to be handled may be subdivided according to the keyword. The content of the last-last dialog is "please help me find a course which is not completed", the content of the last-last dialog is "please help me preview of a course which is not completed", keywords with the largest semantic difference relative to the last-last dialog are "preview" and "content", so that the intention of the "task to be done" can be subdivided into a "task list to be done" and a "task content preview", and corresponding descriptions can be added to the subdivision intention, for example, the description of the task list to be done is "provide task list to user", and the description of the task content preview to be done is "provide task list to user and content preview thumbnail of each task" so as to distinguish meanings of the two intents, and meanwhile, corresponding business flows, for example, the business flow of the task content preview to be done is subdivided and designed more than the business flow of the task list to be done, for searching the task content to be done and the thumbnail, etc. Likewise, the above operations from keyword division to business process design may be automatically implemented by means of a large model, may be implemented empirically by a developer, or may be implemented by a combination of automation means and human experience, and the embodiment is not limited specifically.

The second alternative embodiment is suitable for an intent analysis chain with the end point terminating the dialogue chain for the user, if the length of the intent analysis chain exceeds a set threshold, which indicates that the exploration process of the user is too long and useful information which is trusted is not obtained after a plurality of dialogues, then the intent recognition is likely to be inaccurate, and a series of questions and answers are invalid. The intention classification adjustment can be performed in the following two cases:

In the first case, the intention recognition result in the intention analysis chain covers the existing various intention types, which indicates that the existing intention types do not touch the actual intention of the user, and then new intention types can be added according to the history dialogue content of the user. Alternatively, keyword parsing can be performed on all dialogs in the intent analysis chain, and association rules are used to determine several keywords with highest co-occurrence frequency in all dialogs, and new intent types are added according to the keywords. For example, if several keywords with highest co-occurrence frequency in all dialogs are "bidding schemes", "technical indexes", "business indexes", etc., new intention types "bidding scheme generation intention" can be added, and corresponding business processes can be designed. Likewise, the above operations from keyword division to business process design may be automatically implemented by means of a large model, may be implemented empirically by a developer, or may be implemented by a combination of automation means and human experience, and the embodiment is not limited specifically.

And in the second case, the intention recognition result in the intention analysis chain does not cover the existing various intention types, and the user can be asked according to the uncovered intention types of the intention analysis chain to determine whether the user belongs to the uncovered intention types. For example, if "PPT authoring intent" does not appear in the current intent analysis chain, a question may be asked to the user "ask you if you want to author a PPT.

In the above embodiment, attention is focused on an intention analysis chain with an overlong user exploration process, real intention recognition barriers possibly caused by intention classification are analyzed according to different exploration paths shown by intention analysis, a specific mode for eliminating the reasons is provided, and an improvement direction is provided for rationality and accuracy of the intention classification.

However, in the case of excessively long intention exploration, besides the recognition of the true intention possibly blocked by the intention classification, factors such as improper dialogue content of the user, occurrence of uncertainty waves on the intention of the user and the like may also block the recognition of the true intention, so that whether the adjusted intention classification is really beneficial to the analysis of the true intention after being applied to the large model prompt needs further verification. For this reason, in this embodiment, multiple intention types after each adjustment are used as an intention classification scheme, and a reinforcement learning network is constructed to determine an optimal intention classification scheme. In the prompt example of S130, the 6 types of intention are an intention classification scheme, and in another embodiment, the 7 types of intention are another intention classification scheme after the search intention is subdivided into an actual operation search intention and a theoretical search intention, the question-answer intention, the actual operation search intention, the theoretical search intention, the task intention to be done, the learning task intention, the teaching plan creation intention and the PPT creation intention.

In one embodiment, the reinforcement learning network uses the user's intention analysis chain in the actual dialogue as a state variable, uses various intention classification schemes as action variables, and constructs a reward function according to the final result and the average length of the user's intention analysis chain to select the optimal intention classification scheme and apply the result to the subsequent large model prompt. Alternatively, the higher the coverage of the end result of the intent analysis chain for all users over a period of time to the intent classification scheme used, the higher the forward rewards, and the shorter the average length of the intent analysis chain for all users over a period of time, the higher the forward rewards. By way of example, the following reward function may be constructed:

wherein, the Indicating the value of the prize,Representing the number of intention types in the currently in-use intention classification scheme covered by the end result of all user intention analysis chains over a period of time,Representing the total number of intent types in the intent classification scheme,Representing the average length of the intent analysis chain for all users over a period of time,AndPositive scaling factors, respectively.

Optionally, the user's intent analysis chain in the state variable may be encoded as a vector, the dimension of the vector is greater than the maximum length of the historical intent analysis chain, each intent type corresponds to a non-0 value, the first dimension of the vector is the non-0 value corresponding to the first intent recognition result in the intent analysis chain, the second dimension is the non-0 value corresponding to the second intent recognition result in the intent analysis chain, and so on, until all the intent recognition results in the intent analysis chain have corresponding vector elements, and the rest of the vector elements are filled with 0.

Optionally, the reinforcement learning network is structured as shown in fig. 2, and takes as batch input the intent analysis chains (i.e., state variables) of all the students in a period of time, and takes as output the optimal intent classification scheme (i.e., action variables) to be adopted in a period of time. The network comprises a feature extraction layer and a Q value calculation layer, wherein the feature extraction layer is used for encoding input features to obtain corresponding encoded features, the Q value calculation layer is used for outputting Q value vectors according to the encoded features, each element in the Q value vectors is the probability of selecting each intention classification scheme, and the intention classification scheme with the highest probability is the final output action variable of the reinforcement learning network. Optionally, the feature extraction layer and the Q value calculation layer are all in full connection structure, wherein the number of nodes of the feature extraction layer is gradually increased to enrich the feature information layer by layer, the number of nodes of the Q value calculation layer is gradually reduced to adapt to the dimension of the Q value vector, and the specific layer number and the node number can be flexibly set according to the requirement. The network updates the Q value by:

First, calculate :

Wherein, the Is shown in the stateDownward movementIs used for the (a) and (b),Is shown in the stateTake action downwardsThe Q value after the update is set to be equal to,Representative stateThe largest one of all Q values,Representing the status of the slaveTake actionReach toA reward obtained later; And (3) with Are adjustable coefficients.

Then according toAndThe difference between them is constructed as follows:

According toThe network parameters of the feature extraction layer and the Q value calculation layer can be updated.

Based on the basic algorithm, the reinforcement learning network gradually learns the optimal intention classification scheme which can really improve the effect of intention recognition according to the change of an intention analysis chain brought by each intention classification scheme after use in continuous user dialogue. The optimal scheme is applied to the subsequent large model prompt, so that the intention recognition effect of the large model and the user dialogue result are more excellent. Every time a new intention classification scheme is added, the corresponding node and full-connection operation with the previous layer are added at the last layer of the Q value calculation layer, and the network gradually updates parameters according to the dialogue condition.

Of course, besides the reinforcement learning network, various intention classification schemes can be evaluated according to user feedback, so that intention classification can be evaluated in time, and the effect after adjustment is prevented from being reduced or unstable.

Further, in another specific embodiment, the operations of adjusting and evaluating the intent classification scheme may be performed for each user, respectively, that is, the intent analysis chains of each user may be generated according to the information collection condition of each user on the useful information in the previous dialogue, respectively, and the various intent types of each user may be adjusted according to the change condition of the intent recognition results in the intent analysis chains of each user, and the various intent types after adjustment of each user may be applied to the subsequent large model prompts of each user.

Correspondingly, respectively aiming at each user, taking a plurality of intention types after each adjustment as an intention classification scheme, respectively constructing a reinforcement learning network aiming at each user, taking an intention analysis chain of the current user as a state variable, taking various intention classification schemes of the current user as action variables, constructing a reward function according to a final result and an average length of the intention analysis chain of the current user, selecting an optimal intention classification scheme of the current user, and applying the result to a large model prompt aiming at the current user.

The method has the advantages that the intention classification scheme can be adjusted for each user respectively, the method adapts to the personalized learning habit of the user, the improper influence possibly caused by applying the intention classification adjustment of the single user to all users is avoided, and the disadvantage that more storage space is needed for calculating resources. In practical application, the implementation mode can be flexibly selected according to the practical condition of the system.

In addition, it should be noted that, the user data (including voice data, image data, and dialogue text data) related to the present application are all information and data authorized by the user or fully authorized by each party, and the collection, use and processing of the related data need to comply with the related laws and regulations and standards of the related country and region, and provide corresponding operation entries for the user to select authorization or rejection.

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where, as shown in fig. 3, the device includes a processor 60, a memory 61, an input device 62 and an output device 63, where the number of processors 60 in the device may be one or more, in fig. 3, one processor 60 is taken as an example, and the processor 60, the memory 61, the input device 62 and the output device 63 in the device may be connected by a bus or other manners, in fig. 3, the connection is taken as an example by a bus.

The memory 61 is used as a computer readable storage medium for storing software programs, computer executable programs and modules, such as program instructions/modules corresponding to the dynamic intent understanding method based on multimodal fusion in the embodiments of the invention. The processor 60 executes various functional applications of the device and data processing, i.e., implements the above-described dynamic intent understanding method based on multimodal fusion, by running software programs, instructions, and modules stored in the memory 61.

The memory 61 may mainly include a storage program area that may store an operating system, an application program required for at least one function, and a storage data area that may store data created according to the use of the terminal, etc. In addition, the memory 61 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 61 may further comprise memory remotely located relative to processor 60, which may be connected to the device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input means 62 may be used to receive entered numeric or character information and to generate key signal inputs related to user settings and function control of the device. The output 63 may comprise a display device such as a display screen.

Embodiments of the present invention also provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the dynamic intent understanding method based on multimodal fusion of any of the embodiments.

The computer storage media of embodiments of the invention may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the C-language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

It should be noted that the above embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that the technical solution described in the above embodiments may be modified or some or all of the technical features may be equivalently replaced, and these modifications or substitutions do not deviate the essence of the corresponding technical solution from the technical solution of the embodiments of the present invention.

Claims

1. A dynamic intent understanding method based on multimodal fusion, characterized by comprising:

Obtaining multimodal behavioral data of the user in this conversation, wherein the multimodal behavioral data includes voice, image, and conversation text;

Rewrite the text of this conversation based on whether it is the first conversation;

The rewritten conversation text and other multimodal behavior data are provided to the big model, prompting the big model to identify the intent of the current conversation. Specifically, the prompt specifies the big model's role as an expert in understanding user conversation intent, specifies understanding techniques under typical conversation patterns, specifies multiple intent types to be identified and their descriptions, and instructs the big model to understand the rewritten conversation and other multimodal behavior data based on the specified role and information, and determine to which of the multiple intent types the current conversation belongs.

Call the corresponding business process based on the intent recognition result to provide useful information for this conversation;

Among them, the multiple intent types include question-and-answer intent, search intent, to-do task intent, learning task intent, lesson plan creation intent and PPT creation intent.

2. The method according to claim 1, wherein rewriting the text of the conversation based on whether the conversation is the first conversation comprises:

If this is the first conversation, the text of this conversation will be used as the rewritten text;

If this conversation is not the first conversation, the text of this conversation and the text of the historical conversation will be combined to form the rewritten text of the conversation.

3. The method according to claim 1, characterized in that after calling the corresponding business process based on the intent recognition result to provide useful information for the current conversation, it also includes:

Generate a user intention analysis chain based on the user's acceptance of useful information in previous conversations, where the intention analysis chain is composed of the intention recognition results corresponding to the user's acceptance of useful information in each conversation or the termination of the conversation chain;

According to the changes in the intention recognition results in the intention analysis chain, the multiple intention types are adjusted, and the adjusted multiple intention types are applied to subsequent large model prompts.

4. The method according to claim 3, wherein generating a user intention analysis chain based on the user's acceptance of useful information in previous conversations further comprises:

If the user clicks or copies useful information from a conversation, it is determined that the user has accepted the useful information from the conversation, and the conversation is marked as a link node;

If the user does not enter the conversation text within the set time limit, or exits the conversation interface, or clicks the option to start a new conversation, it is determined that the user has terminated the conversation chain and the conversation is marked as a chain node;

The intention recognition results corresponding to all previous conversations of the same user between two chain nodes are arranged in sequence to form an intention analysis chain of the same user.

5. The method according to claim 3, wherein adjusting the multiple intent types according to changes in the intent recognition results in the intent analysis chain comprises:

When the end point of the intent analysis chain is that the user has accepted the useful information from the last conversation:

If all intent recognition results in the intent analysis chain are the same, the same intent types corresponding to the intent recognition results are subdivided according to the last two conversations in the intent analysis chain;

Otherwise, extract the continuous repetition segment at the end of the intention analysis chain as the last intention recognition result in the intention analysis chain; and subdivide the intention type of the last conversation before the continuous repetition segment according to the first conversation of the continuous repetition segment.

6. The method according to claim 3, wherein adjusting the multiple intent types according to changes in the intent recognition results in the intent analysis chain comprises:

When the end point of the intent analysis chain is the user terminating the conversation chain:

If the intent recognition results in the intent analysis chain cover the multiple intent types, add new intent types based on the user's previous conversation content;

Otherwise, questions are asked to the user based on the intent types not covered by the intent analysis chain to determine whether the user belongs to the uncovered intent types.

7. The method according to claim 3, after adjusting the multiple intent types based on changes in the intent recognition results in the intent analysis chain, further comprising:

The multiple intention types after each adjustment are used as an intention classification scheme;

Build a reinforcement learning network with the user's intention analysis chain as the state variable and various intention classification schemes as the action variables. Build a reward function based on the final result and average length of the user's intention analysis chain to select the optimal intention classification scheme and apply it to subsequent large-scale model prompts.

8. The method according to claim 3, wherein generating a user intention analysis chain based on the user's acceptance of useful information in previous conversations comprises: generating an intention analysis chain for each user based on the user's acceptance of useful information in previous conversations;

Correspondingly, the multiple intent types are adjusted according to the changes in the intent recognition results in the intent analysis chain, and the adjusted multiple intent types are applied to the subsequent large model prompts, including: adjusting the multiple intent types for each user according to the changes in the intent recognition results in the intent analysis chain of each user, and applying the adjusted multiple intent types for each user to the subsequent large model prompts of each user.

9. An electronic device, comprising:

one or more processors;

a memory for storing one or more programs,

When the one or more programs are executed by the one or more processors, the one or more processors implement the dynamic intent understanding method based on multimodal fusion described in any one of claims 1-8.

10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when executed by a processor, implements the dynamic intent understanding method based on multimodal fusion as described in any one of claims 1-8.