CN111428019B

CN111428019B - Data processing method and equipment for knowledge base questions and answers

Info

Publication number: CN111428019B
Application number: CN202010255287.8A
Authority: CN
Inventors: 谷博; 雷欣; 李志飞
Original assignee: Mobvoi Information Technology Co Ltd
Current assignee: Shanghai Mobvoi Information Technology Co ltd
Priority date: 2020-04-02
Filing date: 2020-04-02
Publication date: 2023-07-28
Anticipated expiration: 2040-04-02
Also published as: CN111428019A

Abstract

The present disclosure provides a data processing method and device for knowledge base questions and answers, where the data processing method includes: acquiring any knowledge item from a knowledge base; selecting user descriptions matched with the knowledge items from the dialogue record to form a set of user descriptions; associating the set of user utterances with the knowledge item; and training the knowledge base question-answering model by taking the associated set of user descriptions and the knowledge items as training samples so as to feed back the subsequently input user descriptions according to training results. The data processing method can improve the real-time performance of the model based on online real data optimization, and ensure the optimal model effect; the operation convenience of operators is improved, and the working efficiency is improved; the method accelerates the discovery of the defects existing in the knowledge items and promotes the continuous perfection of the knowledge base.

Description

Data processing method and device for knowledge base question answering

技术领域technical field

本公开涉及数据处理技术领域，尤其涉及一种用于知识库问答的数据处理方法及设备。The present disclosure relates to the technical field of data processing, in particular to a data processing method and device for knowledge base question answering.

背景技术Background technique

问答系统的历史整体进程是从基于模板的问答专家系统发展到基于信息检索的问答，然后发展到基于社区的问答，再发展到目前的基于知识库的问答。基于信息检索的问答算法是在关键词匹配的基础上结合信息抽取和浅层语义分析。基于社区的问答依赖于网民贡献，问答过程依赖于关键词检索技术。基于知识库的问答则基于语义解析和知识库，通过知识库问答模型将用户输入的问题进行语义解析，并在知识库中选取与用户输入的问题相匹配的知识条目。现有的基于知识库问答的模型优化往往需要离线进行，不能支持运营人员在线实时调优模型，且知识库问答在线标注不够自动化，未对线上大量的真实数据做有效筛选、聚类和推荐，使得运营人员标注工作效率低、工作量大、重复性高。另外，线上用户的许多用户说法的数据，未被模型有效使用。The overall historical process of question answering system is from template-based question answering expert system to information retrieval-based question answering, then to community-based question answering, and then to the current knowledge base-based question answering. The question answering algorithm based on information retrieval combines information extraction and shallow semantic analysis on the basis of keyword matching. Community-based Q&A relies on the contribution of netizens, and the Q&A process relies on keyword retrieval technology. Question answering based on knowledge base is based on semantic analysis and knowledge base. Through the knowledge base question answering model, the questions entered by users are analyzed semantically, and the knowledge items that match the questions entered by users are selected in the knowledge base. Existing model optimization based on knowledge base questions and answers often needs to be performed offline, which cannot support operators to optimize the model online in real time, and the online annotation of knowledge base questions and answers is not automatic enough, and does not effectively screen, cluster and recommend a large amount of online real data , so that the operator's labeling work efficiency is low, the workload is heavy, and the repetition is high. In addition, the data of many user statements of online users is not effectively used by the model.

发明内容Contents of the invention

为了解决或者至少缓解上述技术问题中的至少一个，本公开提供了一种用于知识库问答的处理方法及设备。In order to solve or at least alleviate at least one of the above technical problems, the present disclosure provides a processing method and device for knowledge base question answering.

根据本公开的一个方面，一种用于知识库问答的数据处理方法，所述数据处理方法包括：According to one aspect of the present disclosure, a data processing method for knowledge base question answering, the data processing method includes:

从知识库获取任一知识条目；Obtain any knowledge entry from the knowledge base;

在对话记录中选取与所述知识条目匹配的用户说法形成用户说法的集合；Selecting user statements matching the knowledge entry in the dialogue record to form a set of user statements;

将所述用户说法的集合与所述知识条目进行关联；以及associating the set of user sayings with the knowledge item; and

将关联后的所述用户说法的集合与所述知识条目作为训练样本对知识库问答模型进行训练，以根据训练结果对后续输入的用户说法进行反馈。The set of associated user statements and the knowledge items are used as training samples to train the knowledge base question answering model, so as to give feedback to the subsequently input user statements according to the training results.

根据本公开的至少一个实施方式，所述在对话记录中选取与所述知识条目匹配的用户说法形成用户说法的集合，包括：According to at least one embodiment of the present disclosure, the selection of user statements matching the knowledge item in the dialogue record to form a set of user statements includes:

如果所述知识条目被知识库问答模型作为近似答案提供给用户，且被用户回复或点击选取过，则将所述对话记录中对应的用户说法设定为A级；If the knowledge item is provided to the user as an approximate answer by the knowledge base question answering model, and is selected by the user's reply or click, then the corresponding user statement in the dialogue record is set as A level;

如果所述知识条目被知识库问答模型作为近似答案提供给用户，且未被用户回复或点击选取过，则将所述对话记录中对应的用户说法设定为B级；If the knowledge item is provided to the user as an approximate answer by the knowledge base question answering model, and has not been selected by the user's reply or click, then the corresponding user statement in the dialogue record is set to B level;

如果所述知识条目既没有被知识库问答模型作为最佳答案也没有作为近似答案提供给用户，但是置信度大于等于预设值，则将所述对话记录中对应的用户说法设定为C级；以及If the knowledge item is neither provided to the user as the best answer nor an approximate answer by the knowledge base question answering model, but the confidence level is greater than or equal to the preset value, then the corresponding user statement in the dialogue record is set as C level ;as well as

将所述用户说法按照优先级A级＞B级＞C级的顺序进行排序并去重，以形成所述用户说法的集合。The user statements are sorted and deduplicated in the order of priority A>B>C to form a set of user statements.

根据本公开的另一个方面，一种用于知识库问答的数据处理方法，所述数据处理方法包括：According to another aspect of the present disclosure, a data processing method for knowledge base question answering, the data processing method includes:

将对话记录中的用户说法进行聚类，形成至少一类用户说法的集合；Clustering user statements in the dialogue records to form at least one set of user statements;

针对每一类用户说法的集合，从知识库选出与该类用户说法的集合匹配的知识条目的集合；For each set of user sayings, select a set of knowledge entries matching the set of user sayings from the knowledge base;

将该类用户说法的集合与所述知识条目的集合中的其中一个知识条目进行关联；以及associating the set of such user sayings with one of the knowledge items in the set of knowledge items; and

将关联后的该类用户说法的集合与所述其中一个知识条目作为训练样本对知识库问答模型进行训练，以根据训练结果对后续输入的用户说法进行反馈。The associated set of user statements of this type and the one of the knowledge items are used as training samples to train the knowledge base question answering model, so as to give feedback to subsequent input user statements according to the training results.

根据本公开的至少一个实施方式，所述将对话记录中的用户说法进行聚类，形成至少一类用户说法的集合，包括：According to at least one embodiment of the present disclosure, the clustering of user statements in the dialogue records to form at least one type of user statement collection includes:

在对话记录中，将知识库问答模型的反馈内容包括近似答案或无答案的用户说法聚为一类；或者在对话记录中，将知识库问答模型给出的置信度小于预设值的用户说法聚为一类。In the dialogue records, the feedback content of the knowledge base question answering model includes user statements with approximate answers or no answers; grouped together.

将聚类得到的至少一类用户说法的集合进行排序。Sorting the collection of at least one type of user sayings obtained by clustering.

根据本公开的至少一个实施方式，所述将聚类得到的至少一类用户说法的集合进行排序，包括：According to at least one embodiment of the present disclosure, the sorting the set of at least one type of user sayings obtained by clustering includes:

将聚类得到的至少一类用户说法的集合按照提问次数进行降序排列；其中，提问次数是指每一类用户说法的集合中未去重的用户说法的总数。The set of at least one type of user statements obtained by clustering is sorted in descending order according to the number of questions asked; wherein, the number of questions refers to the total number of unduplicated user statements in the set of each type of user statements.

将提问次数相同的至少一类用户说法的集合按照聚类问题数进行升序排列；聚类问题数是指每一类用户说法的集合中去重后的用户说法的总数。The collection of at least one type of user statements with the same number of questions is arranged in ascending order according to the number of clustering questions; the number of clustering questions refers to the total number of user statements after deduplication in the set of user statements of each type.

将聚类问题数相同的至少一类用户说法的集合按照时间由近及远的顺序进行排序。Sorting the set of at least one type of user statements with the same number of clustering questions in order of time from near to far.

根据本公开的至少一个实施方式，所述针对每一类用户说法的集合，从知识库选出与该类用户说法的集合匹配的知识条目的集合，包括：According to at least one embodiment of the present disclosure, for each set of user sayings, selecting a set of knowledge items matching the set of user sayings from the knowledge base includes:

将知识库中的知识条目与每一类用户说法的集合中的各个用户说法进行逐一匹配；Match the knowledge entries in the knowledge base with each user statement in the collection of each type of user statement one by one;

选取知识库问答模型给出的置信度大于等于预设值的知识条目形成所述知识条目的集合；以及Selecting knowledge items whose confidence given by the knowledge base question answering model is greater than or equal to a preset value to form a set of the knowledge items; and

在所述知识条目的集合中，按照各个知识条目出现的累计次数降序排列并去重。In the set of knowledge items, sort them in descending order according to the accumulative number of occurrences of each knowledge item and deduplicate them.

根据本公开的另一个方面，一种用于知识库问答的处理设备，所述设备包括：According to another aspect of the present disclosure, a processing device for knowledge base question answering, the device includes:

存储器，所述存储器存储有执行指令；以及a memory storing instructions for execution; and

处理器，所述处理器执行所述存储器存储的执行指令，使得所述处理器执行前述任一项所述的方法。A processor, the processor executes the execution instructions stored in the memory, so that the processor executes the method described in any one of the foregoing.

附图说明Description of drawings

附图示出了本公开的示例性实施方式，并与其说明一起用于解释本公开的原理，其中包括了这些附图以提供对本公开的进一步理解，并且附图包括在本说明书中并构成本说明书的一部分。The accompanying drawings illustrate exemplary embodiments of the present disclosure and, together with the description, serve to explain the principles of the disclosure, are included to provide a further understanding of the disclosure, and are incorporated in and constitute this specification. part of the manual.

图1是本公开用于知识库问答的数据处理方法的一种示例性实施方式的流程示意图。Fig. 1 is a schematic flowchart of an exemplary implementation of the data processing method for knowledge base question answering in the present disclosure.

图2是本公开用于知识库问答的数据处理方法的另一种示例性实施方式的流程示意图。Fig. 2 is a schematic flowchart of another exemplary implementation of the data processing method for knowledge base question answering in the present disclosure.

图3是本公开用于知识库问答的数据处理方法的另一种示例性实施方式的流程示意图。Fig. 3 is a schematic flowchart of another exemplary implementation of the data processing method for knowledge base question answering in the present disclosure.

图4是本公开用于知识库问答的数据处理设备的示例性实施方式的结构示意图。Fig. 4 is a schematic structural diagram of an exemplary embodiment of a data processing device for knowledge base question answering according to the present disclosure.

具体实施方式Detailed ways

下面结合附图和实施方式对本公开作进一步的详细说明。可以理解的是，此处所描述的具体实施方式仅用于解释相关内容，而非对本公开的限定。另外还需要说明的是，为了便于描述，附图中仅示出了与本公开相关的部分。The present disclosure will be further described in detail below with reference to the drawings and embodiments. It can be understood that the specific implementation manners described here are only used to explain relevant content, rather than to limit the present disclosure. It should also be noted that, for ease of description, only parts related to the present disclosure are shown in the drawings.

需要说明的是，在不冲突的情况下，本公开中的实施方式及实施方式中的特征可以相互组合。下面将参考附图并结合实施方式来详细说明本公开。It should be noted that, in the case of no conflict, the implementation modes and the features in the implementation modes in the present disclosure can be combined with each other. The present disclosure will be described in detail below with reference to the drawings and embodiments.

知识库问答系统包括知识库问答模型和建立的知识库，知识库中包括若干知识条目，知识条目是知识库组成的最小单位。当知识库问答模型接收到一个用户说法(用户提问的问题)时，通过语义模型进行相似度计算，在知识库中获取答案并反馈给用户，常见于FAQ一问一答的问答形式。知识库问答系统可以有多种实现形式，例如设置成智能对话机器人等。在知识库问答系统的在线使用过程中，其中的一个用户说法以及对该用户说法反馈的知识条目组成相应的一条对话记录。The knowledge base question answering system includes the knowledge base question answering model and the established knowledge base. The knowledge base includes several knowledge items, and the knowledge item is the smallest unit of the knowledge base. When the knowledge base question answering model receives a user statement (question asked by the user), the similarity calculation is performed through the semantic model, and the answer is obtained in the knowledge base and fed back to the user, which is common in the FAQ form of one question and one answer. The knowledge base question answering system can be implemented in various forms, for example, it can be set as an intelligent dialogue robot. During the online use of the knowledge base question answering system, one of the user's statements and the knowledge items fed back to the user's statement form a corresponding dialogue record.

在一种应用场景中，用户向知识库问答模型输入用户说法，知识库问答模型可以针对用户说法在知识库中找出匹配的知识条目形成反馈。对于同一个用户说法，知识库问答模型可能会反馈不止一个知识条目。而对于有些用户说法，也可能在知识库中找不到匹配度合适的知识条目，从而得不到答案，这种情况下知识库问答模型会反馈“无法提供答案”等类似的答复。在知识库问答模型将用户说法与知识库中的每个知识条目相匹配的过程中，对于每一个知识条目，知识库问答模型会分别给出相应的置信度，置信度是指知识库问答模型在对该用户说法和对应的一个知识条目进行判断后，得出的该用户说法与该对应的知识条目之间相匹配的可信程度。知识库问答模型会根据各个知识条目的置信度高低，向用户反馈出最佳答案、近似答案或无答案。最佳答案是指：当用户发起对话时，知识库问答模型可以获取一个置信度最高且高于某个指定数值的知识条目用来回答其问题，则此时回复的知识条目作为最佳答案被提供给用户。近似答案是指：当用户发起对话时，知识库问答模型获取若干置信度在一个指定范围内的知识条目用来回答其问题，则此时回复的知识条目作为近似答案被提供给用户。无答案是指：当用户发起对话时，知识库问答模型获取不到置信度在指定范围内的知识条目用于回答其问题，则此时反馈为无答案或类似说法。In one application scenario, the user inputs user statements into the knowledge base question answering model, and the knowledge base question answering model can find matching knowledge items in the knowledge base for user statements to form feedback. For the same user statement, the knowledge base question answering model may feed back more than one knowledge item. And for some user statements, it is also possible that no suitable matching knowledge entry can be found in the knowledge base, so that no answer can be obtained. In this case, the knowledge base question answering model will feedback "cannot provide an answer" and similar answers. In the process of the knowledge base question answering model matching the user statement with each knowledge item in the knowledge base, for each knowledge item, the knowledge base question answering model will give the corresponding confidence, the confidence refers to the knowledge base question answering model After judging the user's statement and a corresponding knowledge item, the degree of credibility of matching between the user's statement and the corresponding knowledge item is obtained. The knowledge base question answering model will feed back the best answer, approximate answer or no answer to the user according to the confidence level of each knowledge item. The best answer means: when a user initiates a dialogue, the knowledge base question answering model can obtain a knowledge item with the highest confidence and higher than a specified value to answer his question, then the knowledge item replied at this time is regarded as the best answer. provided to the user. Approximate answer means: when a user initiates a dialogue, the knowledge base question answering model obtains several knowledge items with a confidence level within a specified range to answer their questions, and the knowledge items replied at this time are provided to the user as approximate answers. No answer means: when a user initiates a dialogue, the knowledge base question answering model cannot obtain knowledge items with a confidence level within the specified range to answer their questions, and the feedback at this time is no answer or similar statements.

在知识库问答系统在线运行的过程中，会产生大量的对话记录，形成对话记录的数据集合，这些对话记录的数据均存储在系统之中，现有技术中没有利用该在线形成的对话记录数据来对模型进行优化。也就是说，现有的知识库问答模型对于线上用户的许多用户说法的数据没有进行有效的使用，不能进行在线的模型优化。现有知识库问答模型的优化往往需要离线进行，不能支持运营人员在线实时调优模型。During the online operation of the knowledge base question answering system, a large number of dialogue records will be generated to form a data set of dialogue records. The data of these dialogue records are all stored in the system. In the prior art, there is no dialogue record data formed online. to optimize the model. That is to say, the existing knowledge base question answering model does not effectively use the data of many user statements of online users, and cannot perform online model optimization. The optimization of the existing knowledge base question answering model often needs to be done offline, which cannot support the online real-time tuning of the model by operators.

根据本公开的一个方面，参见图1所示的本公开用于知识库问答的数据处理方法的一种示例性实施方式的流程示意图。一种用于知识库问答的数据处理方法，用于对知识库问答过程中产生的数据(如对话记录)进行处理，以便能够实现在线对知识库模型进行优化。该数据处理方法包括：According to one aspect of the present disclosure, refer to FIG. 1 , which is a schematic flowchart of an exemplary implementation of the data processing method for knowledge base question answering of the present disclosure. A data processing method for knowledge base question answering, which is used for processing data (such as dialogue records) generated during the knowledge base question answering process, so as to optimize the knowledge base model online. The data processing methods include:

S10、从知识库获取任一知识条目。例如，系统自动从知识库中选择某一个知识领域的相关知识条目，针对该知识领域的知识条目，采用本公开的数据处理方法逐一进行处理。或者，系统也可以根据对话记录中的置信度，选择获得的置信度较低的知识条目，逐一进行处理。S10. Acquire any knowledge item from the knowledge base. For example, the system automatically selects relevant knowledge items in a certain knowledge field from the knowledge base, and uses the data processing method of the present disclosure to process the knowledge items in the knowledge field one by one. Alternatively, the system may also select the acquired knowledge items with lower confidence levels according to the confidence level in the dialogue record, and process them one by one.

S20、在对话记录中选取与所述知识条目匹配的用户说法形成用户说法的集合。每条对话记录包括一个用户说法，如果知识库问答模型在知识库中筛选出置信度满足要求的知识条目，则对话记录中会包含相应的知识条目；如果知识库问答模型在知识库中没有筛选出置信度满足要求的知识条目，则对话记录中没有包含知识条目。在所有的对话记录中，选择其中包含有步骤S10中选取的知识条目的对话记录中的用户说法，将选择出来的所有用户说法形成用户说法的集合。S20. Select user sayings that match the knowledge item in the dialogue record to form a set of user sayings. Each dialog record includes a user statement. If the knowledge base Q&A model screens out a knowledge item with a confidence level that meets the requirements in the knowledge base, the corresponding knowledge item will be included in the dialog record; if the knowledge base Q&A model does not filter out If the knowledge items whose confidence meets the requirements are found, the knowledge items are not included in the dialog record. Among all the dialogue records, select the user sayings in the dialogue records containing the knowledge items selected in step S10, and form all the selected user sayings into a set of user sayings.

S30、将所述用户说法的集合与所述知识条目进行关联。本领域技术人员可以理解，从系统运行过程中产生的所有对话记录中找出的与该知识条目匹配过的用户说法，其都是与该知识条目相关的，否则不会被知识库问答模型反馈作为该用户说法的知识条目。通过上面步骤自动选择出来与该知识条目匹配的用户说法的集合，知识库问答模型将该集合与该知识条目进行关联。在关联时，知识库问答模型可以将用户说法的集合中所有的用户说法均与该知识条目关联，也可以只选择其中部分用户说法进行关联，选择的这部分用户说法可以是用户会话意图更明确，问法更标准的用户说法，也可以是出现频率较高的用户说法。S30. Associating the set of user sayings with the knowledge item. Those skilled in the art can understand that the user statements matched with the knowledge item found from all dialogue records generated during the operation of the system are all related to the knowledge item, otherwise they will not be fed back by the knowledge base question answering model Knowledge entry that is the user utterance. Through the above steps, a set of user sayings matching the knowledge item is automatically selected, and the knowledge base question answering model associates the set with the knowledge item. When linking, the knowledge base question-answering model can associate all user statements in the user statement collection with the knowledge entry, or select only some of the user statements for association. The selected user statements can make the user's session intention more clear , a user statement with a more standard questioning method, or a user statement with a high frequency of occurrence.

S40、将关联后的所述用户说法的集合与所述知识条目作为训练样本对知识库问答模型进行训练,以根据训练结果对后续输入的用户说法进行反馈。在大量的对话记录中选取出与知识条目相关的用户说法的集合并进行关联后，作为训练样本输入到模型中，知识库问答模型在训练后，能将样本中的用户说法与样本中的知识条目形成关联关系，待下次用户向知识库问答模型输入与样本中相同或相似的用户说法后，知识库问答模型能通过样本中的关联关系直接反馈出与其相关联的知识条目作为答复，实现了在线优化知识库问答模型的目的。其中，当输入的用户说法与样本中的用户说法存在一个以上相同的关键词时，可以认为输入的用户说法与样本中的用户说法是相似的。S40. Using the associated set of user statements and the knowledge items as training samples to train the knowledge base question answering model, so as to give feedback to the subsequently input user statements according to the training results. After a large number of dialogue records are selected and associated with a set of user statements related to knowledge items, they are input into the model as training samples. After training, the knowledge base question answering model can combine the user statements in the samples with the knowledge in the samples. The entries form an association relationship. After the next time the user enters the same or similar user statement as the sample in the knowledge base question answering model, the knowledge base question answering model can directly feed back the associated knowledge item as a reply through the association relationship in the sample, realizing It achieves the purpose of optimizing the knowledge base question answering model online. Wherein, when there is more than one same keyword in the input user statement and the user statement in the sample, it can be considered that the input user statement is similar to the user statement in the sample.

本公开的数据处理方法，利用生产环境中(知识库问答系统在线运行过程中)产生的用户对话记录的数据，系统自动筛选出与需要优化的知识条目匹配的用户说法形成用户说法的集合，通过将用户说法的集合与知识条目进行关联，并将已关联的用户说法的集合与知识条目自动导入到知识库问答模型中，对知识库问答模型进行在线训练，可以达到在线学习优化模型的目的。解决了现有技术中线上用户的许多用户说法的数据未被模型有效使用的问题。同时，通过系统自动在大量的在线对话记录中筛选形成用户说法的集合，避免了人工进行选取存在的工作效率低、工作量大、重复性高等问题。In the data processing method of the present disclosure, the system automatically screens out user statements that match the knowledge items that need to be optimized to form a set of user statements by using the data of user dialogue records generated in the production environment (during the online operation of the knowledge base question answering system). The collection of user statements and knowledge entries are associated, and the associated collection of user statements and knowledge entries are automatically imported into the knowledge base question answering model, and the online training of the knowledge base question answering model can achieve the purpose of online learning and optimization of the model. It solves the problem in the prior art that the data of many user statements of online users are not effectively used by the model. At the same time, the system automatically screens a large number of online dialogue records to form a collection of user statements, avoiding the problems of low work efficiency, heavy workload, and high repetition that exist in manual selection.

在本公开的一个实施方式中，步骤S20，在对话记录中选取与所述知识条目匹配的用户说法形成用户说法的集合，可以包括：In one embodiment of the present disclosure, step S20, selecting user statements matching the knowledge item in the dialog record to form a set of user statements may include:

如果所述知识条目被知识库问答模型作为近似答案提供给用户，且被用户回复或点击选取过，则将所述对话记录中对应的用户说法设定为A级。该知识条目被作为近似答案提供给用户，且被用户回复或点击选取过，说明该知识条目与对应的用户对话的匹配度较高，具有较高的置信度，将其优先级别设为A级。If the knowledge item is provided to the user as an approximate answer by the knowledge base question answering model, and is selected by the user's reply or click, then the corresponding user statement in the dialogue record is set as A level. This knowledge item is provided to the user as an approximate answer, and has been selected by the user’s reply or click, indicating that the knowledge item has a high degree of matching with the corresponding user dialogue and has a high degree of confidence. Set its priority level to A .

如果所述知识条目被知识库问答模型作为近似答案提供给用户，且未被用户回复或点击选取过，则将所述对话记录中对应的用户说法设定为B级。该知识条目虽然被作为近似答案提供给用户，但是没有被用户回复或点击选取过，说明该知识条目与对应的用户对话具有一定匹配度，置信度比A级的置信度要低一些，将其优先级别设为B级。If the knowledge item is provided to the user as an approximate answer by the knowledge base question answering model, and has not been answered or clicked and selected by the user, then the corresponding user statement in the dialogue record is set as B level. Although the knowledge item is provided to the user as an approximate answer, it has not been replied or selected by the user, indicating that the knowledge item has a certain degree of matching with the corresponding user dialogue, and the confidence level is lower than that of A level. The priority level is set to B level.

如果所述知识条目既没有被知识库问答模型作为最佳答案也没有作为近似答案提供给用户，但是在与用户说法匹配时置信度大于等于预设值，例如可以将该预设值设定为0.5，则将所述对话记录中对应的用户说法设定为C级。也就是说，所筛选的用户说法应该是与该知识条目在对话记录中进行匹配的置信度大于预设值的用户说法，对于置信度小于该预设值的用户说法则不再选择到形成的用户说法的集合中，以避免引入与该知识条目不匹配的噪声。If the knowledge item is neither provided to the user as the best answer nor an approximate answer by the knowledge base question answering model, but the confidence level is greater than or equal to the preset value when matching the user's statement, for example, the preset value can be set to 0.5, then the corresponding user statement in the dialogue record is set as C level. That is to say, the user statement to be screened should be the user statement whose confidence level is greater than the preset value when matching the knowledge item in the dialogue record, and the user statement whose confidence level is less than the preset value is no longer selected to form. In the collection of user sayings, to avoid introducing noise that does not match the knowledge entry.

将上述筛选出来的用户说法按照优先级A级＞B级＞C级的顺序进行排序并去重，以形成所述用户说法的集合。也就是说在形成的用户说法的集合中，优先级别为A级的用户说法排列在最前面，依次类推，优先级别为C级的用户说法排在最后。The above-screened user statements are sorted and deduplicated in the order of priority A>B>C to form a set of user statements. That is to say, in the set of user statements formed, the user statements with a priority level of A are arranged at the top, and so on, and the user statements with a priority level of C are ranked last.

上述各实施方式适用于知识库问答系统上线初期，即上线运行一段较短时间后，系统中已经存在了一定数量的对话记录，需要对知识库问答模型进行优化以改进系统性能的情况。The above implementations are applicable to the initial stage of the knowledge base question answering system, that is, after a short period of online operation, there are already a certain number of dialogue records in the system, and the knowledge base question answering model needs to be optimized to improve system performance.

根据本公开的另一个方面，参见图2所示的本公开用于知识库问答的数据处理方法的另一种示例性实施方式的流程示意图。一种用于知识库问答的数据处理方法，用于对知识库问答过程中产生的数据(如对话记录)进行处理，以便能够实现在线对知识库模型进行优化。该数据处理方法包括：According to another aspect of the present disclosure, see FIG. 2 , which is a schematic flowchart of another exemplary implementation of the data processing method for knowledge base question answering of the present disclosure. A data processing method for knowledge base question answering, which is used for processing data (such as dialogue records) generated during the knowledge base question answering process, so as to optimize the knowledge base model online. The data processing methods include:

S110、将对话记录中的用户说法进行聚类，形成至少一类用户说法的集合。随着知识库问答系统在线运行时间的增加，针对同一个问题会有大量类似的用户说法存在于系统中，知识库问答模型将存在于对话记录中的大量用户说法按照相似程度进行自动聚类，将与某个问题相关的多个类似的用户说法聚为一类，将与另一个问题相关的多个类似的用户说法聚为一类，形成不同种类的用户说法的集合。可能只有一类，也可能会有两类或三类不同的用户说法。S110. Cluster the user statements in the dialogue records to form a set of at least one type of user statements. With the increase of the online running time of the knowledge base question answering system, there will be a large number of similar user statements for the same question in the system. The knowledge base question answering model will automatically cluster a large number of user statements existing in the dialogue records according to the degree of similarity. Multiple similar user statements related to a certain question are clustered into one category, and multiple similar user statements related to another question are clustered into one category to form a collection of different types of user statements. There may be only one category, or there may be two or three different categories of user statements.

S220、针对每一类用户说法的集合，知识库问答模型从知识库选出与该类用户说法的集合匹配的知识条目的集合。也就是分别形成了用户说法的集合和知识条目的集合，知识条目的集合中的每一个知识条目均与用户说法的集合中的每个用户说法有一定的匹配度。S220. For each set of user sayings, the knowledge base question answering model selects a set of knowledge items matching the set of user sayings from the knowledge base. That is, a set of user statements and a set of knowledge items are respectively formed, and each knowledge item in the set of knowledge items has a certain degree of matching with each user statement in the set of user statements.

S330、将该类用户说法的集合与所述知识条目的集合中的其中一个知识条目进行关联。虽然知识条目的集合中的每一个知识条目均与用户说法的集合中的用户说法有一定的匹配度，但是，为了达到优化知识库问答模型的目的，避免引入多余的噪声，知识库问答模型只在知识条目的集合中筛选出其中一个知识条目进行关联。S330. Associate the set of such user sayings with one of the knowledge items in the set of knowledge items. Although each knowledge item in the set of knowledge items has a certain degree of matching with the user statement in the set of user statements, in order to achieve the purpose of optimizing the knowledge base question answering model and avoid introducing redundant noise, the knowledge base question answering model only One of the knowledge items is selected from the collection of knowledge items for association.

S440、将关联后的该类用户说法的集合与所述其中一个知识条目作为训练样本对知识库问答模型进行训练,以根据训练结果对后续输入的用户说法进行反馈。在大量的对话记录中选取出各种用户说法进行聚类，针对形成的每一类用户说法的集合，筛选出与其匹配的知识条目并进行关联后，作为训练样本输入到模型中，知识库问答模型在训练后，能将样本中的用户说法的集合与样本中的知识条目形成关联关系，待下次用户向知识库问答模型输入与样本中某一类用户说法的集合中相同或相似的用户说法后，知识库问答模型能通过样本中的关联关系直接反馈出与其相关联的知识条目作为答复，实现了在线优化知识库问答模型的目的。其中，当输入的用户说法与样本中的用户说法存在一个以上相同的关键词时，可以认为输入的用户说法与样本中的用户说法是相似的。S440. Using the associated set of user statements of this type and the one of the knowledge items as training samples to train the knowledge base question answering model, so as to give feedback to the subsequently input user statements according to the training results. From a large number of dialogue records, various user statements are selected for clustering. For each set of user statements formed, the matching knowledge items are screened out and associated, and then input into the model as training samples. Knowledge base question answering After the model is trained, it can associate the collection of user statements in the sample with the knowledge items in the sample, and the next time the user enters the same or similar user as a certain type of user statement in the sample to the knowledge base question answering model. After the statement, the knowledge base question answering model can directly feed back the associated knowledge items as answers through the association relationship in the sample, realizing the purpose of optimizing the knowledge base question answering model online. Wherein, when there is more than one same keyword in the input user statement and the user statement in the sample, it can be considered that the input user statement is similar to the user statement in the sample.

在本公开的一个实施方式中，步骤S110，将对话记录中的用户说法进行聚类，形成至少一类用户说法的集合，可以包括：In one embodiment of the present disclosure, step S110, clustering the user sayings in the dialogue records to form a set of at least one type of user sayings may include:

在对话记录中，将知识库问答模型的反馈内容包括近似答案或无答案的用户说法聚为一类。或者在对话记录中，将知识库问答模型给出的置信度小于预设值的用户说法聚为一类。这两种聚类方式，均可以将对话记录中初始匹配效果不好的用户说法筛选出来并进行聚类，使这些用户说法在后续步骤中匹配到更好的知识条目，以对模型进行优化。In the dialogue records, the feedback content of the knowledge base question answering model, including user statements with approximate answers or no answers, is clustered into one category. Or in the dialogue records, the user statements whose confidence level given by the knowledge base question answering model is less than the preset value are clustered into one category. These two clustering methods can filter out and cluster user statements with poor initial matching effects in the dialogue records, so that these user statements can be matched to better knowledge items in subsequent steps to optimize the model.

也就是说可以采用上述两种不同的聚类方式，可根据不同的生产环境选择其中一种聚类方式。That is to say, the above two different clustering methods can be used, and one of the clustering methods can be selected according to different production environments.

S111、将聚类得到的至少一类用户说法的集合进行排序。参见图3所示的本公开用于知识库问答的数据处理方法的另一种示例性实施方式的流程示意图。如果聚类形成了不止一个种类的用户说法的集合，在各个种类的用户说法的集合之间按照一定的规则进行排序，以便于后续对这些数据的进一步处理。S111. Sorting the set of at least one type of user sayings obtained by clustering. Refer to FIG. 3 for a schematic flowchart of another exemplary implementation of the data processing method for knowledge base question answering of the present disclosure. If the clustering forms more than one set of user sayings, the sets of user sayings of each type are sorted according to certain rules, so as to facilitate subsequent further processing of these data.

在本公开的一个实施方式中，步骤S111，将聚类得到的至少一类用户说法的集合进行排序，可以包括：In one embodiment of the present disclosure, step S111, sorting the set of at least one type of user sayings obtained by clustering may include:

将聚类得到的至少一类用户说法的集合按照提问次数进行降序排列；其中，提问次数是指每一类用户说法的集合中未去重的用户说法的总数。如果聚类得到两个以上的用户说法的集合，将其中包含的用户说法的数量多的集合排列在前面，包含的用户说法的数量少的集合排列在后面。The set of at least one type of user statements obtained by clustering is sorted in descending order according to the number of questions asked; wherein, the number of questions refers to the total number of unduplicated user statements in the set of each type of user statements. If more than two sets of user sayings are obtained by clustering, the set containing the largest number of user sayings is arranged in the front, and the set containing the small number of user sayings is arranged in the back.

将提问次数相同的至少一类用户说法的集合按照聚类问题数进行升序排列；聚类问题数是指每一类用户说法的集合中去重后的用户说法的总数。如果存在两个集合其中包含的用户说法的数量相同，则再比较去重后的聚类问题数进行排序，将聚类问题数少的(也就是包含重复用户说法数量多的)集合排列在前面，将聚类问题数多的(也就是包含重复用户说法数量少的)集合排列在后面。The collection of at least one type of user statements with the same number of questions is arranged in ascending order according to the number of clustering questions; the number of clustering questions refers to the total number of user statements after deduplication in the set of user statements of each type. If there are two sets that contain the same number of user statements, then compare the number of clustering questions after deduplication and sort them, and arrange the set with fewer clustering questions (that is, the number of repeated user statements) in the front , arrange the sets with the most clustering questions (that is, the sets with the fewest repeated user statements) in the back.

将聚类问题数相同的至少一类用户说法的集合按照时间由近及远的顺序进行排序。如果存在两个集合其中包含的用户说法的数量相同并且包含的重复用户说法的数量也相同，则这两个集合之间将距离目前时间较近的集合排在前面，将距离目前时间较远的集合排在后面。此处所说的时间是指用户说法输入到知识库问答模型的时间，也就是用户提问的时间。换句话说，就是判断两个集合中各自距离当前时间最近的用户说法中哪个用户说法的时间最靠近当前，将最靠近当前的用户说法所对应的集合排在前面。Sorting the set of at least one type of user statements with the same number of clustering questions in order of time from near to far. If there are two sets that contain the same number of user statements and contain the same number of repeated user statements, the set that is closer to the current time will be placed in front of the two sets, and the set that is farther away from the current time will be ranked first. Sets come next. The time mentioned here refers to the time when the user's statement is input into the knowledge base question answering model, that is, the time when the user asks a question. In other words, it is to judge which of the user statements closest to the current time in the two sets is the closest to the current time, and rank the set corresponding to the user statement closest to the current time in front.

在本公开的一个实施方式中，步骤S220，针对每一类用户说法的集合，从知识库选出与该类用户说法的集合匹配的知识条目的集合，可以包括：In one embodiment of the present disclosure, step S220, for each set of user sayings, select a set of knowledge items matching the set of user sayings from the knowledge base, which may include:

将知识库中的知识条目与每一类用户说法的集合中的各个用户说法进行逐一匹配。每一类用户说法的集合中包含多个用户说法，针对其中的每一个用户说法，均在知识库中找出与其匹配的知识条目(可能匹配到不止一个知识条目，也可能没有匹配到相关的知识条目)。即将每一类用户说法的集合中的各个用户说法与知识库中的知识条目进行遍历性匹配。也即是说，该步骤将每一类用户说法的集合中的各个用户说法在知识库中重新进行了一次匹配，以筛选出更好的与其匹配的知识条目来进行关联。其中更好的匹配是指：该步骤重新匹配后的用户说法与知识条目的置信度比对话记录中初始匹配的用户说法与知识条目的置信度更高。The knowledge entries in the knowledge base are matched one by one with each user statement in the collection of each type of user statement. The collection of each type of user statement contains multiple user statements, and for each user statement, find the matching knowledge entry in the knowledge base (maybe more than one knowledge entry is matched, or there may be no relevant knowledge entry). knowledge entry). That is to perform ergodic matching between each user statement in the collection of each type of user statement and the knowledge entries in the knowledge base. That is to say, in this step, each user statement in the collection of each type of user statement is re-matched in the knowledge base to filter out better matching knowledge items for association. The better matching means that the confidence level of the re-matched user statement and knowledge item in this step is higher than the confidence level of the initially matched user statement and knowledge item in the dialogue record.

选取知识库问答模型给出的置信度大于等于预设值的知识条目形成所述知识条目的集合。在上步骤中进行逐一匹配时，每匹配到相应的知识条目，知识库问答模型都会给出针对这个匹配的置信度，可以预先设定一个置信度的阈值，例如将置信度的阈值设定为0.5，只有匹配的置信度大于等于0.5的知识条目才被筛选到形成的知识条目的集合中，避免产生噪声。A set of knowledge items is formed by selecting knowledge items whose confidence given by the knowledge base question answering model is greater than or equal to a preset value. When performing one-by-one matching in the previous step, each time a corresponding knowledge entry is matched, the knowledge base question answering model will give a confidence level for this match, and a confidence threshold can be set in advance, for example, the confidence threshold is set as 0.5, only knowledge items with a matching confidence greater than or equal to 0.5 are screened into the set of knowledge items formed to avoid noise.

在所述知识条目的集合中，按照各个知识条目出现的累计次数降序排列并去重。每一类的用户说法的集合中，均是类似的用户说法，其中有的用户说法主题和目的比较明确，能够在知识库中匹配到合适的知识条目，而有的主题和目的比较模糊，没有能够匹配到合适的知识条目。如果在知识条目的集合中，某个知识条目出现的次数较多，说明该知识条目的匹配度更好，故将其排列在前面，相反则排列在后面，便于后续对数据的进一步处理。In the set of knowledge items, sort them in descending order according to the accumulative number of occurrences of each knowledge item and deduplicate them. In the collection of each type of user statements, there are similar user statements, some of which have relatively clear topics and purposes, and can be matched with appropriate knowledge items in the knowledge base, while some topics and purposes are relatively vague, and there is no Able to match appropriate knowledge items. If a certain knowledge item appears more often in the set of knowledge items, it means that the matching degree of the knowledge item is better, so it is arranged in the front, otherwise, it is arranged in the back, which is convenient for further processing of the data.

如果某一类用户说法的集合在目前的知识库中没有匹配到相关的知识条目，也就是无答案，则需要根据用户说法的内容，利用网络搜索或人工输入等方式，在知识库中增加相关的知识条目，对目前的知识库进行补充和完善。然后将增加的知识条目筛选到形成的知识条目的集合中，在后续步骤中，将增加的知识条目与相匹配的用户说法的集合进行关联后输入到知识库问答模型中作为训练样本，对模型进行训练和在线优化。If there is no relevant knowledge entry in the current knowledge base for a set of user statements, that is, there is no answer, then it is necessary to add relevant information to the knowledge base by using network search or manual input according to the content of the user statement. knowledge entries to supplement and improve the current knowledge base. Then, the added knowledge items are screened into the set of knowledge items formed. In the next step, the added knowledge items are associated with the matching set of user statements and then input into the knowledge base question answering model as training samples. Perform training and online optimization.

上述各实施方式适用于知识库问答系统上线中后期，即上线运行一段时间后，系统中存储了很大数量的对话记录，需要对知识库问答模型进行优化以改进系统性能的情况。The above implementations are applicable to the middle and later stages of the knowledge base question answering system, that is, after a period of time of online operation, a large number of dialogue records are stored in the system, and the knowledge base question answering model needs to be optimized to improve system performance.

综合以上两种不同的实施方式，可以看出，本公开的用于知识库问答的数据处理方法能够提升模型基于线上真实数据优化的实时性，保障模型效果最优；提升运营人员的操作便捷性，提升工作效率；加速发现知识条目中存在的不足，促进知识库不断完善。Combining the above two different implementation methods, it can be seen that the data processing method for knowledge base question answering disclosed in this disclosure can improve the real-time optimization of the model based on online real data, ensure the optimal effect of the model, and improve the operation convenience of operators. Improve work efficiency; accelerate the discovery of deficiencies in knowledge entries, and promote the continuous improvement of the knowledge base.

本公开还提供了一种用于知识库问答的数据处理设备，参见图4所示的本公开用于知识库问答的数据处理设备的示例性实施方式的结构示意图。该设备包括：通信接口1000、存储器2000和处理器3000。通信接口1000用于与外界设备进行通信，进行数据交互传输。存储器2000内存储有可在处理器3000上运行的计算机程序。处理器3000执行所述计算机程序时实现上述实施方式中方法。所述存储器2000和处理器3000的数量可以为一个或多个。The present disclosure also provides a data processing device for knowledge base question answering, see FIG. 4 for a schematic structural diagram of an exemplary implementation of the present disclosure data processing device for knowledge base question answering. The device includes: a communication interface 1000 , a memory 2000 and a processor 3000 . The communication interface 1000 is used for communicating with external devices and performing interactive data transmission. Computer programs executable on the processor 3000 are stored in the memory 2000 . When the processor 3000 executes the computer program, the methods in the foregoing implementation manners are realized. The number of the memory 2000 and the processor 3000 may be one or more.

存储器2000可以包括高速RAM存储器，也可以还包括非易失性存储器(non-volatile memory)，例如至少一个磁盘存储器。The memory 2000 may include a high-speed RAM memory, and may also include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory.

如果通信接口1000、存储器2000及处理器3000独立实现，则通信接口1000、存储器2000及处理器3000可以通过总线相互连接并完成相互间的通信。所述总线可以是工业标准体系结构(ISA，Industry Standard Architecture)总线、外部设备互连(PCI，PeripheralComponent)总线或扩展工业标准体系结构(EISA，Extended Industry StandardComponent)总线等。所述总线可以分为地址总线、数据总线、控制总线等。为便于表示，该图中仅用一条粗线表示，但并不表示仅有一根总线或一种类型的总线。If the communication interface 1000, the memory 2000, and the processor 3000 are implemented independently, the communication interface 1000, the memory 2000, and the processor 3000 may be connected to each other through a bus to complete mutual communication. The bus may be an Industry Standard Architecture (ISA, Industry Standard Architecture) bus, a Peripheral Component Interconnect (PCI, Peripheral Component) bus, or an Extended Industry Standard Architecture (EISA, Extended Industry Standard Component) bus, and the like. The bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in this figure, but it does not mean that there is only one bus or one type of bus.

可选的，在具体实现上，如果通信接口1000、存储器2000、及处理器3000集成在一块芯片上，则通信接口1000、存储器2000、及处理器3000可以通过内部接口完成相互间的通信。Optionally, in specific implementation, if the communication interface 1000, the memory 2000, and the processor 3000 are integrated on one chip, then the communication interface 1000, the memory 2000, and the processor 3000 can communicate with each other through internal interfaces.

流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为，表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分，并且本公开的优选实施方式的范围包括另外的实现，其中可以不按所示出或讨论的顺序，包括根据所涉及的功能按基本同时的方式或按相反的顺序，来执行功能，这应被本公开的实施方式所属技术领域的技术人员所理解。处理器执行上文所描述的各个方法和处理。例如，本公开中的方法实施方式可以被实现为软件程序，其被有形地包含于机器可读介质，例如存储器。在一些实施方式中，软件程序的部分或者全部可以经由存储器和/或通信接口而被载入和/或安装。当软件程序加载到存储器并由处理器执行时，可以执行上文描述的方法中的一个或多个步骤。备选地，在其他实施方式中，处理器可以通过其他任何适当的方式(例如，借助于固件)而被配置为执行上述方法之一。Any process or method descriptions in flowcharts or otherwise described herein may be understood to represent modules, segments or portions of code comprising one or more executable instructions for implementing specific logical functions or steps of the process , and the scope of preferred embodiments of the present disclosure includes additional implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order depending on the functions involved, which shall It is understood by those skilled in the art to which the embodiments of the present disclosure belong. The processor executes the various methods and processes described above. For example, method embodiments in the present disclosure may be implemented as a software program tangibly embodied on a machine-readable medium, such as memory. In some implementations, part or all of the software program may be loaded and/or installed via memory and/or a communication interface. One or more steps in the methods described above may be performed when a software program is loaded into memory and executed by a processor. Alternatively, in other implementation manners, the processor may be configured to perform one of the above-mentioned methods in any other suitable manner (for example, by means of firmware).

在流程图中表示或在此以其他方式描述的逻辑和/或步骤，可以具体实现在任何可读存储介质中，以供指令执行系统、装置或设备(如基于计算机的系统、包括处理器的系统或其他可以从指令执行系统、装置或设备取指令并执行指令的系统)使用，或结合这些指令执行系统、装置或设备而使用。The logic and/or steps shown in the flowcharts or otherwise described herein can be embodied in any readable storage medium for instruction execution systems, devices or devices (such as computer-based systems, processor-included system or other systems that may fetch and execute instructions from an instruction execution system, device, or device), or be used in conjunction with such an instruction execution system, device, or device.

就本说明书而言，“可读存储介质”可以是任何可以包含、存储、通信、传播或传输程序以供指令执行系统、装置或设备或结合这些指令执行系统、装置或设备而使用的装置。可读存储介质的更具体的示例(非穷尽性列表)包括以下：具有一个或多个布线的电连接部(电子装置)，便携式计算机盘盒(磁装置)，随机存取存储器(RAM)，只读存储器(ROM)，可擦除可编辑只读存储器(EPROM或闪速存储器)，光纤装置，以及便携式只读存储器(CDROM)。另外，可读存储介质甚至可以是可在其上打印所述程序的纸或其他合适的介质，因为可以例如通过对纸或其他介质进行光学扫描，接着进行编辑、解译或必要时以其他合适方式进行处理来以电子方式获得所述程序，然后将其存储在存储器中。As far as this specification is concerned, a "readable storage medium" may be any device that can contain, store, communicate, spread or transmit programs for instruction execution systems, devices or devices or use in conjunction with these instruction execution systems, devices or devices. More specific examples (non-exhaustive list) of readable storage media include the following: electrical connection with one or more wires (electronic device), portable computer disk case (magnetic device), random access memory (RAM), Read Only Memory (ROM), Erasable and Editable Read Only Memory (EPROM or Flash Memory), Fiber Optic Devices, and Portable Read Only Memory (CDROM). In addition, the readable storage medium may even be paper or other suitable medium on which the program can be printed, since it can be done, for example, by optically scanning the paper or other medium, followed by editing, interpretation or other suitable processing if necessary. The program is processed electronically and then stored in memory.

应当理解，本公开的各部分可以用硬件、软件或它们的组合来实现。在上述实施方式中，多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件来实现。例如，如果用硬件来实现，和在另一实施方式中一样，可用本领域公知的下列技术中的任一项或他们的组合来实现：具有用于对数据信息实现逻辑功能的逻辑门电路的离散逻辑电路，具有合适的组合逻辑门电路的专用集成电路，可编程门阵列(PGA)，现场可编程门阵列(FPGA)等。It should be understood that various parts of the present disclosure may be realized by hardware, software or a combination thereof. In the embodiments described above, various steps or methods may be implemented by software stored in memory and executed by a suitable instruction execution system. For example, if it is implemented by hardware, as in another embodiment, it can be implemented by any one of the following technologies known in the art or their combination: Discrete logic circuits, ASICs with suitable combinational logic gates, programmable gate arrays (PGAs), field programmable gate arrays (FPGAs), etc.

本技术领域的普通技术人员可以理解实现上述实施方式方法的全部或部分步骤是可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种可读存储介质中，该程序在执行时，包括方法实施方式的步骤之一或其组合。Those of ordinary skill in the art can understand that all or part of the steps to realize the method of the above-mentioned embodiment can be completed by instructing related hardware through a program, and the program can be stored in a readable storage medium, and the program can be executed when executed , comprising one or a combination of the steps of the method embodiments.

此外，在本公开各个实施方式中的各功能单元可以集成在一个处理模块中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个可读存储介质中。所述存储介质可以是只读存储器，磁盘或光盘等。In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing module, each unit may exist separately physically, or two or more units may be integrated into one module. The above-mentioned integrated modules can be implemented in the form of hardware or in the form of software function modules. If the integrated modules are realized in the form of software function modules and sold or used as independent products, they can also be stored in a readable storage medium. The storage medium may be a read-only memory, a magnetic disk or an optical disk, and the like.

在本说明书的描述中，参考术语“一个实施例/方式”、“一些实施例/方式”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例/方式或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例/方式或示例中。在本说明书中，对上述术语的示意性表述不必须针对的是相同的实施例/方式或示例。而且，描述的具体特征、结构、材料或者特点可以在任一个或多个实施例/方式或示例中以合适的方式结合。此外，在不相互矛盾的情况下，本领域的技术人员可以将本说明书中描述的不同实施例/方式或示例以及不同实施例/方式或示例的特征进行结合和组合。In the description of this specification, descriptions referring to the terms "one embodiment/mode", "some embodiments/modes", "examples", "specific examples", or "some examples" mean that the embodiments/modes are combined The specific features, structures, materials or characteristics described in or examples are included in at least one embodiment/mode or example of the present application. In this specification, the schematic representations of the above terms do not necessarily refer to the same embodiment/mode or example. Moreover, the described specific features, structures, materials or characteristics may be combined in any one or more embodiments/modes or examples in an appropriate manner. In addition, those skilled in the art may combine and combine different embodiments/modes or examples and features of different embodiments/modes or examples described in this specification without conflicting with each other.

此外，术语“第一”、“第二”仅用于描述目的，而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。在本申请的描述中，“多个”的含义是至少两个，例如两个，三个等，除非另有明确具体的限定。In addition, the terms "first" and "second" are used for descriptive purposes only, and cannot be interpreted as indicating or implying relative importance or implicitly specifying the quantity of indicated technical features. Thus, the features defined as "first" and "second" may explicitly or implicitly include at least one of these features. In the description of the present application, "plurality" means at least two, such as two, three, etc., unless otherwise specifically defined.

本领域的技术人员应当理解，上述实施方式仅仅是为了清楚地说明本公开，而并非是对本公开的范围进行限定。对于所属领域的技术人员而言，在上述公开的基础上还可以做出其它变化或变型，并且这些变化或变型仍处于本公开的范围内。It should be understood by those skilled in the art that the above-mentioned embodiments are only for clearly illustrating the present disclosure, rather than limiting the scope of the present disclosure. For those skilled in the art, other changes or modifications can be made on the basis of the above disclosure, and these changes or modifications are still within the scope of the present disclosure.

Claims

1. A data processing method for knowledge base questions and answers, the method comprising:

acquiring any knowledge item from a knowledge base;

selecting user utterances matched with the knowledge items from the dialogue record to form a set of user utterances; the dialogue record includes: the user speaks and the knowledge items fed back to the user speaks form a corresponding dialogue record;

associating the set of user utterances with the knowledge item; and

and training the knowledge base question-answering model by taking the correlated set of user utterances and the knowledge items as training samples so as to feed back the subsequently input user utterances according to training results.

2. The data processing method of claim 1, wherein selecting the user utterance in the conversation record that matches the knowledge item forms a set of user utterances, comprising:

if the knowledge item is provided to the user as an approximate answer by a knowledge base question-answer model and is replied or clicked by the user, setting the corresponding user description in the dialogue record as a level A; the approximate answer includes: when a user initiates a dialogue, the knowledge base question-answering model acquires a plurality of knowledge items with confidence degrees within a specified range and answers questions, and the replied knowledge items are determined to be approximate answers and are provided for the user;

if the knowledge item is provided to the user as an approximate answer by the knowledge base question-answer model and is not replied or clicked by the user, setting the corresponding user description in the dialogue record as a B level;

if the knowledge item is not provided for the user as the best answer or the approximate answer by the knowledge base question-answering model, but the confidence coefficient is larger than or equal to a preset value, setting the corresponding user description in the dialogue record as a C level; and

sorting and de-duplicating the user's utterances in order of priority class A > class B > class C to form a set of the user's utterances;

the best answer includes: when a user initiates a dialogue, the knowledge base question-answering model acquires a knowledge item with highest confidence and higher than a certain designated value, answers the question, and determines the knowledge item replied at the moment as the best answer to provide for the user.

3. A data processing method for knowledge base questions and answers, the method comprising:

clustering user utterances in the conversation records to form a set of at least one type of user utterances, including: in the dialogue record, the feedback content of the knowledge base question-answer model comprises user descriptions of approximate answers or no answers to be classified; or in the dialogue record, gathering the user's description given by the knowledge base question-answer model that the confidence coefficient is smaller than the preset value into one type; the dialogue record includes: the user speaks and the knowledge items fed back to the user speaks form a corresponding dialogue record; the approximate answer includes: when a user initiates a dialogue, the knowledge base question-answering model acquires a plurality of knowledge items with confidence degrees within a specified range and answers questions, and the replied knowledge items are determined to be approximate answers and are provided for the user;

selecting, for each class of set of user utterances, a set of knowledge items from the knowledge base that match the class of set of user utterances;

associating the set of user utterances with one of the set of knowledge items; and

and training the knowledge base question-answering model by using the correlated set of the user descriptions and one of the knowledge items as a training sample so as to feed back the subsequently input user descriptions according to the training result.

4. The data processing method of claim 3, wherein clustering user utterances in the conversation records to form a set of at least one type of user utterance comprises:

and sequencing the clustered collection of at least one type of user description.

5. The data processing method of claim 4, wherein the ranking the clustered collection of at least one type of user utterance comprises:

the clustered set of at least one type of user description is arranged in descending order according to the question number; the question times refer to the total number of unrepeated user utterances in the set of each type of user utterances.

6. The data processing method of claim 5, wherein the ranking the clustered collection of at least one type of user utterance comprises:

the method comprises the steps of (1) arranging at least one group of user description sets with the same question times in ascending order according to the number of clustering questions; the clustering problem number refers to the total number of user utterances after de-duplication in the set of each type of user utterances.

7. The data processing method of claim 6, wherein the ranking the clustered collection of at least one type of user utterance comprises:

and sequencing the set of at least one type of user description with the same clustering problem number according to the sequence from the near to the far.

8. The data processing method of claim 3, wherein the selecting, for each class of user utterance, a set of knowledge items from a knowledge base that match the class of user utterance sets comprises:

matching knowledge items in the knowledge base with each user utterance in the set of each type of user utterance one by one;

selecting knowledge items with confidence degrees larger than or equal to a preset value given by a knowledge base question-answer model to form a set of the knowledge items; and

and in the knowledge item set, the knowledge items are arranged in descending order according to the accumulated times of occurrence of each knowledge item and are de-duplicated.

9. A data processing apparatus for knowledge base questions and answers, the data processing apparatus comprising:

a memory storing execution instructions; and

a processor executing the memory-stored execution instructions, causing the processor to perform the method of any one of claims 1 to 8.