CN109271495B - Question-answer recognition effect detection method, device, equipment and readable storage medium - Google Patents
Question-answer recognition effect detection method, device, equipment and readable storage medium Download PDFInfo
- Publication number
- CN109271495B CN109271495B CN201810923157.XA CN201810923157A CN109271495B CN 109271495 B CN109271495 B CN 109271495B CN 201810923157 A CN201810923157 A CN 201810923157A CN 109271495 B CN109271495 B CN 109271495B
- Authority
- CN
- China
- Prior art keywords
- answer
- question
- hot
- word
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本公开实施例提供问答识别效果检测方法、装置、设备及可读存储介质。问答识别效果检测方法包括:根据用户提问的原始数据获取包括热词的热词集合;对所述热词集合中的热词进行排序以及筛选以确定热点提问,并且将热词与用户反馈和答案信息进行关联,以针对热词获得该热词所关联的答案分布,并且针对答案获得该答案所关联的热词;检测一个热词所关联的去重后的答案数量是否超过第一阈值以及一个答案所关联的热词数量是否超过第二阈值;当检测到一个热词所关联的去重后的答案数量超过第一阈值和/或一个答案所关联的热词数量超过第二阈值时,确定出现热词与答案的匹配异常案例,可以对问答识别效果进行集中监控。
Embodiments of the present disclosure provide a question and answer recognition effect detection method, device, equipment, and a readable storage medium. The question and answer recognition effect detection method includes: obtaining a hot word set including hot words according to the original data of user questions; sorting and screening the hot words in the hot word set to determine hot questions, and combining the hot words with user feedback and answers Information is associated to obtain the answer distribution associated with the hot word for the hot word, and obtain the hot word associated with the answer for the answer; detect whether the number of deduplicated answers associated with a hot word exceeds the first threshold and a Whether the number of hot words associated with the answer exceeds the second threshold; when it is detected that the number of deduplicated answers associated with a hot word exceeds the first threshold and/or the number of hot words associated with an answer exceeds the second threshold, determine In the case of abnormal matching between hot words and answers, the effect of question and answer recognition can be monitored in a centralized manner.
Description
技术领域technical field
本公开实施例涉及计算机技术领域,尤其涉及问答识别效果检测方法、装置、设备及可读存储介质。The embodiments of the present disclosure relate to the field of computer technology, and in particular, to a question and answer recognition effect detection method, device, equipment and readable storage medium.
背景技术Background technique
相关技术中的问答识别体系建设过程中,识别问答效果需要大量离线标注数据进行检查。对问答效果识别进行外包及众包标注的回收时效不佳并且质量把控沟通成本高。使用验证集回归方式受到用户描述的随意性及多变性影响,无法有效评估实际用户问答效果。线上的坏案例(Badcase)分析(逐案方式)需要运营投入大量人力且分析结论耗时久。因此,大量线上识别问答效果的问题无法被及时发现解决,影响问答用户体验。In the construction process of the question-answer recognition system in the related art, a large amount of offline labeled data is required to check the recognition effect of the question-answer. The timeliness of outsourcing and crowdsourcing labeling for question and answer effect recognition is not good, and the cost of quality control and communication is high. The regression method using the verification set is affected by the arbitrariness and variability of user descriptions, and cannot effectively evaluate the actual user question-answering effect. Online bad case analysis (case-by-case) requires a lot of manpower for operations and takes a long time to analyze and conclude. Therefore, a large number of problems in online recognition of Q&A effects cannot be found and solved in a timely manner, which affects the Q&A user experience.
相关技术中的问答系统中的热词发现方法基于热词的波动趋势及用户反馈监控热点话题来发现潜在的业务问题或问答匹配问题。运营人员发现热门话题后对其进行浏览、归纳及分析,同时结合逐条分析热词下的用户原始问答日志,确定识别效果。相关技术的方案主要意图在于发现用户描述集中的问题,而没有对问答识别效果进行集中监控,导致长尾问题的问答识别效果无法得到有效关注。The method for discovering hot words in the question answering system in the related art monitors hot topics based on fluctuation trends of hot words and user feedback to discover potential business problems or question-answer matching problems. Operators browse, summarize and analyze hot topics after discovering them, and at the same time analyze the original question and answer logs of users under hot words one by one to determine the recognition effect. The main purpose of the solution of the related technology is to discover the problems that the users describe intensively, but there is no centralized monitoring of the recognition effect of the question and answer, so that the effect of the recognition of the question and answer of the long-tail question cannot be effectively paid attention to.
因此,亟需一种能快速、有效地进行问答识别效果检测的方法。Therefore, there is an urgent need for a method that can quickly and effectively detect the effect of question and answer recognition.
发明内容Contents of the invention
有鉴于此,本公开第一方面提供了一种问答识别效果检测方法,包括:In view of this, the first aspect of the present disclosure provides a method for detecting the effect of question and answer recognition, including:
根据用户提问的原始数据获取包括热词的热词集合;Obtain a set of hot words including hot words according to the original data of user questions;
对所述热词集合中的热词进行排序以及筛选以确定热点提问,并且将热词与用户反馈和答案信息进行关联,以针对热词获得该热词所关联的答案分布,并且针对答案获得该答案所关联的热词;Sorting and screening the hot words in the hot word set to determine hot questions, and associating the hot words with user feedback and answer information, so as to obtain the distribution of answers associated with the hot words for the hot words, and obtain for the answers Hot words associated with the answer;
检测一个热词所关联的去重后的答案数量是否超过第一阈值以及一个答案所关联的热词数量是否超过第二阈值;Detecting whether the number of deduplicated answers associated with a hot word exceeds a first threshold and whether the number of hot words associated with an answer exceeds a second threshold;
当检测到一个热词所关联的去重后的答案数量超过第一阈值和/或一个答案所关联的热词数量超过第二阈值时,确定出现热词与答案的匹配异常案例。When it is detected that the number of deduplicated answers associated with a hot word exceeds a first threshold and/or the number of hot words associated with an answer exceeds a second threshold, it is determined that an abnormal case of matching between a hot word and an answer occurs.
本公开第二方面提供了一种问答识别效果检测系统,包括:The second aspect of the present disclosure provides a question and answer recognition effect detection system, including:
热词获取模块,被配置为根据用户提问的原始数据获取包括热词的热词集合;The hot word acquisition module is configured to acquire a hot word set including hot words according to the raw data of the user's question;
问答关联模块,被配置为对所述热词集合中的热词进行排序以及筛选以确定热点提问,并且将热词与用户反馈和答案信息进行关联,以针对热词获得该热词所关联的答案分布,并且针对答案获得该答案所关联的热词;The question and answer association module is configured to sort and filter the hot words in the hot word set to determine hot questions, and associate the hot words with user feedback and answer information to obtain the hot words associated with the hot words Answer distribution, and obtain the hot words associated with the answer for the answer;
异常检测模块,被配置为检测一个热词所关联的去重后的答案数量是否超过第一阈值以及一个答案所关联的热词数量是否超过第二阈值;An anomaly detection module configured to detect whether the number of deduplicated answers associated with a hot word exceeds a first threshold and whether the number of hot words associated with an answer exceeds a second threshold;
异常确定模块,被配置为当所述异常检测模块检测到一个热词所关联的去重后的答案数量超过第一阈值和/或一个答案所关联的热词数量超过第二阈值时,确定出现热词与答案的匹配异常案例。An abnormality determination module configured to determine that when the abnormality detection module detects that the number of deduplicated answers associated with a hot word exceeds a first threshold and/or the number of hot words associated with an answer exceeds a second threshold, determine that An abnormal case of matching hot words and answers.
本公开第三方面提供了一种电子设备,包括存储器和处理器;其中,所述存储器用于存储一条或多条计算机指令,其中,所述一条或多条计算机指令被所述处理器执行以实现如第一方面所述的方法。A third aspect of the present disclosure provides an electronic device, including a memory and a processor; wherein the memory is used to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to Implement the method as described in the first aspect.
本公开第四方面提供了一种可读存储介质,其上存储有计算机指令,该计算机指令被处理器执行时实现如第一方面所述的方法。A fourth aspect of the present disclosure provides a readable storage medium on which computer instructions are stored, and when the computer instructions are executed by a processor, the method described in the first aspect is implemented.
在本公开实施方式中,通过根据用户提问的原始数据获取包括热词的热词集合;对所述热词集合中的热词进行排序以及筛选以确定热点提问,并且将热词与用户反馈和答案信息进行关联,以针对热词获得该热词所关联的答案分布,并且针对答案获得该答案所关联的热词;检测一个热词所关联的去重后的答案数量是否超过第一阈值以及一个答案所关联的热词数量是否超过第二阈值;当检测到一个热词所关联的去重后的答案数量超过第一阈值和/或一个答案所关联的热词数量超过第二阈值时,确定出现热词与答案的匹配异常案例,可以通过获取热词集合,同时通过热词与答案的双向关联分析,结合用户反馈,对问答识别效果进行集中监控,而且能够有效关注诸如匹配异常案例之类的长尾问题的问答识别效果。In the embodiment of the present disclosure, a hot word set including hot words is obtained according to the raw data of user questions; the hot words in the hot word set are sorted and screened to determine hot questions, and the hot words are combined with user feedback and The answer information is associated, to obtain the answer distribution associated with the hot word for the hot word, and obtain the hot word associated with the answer for the answer; detect whether the number of deduplicated answers associated with a hot word exceeds the first threshold and Whether the number of hot words associated with an answer exceeds the second threshold; when it is detected that the number of deduplicated answers associated with a hot word exceeds the first threshold and/or the number of hot words associated with an answer exceeds the second threshold, To determine the case of abnormal matching between hot words and answers, you can obtain hot word sets, and at the same time through the two-way correlation analysis of hot words and answers, combined with user feedback, to monitor the effect of question and answer recognition in a centralized manner, and can effectively pay attention to cases such as abnormal matching. The effect of question answering recognition on long-tail questions of the class.
本公开的这些方面或其他方面在以下实施例的描述中会更加简明易懂。These or other aspects of the present disclosure will be more clearly understood in the description of the following embodiments.
附图说明Description of drawings
为了更清楚地说明本公开实施例或相关技术中的技术方案,下面将对示例性实施例或相关技术描述中所需要使用的附图作一简单地介绍,显而易见地,下面描述中的附图是本公开的一些示例性实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present disclosure or related technologies, the following will briefly introduce the accompanying drawings that need to be used in the description of the exemplary embodiments or related technologies. Obviously, the accompanying drawings in the following description These are some exemplary embodiments of the present disclosure, and those skilled in the art can also obtain other drawings based on these drawings without creative efforts.
图1示出根据本公开一实施方式的问答识别效果检测方法的流程图;FIG. 1 shows a flow chart of a method for detecting a question and answer recognition effect according to an embodiment of the present disclosure;
图2示出根据本公开另一实施方式的问答识别效果检测方法的流程图;FIG. 2 shows a flow chart of a method for detecting the effect of question and answer recognition according to another embodiment of the present disclosure;
图3示出根据本公开另一实施方式的问答识别效果检测方法的步骤S101的流程图的示例;FIG. 3 shows an example of the flow chart of step S101 of the method for detecting the effect of question and answer recognition according to another embodiment of the present disclosure;
图4示出根据本公开一实施方式的问答识别效果检测装置的结构框图;Fig. 4 shows a structural block diagram of a question and answer recognition effect detection device according to an embodiment of the present disclosure;
图5示出根据本公开另一实施方式的问答识别效果检测装置的结构框图;Fig. 5 shows a structural block diagram of a question and answer recognition effect detection device according to another embodiment of the present disclosure;
图6示出根据本公开一实施方式的问答识别效果检测装置中的热词获取模块401的结构框图;FIG. 6 shows a structural block diagram of the hot
图7示出根据本公开一实施方式的设备的结构框图;Fig. 7 shows a structural block diagram of a device according to an embodiment of the present disclosure;
图8是适于用来实现根据本公开一实施方式的问答识别效果检测方法的计算机系统的结构示意图。Fig. 8 is a schematic structural diagram of a computer system suitable for implementing the method for detecting the effect of question-and-answer recognition according to an embodiment of the present disclosure.
具体实施方式Detailed ways
为了使本技术领域的人员更好地理解本公开方案,下面将结合本公开示例性实施例中的附图,对本公开示例性实施例中的技术方案进行清楚、完整地描述。In order to enable those skilled in the art to better understand the solutions of the present disclosure, the following will clearly and completely describe the technical solutions in the exemplary embodiments of the present disclosure with reference to the accompanying drawings in the exemplary embodiments of the present disclosure.
在本公开的说明书和权利要求书及上述附图中的描述的一些流程中,包含了按照特定顺序出现的多个操作,但是应该清楚了解,这些操作可以不按照其在本文中出现的顺序来执行或并行执行,操作的序号如101、102等,仅仅是用于区分开各个不同的操作,序号本身不代表任何的执行顺序。另外,这些流程可以包括更多或更少的操作,并且这些操作可以按顺序执行或并行执行。需要说明的是,本文中的“第一”、“第二”等描述,是用于区分不同的消息、设备、模块等,不代表先后顺序,也不限定“第一”和“第二”是不同的类型。In some processes described in the specification and claims of the present disclosure and the descriptions in the above drawings, a plurality of operations appearing in a specific order are included, but it should be clearly understood that these operations may not be performed in the order in which they appear herein Execution or parallel execution, the serial numbers of the operations, such as 101, 102, etc., are only used to distinguish different operations, and the serial numbers themselves do not represent any execution order. Additionally, these processes can include more or fewer operations, and these operations can be performed sequentially or in parallel. It should be noted that the descriptions of "first" and "second" in this article are used to distinguish different messages, devices, modules, etc. are different types.
下面将结合本公开示例性实施例中的附图,对本公开示例性实施例中的技术方案进行清楚、完整地描述,显然,所描述的示例性实施例仅仅是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。.The technical solutions in the exemplary embodiments of the present disclosure will be clearly and completely described below in conjunction with the accompanying drawings in the exemplary embodiments of the present disclosure. Obviously, the described exemplary embodiments are only part of the embodiments of the present disclosure, rather than Full examples. Based on the embodiments in the present disclosure, all other embodiments obtained by those skilled in the art without creative efforts fall within the protection scope of the present disclosure. .
图1示出根据本公开一实施方式的问答识别效果检测方法的流程图。该方法可以包括步骤S101、S102、S103和S104。Fig. 1 shows a flow chart of a method for detecting the effect of question and answer recognition according to an embodiment of the present disclosure. The method may include steps S101, S102, S103 and S104.
在步骤S101中,根据用户提问的原始数据获取包括热词的热词集合。In step S101, a hot word set including hot words is obtained according to the raw data of the user's question.
在步骤S102中,对热词集合中的热词进行排序以及筛选以确定热点提问,并且将热词与用户反馈和答案信息进行关联,以针对热词获得该热词所关联的答案分布,并且针对答案获得该答案所关联的热词。In step S102, the hot words in the hot word set are sorted and screened to determine hot questions, and the hot words are associated with user feedback and answer information to obtain the answer distribution associated with the hot words for the hot words, and Obtain the hot words associated with the answer for the answer.
在步骤S103中,检测一个热词所关联的去重后的答案数量是否超过第一阈值以及一个答案所关联的热词数量是否超过第二阈值。In step S103, it is detected whether the number of deduplicated answers associated with a hot word exceeds a first threshold and whether the number of hot words associated with an answer exceeds a second threshold.
在步骤S104中,当检测到一个热词所关联的去重后的答案数量超过第一阈值和/或一个答案所关联的热词数量超过第二阈值时,确定出现热词与答案的匹配异常案例。In step S104, when it is detected that the number of deduplicated answers associated with a hot word exceeds the first threshold and/or the number of hot words associated with an answer exceeds the second threshold, it is determined that there is an abnormal matching between the hot word and the answer case.
在本公开的实施例中,热词是一种词汇现象,反映了一个时间周期内人们普遍关注的问题和事物。热词具有时间和空间特征,即反映某个群体中某个时期的热点话题,主要表达形式有词汇以及短语。In the embodiments of the present disclosure, a hot word is a vocabulary phenomenon, which reflects issues and things that people generally pay attention to in a period of time. Hot words have the characteristics of time and space, that is, they reflect the hot topics of a certain period in a certain group, and the main forms of expression are vocabulary and phrases.
在本公开的一个实施例中,根据用户提问的原始数据获取包括热词的热词集合的方式可以是进行热词挖掘。可以根据用户提问的原始数据,通过新词发现、短语挖掘和热度计算等操作来获取热词集合或候选的热词集合。In an embodiment of the present disclosure, the method of obtaining a hot word set including hot words according to the raw data of user questions may be to perform hot word mining. Based on the original data of user questions, hot word sets or candidate hot word sets can be obtained through operations such as new word discovery, phrase mining, and popularity calculation.
在本公开的一个实施例中,对热词集合中的热词进行排序以及筛选以确定热点提问,包括:根据热词的时间段分布、热词的入口多样性分布和对热词的周期性分析,对热词结果进行排序及筛选,以确定热点提问。热词的时间段分布可以指的是以诸如10分钟之类的时间段为单位的热词数量走势。In one embodiment of the present disclosure, sorting and screening the hot words in the hot word set to determine hot questions includes: according to the time period distribution of hot words, the entry diversity distribution of hot words and the periodicity of hot words Analyze, sort and filter hot word results to determine hot questions. The time period distribution of the hot words may refer to the trend of the number of hot words in units of time periods such as 10 minutes.
在本公开的一个实施例中,将热词与用户反馈和答案信息进行关联,包括:以热词为维度建立倒排索引,将热词与用户反馈和答案信息进行关联。因此,在获取热词(热词挖掘)完成后,可以以热词为维度建立倒排索引,将热词与用户反馈和答案信息进行关联,加速后续分析的过程。另外,用户反馈可以包括用户对问答系统的评价等。In an embodiment of the present disclosure, associating hot words with user feedback and answer information includes: establishing an inverted index with hot words as a dimension, and associating hot words with user feedback and answer information. Therefore, after the acquisition of hot words (hot word mining) is completed, an inverted index can be established with the hot words as the dimension, and the hot words can be associated with user feedback and answer information to speed up the subsequent analysis process. In addition, the user feedback may include the user's evaluation of the question answering system and the like.
在本公开的一个实施例中,针对热词获得该热词所关联的答案分布可以包括:以单个热词为单位对热词所关联的提问与答案对进行分析以获得该热词所关联的答案分布。可以理解,可以存在与一个热词关联的多对提问与答案,因此,可以对多个提问与答案对进行分析,获得该热词所关联的答案分布。例如,对于某个语义清晰的热词,如“信用卡还款”,所对应的答案分布都会围绕该热词主题。In an embodiment of the present disclosure, obtaining the distribution of answers associated with the hot word may include: analyzing the question and answer pairs associated with the hot word in units of a single hot word to obtain the answers associated with the hot word distribution of answers. It can be understood that there may be multiple pairs of questions and answers associated with a hot word, therefore, multiple pairs of questions and answers may be analyzed to obtain the distribution of answers associated with the hot word. For example, for a hot word with clear semantics, such as "credit card repayment", the corresponding answer distribution will revolve around the topic of the hot word.
在本公开的一个实施例中,针对答案获得该答案所关联的热词,包括:以单个答案为单位对与该答案对应的用户提问进行热词聚类以获得该答案所关联的热词。例如,某个答案对应的原始提问应围绕某一个或少量的热点话题,而不应与超过某阈值的多个热词相关。In an embodiment of the present disclosure, obtaining the hot words associated with the answer includes: performing hot word clustering on user questions corresponding to the answer in units of a single answer to obtain the hot words associated with the answer. For example, the original question corresponding to a certain answer should focus on one or a small number of hot topics, and should not be related to multiple hot words exceeding a certain threshold.
在本公开的一个实施例中,当检测到一个热词所关联的去重后的答案数量超过第一阈值和/或一个答案所关联的热词数量超过第二阈值时,确定出现热词与答案的匹配异常案例指的是:无论出现一个热词所关联的去重后的答案数量超过第一阈值、或者一个答案所关联的热词数量超过第二阈值、或者一个热词所关联的去重后的答案数量超过第一阈值同时一个答案所关联的热词数量超过第二阈值这三种情况中的哪一种情况,都可以确定出现了热词与答案的匹配异常案例。换言之,如果某热词所关联的去重答案数量过多,则该热词所关联的提问具会有与提问不相关的答案;如果某个答案所关联的热词数量过多,则该答案对应的原始提问围绕了过多的话题(热词),可以认为出现热词与答案的匹配异常案例,或者是出现候选的热词与答案的匹配异常案例以供后续排查。在本公开的实施例中,热词与答案的匹配异常案例可以被认为是坏案例中的一种。本领域技术人员可以理解,根据实际情况需要,第一阈值和第二阈值可以是任何数量。另外,去重指的是去除重复的答案,去重可以避免统计和计算失真。In one embodiment of the present disclosure, when it is detected that the number of deduplicated answers associated with a hot word exceeds a first threshold and/or the number of hot words associated with an answer exceeds a second threshold, it is determined that a hot word and An abnormal case of answer matching refers to: no matter whether the number of deduplicated answers associated with a hot word exceeds the first threshold, or the number of hot words associated with an answer exceeds the second threshold, or the deduplication associated with a hot word In any one of the three situations where the number of repeated answers exceeds the first threshold and the number of hot words associated with an answer exceeds the second threshold, it can be determined that there is an abnormal case of matching between hot words and answers. In other words, if there are too many deduplicated answers associated with a hot word, the question associated with the hot word has an answer that is not related to the question; The corresponding original question revolves around too many topics (hot words). It can be considered that there are cases of abnormal matching between hot words and answers, or there are cases of abnormal matching between candidate hot words and answers for subsequent investigation. In the embodiments of the present disclosure, the abnormal case of hotword-answer matching can be considered as one of the bad cases. Those skilled in the art can understand that, according to actual needs, the first threshold and the second threshold can be any number. In addition, deduplication refers to the removal of duplicate answers, which can avoid statistical and computational distortion.
在本公开实施方式中,通过根据用户提问的原始数据获取包括热词的热词集合;对热词集合中的热词进行排序以及筛选以确定热点提问,并且将热词与用户反馈和答案信息进行关联,以针对热词获得该热词所关联的答案分布,并且针对答案获得该答案所关联的热词;检测一个热词所关联的去重后的答案数量是否超过第一阈值以及一个答案所关联的热词数量是否超过第二阈值;当检测到一个热词所关联的去重后的答案数量超过第一阈值和/或一个答案所关联的热词数量超过第二阈值时,确定出现热词与答案的匹配异常案例,可以通过获取热词集合,同时通过热词与答案的双向关联分析,结合用户反馈,对问答识别效果进行集中监控,而且能够有效关注诸如匹配异常案例之类的长尾问题的问答识别效果。相关技术的热词挖掘方案可通过Apriori关联规则挖掘、SegPhrase短语发现等方法完成,对分词正确性或语料质量都有较高的要求,同时,相关技术的方案仅止于热词挖掘,不像本公开实施例的方案中深入问答系统进行热词的评估和问答匹配的坏案例发现。In the embodiment of the present disclosure, a hot word set including hot words is obtained according to the raw data of user questions; the hot words in the hot word set are sorted and screened to determine hot questions, and the hot words are combined with user feedback and answer information Carry out association, to obtain the answer distribution associated with the hot word for the hot word, and obtain the hot word associated with the answer for the answer; detect whether the number of deduplicated answers associated with a hot word exceeds the first threshold and an answer Whether the number of hot words associated exceeds the second threshold; when it is detected that the number of answers after deduplication associated with a hot word exceeds the first threshold and/or the number of hot words associated with an answer exceeds the second threshold, it is determined that there is a Abnormal cases of matching of hot words and answers can be obtained by obtaining hot word sets, and at the same time through the two-way correlation analysis of hot words and answers, combined with user feedback, the effect of question and answer recognition can be monitored in a centralized manner, and it can effectively pay attention to cases such as matching abnormal cases Question answering recognition performance for long-tail questions. The hot word mining scheme of related technologies can be completed by methods such as Apriori association rule mining and SegPhrase phrase discovery, which have high requirements for word segmentation accuracy or corpus quality. At the same time, the related technology solutions only stop at hot word mining, unlike In the solution of the embodiment of the present disclosure, the deep question answering system performs hot word evaluation and bad case discovery of question answer matching.
在本公开的实施方式中,长尾问题可以指的是出现频率不是很高的那些问题。In embodiments of the present disclosure, long-tail questions may refer to those questions that do not appear frequently.
图2示出根据本公开另一实施方式的问答识别效果检测方法的流程图。图2所示的实施方式与图1所示的实施方式的区别在于,图2除了包括步骤S101、S102、S103和S104之外,还包括步骤S201和S202。Fig. 2 shows a flow chart of a method for detecting the effect of question and answer recognition according to another embodiment of the present disclosure. The difference between the embodiment shown in FIG. 2 and the embodiment shown in FIG. 1 is that, besides steps S101 , S102 , S103 and S104 , FIG. 2 also includes steps S201 and S202 .
在步骤S201中,检测一个热词所关联的去重后的答案中是否包括该热词。In step S201, it is detected whether the hot word is included in the deduplicated answer associated with the hot word.
在步骤S202中,当检测到一个热词所关联的去重后的全部答案中均不包括该热词时,确定出现答案缺失案例。In step S202, when it is detected that all the deduplicated answers associated with a hot word do not include the hot word, it is determined that a missing answer case occurs.
在本公开的一个实施例中,可以在在确定热词与答案的匹配异常案例进行进一步分析,如果某热词所关联的所有答案标题中,均不包含该热词字符串,则可以认为是答案库有答案缺失,因此可以认为出现答案缺失案例,或者可将此案例作为候选的缺失案例供后续排查。在本公开的实施例中,答案缺失案例可以被认为是坏案例中的一种。In one embodiment of the present disclosure, further analysis can be carried out in the case of abnormal matching between hot words and answers. If all the answer titles associated with a hot word do not contain the hot word string, it can be considered as There are missing answers in the answer library, so it can be considered that there is a missing answer case, or this case can be used as a candidate missing case for subsequent investigation. In embodiments of the present disclosure, the missing-answer case can be considered as one of the bad cases.
在本公开的实施方式中,可以通过获取热词集合,同时通过热词与答案的双向关联分析,结合用户反馈,对问答识别效果进行集中监控,而且能够有效关注诸如匹配异常案例以及答案缺失案例之类的长尾问题的问答识别效果。In the embodiment of the present disclosure, by obtaining the hot word set, at the same time, through the two-way association analysis of hot words and answers, combined with user feedback, the effect of question and answer recognition can be monitored intensively, and it can effectively pay attention to cases such as abnormal matching and missing answers The effect of question and answer recognition on long-tail questions such as .
图3示出根据本公开另一实施方式的问答识别效果检测方法的步骤S101的流程图的示例。步骤S101包括步骤S301、S302、S303和S304。Fig. 3 shows an example of a flow chart of step S101 of the question-answer recognition effect detection method according to another embodiment of the present disclosure. Step S101 includes steps S301, S302, S303 and S304.
在步骤S301中,通过预设的新词发现算法对用户提问的原始数据进行计算以获取新词,其中,新词以字符为组成单元。In step S301, a preset new word discovery algorithm is used to calculate the original data of the user's question to obtain new words, wherein a new word is composed of characters.
在步骤S302中,根据预设基础词库中的基础词以及获取的新词生成分词词典,并且利用分词词典对用户提问进行分词以得到提问分词结果。In step S302, a word segmentation dictionary is generated according to the basic words in the preset basic thesaurus and the acquired new words, and the user question is segmented using the word segmentation dictionary to obtain a question segmentation result.
在步骤S303中,通过预设的短语发现算法对分词结果进行计算以获取短语,其中,短语以分词后得到的单词为组成单元。In step S303, the word segmentation result is calculated by a preset phrase discovery algorithm to obtain a phrase, wherein the phrase is composed of words obtained after word segmentation.
在步骤S304中,根据预设热度算法计算短语的热度,并将热度大于预设热度阈值的短语确定为热词。In step S304, the popularity of phrases is calculated according to a preset popularity algorithm, and phrases with popularity greater than a preset popularity threshold are determined as hot words.
在本公开的一个实施例中,预设的新词发现算法可以是相关技术中的用于获取新词的算法,只要能够实现对用户提问的原始数据进行计算以获取新词即可。In an embodiment of the present disclosure, the preset new word discovery algorithm may be an algorithm for acquiring new words in the related art, as long as the original data of the user's question can be calculated to acquire new words.
在本公开的一个实施例中,步骤S301包括:通过预设的新词发现算法对用户提问的原始数据中的字符串的自由度和凝固度进行计算,并以通过预设新词阈值对字符串进行限制以获取新词。将字符作为单元进行自由度和凝聚度的计算,通过预设新词阈值,能够有效识别新词表达方式,避开发现新词过程中的坏案例。新词阈值可以指的是根据实际需要对字符串的自由度和凝固度设定的相应阈值。另外,获得的新词可以是新提问中的潜在新词,可能代表了新产品、新业务或新表达方式等。另外,发现的新词可以是具有高词粒度的新词。In one embodiment of the present disclosure, step S301 includes: using a preset new word discovery algorithm to calculate the degree of freedom and coagulation degree of the character string in the original data queried by the user, and using the preset new word threshold to calculate the character string Strings are restricted for new words. The character is used as a unit to calculate the degrees of freedom and cohesion, and by presetting the new word threshold, it can effectively identify the expression of new words and avoid bad cases in the process of discovering new words. The new word threshold may refer to the corresponding threshold set for the degree of freedom and solidification of the character string according to actual needs. In addition, the obtained new words may be potential new words in new questions, which may represent new products, new services, or new expressions. In addition, the discovered new words may be new words with high word granularity.
在本公开的一个实施例中,步骤S303包括:通过预设的短语发现算法对分词结果中的单词的自由度和凝固度进行计算,并以通过预设短语阈值对单词进行限制以获取短语。将单词作为单元进行自由度和凝聚度的计算,通过预设短语阈值,能够有效识别短语表达方式,避开发现短语过程中的坏案例。短语阈值可以指的是根据实际需要对词序列的自由度和凝固度设定的相应阈值。In one embodiment of the present disclosure, step S303 includes: using a preset phrase discovery algorithm to calculate the degrees of freedom and coagulation of words in the word segmentation result, and restricting the words by a preset phrase threshold to obtain phrases. Calculate the degree of freedom and cohesion of words as a unit, and through the preset phrase threshold, it can effectively identify the expression of phrases and avoid bad cases in the process of discovering phrases. The phrase threshold may refer to a corresponding threshold set for the degree of freedom and the degree of solidification of the word sequence according to actual needs.
在本公开的一个实施例中,在根据预设热度算法计算短语的热度时,可以根据短语的信息熵和互信息、以及短语出现频次、左右邻的出现均衡数等指标进行加权分析,以获得短语的热度表示。在本公开的实施例中,信息熵是信息论中用于度量信息量的一个概念,直接反应了系统有序化程度,系统越规则有序则信息熵越低,反之系统越混乱无序则信息熵越高。互信息是信息论里一种有用的信息度量,它可以看成是一个随机变量中包含的关于另一个随机变量的信息量,或者说是一个随机变量由于已知另一个随机变量而减少的不肯定性。左右邻指的是当前短语左侧和右侧的短语。根据本公开实施方式对的教导,本领域技术人员可以理解,可以通过相关技术中的计算方法来计算短语的热度。In one embodiment of the present disclosure, when calculating the popularity of a phrase according to the preset popularity algorithm, a weighted analysis can be performed according to the information entropy and mutual information of the phrase, as well as the frequency of occurrence of the phrase, the appearance balance number of left and right neighbors, etc., to obtain The popularity of the phrase. In the embodiments of the present disclosure, information entropy is a concept used to measure the amount of information in information theory, which directly reflects the degree of ordering of the system. The more regular and orderly the system, the lower the information entropy; The higher the entropy is. Mutual information is a useful measure of information in information theory. It can be seen as the amount of information contained in a random variable about another random variable, or the uncertainty of a random variable that is reduced by knowing another random variable. sex. Left and right neighbors refer to the phrases to the left and right of the current phrase. According to the teachings of the embodiments of the present disclosure, those skilled in the art can understand that the popularity of a phrase can be calculated by a calculation method in the related art.
在本公开实施方式中,不同于传统方案通过关联挖掘等方法进行短语挖掘,本方案将短语聚合视为新词发现的特殊过程,即将单词看作一个单元进行自由度和凝聚度的计算,不但能达到短语聚合结果,还能够有效识别新兴的短语表达方式并规避新词发现过程中的坏案例。例如,新词发现过程中错误挖掘出“怎么开”和“通支付工具”两个词,本方案可通过短语挖掘(获取)将二者再次合成为一个有效短语“怎么开通支付工具”,从而完成错误纠正。In the embodiment of the present disclosure, unlike the traditional schemes that use association mining and other methods to mine phrases, this scheme regards phrase aggregation as a special process of new word discovery, that is, a word is regarded as a unit to calculate the degree of freedom and cohesion, not only It can achieve phrase aggregation results, and can effectively identify emerging phrase expressions and avoid bad cases in the process of new word discovery. For example, in the process of discovering new words, the words "how to open" and "pass payment tool" were mistakenly mined. This scheme can synthesize the two words into an effective phrase "how to open a payment tool" through phrase mining (acquisition), so that Complete error correction.
而且,根据本公开实施方式的问答识别效果检测方案结合问答系统的特征,如时间波动、用户反馈、答案分布等进行分析,对热度识别及长尾问题发现都能起到作用。Moreover, the Q&A recognition effect detection scheme according to the embodiments of the present disclosure combines the characteristics of the Q&A system, such as time fluctuations, user feedback, answer distribution, etc. for analysis, and can play a role in popularity recognition and long-tail question discovery.
以下参照图4对本公开的问答识别效果检测装置进行描述。The question and answer recognition effect detection device of the present disclosure will be described below with reference to FIG. 4 .
图4示出根据本公开另一实施方式的问答识别效果检测装置的结构框图。如图4所示的问答识别效果检测装置包括热词获取模块401、问答关联模块402、异常检测模块403和异常确定模块404。Fig. 4 shows a structural block diagram of a question and answer recognition effect detection device according to another embodiment of the present disclosure. The question and answer recognition effect detection device shown in FIG. 4 includes a hot
热词获取模块401被配置为根据用户提问的原始数据获取包括热词的热词集合。The hot
问答关联模块402被配置为对热词集合中的热词进行排序以及筛选以确定热点提问,并且将热词与用户反馈和答案信息进行关联,以针对热词获得该热词所关联的答案分布,并且针对答案获得该答案所关联的热词。The question-
异常检测模块403被配置为检测一个热词所关联的去重后的答案数量是否超过第一阈值以及一个答案所关联的热词数量是否超过第二阈值。The
异常确定模块404被配置为当异常检测模块403检测到一个热词所关联的去重后的答案数量超过第一阈值和/或一个答案所关联的热词数量超过第二阈值时,确定出现热词与答案的匹配异常案例。The
在本公开的一个实施例中,根据用户提问的原始数据获取包括热词的热词集合的方式可以是进行热词挖掘。可以根据用户提问的原始数据,通过新词发现、短语挖掘和热度计算等操作来获取热词集合或候选的热词集合。In an embodiment of the present disclosure, the method of obtaining a hot word set including hot words according to the raw data of user questions may be to perform hot word mining. Based on the original data of user questions, hot word sets or candidate hot word sets can be obtained through operations such as new word discovery, phrase mining, and popularity calculation.
在本公开的一个实施例中,问答关联模块402还被配置为:根据热词的时间段分布、热词的入口多样性分布和对热词的周期性分析,对热词结果进行排序及筛选,以确定热点提问。热词的时间段分布可以指的是以诸如10分钟之类的时间段为单位的热词数量走势。In an embodiment of the present disclosure, the question-
在本公开的一个实施例中,问答关联模块402还被配置为:以热词为维度建立倒排索引,将热词与用户反馈和答案信息进行关联。因此,在获取热词(热词挖掘)完成后,可以以热词为维度建立倒排索引,将热词与用户反馈和答案信息进行关联,加速后续分析的过程。另外,用户反馈可以包括用户对问答系统的评价等。In an embodiment of the present disclosure, the question-
在本公开的一个实施例中,问答关联模块402还被配置为:以单个热词为单位对热词所关联的提问与答案对进行分析以获得该热词所关联的答案分布。可以理解,可以存在与一个热词关联的多对提问与答案,因此,可以对多个提问与答案对进行分析,获得该热词所关联的答案分布。例如,对于某个语义清晰的热词,如“信用卡还款”,所对应的答案分布都会围绕该热词主题。In an embodiment of the present disclosure, the question-
在本公开的一个实施例中,问答关联模块402还被配置为:以单个答案为单位对与该答案对应的用户提问进行热词聚类以获得该答案所关联的热词。例如,某个答案对应的原始提问应围绕某一个或少量的热点话题,而不应与超过某阈值的多个热词相关。In an embodiment of the present disclosure, the question-
在本公开的一个实施例中,当检测到一个热词所关联的去重后的答案数量超过第一阈值和/或一个答案所关联的热词数量超过第二阈值时,确定出现热词与答案的匹配异常案例指的是:无论出现一个热词所关联的去重后的答案数量超过第一阈值、或者一个答案所关联的热词数量超过第二阈值、或者一个热词所关联的去重后的答案数量超过第一阈值同时一个答案所关联的热词数量超过第二阈值这三种情况中的哪一种情况,都可以确定出现了热词与答案的匹配异常案例。换言之,如果某热词所关联的去重答案数量过多,则该热词所关联的提问具会有与提问不相关的答案;如果某个答案所关联的热词数量过多,则该答案对应的原始提问围绕了过多的话题(热词),可以认为出现热词与答案的匹配异常案例,或者是出现候选的热词与答案的匹配异常案例以供后续排查。在本公开的实施例中,热词与答案的匹配异常案例可以被认为是坏案例中的一种。本领域技术人员可以理解,根据实际情况需要,第一阈值和第二阈值可以是任何数量。另外,去重指的是去除重复的答案,去重可以避免统计和计算失真。In one embodiment of the present disclosure, when it is detected that the number of deduplicated answers associated with a hot word exceeds a first threshold and/or the number of hot words associated with an answer exceeds a second threshold, it is determined that a hot word and An abnormal case of answer matching refers to: no matter whether the number of deduplicated answers associated with a hot word exceeds the first threshold, or the number of hot words associated with an answer exceeds the second threshold, or the deduplication associated with a hot word In any one of the three situations where the number of repeated answers exceeds the first threshold and the number of hot words associated with an answer exceeds the second threshold, it can be determined that there is an abnormal case of matching between hot words and answers. In other words, if there are too many deduplicated answers associated with a hot word, the question associated with the hot word has an answer that is not related to the question; The corresponding original question revolves around too many topics (hot words). It can be considered that there are cases of abnormal matching between hot words and answers, or there are cases of abnormal matching between candidate hot words and answers for subsequent investigation. In the embodiments of the present disclosure, the abnormal case of hotword-answer matching can be considered as one of the bad cases. Those skilled in the art can understand that, according to actual needs, the first threshold and the second threshold can be any number. In addition, deduplication refers to the removal of duplicate answers, which can avoid statistical and computational distortion.
在本公开实施方式中,通过热词获取模块,被配置为根据用户提问的原始数据获取包括热词的热词集合;问答关联模块,被配置为对热词集合中的热词进行排序以及筛选以确定热点提问,并且将热词与用户反馈和答案信息进行关联,以针对热词获得该热词所关联的答案分布,并且针对答案获得该答案所关联的热词;异常检测模块,被配置为检测一个热词所关联的去重后的答案数量是否超过第一阈值以及一个答案所关联的热词数量是否超过第二阈值;异常确定模块,被配置为当所述异常检测模块检测到一个热词所关联的去重后的答案数量超过第一阈值和/或一个答案所关联的热词数量超过第二阈值时,确定出现热词与答案的匹配异常案例,可以通过获取热词集合,同时通过热词与答案的双向关联分析,结合用户反馈,对问答识别效果进行集中监控,而且能够有效关注诸如匹配异常案例之类的长尾问题的问答识别效果。相关技术的热词挖掘方案可通过Apriori关联规则挖掘、SegPhrase短语发现等方法完成,对分词正确性或语料质量都有较高的要求,同时,相关技术的方案仅止于热词挖掘,不像本公开实施例的方案中深入问答系统进行热词的评估和问答匹配的坏案例发现。In the embodiment of the present disclosure, the hot word acquisition module is configured to acquire a hot word set including hot words according to the raw data of user questions; the question-answer association module is configured to sort and filter hot words in the hot word set To determine hot questions, and associate hot words with user feedback and answer information, to obtain the answer distribution associated with the hot words for the hot words, and obtain the hot words associated with the answers for the answers; the anomaly detection module is configured In order to detect whether the number of deduplicated answers associated with a hot word exceeds the first threshold and whether the number of hot words associated with an answer exceeds the second threshold; the abnormality determination module is configured to detect an abnormality when the abnormality detection module detects a When the number of deduplicated answers associated with hot words exceeds the first threshold and/or the number of hot words associated with an answer exceeds the second threshold, it is determined that there is an abnormal case of matching between hot words and answers. By obtaining the hot word set, At the same time, through the two-way correlation analysis of hot words and answers, combined with user feedback, the Q&A recognition effect is centrally monitored, and it can effectively pay attention to the Q&A recognition effect of long-tail questions such as matching abnormal cases. The hot word mining scheme of related technologies can be completed by methods such as Apriori association rule mining and SegPhrase phrase discovery, which have high requirements for word segmentation accuracy or corpus quality. At the same time, the related technology solutions only stop at hot word mining, unlike In the solution of the embodiment of the present disclosure, the deep question answering system performs hot word evaluation and bad case discovery of question answer matching.
在本公开的实施方式中,长尾问题可以指的是出现频率不是很高的那些问题。In embodiments of the present disclosure, long-tail questions may refer to those questions that do not appear frequently.
图5示出根据本公开另一实施方式的问答识别效果检测装置的结构框图。图5所示的实施方式与图4所示的实施方式的区别在于,图5除了包括热词获取模块401、问答关联模块402、异常检测模块403和异常确定模块404之外,还包括步骤缺失检测模块501和缺失确定模块502。Fig. 5 shows a structural block diagram of a question and answer recognition effect detection device according to another embodiment of the present disclosure. The difference between the embodiment shown in FIG. 5 and the embodiment shown in FIG. 4 is that, in addition to including the hot
缺失检测模块501被配置为检测一个热词所关联的去重后的答案中是否包括该热词。The missing
缺失确定模块502被配置为当缺失检测模块501检测到一个热词所关联的去重后的全部答案中均不包括该热词时,确定出现答案缺失案例。The missing
在本公开的一个实施例中,可以在在确定热词与答案的匹配异常案例进行进一步分析,如果某热词所关联的所有答案标题中,均不包含该热词字符串,则可以认为是答案库有答案缺失,因此可以认为出现答案缺失案例,或者可将此案例作为候选的缺失案例供后续排查。在本公开的实施例中,答案缺失案例可以被认为是坏案例中的一种。In one embodiment of the present disclosure, further analysis can be carried out in the case of abnormal matching between hot words and answers. If all the answer titles associated with a hot word do not contain the hot word string, it can be considered as There are missing answers in the answer library, so it can be considered that there is a missing answer case, or this case can be used as a candidate missing case for subsequent investigation. In embodiments of the present disclosure, the missing-answer case can be considered as one of the bad cases.
在本公开的实施方式中,可以通过获取热词集合,同时通过热词与答案的双向关联分析,结合用户反馈,对问答识别效果进行集中监控,而且能够有效关注诸如匹配异常案例以及答案缺失案例之类的长尾问题的问答识别效果。In the embodiment of the present disclosure, by obtaining the hot word set, at the same time, through the two-way association analysis of hot words and answers, combined with user feedback, the effect of question and answer recognition can be monitored intensively, and it can effectively pay attention to cases such as abnormal matching and missing answers The effect of question and answer recognition on long-tail questions such as .
图6示出根据本公开一实施方式的问答识别效果检测装置中的热词获取模块401的结构框图。热词获取模块401包括新词获取子模块601、提问分词子模块602、短语获取子模块603和热度计算子模块604。Fig. 6 shows a structural block diagram of the hot
新词获取子模块601被配置为通过预设的新词发现算法对用户提问的原始数据进行计算以获取新词,其中,新词以字符为组成单元。The new
提问分词子模块602被配置为根据预设基础词库中的基础词以及获取的新词生成分词词典,并且利用分词词典对用户提问进行分词以得到提问分词结果。The
短语获取子模块603被配置为通过预设的短语发现算法对分词结果进行计算以获取短语,其中,短语以分词后得到的单词为组成单元。The
热度计算子模块604被配置为根据预设热度算法计算短语的热度,并将热度大于预设热度阈值的短语确定为热词。The
在本公开的一个实施例中,预设的新词发现算法可以是相关技术中的用于获取新词的算法,只要能够实现对用户提问的原始数据进行计算以获取新词即可。In an embodiment of the present disclosure, the preset new word discovery algorithm may be an algorithm for acquiring new words in the related art, as long as the original data of the user's question can be calculated to acquire new words.
在本公开的一个实施例中,新词获取子模块601被配置为:通过预设的新词发现算法对用户提问的原始数据中的字符串的自由度和凝固度进行计算,并以通过预设新词阈值对字符串进行限制以获取新词。将字符作为单元进行自由度和凝聚度的计算,通过预设新词阈值,能够有效识别新词表达方式,避开发现新词过程中的坏案例。新词阈值可以指的是根据实际需要对字符串的自由度和凝固度设定的相应阈值。另外,获得的新词可以是新提问中的潜在新词,可能代表了新产品、新业务或新表达方式等。另外,发现的新词可以是具有高词粒度的新词。In one embodiment of the present disclosure, the new
在本公开的一个实施例中,短语获取子模块603还被配置为:通过预设的短语发现算法对分词结果中的单词的自由度和凝固度进行计算,并以通过预设短语阈值对单词进行限制以获取短语。将单词作为单元进行自由度和凝聚度的计算,通过预设短语阈值,能够有效识别短语表达方式,避开发现短语过程中的坏案例。短语阈值可以指的是根据实际需要对词序列的自由度和凝固度设定的相应阈值。In one embodiment of the present disclosure, the
在本公开的一个实施例中,在根据预设热度算法计算短语的热度时,可以根据短语的信息熵和互信息、以及短语出现频次、左右邻的出现均衡数等指标进行加权分析,以获得短语的热度表示。在本公开的实施例中,信息熵是信息论中用于度量信息量的一个概念,直接反应了系统有序化程度,系统越规则有序则信息熵越低,反之系统越混乱无序则信息熵越高。互信息是信息论里一种有用的信息度量,它可以看成是一个随机变量中包含的关于另一个随机变量的信息量,或者说是一个随机变量由于已知另一个随机变量而减少的不肯定性。左右邻指的是当前短语左侧和右侧的短语。根据本公开实施方式对的教导,本领域技术人员可以理解,可以通过相关技术中的计算方法来计算短语的热度。In one embodiment of the present disclosure, when calculating the popularity of a phrase according to the preset popularity algorithm, a weighted analysis can be performed according to the information entropy and mutual information of the phrase, as well as the frequency of occurrence of the phrase, the appearance balance number of left and right neighbors, etc., to obtain The popularity of the phrase. In the embodiments of the present disclosure, information entropy is a concept used to measure the amount of information in information theory, which directly reflects the degree of ordering of the system. The more regular and orderly the system, the lower the information entropy; The higher the entropy is. Mutual information is a useful measure of information in information theory. It can be seen as the amount of information contained in a random variable about another random variable, or the uncertainty of a random variable that is reduced by knowing another random variable. sex. Left and right neighbors refer to the phrases to the left and right of the current phrase. According to the teachings of the embodiments of the present disclosure, those skilled in the art can understand that the popularity of a phrase can be calculated by a calculation method in the related art.
在本公开实施方式中,不同于传统方案通过关联挖掘等方法进行短语挖掘,本方案将短语聚合视为新词发现的特殊过程,即将单词看作一个单元进行自由度和凝聚度的计算,不但能达到短语聚合结果,还能够有效识别新兴的短语表达方式并规避新词发现过程中的坏案例。例如,新词发现过程中错误挖掘出“怎么开”和“通支付工具”两个词,本方案可通过短语挖掘(获取)将二者再次合成为一个有效短语“怎么开通支付工具”,从而完成错误纠正。In the embodiment of the present disclosure, unlike the traditional schemes that use association mining and other methods to mine phrases, this scheme regards phrase aggregation as a special process of new word discovery, that is, a word is regarded as a unit to calculate the degree of freedom and cohesion, not only It can achieve phrase aggregation results, and can effectively identify emerging phrase expressions and avoid bad cases in the process of new word discovery. For example, in the process of discovering new words, the words "how to open" and "pass payment tool" were mistakenly mined. This scheme can synthesize the two words into an effective phrase "how to open a payment tool" through phrase mining (acquisition), so that Complete error correction.
而且,根据本公开实施方式的问答识别效果检测方案结合问答系统的特征,如时间波动、用户反馈、答案分布等进行分析,对热度识别及长尾问题发现都能起到作用。Moreover, the Q&A recognition effect detection scheme according to the embodiments of the present disclosure combines the characteristics of the Q&A system, such as time fluctuations, user feedback, answer distribution, etc. for analysis, and can play a role in popularity recognition and long-tail question discovery.
以上描述了问答识别效果检测装置的内部功能和结构,在一个可能的设计中,该问答识别效果检测装置的结构可实现为问答识别效果检测设备,如图7中所示,该处理设备700可以包括处理器701以及存储器702。The above has described the internal function and structure of the question and answer recognition effect detection device. In a possible design, the structure of the question and answer recognition effect detection device can be realized as a question and answer recognition effect detection device. As shown in FIG. 7, the
所述存储器702用于存储支持问答识别效果检测装置执行上述任一实施例中问答识别效果检测方法的程序,所述处理器701被配置为用于执行所述存储器702中存储的程序。The
所述存储器702用于存储一条或多条计算机指令,其中,所述一条或多条计算机指令被所述处理器701执行。The
所述处理器701用于执行前述各方法步骤中的全部或部分步骤。The
其中,所述问答识别效果检测设备的结构中还可以包括通信接口,用于问答识别效果检测设备与其他设备或通信网络通信。Wherein, the structure of the question and answer recognition effect detection device may further include a communication interface, which is used for the question and answer recognition effect detection device to communicate with other devices or communication networks.
本公开示例性实施例还提供了一种计算机存储介质,用于储存所述问答识别效果检测装置所用的计算机软件指令,其包含用于执行上述任一实施例中问答识别效果检测方法所涉及的程序。Exemplary embodiments of the present disclosure also provide a computer storage medium for storing computer software instructions used by the question and answer recognition effect detection device, which includes instructions for executing the question and answer recognition effect detection method in any of the above embodiments. program.
图8是适于用来实现根据本公开一实施方式的问答识别效果检测方法的计算机系统的结构示意图。Fig. 8 is a schematic structural diagram of a computer system suitable for implementing the method for detecting the effect of question-and-answer recognition according to an embodiment of the present disclosure.
如图8所示,计算机系统800包括中央处理单元(CPU)801,其可以根据存储在只读存储器(ROM)802中的程序或者从存储部分808加载到随机访问存储器(RAM)803中的程序而执行上述图1所示的实施方式中的各种处理。在RAM803中,还存储有系统800操作所需的各种程序和数据。CPU801、ROM802以及RAM803通过总线804彼此相连。输入/输出(I/O)接口805也连接至总线804。As shown in FIG. 8 , a
以下部件连接至I/O接口805:包括键盘、鼠标等的输入部分806;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分807;包括硬盘等的存储部分808;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分809。通信部分809经由诸如因特网的网络执行通信处理。驱动器810也根据需要连接至I/O接口805。可拆卸介质811,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器810上,以便于从其上读出的计算机程序根据需要被安装入存储部分808。The following components are connected to the I/O interface 805: an
特别地,根据本公开的实施方式,上文参考图1描述的方法可以被实现为计算机软件程序。例如,本公开的实施方式包括一种计算机程序产品,其包括有形地包含在及其可读介质上的计算机程序,所述计算机程序包含用于执行图1的数据处理方法的程序代码。在这样的实施方式中,该计算机程序可以通过通信部分809从网络上被下载和安装,和/或从可拆卸介质811被安装。In particular, according to an embodiment of the present disclosure, the method described above with reference to FIG. 1 may be implemented as a computer software program. For example, the embodiments of the present disclosure include a computer program product including a computer program tangibly embodied on its readable medium, the computer program including program code for executing the data processing method of FIG. 1 . In such an embodiment, the computer program may be downloaded and installed from a network via the
附图中的流程图和框图,图示了按照本公开各种实施方式的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,路程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分,所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a roadmap or block diagram may represent a module, program segment, or part of code that contains one or more Executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
描述于本公开实施方式中所涉及到的单元或模块可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元或模块也可以设置在处理器中,这些单元或模块的名称在某种情况下并不构成对该单元或模块本身的限定。The units or modules involved in the embodiments described in the present disclosure may be implemented by means of software or hardware. The described units or modules may also be set in the processor, and the names of these units or modules do not constitute limitations on the units or modules themselves in some cases.
作为另一方面,本公开还提供了一种计算机可读存储介质,该计算机可读存储介质可以是上述实施方式中所述装置中所包含的计算机可读存储介质;也可以是单独存在,未装配入设备中的计算机可读存储介质。计算机可读存储介质存储有一个或者一个以上程序,所述程序被一个或者一个以上的处理器用来执行描述于本公开的方法。As another aspect, the present disclosure also provides a computer-readable storage medium. The computer-readable storage medium may be the computer-readable storage medium included in the device described in the above-mentioned embodiments; A computer-readable storage medium assembled in a device. The computer-readable storage medium stores one or more programs, and the programs are used by one or more processors to execute the methods described in the present disclosure.
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的发明范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离所述发明构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only a preferred embodiment of the present disclosure and an illustration of the applied technical principle. It should be understood by those skilled in the art that the scope of the invention involved in this disclosure is not limited to the technical solution formed by the specific combination of the above technical features, but also covers the technical solutions made by the above technical features without departing from the inventive concept. Other technical solutions formed by any combination of or equivalent features thereof. For example, a technical solution formed by replacing the above-mentioned features with (but not limited to) technical features with similar functions disclosed in this disclosure.
Claims (20)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810923157.XA CN109271495B (en) | 2018-08-14 | 2018-08-14 | Question-answer recognition effect detection method, device, equipment and readable storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810923157.XA CN109271495B (en) | 2018-08-14 | 2018-08-14 | Question-answer recognition effect detection method, device, equipment and readable storage medium |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN109271495A CN109271495A (en) | 2019-01-25 |
| CN109271495B true CN109271495B (en) | 2023-02-17 |
Family
ID=65153351
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201810923157.XA Active CN109271495B (en) | 2018-08-14 | 2018-08-14 | Question-answer recognition effect detection method, device, equipment and readable storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN109271495B (en) |
Families Citing this family (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110415705B (en) * | 2019-08-01 | 2022-03-01 | 苏州奇梦者网络科技有限公司 | Hot word recognition method, system, device and storage medium |
| CN111680134B (en) * | 2020-04-20 | 2023-05-02 | 重庆兆光科技股份有限公司 | Method for measuring inquiry and answer consultation information by information entropy |
| CN112487140B (en) * | 2020-11-27 | 2024-06-07 | 平安科技(深圳)有限公司 | Question-answer dialogue evaluating method, device, equipment and storage medium |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101520802A (en) * | 2009-04-13 | 2009-09-02 | 腾讯科技(深圳)有限公司 | Question-answer pair quality evaluation method and system |
| CN103577556A (en) * | 2013-10-21 | 2014-02-12 | 北京奇虎科技有限公司 | Device and method for obtaining association degree of question and answer pair |
| CN106997399A (en) * | 2017-05-24 | 2017-08-01 | 海南大学 | A kind of classification question answering system design method that framework is associated based on data collection of illustrative plates, Information Atlas, knowledge mapping and wisdom collection of illustrative plates |
| CN106997245A (en) * | 2016-01-24 | 2017-08-01 | 杨文韬 | A kind of method that input method dictionary is built according to Chinese language model |
| CN107424461A (en) * | 2017-08-01 | 2017-12-01 | 深圳市鹰硕技术有限公司 | Information screen method and system |
| WO2018000282A1 (en) * | 2016-06-29 | 2018-01-04 | 深圳狗尾草智能科技有限公司 | Extended learning method of chat dialogue system and chat dialogue system |
Family Cites Families (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20020070967A1 (en) * | 2000-12-08 | 2002-06-13 | Tanner Timothy T. | Method and apparatus for map display of news stories |
| US8825488B2 (en) * | 2010-04-12 | 2014-09-02 | Adobe Systems Incorporated | Method and apparatus for time synchronized script metadata |
| CN102710795B (en) * | 2012-06-20 | 2015-02-11 | 北京奇虎科技有限公司 | Hot spot polymerization method and device |
| US20140120513A1 (en) * | 2012-10-25 | 2014-05-01 | International Business Machines Corporation | Question and Answer System Providing Indications of Information Gaps |
| CN105786875B (en) * | 2014-12-23 | 2019-06-14 | 北京奇虎科技有限公司 | Method and apparatus for providing question-and-answer data search results |
| WO2016101727A1 (en) * | 2014-12-23 | 2016-06-30 | 北京奇虎科技有限公司 | Question-and-answer-based search result adjustment method and device |
| US20160196490A1 (en) * | 2015-01-02 | 2016-07-07 | International Business Machines Corporation | Method for Recommending Content to Ingest as Corpora Based on Interaction History in Natural Language Question and Answering Systems |
| CN105654945B (en) * | 2015-10-29 | 2020-03-06 | 乐融致新电子科技(天津)有限公司 | Language model training method, device and equipment |
-
2018
- 2018-08-14 CN CN201810923157.XA patent/CN109271495B/en active Active
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101520802A (en) * | 2009-04-13 | 2009-09-02 | 腾讯科技(深圳)有限公司 | Question-answer pair quality evaluation method and system |
| CN103577556A (en) * | 2013-10-21 | 2014-02-12 | 北京奇虎科技有限公司 | Device and method for obtaining association degree of question and answer pair |
| CN106997245A (en) * | 2016-01-24 | 2017-08-01 | 杨文韬 | A kind of method that input method dictionary is built according to Chinese language model |
| WO2018000282A1 (en) * | 2016-06-29 | 2018-01-04 | 深圳狗尾草智能科技有限公司 | Extended learning method of chat dialogue system and chat dialogue system |
| CN106997399A (en) * | 2017-05-24 | 2017-08-01 | 海南大学 | A kind of classification question answering system design method that framework is associated based on data collection of illustrative plates, Information Atlas, knowledge mapping and wisdom collection of illustrative plates |
| CN107424461A (en) * | 2017-08-01 | 2017-12-01 | 深圳市鹰硕技术有限公司 | Information screen method and system |
Non-Patent Citations (3)
| Title |
|---|
| 基于混合策略的公众健康领域新词识别方法研究;侯丽等;《图书情报工作》;20151205(第23期);全文 * |
| 基于社区的问答搜索引擎搜索结果重合率研究;黄玉等;《山东科学》;20090815(第04期);全文 * |
| 网络用语词典的构建及问题分析;昝红英等;《中文信息学报》;20161115(第06期);全文 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN109271495A (en) | 2019-01-25 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11710054B2 (en) | Information recommendation method, apparatus, and server based on user data in an online forum | |
| US9288124B1 (en) | Systems and methods of classifying sessions | |
| US9817893B2 (en) | Tracking changes in user-generated textual content on social media computing platforms | |
| US20200151392A1 (en) | System and method automated analysis of legal documents within and across specific fields | |
| CN106557558B (en) | Data analysis method and device | |
| US20210224481A1 (en) | Method and apparatus for topic early warning, computer equipment and storage medium | |
| US20190303395A1 (en) | Techniques to determine portfolio relevant articles | |
| CN110135976A (en) | User's portrait generation method, device, electronic equipment and computer-readable medium | |
| CN112084150B (en) | Model training, data retrieval method, device, equipment and storage medium | |
| CN110020176A (en) | A kind of resource recommendation method, electronic equipment and computer readable storage medium | |
| CN110390408A (en) | Trading object prediction technique and device | |
| KR20220143766A (en) | Dynamic discovery and correction of data quality issues | |
| CN106354856A (en) | Deep neural network enhanced search method and device based on artificial intelligence | |
| CN112148776A (en) | Academic relation prediction method and device based on neural network introducing semantic information | |
| CN109271495B (en) | Question-answer recognition effect detection method, device, equipment and readable storage medium | |
| CN109582967B (en) | Public opinion abstract extraction method, device, equipment and computer-readable storage medium | |
| CN111241381A (en) | Information recommendation method, apparatus, electronic device, and computer-readable storage medium | |
| US20250088542A1 (en) | Data set and algorithm validation, bias characterization, and valuation | |
| CN112101447B (en) | Quality evaluation method, device, equipment and storage medium for data set | |
| CN109947728A (en) | A kind of processing method and processing device of journal file | |
| CN111651643B (en) | Candidate content processing method and related equipment | |
| Ting et al. | Applying social network embedding and word embedding for socialbots detection | |
| US11989513B2 (en) | Quantitative comment summarization | |
| CN113449507B (en) | Quality improvement method and device, electronic equipment and storage medium | |
| CN119336530A (en) | Abnormality detection method and device, electronic device, and storage medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| TA01 | Transfer of patent application right | ||
| TA01 | Transfer of patent application right |
Effective date of registration: 20200922 Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands Applicant after: Advanced innovation technology Co.,Ltd. Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands Applicant before: Alibaba Group Holding Ltd. Effective date of registration: 20200922 Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands Applicant after: Innovative advanced technology Co.,Ltd. Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands Applicant before: Advanced innovation technology Co.,Ltd. |
|
| GR01 | Patent grant | ||
| GR01 | Patent grant |