CN118838993A - Method for constructing keyword library and related products thereof - Google Patents
Method for constructing keyword library and related products thereof Download PDFInfo
- Publication number
- CN118838993A CN118838993A CN202410869190.4A CN202410869190A CN118838993A CN 118838993 A CN118838993 A CN 118838993A CN 202410869190 A CN202410869190 A CN 202410869190A CN 118838993 A CN118838993 A CN 118838993A
- Authority
- CN
- China
- Prior art keywords
- keywords
- keyword
- frequency
- user
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Human Computer Interaction (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了一种用于构建关键词库的方法及其相关产品。该方法包括:获取用户在对话系统中输入的文本数据和/或用户操作的历史日志数据;以预设使用频次挖掘所述用户输入的文本数据中的关键词,以获得高频次关键词;通过预定策略挖掘所述历史日志数据中的关键词,以获得扩展关键词;对所述高频次关键词和/或所述扩展关键词进行分析,以获得候选关键词;以及对所述候选关键词进行审核,以便将审核通过的关键词用于构建关键词库。通过本发明提供的方案,能够快速积累高质量的关键词,提高了对话系统中关键词库的构建效率和准确度。
The present invention discloses a method for constructing a keyword library and related products. The method includes: obtaining text data input by a user in a dialogue system and/or historical log data of user operations; mining keywords in the text data input by the user at a preset frequency of use to obtain high-frequency keywords; mining keywords in the historical log data by a predetermined strategy to obtain extended keywords; analyzing the high-frequency keywords and/or the extended keywords to obtain candidate keywords; and reviewing the candidate keywords so that the keywords that have passed the review are used to construct the keyword library. Through the solution provided by the present invention, high-quality keywords can be accumulated quickly, which improves the efficiency and accuracy of constructing the keyword library in the dialogue system.
Description
技术领域Technical Field
本发明一般涉及自然语言处理技术领域。更具体地,本发明涉及一种用于构建关键词库的方法、电子设备和计算机可读存储介质。The present invention generally relates to the field of natural language processing technology. More specifically, the present invention relates to a method for building a keyword library, an electronic device and a computer-readable storage medium.
背景技术Background Art
随着自然语言处理技术的发展,作为文本挖掘技术之一的关键词提取技术被广泛应用于人工智能和语义识别等诸多领域。现有的关键词库构建技术主要用于搜索引擎优化、广告投放、内容推荐等领域。其主要目的是通过对大量文本数据进行分析,提取出具有代表性和相关性的关键词,以便更好地理解和索引文本内容,从而提高搜索的准确性和效率。现有技术在构建初始关键词库时效果不错,但随着时间的推移和领域知识的变化,关键词库可能会变得过时。这时就需要对关键词库进行更新或扩展。With the development of natural language processing technology, keyword extraction technology, as one of the text mining technologies, has been widely used in many fields such as artificial intelligence and semantic recognition. The existing keyword library construction technology is mainly used in search engine optimization, advertising, content recommendation and other fields. Its main purpose is to analyze a large amount of text data and extract representative and relevant keywords in order to better understand and index the text content, thereby improving the accuracy and efficiency of the search. The existing technology works well in building the initial keyword library, but as time goes by and the domain knowledge changes, the keyword library may become outdated. At this time, the keyword library needs to be updated or expanded.
现有技术在构建初始关键词库时,通常包括以下几个步骤:The existing technology usually includes the following steps when constructing an initial keyword library:
数据预处理:对原始文本数据进行清洗、分词和去停用词等操作,以便后续处理。Data preprocessing: Clean, segment, and remove stop words on the original text data for subsequent processing.
特征提取:从预处理后的文本数据中提取出关键词,常用的方法包括TF-IDF、TextRank、LDA等。Feature extraction: Extract keywords from preprocessed text data. Commonly used methods include TF-IDF, TextRank, LDA, etc.
关键词筛选和优化:根据设定的阈值或规则,从提取出的关键词中筛选出具有代表性和相关性的关键词,并进行优化,以提高搜索的准确性和效率。Keyword screening and optimization: According to the set thresholds or rules, representative and relevant keywords are screened out from the extracted keywords and optimized to improve the accuracy and efficiency of the search.
然而,上述的技术方案仍存在一些问题,如通用分词器在特定业务数据上的表现往往不够理想,不能区分用户输入文本和知识库文本,以及传统分词对于名词短语和动词短语等复合短语没有专门的处理策略。另外,现有技术中通常依赖于统计方法(如TF-IDF)或基于图的方法(如TextRank)来提取关键词。这些方法可能在处理大规模数据时效果良好,但在理解特定领域或对话上下文的细微差别方面可能不够精确。如TF-IDF等统计方法经常会错误地将常用词识别为关键词,导致关键词库中充斥着非关键信息。同时,针对简写和名词别名的问题,用分词也无法直接解决。这些问题将进一步导致关键词库不够丰富,不能实际运用于下游任务。并且对由多个词组成的关键短语也存在难以处理的问题。However, the above technical solutions still have some problems, such as the general word segmenter often does not perform well on specific business data, cannot distinguish between user input text and knowledge base text, and traditional word segmentation has no special processing strategy for compound phrases such as noun phrases and verb phrases. In addition, the existing technology usually relies on statistical methods (such as TF-IDF) or graph-based methods (such as TextRank) to extract keywords. These methods may work well when processing large-scale data, but may not be accurate enough in understanding the nuances of specific fields or conversation contexts. Statistical methods such as TF-IDF often mistakenly identify common words as keywords, resulting in the keyword library being filled with non-key information. At the same time, word segmentation cannot directly solve the problem of abbreviations and noun aliases. These problems will further lead to the keyword library being not rich enough and cannot be actually used in downstream tasks. And there are also problems with key phrases composed of multiple words that are difficult to handle.
因此,现有的关键词库往往关键词的准确性和覆盖率较低,构建过程需要大量人工干预等。Therefore, existing keyword libraries often have low keyword accuracy and coverage, and the construction process requires a lot of manual intervention.
有鉴于此,亟需提供一种用于构建关键词库的方法,以便提高关键词库构建的效率和质量。In view of this, there is an urgent need to provide a method for constructing a keyword library so as to improve the efficiency and quality of keyword library construction.
发明内容Summary of the invention
为了至少解决如上所提到的一个或多个技术问题,本发明在多个方面中提出了用于构建关键词库的方案。In order to at least solve one or more of the technical problems mentioned above, the present invention proposes solutions for constructing a keyword library in multiple aspects.
在第一方面中,本发明提供一种用于构建关键词库的方法,包括:获取用户在对话系统中输入的文本数据和/或用户操作的历史日志数据;以预设使用频次挖掘所述用户输入的文本数据中的关键词,以获得高频次关键词;通过预定策略挖掘所述历史日志数据中的关键词,以获得扩展关键词;对所述高频次关键词和/或所述扩展关键词进行分析,以获得候选关键词;以及对所述候选关键词进行审核,以便将审核通过的关键词用于构建关键词库。In a first aspect, the present invention provides a method for constructing a keyword library, comprising: obtaining text data input by a user in a dialogue system and/or historical log data of user operations; mining keywords in the text data input by the user at a preset usage frequency to obtain high-frequency keywords; mining keywords in the historical log data through a predetermined strategy to obtain extended keywords; analyzing the high-frequency keywords and/or the extended keywords to obtain candidate keywords; and reviewing the candidate keywords so that the reviewed keywords can be used to construct the keyword library.
在一些实施例中,对所述高频次关键词和/或所述扩展关键词进行分析,以获得候选关键词包括:获取所述对话系统中的知识库文本数据;利用知识库文本数据进行模型训练,以获得文本分析模型;以及利用所述文本分析模型对所述高频次关键词和/或所述扩展关键词进行分析,以获得候选关键词。In some embodiments, analyzing the high-frequency keywords and/or the extended keywords to obtain candidate keywords includes: acquiring knowledge base text data in the dialogue system; using the knowledge base text data to perform model training to obtain a text analysis model; and using the text analysis model to analyze the high-frequency keywords and/or the extended keywords to obtain candidate keywords.
在一些实施例中,利用所述文本分析模型对所述高频次关键词和/或所述扩展关键词进行分析包括:通过词性联合字符相似度规则和向量相似度规则对所述高频次关键词和/或所述扩展关键词进行关系分析,以筛选出关键词对。In some embodiments, analyzing the high-frequency keywords and/or the extended keywords using the text analysis model includes: performing relationship analysis on the high-frequency keywords and/or the extended keywords through part-of-speech joint character similarity rules and vector similarity rules to filter out keyword pairs.
在一些实施例中,以预设使用频次挖掘所述用户输入的文本数据中的关键词,以获得高频次关键词包括:对所述用户输入的文本数据进行关键词使用频次分析;以及根据所述关键词使用频次分析的结果,筛选出在预定时间窗口内使用频次大于预设使用频次的关键词,以获得高频次关键词。In some embodiments, mining keywords in the text data input by the user with a preset usage frequency to obtain high-frequency keywords includes: performing a keyword usage frequency analysis on the text data input by the user; and based on the results of the keyword usage frequency analysis, screening out keywords whose usage frequency within a predetermined time window is greater than a preset usage frequency to obtain high-frequency keywords.
在一些实施例中,对所述高频关键词进行分析,以获得候选关键词包括;在所述对话系统中的知识库中搜索获得的所述高频次关键词;响应于在所述知识库中搜索到所述高频次关键词,将其标注为已知关键词;响应于在所述知识库中未搜索到所述高频次关键词,将其标注为新关键词;以及利用所述知识库中与所述新关键词相似的相似关键词的分布频次和/或类目对所述新关键词进行打分,以便用于对所述高频次关键词的分析。In some embodiments, analyzing the high-frequency keywords to obtain candidate keywords includes: searching the high-frequency keywords in the knowledge base in the dialogue system; in response to searching the high-frequency keywords in the knowledge base, marking them as known keywords; in response to not searching the high-frequency keywords in the knowledge base, marking them as new keywords; and scoring the new keywords using the distribution frequency and/or category of similar keywords in the knowledge base that are similar to the new keywords for use in analyzing the high-frequency keywords.
在一些实施例中,所述用户操作的历史日志数据包括以下中的一项或多项:用户在所述对话系统中的搜索数据;用户与所述对话系统中的问答助手之间交互产生的交互数据;以及用户点击所述对话系统中的内容标题产生的内容数据。In some embodiments, the historical log data of user operations includes one or more of the following: search data of the user in the dialogue system; interaction data generated by the interaction between the user and the question-and-answer assistant in the dialogue system; and content data generated by the user clicking on a content title in the dialogue system.
在一些实施例中,通过预定策略挖掘所述历史日志数据中的关键词,以获得扩展关键词包括:对所述历史日志数据应用编辑距离算法和/或最长公共子序列算法,以获取所述扩展关键词;其中包括:对所述历史日志数据中的所述搜索数据,通过搜索关键词的使用频次进行排序,并且按序选取预设数量的搜索关键词;以及计算选取的搜索关键词的最长连续公共子序列,以获取所述搜索数据中的扩展关键词。In some embodiments, mining the keywords in the historical log data through a predetermined strategy to obtain extended keywords includes: applying an edit distance algorithm and/or a longest common subsequence algorithm to the historical log data to obtain the extended keywords; including: sorting the search data in the historical log data by the frequency of use of the search keywords, and selecting a preset number of search keywords in sequence; and calculating the longest continuous common subsequence of the selected search keywords to obtain the extended keywords in the search data.
在一些实施例中,所述对所述候选关键词进行审核,将审核通过的关键词用于构建关键词库包括:基于审核通过的所述候选项关键词,则保留为有效关键词;根据所述有效关键词,进行词性分析,以将其中包括的关键词短语拆分为实体和属性;通过用同类词语随机替换所述关键词短语的实体或属性,以生成新的关键词短语;利用所述文本分析模型,对所述新的关键词短语进行分析评估,保留合理的关键词短语;利用所述有效关键词和所述合理的关键词短语,进行构建关键词库。In some embodiments, the review of the candidate keywords and using the reviewed keywords to construct a keyword library includes: retaining the candidate keywords that have passed the review as valid keywords; performing part-of-speech analysis based on the valid keywords to split the keyword phrases included therein into entities and attributes; generating new keyword phrases by randomly replacing the entities or attributes of the keyword phrases with similar words; using the text analysis model to analyze and evaluate the new keyword phrases and retain reasonable keyword phrases; and using the valid keywords and the reasonable keyword phrases to construct a keyword library.
在第二方面中,本发明提供一种用于构建关键词库的电子设备,包括:处理器;以及存储器,其上存储有用于构建关键词库的计算机程序,当所述计算机程序被所述处理器执行时,实现如第一方面中任一实施例所述的构建关键词库的方法。In a second aspect, the present invention provides an electronic device for constructing a keyword library, comprising: a processor; and a memory on which a computer program for constructing a keyword library is stored. When the computer program is executed by the processor, the method for constructing a keyword library as described in any embodiment of the first aspect is implemented.
在第三方面中,本发明提供一种计算机可读存储介质,所述存储介质上存储有用于构建关键词库的计算机可读程序,该计算机可读程序被一个或多个处理器执行时,实现如第一方面中任一实施例所述的构建关键词库的方法In a third aspect, the present invention provides a computer-readable storage medium, wherein a computer-readable program for constructing a keyword library is stored on the storage medium. When the computer-readable program is executed by one or more processors, the method for constructing a keyword library as described in any embodiment of the first aspect is implemented.
通过如上所提供的用于构建关键词库的方法,本发明实施例通过对用户在对话系统中输入的文本数据和/或用户操作的历史日志数据,分别提取高频次关键词和扩展关键词,以及通过对高频次关键词和/或扩展关键词的分析,获得用于构建关键词库的候选关键词。最后通过对所述候选关键词进行审核后,获得构建关键词库的关键词。通过本发明提供的方案能够快速积累高质量的关键词,提高了对话系统中关键词库的构建效率和准确度。Through the method for constructing a keyword library as provided above, the embodiment of the present invention extracts high-frequency keywords and extended keywords from the text data input by the user in the dialogue system and/or the historical log data of the user's operation, and obtains candidate keywords for constructing the keyword library by analyzing the high-frequency keywords and/or extended keywords. Finally, after reviewing the candidate keywords, the keywords for constructing the keyword library are obtained. The solution provided by the present invention can quickly accumulate high-quality keywords, thereby improving the efficiency and accuracy of constructing the keyword library in the dialogue system.
进一步,在一些实施例中,通过利用知识库中的文本数据进行训练文本分析模型,弥补了预训练模型的数据不足。通过本发明能够有效地分析关键词之间的嵌套关系和短语中的词性,识别出主题与子主题,这将有助于进一步优化关键词库的结构和内容,从而提升问答系统和知识库的运营效率。Furthermore, in some embodiments, by using the text data in the knowledge base to train the text analysis model, the data deficiency of the pre-trained model is compensated. The present invention can effectively analyze the nested relationship between keywords and the parts of speech in phrases, and identify topics and subtopics, which will help to further optimize the structure and content of the keyword library, thereby improving the operational efficiency of the question-answering system and the knowledge base.
由此,本发明提供的方案先通过自动化挖掘再进行审核来确定关键词的方式,既能够快速积累高质量关键词,又能够深入挖掘低频长尾关键词,从而提高对话系统中关键词库的构建效率和准确度。同时,通过结合公开搜索引擎和自有知识库的运营(如增加修改来更新知识),能够及时捕捉到新出现的关键词,使得关键词库也可以随之保持同步更新。因此,本发明的方案可以使关键词库持续保持时效性和覆盖率,从而可以实现即能提升问答系统和知识库的运营效率,又能快速准确的整理关键词,对用户问题标签分析和提高用户使用满意度有直接的帮助作用。Therefore, the solution provided by the present invention first determines the keywords through automated mining and then auditing, which can not only quickly accumulate high-quality keywords, but also deeply mine low-frequency long-tail keywords, thereby improving the efficiency and accuracy of building the keyword library in the dialogue system. At the same time, by combining the operation of public search engines and proprietary knowledge bases (such as adding modifications to update knowledge), newly emerging keywords can be captured in a timely manner, so that the keyword library can also be kept updated synchronously. Therefore, the solution of the present invention can enable the keyword library to continue to maintain timeliness and coverage, thereby achieving both the improvement of the operational efficiency of the question-and-answer system and the knowledge base, and the rapid and accurate organization of keywords, which directly helps the analysis of user question tags and the improvement of user satisfaction.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
通过参考附图阅读下文的详细描述,本发明示例性实施方式的上述以及其他目的、特征和优点将变得易于理解。在附图中,以示例性而非限制性的方式示出了本发明的若干实施方式,并且相同或对应的标号表示相同或对应的部分,其中:By reading the following detailed description with reference to the accompanying drawings, the above and other objects, features and advantages of the exemplary embodiments of the present invention will become readily understood. In the accompanying drawings, several embodiments of the present invention are shown in an exemplary and non-limiting manner, and the same or corresponding reference numerals represent the same or corresponding parts, wherein:
图1示出了根据本发明实施例的构建关键词库方法的示例性流程图;FIG1 shows an exemplary flow chart of a method for constructing a keyword library according to an embodiment of the present invention;
图2示出了根据本发明实施例的关键词分析的示例性流程图;以及FIG. 2 shows an exemplary flow chart of keyword analysis according to an embodiment of the present invention; and
图3示出了根据本发明实施例的电子设备的示例性结构图。FIG. 3 shows an exemplary structural diagram of an electronic device according to an embodiment of the present invention.
具体实施方式DETAILED DESCRIPTION
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative work are within the scope of protection of the present invention.
应当理解,当本发明的权利要求、说明书及附图中使用到术语“第一”、“第二”、“第三”和“第四”等时,其仅用于区别不同对象,而不是用于描述特定顺序。本发明的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when the terms "first", "second", "third" and "fourth" are used in the claims, description and drawings of the present invention, they are only used to distinguish different objects, rather than to describe a specific order. The terms "include" and "comprise" used in the description and claims of the present invention indicate the presence of the described features, wholes, steps, operations, elements and/or components, but do not exclude the presence or addition of one or more other features, wholes, steps, operations, elements, components and/or their collections.
还应当理解,在此本发明说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本发明。如在本发明说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本发明说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the terms used in this specification of the present invention are only for the purpose of describing specific embodiments and are not intended to limit the present invention. As used in the specification of the present invention and the claims, the singular forms of "a", "an" and "the" are intended to include the plural forms unless the context clearly indicates otherwise. It should also be further understood that the term "and/or" used in the specification of the present invention and the claims refers to any combination of one or more of the associated listed items and all possible combinations, and includes these combinations.
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in this specification and claims, the term "if" may be interpreted as "when" or "upon" or "in response to determining" or "in response to detecting," depending on the context. Similarly, the phrase "if it is determined" or "if [described condition or event] is detected" may be interpreted as meaning "upon determination" or "in response to determining" or "upon detection of [described condition or event]" or "in response to detecting [described condition or event]," depending on the context.
为了便于对本发明技术方案的充分理解,下面对其中所涉及的问答(QuestionAnswering,QA)系统从数据预处理到系统评估的一些常见术语,以及具体技术及操作原理进行介绍。In order to facilitate a full understanding of the technical solution of the present invention, some common terms of the question answering (QA) system involved therein from data preprocessing to system evaluation, as well as specific technologies and operating principles are introduced below.
信息检索(Information Retrieval,IR):指的是从大规模非结构化数据集中检索和提取相关信息的过程。在问答系统中,信息检索通常用于从潜在的大量文档中找到可能与问题相关的文本片段。Information Retrieval (IR): refers to the process of retrieving and extracting relevant information from large-scale unstructured data sets. In question-answering systems, information retrieval is often used to find text snippets that may be relevant to a question from a potentially large number of documents.
知识库(Knowledge Base,KB):是一个包含结构化信息的数据库,通常由知识组成。每个知识以标题,内容,类目,内容类型等字段组成。知识库用于存储问答系统可能需要的知识信息。Knowledge Base (KB): is a database containing structured information, usually composed of knowledge. Each piece of knowledge consists of fields such as title, content, category, content type, etc. The knowledge base is used to store knowledge information that may be needed by the question-answering system.
自然语言处理(Natura lLanguage Processing,NLP):指的是计算机科学、人工智能和语言学的交叉领域,旨在使计算机能够理解、解释和生成人类语言。Natural Language Processing (NLP): refers to the intersection of computer science, artificial intelligence and linguistics, which aims to enable computers to understand, interpret and generate human language.
实体识别(Named Entity Recognition,NER):是自然语言处理的一个任务,用于识别文本中具有特定意义的实体,如人名、地点、组织、时间等。Named Entity Recognition (NER): It is a task in natural language processing that is used to identify entities with specific meanings in text, such as names, places, organizations, time, etc.
语义解析(Semantic Parsing):是将自然语言文本转换为机器可理解的逻辑形式(如语义表示)的过程。Semantic Parsing: It is the process of converting natural language text into a machine-understandable logical form (such as semantic representation).
问答对(Question-Answer Pairs):是问答系统中的基本数据单元,由一个问题及其对应的正确答案组成。Question-Answer Pairs: It is the basic data unit in the question-answering system, consisting of a question and its corresponding correct answer.
准确率(Accuracy):是衡量问答系统性能的一个指标,表示系统正确回答问题的比例。Accuracy: It is an indicator to measure the performance of the question-answering system, indicating the proportion of questions answered correctly by the system.
召回率(Recall):是衡量问答系统性能的一个指标,表示系统正确回答的问题占所有可能正确回答的问题的比例。Recall: It is an indicator to measure the performance of the question-answering system, indicating the proportion of questions correctly answered by the system to all possible correctly answered questions.
F1分数(F1Score):是准确率和召回率的调和平均值,用于综合评价问答系统的性能。F1 score: It is the harmonic mean of precision and recall, and is used to comprehensively evaluate the performance of the question-answering system.
聚类算法(Clustering Algorithm):是数据挖掘和机器学习领域中的一种算法,用于将数据集中的对象分组,使得同一组内的对象之间具有较高的相似性,而不同组之间的对象相似性较低。聚类算法的目标是发现数据中的内在结构,以便更好地理解和解释数据。常见的聚类算法包括K-means、层次聚类和DBSCAN等。Clustering Algorithm: It is an algorithm in the field of data mining and machine learning. It is used to group objects in a data set so that objects in the same group have a high similarity, while objects in different groups have a low similarity. The goal of a clustering algorithm is to discover the inherent structure in the data in order to better understand and interpret the data. Common clustering algorithms include K-means, hierarchical clustering, and DBSCAN.
K-means(K均值聚类):是一种简单且广泛使用的聚类算法,旨在将n个数据点划分到k个簇中,使得每个点与其所属簇的中心(即均值)之间的距离最小化。K-means算法通过迭代地调整簇中心的位置和重新分配数据点到最近的簇中心来优化簇的形成。这种算法适用于解决各种聚类问题,如市场细分、社交网络分析和图像分割等。K-means (K-means clustering): is a simple and widely used clustering algorithm that aims to divide n data points into k clusters so that the distance between each point and the center (i.e., the mean) of its cluster is minimized. The K-means algorithm optimizes cluster formation by iteratively adjusting the position of cluster centers and redistributing data points to the nearest cluster center. This algorithm is suitable for solving various clustering problems, such as market segmentation, social network analysis, and image segmentation.
BERT(Bidirectiona lEncoder Representations from Transformers):是一种基于Transformer架构的预训练语言表示模型,通过在大量文本数据上进行预训练,学习到丰富的语言特征和知识。BERT模型能够理解单词的上下文含义,从而在多种自然语言处理任务中取得了显著的性能提升,如文本分类、命名实体识别和问答系统等。BERT (Bidirectional Encoder Representations from Transformers): is a pre-trained language representation model based on the Transformer architecture. It learns rich language features and knowledge by pre-training on a large amount of text data. The BERT model can understand the contextual meaning of words, thus achieving significant performance improvements in a variety of natural language processing tasks, such as text classification, named entity recognition, and question-answering systems.
预训练(Pre-training):是指在特定任务之前,先在大规模数据集上训练模型的过程。预训练的目的是使模型学习到通用的语言知识或特征表示,这些知识可以在后续的下游任务中复用。预训练模型通常通过无监督学习的方式进行训练,例如,使用掩码语言模型(MLM)或下一个句子预测(NSP)等任务。Pre-training: refers to the process of training a model on a large-scale dataset before a specific task. The purpose of pre-training is to enable the model to learn general language knowledge or feature representations, which can be reused in subsequent downstream tasks. Pre-trained models are usually trained through unsupervised learning, for example, using tasks such as masked language models (MLM) or next sentence prediction (NSP).
MLM(Masked Language Model):是一种预训练任务,用于训练语言模型理解单词的上下文。在MLM任务中,模型的输入文本中的一些单词会被随机掩盖(例如,用特殊的[MASK]标记代替),模型需要预测这些被掩盖的单词。通过这种方式,模型学习到单词在特定上下文中的含义和用法,从而增强其对语言的理解能力。BERT就是通过MLM任务进行预训练的一个典型例子。MLM (Masked Language Model): is a pre-training task used to train a language model to understand the context of words. In the MLM task, some words in the model's input text are randomly masked (for example, replaced with a special [MASK] tag), and the model needs to predict these masked words. In this way, the model learns the meaning and usage of words in a specific context, thereby enhancing its ability to understand language. BERT is a typical example of pre-training through the MLM task.
下面结合附图来详细描述本发明的具体实施方式。The specific embodiments of the present invention are described in detail below with reference to the accompanying drawings.
图1示出了根据本发明实施例的构建关键词库方法100的示例性流程图。FIG. 1 shows an exemplary flow chart of a method 100 for constructing a keyword library according to an embodiment of the present invention.
如图1所示,在步骤S101处,获取用户在对话系统中输入的文本数据和/或用户操作的历史日志数据。在一个实施例场景中,对话系统提供了用户输入关键词或者短语或者句子等文本数据方式,从而用户可以通过该文本数据与对话系统进行交互响应,例如信息检索等。此外,用户在对话系统中的操作,也会被对话系统作为历史记录保存到日志数据中,以便在下次使用中可以对用户的使用习惯或历史对话结果相关内容进行快速响应。通过对上述文本数据或者历史日志数据的分析,可以快速掌握当前系统用户的兴趣点,从而及时改善对话系统中的知识库及检索算法等。进而提高用户使用效率和准确度,改善用户的使用体验。As shown in FIG. 1 , at step S101, the text data input by the user in the dialogue system and/or the historical log data of the user's operation are obtained. In one embodiment scenario, the dialogue system provides a method for the user to input text data such as keywords, phrases, or sentences, so that the user can interact with the dialogue system through the text data, such as information retrieval. In addition, the user's operation in the dialogue system will also be saved as a historical record in the log data by the dialogue system, so that the user's usage habits or historical dialogue results can be quickly responded to in the next use. By analyzing the above text data or historical log data, the interests of the current system users can be quickly grasped, so as to timely improve the knowledge base and retrieval algorithm in the dialogue system. This will improve the user's usage efficiency and accuracy, and improve the user's experience.
为了更好地构建关键词库,可以通过上述获取到的文本数据或历史日志数据,进行针对性的挖掘出关键词。例如,可以利用文本数据对高频使用的关键词进行挖掘,利用历史日志数据进行对关键词的扩展挖掘等。In order to better construct a keyword library, the above-obtained text data or historical log data can be used to mine keywords in a targeted manner. For example, text data can be used to mine frequently used keywords, and historical log data can be used to mine extended keywords.
在完成数据获取后,在步骤S102中,以预设使用频次挖掘用户输入的文本数据中的关键词,以获得高频次关键词。在一个实施例中,首先对用户输入的文本数据进行关键词使用频次分析。其次,根据关键词使用频次分析的结果,筛选出在预定时间窗口内使用频次大于预设使用频次的关键词,以获得高频次关键词。例如,可以预设使用频次为3,并预定2~3个月作为有效时间窗口。然后,通过该预定的有效时间窗口对文本数据进行关键词的频次分析。After completing the data acquisition, in step S102, the keywords in the text data input by the user are mined with the preset usage frequency to obtain high-frequency keywords. In one embodiment, the keyword usage frequency analysis is first performed on the text data input by the user. Secondly, according to the results of the keyword usage frequency analysis, the keywords whose usage frequency is greater than the preset usage frequency within the predetermined time window are screened out to obtain high-frequency keywords. For example, the usage frequency can be preset to 3, and 2 to 3 months can be preset as the effective time window. Then, the keyword frequency analysis is performed on the text data through the predetermined effective time window.
进一步,可以利用对话系统中的业务词库或知识库中的数据,对获得的高频次关键词进行处理,并做好相应的标注。例如,如果该高频次关键词在业务词库或知识库中已存在,则标注为已有关键词;否则,如果不存在,则标注为新关键词。在实际的技术处理过程中,实现具体标注的形式有很多技术手段。例如通过关键词属性或者利用设定标识等方式进行标注,本申请对此方面不做过多的限制。Furthermore, the data in the business vocabulary or knowledge base in the dialogue system can be used to process the obtained high-frequency keywords and make corresponding annotations. For example, if the high-frequency keyword already exists in the business vocabulary or knowledge base, it is marked as an existing keyword; otherwise, if it does not exist, it is marked as a new keyword. In the actual technical processing process, there are many technical means to achieve specific annotation forms. For example, annotation is performed through keyword attributes or by setting identifiers, and this application does not impose too many restrictions on this aspect.
进一步,在一个实施例中,本发明提供了一种综合评分机制,该机制基于关键词在知识库中的分布频次和所属类目进行打分,且保留分数最高的结果。进行打分的目的是找到对知识库来说有价值的关键词,并且综合考虑关键词的词频和在知识库文档中的稀缺程度。Furthermore, in one embodiment, the present invention provides a comprehensive scoring mechanism, which scores keywords based on their distribution frequency and categories in the knowledge base, and retains the results with the highest scores. The purpose of scoring is to find keywords that are valuable to the knowledge base, and comprehensively consider the keyword frequency and scarcity in the knowledge base documents.
打分的公式可以为:综合分数=tf*S;(1)The scoring formula can be: comprehensive score = tf * S; (1)
上式(1)中,tf表示词频分。其中词频采用用户输入的频率来计算。S表示用户问题所在类目重要性。In the above formula (1), tf represents the word frequency score. The word frequency is calculated using the frequency of user input. S represents the importance of the category in which the user's question belongs.
关于公式(1)中词频分的计算方法,在实际的实施场景中,输入问题文本可以有3种情况。第一种,用户输入问题在对话系统的已有业务词库中。第二种,用户输入问题不在已有业务词库中,但是在知识库标题和内容中出现过。第三种,用户输入问题中的完整问题没有在业务词库中出现,但是该问题经分词后,其中单独每个词在业务词库中出现过。Regarding the calculation method of word frequency score in formula (1), in actual implementation scenarios, there are three situations when inputting question text. The first is that the question input by the user is in the existing business vocabulary of the dialogue system. The second is that the question input by the user is not in the existing business vocabulary, but has appeared in the title and content of the knowledge base. The third is that the complete question in the question input by the user does not appear in the business vocabulary, but after the question is segmented, each individual word in it has appeared in the business vocabulary.
对于第一和第二种情况,可以直接计算当前输入问题在业务词库或知识库所有问题中出现的频率。对于第三种情况,把当前输入问题进行分词后,再分别计算每个词在所有问题中的频率,取最小值。For the first and second cases, the frequency of the current input question in all questions in the business vocabulary or knowledge base can be directly calculated. For the third case, the current input question is segmented, and then the frequency of each word in all questions is calculated and the minimum value is taken.
公式(1)中类目重要性的具体计算公式为:The specific calculation formula for the category importance in formula (1) is:
S=Σ((P|Q)*(N/T)); (2)S=Σ((P|Q)*(N/T)); (2)
在上式中,S表示类目重要性;P表示用户问题的类别;Q表示用户问题;N表示类别下知识数量;T表示总知识数。In the above formula, S represents the importance of the category; P represents the category of the user question; Q represents the user question; N represents the amount of knowledge under the category; and T represents the total amount of knowledge.
在一个实施例中,类目重要性分的详细计算方法为:首先使用BERT进行微调分类,在知识库和类目数据上进行训练。然后得到一个BERT的文本分类模型。最后,用BERT计算出当前输入的各个类目的概率。例如兽医类0.8,设备类0.1,信息化0.1。再把每个类别的概率和类别的知识数量占比相乘。In one embodiment, the detailed calculation method of the category importance score is as follows: first, BERT is used to fine-tune the classification and train on the knowledge base and category data. Then a BERT text classification model is obtained. Finally, BERT is used to calculate the probability of each category of the current input. For example, 0.8 for veterinary medicine, 0.1 for equipment, and 0.1 for information technology. Then the probability of each category is multiplied by the proportion of the knowledge quantity of the category.
进一步,在步骤S103中,提供了对历史日志数据进行分析处理的技术方案,即通过预定策略挖掘所述历史日志数据中的关键词,以获得扩展关键词。其中,用户操作的历史日志数据包括以下中的一项或多项:用户在对话系统中的搜索数据;用户与对话系统中的问答助手之间交互产生的交互数据;以及用户点击对话系统中的内容标题产生的内容数据。Further, in step S103, a technical solution for analyzing and processing the historical log data is provided, that is, mining the keywords in the historical log data through a predetermined strategy to obtain extended keywords. The historical log data of user operations includes one or more of the following: search data of the user in the dialogue system; interaction data generated by the interaction between the user and the question-answering assistant in the dialogue system; and content data generated by the user clicking on the content title in the dialogue system.
在一个实施例中,通过预定策略包括对历史日志数据应用编辑距离算法和/或最长公共子序列算法,以获取所述扩展关键词。其中,一种策略可以是利用点击内容标题和搜索词进行公共词分析,然后根据分析的公共词,进行编辑距离算法和/或最长公共子序列算法,以获取公共词中的关键词作为扩展关键词;另一种策略可以是利用历史日志数据中的搜索数据,通过搜索数据中的关键词的使用频次进行排序,并且按序,例如从大到小,选取预设数量的搜索关键词;以及计算选取的搜索关键词的最长连续公共子序列,以获取扩展关键词。In one embodiment, the predetermined strategy includes applying the edit distance algorithm and/or the longest common subsequence algorithm to the historical log data to obtain the extended keywords. Among them, one strategy can be to use the click content title and the search word to perform common word analysis, and then perform the edit distance algorithm and/or the longest common subsequence algorithm based on the analyzed common words to obtain the keywords in the common words as extended keywords; another strategy can be to use the search data in the historical log data, sort the keywords in the search data by the frequency of use, and select a preset number of search keywords in order, for example, from large to small; and calculate the longest continuous common subsequence of the selected search keywords to obtain the extended keywords.
与上述步骤S102中描述的对关键词进行打分的过程相类似,该步骤S103中对扩展关键词也可以使用如公式(1)和(2)的算法,基于扩展关键词在知识库中的分布频次和所属类目进行打分,且保留分数最高的结果。Similar to the process of scoring keywords described in step S102 above, the extended keywords in step S103 can also be scored using algorithms such as formulas (1) and (2) based on the distribution frequency of the extended keywords in the knowledge base and the categories to which they belong, and the results with the highest scores are retained.
通过上述步骤S102和步骤S103分别对高频次关键词和扩展关键词的获取,流程进行到步骤S104处。该步骤通过获取对话系统中的知识库文本数据;并利用知识库文本数据进行模型训练,以获得文本分析模型;以及利用文本分析模型对高频次关键词和/或扩展关键词进行分析,以获得候选关键词。其中,在一个实施例中,具体的分析过程可以包括:获取名词关键词集合;利用名词关键词集合,生成关键词对;以及通过词性联合字符相似度规则和向量相似度规则对关键词对进行筛选。After obtaining high-frequency keywords and extended keywords through the above steps S102 and S103, the process proceeds to step S104. This step obtains the knowledge base text data in the dialogue system; uses the knowledge base text data for model training to obtain a text analysis model; and uses the text analysis model to analyze high-frequency keywords and/or extended keywords to obtain candidate keywords. In one embodiment, the specific analysis process may include: obtaining a noun keyword set; using the noun keyword set to generate keyword pairs; and screening keyword pairs through part-of-speech joint character similarity rules and vector similarity rules.
在一个实施例中,对高频关键词进行分析,以获得候选关键词包括:在对话系统中的知识库中搜索获得的高频次关键词;响应于在知识库中搜索到高频次关键词,将其标注为已知关键词;响应于在知识库中未搜索到高频次关键词,将其标注为新关键词;以及利用知识库中与新关键词相似的相似关键词的分布频次和/或类目对新关键词进行打分,以便用于对高频次关键词的分析。In one embodiment, analyzing high-frequency keywords to obtain candidate keywords includes: searching for high-frequency keywords in a knowledge base in a dialogue system; in response to finding high-frequency keywords in the knowledge base, marking them as known keywords; in response to not finding high-frequency keywords in the knowledge base, marking them as new keywords; and scoring the new keywords using the distribution frequency and/or category of similar keywords in the knowledge base that are similar to the new keywords for use in analyzing the high-frequency keywords.
下面通过一个完整的实施例过程,描述步骤S104的具体操作办法:The specific operation method of step S104 is described below through a complete embodiment process:
首先,通过对话系统的知识库标题和内容,构建无监督数据训练模型。具体可以采用基于Transfomer的双向编码模型(Bert)进行训练。First, an unsupervised data training model is constructed through the title and content of the knowledge base of the dialogue system. Specifically, the bidirectional encoding model (Bert) based on Transformer can be used for training.
在一个实施例中,对于基于Transformer的双向编码模型进行了增量无监督训练,训练方式为“掩码语言模型”。在这种方法中,BERT模型的输入文本中的一些单词会被随机替换为一个特殊的掩码标记(例如,[MASK]),模型的任务是预测这些被掩盖的单词,从而学习到文本中的深层语义和句法信息。这种预训练任务有助于模型捕捉词汇间的依赖关系和上下文信息,为后续的下游任务提供强大的语义表示能力。In one embodiment, the Transformer-based bidirectional encoding model is incrementally unsupervised trained using the "masked language model". In this method, some words in the input text of the BERT model are randomly replaced with a special mask tag (e.g., [MASK]), and the model's task is to predict these masked words, thereby learning the deep semantic and syntactic information in the text. This pre-training task helps the model capture dependencies and contextual information between words, providing powerful semantic representation capabilities for subsequent downstream tasks.
基于Transformer的双向编码模型是预训练模型,对于开放领域文本有很好的理解能力,但是对于业务领域文本却学习不足。利用知识库构建文本做增强训练很好的弥补了预训练模型的不足。The Transformer-based bidirectional encoding model is a pre-trained model that has a good understanding of open-domain texts, but is not good at learning business-domain texts. Using the knowledge base to build texts for enhanced training makes up for the shortcomings of the pre-trained model.
其次,通过统计和分析,进行关键词关系分析。Secondly, conduct keyword relationship analysis through statistics and analysis.
在一个实施例中,首先使用自然语言处理技术,包括词性分析(例如结巴分词工具),对文本进行处理,从而统计和识别关键词之间的嵌套关系和短语中元素的词性。In one embodiment, natural language processing technology, including part-of-speech analysis (such as the Jieba word segmentation tool), is first used to process the text, so as to count and identify the nested relationship between keywords and the parts of speech of elements in phrases.
具体地,首先获取所有的名词关键词。然后根据长度进行排序,把关键词集合做笛卡尔积,生成n*(n-1)/2对关键词。随后,通过词性联合字符相似度规则和向量相似度规则进一步筛选关键词对。Specifically, first obtain all noun keywords. Then sort them according to length, do Cartesian product on the keyword set, and generate n*(n-1)/2 pairs of keywords. Then, further filter the keyword pairs through part-of-speech joint character similarity rule and vector similarity rule.
在一个实施例场景中,通过词性和字符相似度和点互信息对关键词短语进行筛选,需要符合如下规则:In one embodiment scenario, keyword phrases are screened by part of speech, character similarity and point mutual information, and the following rules need to be met:
1、关键词短语分词后不能出现非动词或名词词性。1. Non-verb or noun parts of speech cannot appear after the keyword phrase segmentation.
2、字符相似度用杰卡德相似度,保留相似度大于0.3的短语对。2. Character similarity uses Jaccard similarity, and phrase pairs with similarity greater than 0.3 are retained.
3、点互信息值极高的短语对。3. Phrase pairs with extremely high mutual information values.
通过在上述步骤中基于Transformer的双向编码器模型,能够把短语文本转为预定维度,如768维度的向量。然后,利用余弦相似度度量两个短语对应的向量的相似程度。最后,保留相似度大于0.85的短语对作为候选短语对。Through the Transformer-based bidirectional encoder model in the above steps, the phrase text can be converted into a predetermined dimension, such as a 768-dimensional vector. Then, the cosine similarity is used to measure the similarity between the vectors corresponding to the two phrases. Finally, phrase pairs with a similarity greater than 0.85 are retained as candidate phrase pairs.
对于候选短语对,将较短的短语作为核心词,根据较短的短语进行分组,每组可以保留设定的短语对,如保留20个短语对。For candidate phrase pairs, shorter phrases are used as core words, and the phrases are grouped according to the shorter phrases. Each group can retain a set number of phrase pairs, such as retaining 20 phrase pairs.
再其次,识别关键词及其短语中的子主题,进行主题与子主题识别。Secondly, identify the sub-topics in the keywords and their phrases, and perform topic and sub-topic identification.
在一个实施例中,对于关键词对应短语,识别出关键词(主题)和短语中的其他成分(子主题),如疾病关键词与子主题构成子主题短语。具体实现可以采用如上文中无监督数据训练模型结合K-means聚类算法的方法来识别和提取主题与子主题。In one embodiment, for a keyword-corresponding phrase, the keyword (topic) and other components (subtopics) in the phrase are identified, such as a disease keyword and a subtopic constituting a subtopic phrase. The specific implementation may use the above-mentioned unsupervised data training model combined with the K-means clustering algorithm to identify and extract topics and subtopics.
在这个过程中,首先,利用无监督数据训练模型对文本进行向量化处理,输出为高维向量(如前面所提到的768维度的向量)。然后,使用K-means聚类算法对这些向量进行聚类,以识别出文本中的主题实体和相关的子主题信息。聚类后,获取每个聚类的中心点,并召回距离中心点最近的预定数量的关键词,例如50个候选关键词。接着,对每个关键词进行N元分析,合并结果并统计出现概率较大(如超过0.2)的N元核心词,这些核心词可以被视为潜在的主题或子主题关键词。In this process, first, the text is vectorized using an unsupervised data training model, and the output is a high-dimensional vector (such as the 768-dimensional vector mentioned above). Then, the K-means clustering algorithm is used to cluster these vectors to identify the subject entities and related sub-topic information in the text. After clustering, the center point of each cluster is obtained, and a predetermined number of keywords closest to the center point are recalled, such as 50 candidate keywords. Next, N-gram analysis is performed on each keyword, the results are merged, and the N-gram core words with a high probability of occurrence (such as more than 0.2) are counted. These core words can be regarded as potential subject or sub-topic keywords.
最后,根据主题和子主题成分,进行关键词拓展。Finally, expand keywords based on the topic and sub-topic components.
在一个实施例中,搜集上述步骤中找到的高质量词语集合A,对其进行审核,保留有效关键词,并将其保存到用户字典。然后重新加载用户字典,对文本进行词性分析。其后,通过词性标注,将这些关键词短语拆分为实体和属性。例如,对于“蓝耳病治疗”这样的关键词短语,其可以拆分为“蓝耳病”(实体)+“治疗”(属性)。接下来,经过审核,通过用同类词语随机替换短语中的实体或属性,可以拓展出一系列新的关键词短语。最后,利用文本大模型对这些新生成的关键词短语进行语义合理性评估,保留那些在内部或外部知识库中出现的或语义上合理的关键词短语,从而得到一批新的高质量关键词候选集。In one embodiment, the high-quality word set A found in the above steps is collected, audited, valid keywords are retained, and saved to the user dictionary. Then the user dictionary is reloaded and the text is subjected to part-of-speech analysis. Thereafter, these keyword phrases are split into entities and attributes through part-of-speech tagging. For example, for a keyword phrase such as "blue ear disease treatment", it can be split into "blue ear disease" (entity) + "treatment" (attribute). Next, after audit, a series of new keyword phrases can be expanded by randomly replacing entities or attributes in the phrase with similar words. Finally, the semantic rationality of these newly generated keyword phrases is evaluated using a large text model, and those keyword phrases that appear in internal or external knowledge bases or are semantically reasonable are retained, thereby obtaining a new batch of high-quality keyword candidate sets.
返回图1中所示流程,在步骤S105处,对在步骤S104处获得的候选关键词进行审核,以便将审核通过的关键词用于构建关键词库。该步骤中还包括:基于审核通过的候选关键词,则保留为有效关键词;根据有效关键词,进行词性分析,以将其中包括的关键词短语拆分为实体和属性;然后,通过用同类词语随机替换所述关键词短语的实体或属性,以生成新的关键词短语;最后,利用文本分析模型,对新的关键词短语进行分析评估,保留合理的关键词短语;利用有效关键词和合理的关键词短语,进行构建关键词库。Returning to the process shown in FIG. 1 , at step S105, the candidate keywords obtained at step S104 are reviewed so that the keywords that have passed the review are used to construct the keyword library. This step also includes: based on the candidate keywords that have passed the review, they are retained as valid keywords; based on the valid keywords, part-of-speech analysis is performed to split the keyword phrases included therein into entities and attributes; then, new keyword phrases are generated by randomly replacing the entities or attributes of the keyword phrases with similar words; finally, the new keyword phrases are analyzed and evaluated using a text analysis model, and reasonable keyword phrases are retained; and the keyword library is constructed using valid keywords and reasonable keyword phrases.
综上,基于本发明提供的实施例的教导和启发,本领域技术人员可以理解,本发明通过基于对话系统中用户的输入文本,进行关键词的挖掘。包括:对于高频历史用户输入的文本,结合对话系统的知识库,进行用户输入的高频次关键词挖掘。对于基于用户的历史日志数据,采取扩充关键词抽取策略。从而通过区分高频和低频关键词的挖掘策略,实现了对用户输入的有效分析和关键词的精准提取。其中,高频关键词挖掘侧重于快速锁定关键信息,其主要目的是快速积累高质量关键词。通过用户交互信息中使用的高频次来高效锁定高频次关键词,从而提高问答系统和知识库的运营效率。而低频关键词挖掘则侧重于识别和拓展长尾关键词,两者结合能够更全面地构建关键词库。In summary, based on the teachings and inspiration of the embodiments provided by the present invention, those skilled in the art can understand that the present invention mines keywords based on the input text of the user in the dialogue system. Including: for the text of high-frequency historical user input, combined with the knowledge base of the dialogue system, high-frequency keyword mining of the user input is carried out. For historical log data based on users, an expanded keyword extraction strategy is adopted. Thus, by distinguishing the mining strategy of high-frequency and low-frequency keywords, effective analysis of user input and accurate extraction of keywords are achieved. Among them, high-frequency keyword mining focuses on quickly locking key information, and its main purpose is to quickly accumulate high-quality keywords. Through the high frequency used in user interaction information, high-frequency keywords are efficiently locked, thereby improving the operational efficiency of the question-and-answer system and the knowledge base. Low-frequency keyword mining focuses on identifying and expanding long-tail keywords, and the combination of the two can more comprehensively build a keyword library.
在一个实施例中,针对用户低频输入的关键词进行挖掘,可以实现基于在知识库中已有的关键词短语,进行启发式挖掘和拓展。其主要目的是挖掘低频长尾关键词。具体通过充分考虑字符语义和词性信息,联合大模型的准确理解能力,来充分识别更多的关键词。In one embodiment, mining the keywords that are inputted infrequently by users can achieve heuristic mining and expansion based on the keyword phrases already in the knowledge base. Its main purpose is to mine low-frequency long-tail keywords. Specifically, more keywords can be fully identified by fully considering the character semantics and part-of-speech information and combining the accurate understanding ability of the large model.
进一步,本发明提供的技术方案还可以进行关键词关系分析和主题识别。通过自然语言处理技术和聚类算法,本发明能够有效地分析关键词之间的嵌套关系和短语中的词性,识别出主题与子主题,这有助于进一步优化关键词库的结构和内容。Furthermore, the technical solution provided by the present invention can also perform keyword relationship analysis and topic identification. Through natural language processing technology and clustering algorithms, the present invention can effectively analyze the nested relationship between keywords and the parts of speech in phrases, and identify topics and subtopics, which helps to further optimize the structure and content of the keyword library.
再进一步,在一个实施例中,本发明提出了一种综合评分机制。该机制结合了TF-IDF算法和分类概率模型,以及字符相似度和点互信息等规则,用于评估和筛选关键词,确保了关键词库的质量。Furthermore, in one embodiment, the present invention proposes a comprehensive scoring mechanism that combines the TF-IDF algorithm and the classification probability model, as well as rules such as character similarity and point mutual information, to evaluate and screen keywords, thereby ensuring the quality of the keyword library.
图2示出了根据本发明实施例的关键词分析200的示例性流程。关键词分析是基于已有关键词短语,进行启发式发掘和拓展。其主要目的是挖掘低频长尾关键词,并通过充分考虑字符语义和词性信息,联合大模型的准确理解能力,从而可以充分识别更多的关键词。FIG2 shows an exemplary process of keyword analysis 200 according to an embodiment of the present invention. Keyword analysis is based on existing keyword phrases and performs heuristic discovery and expansion. Its main purpose is to mine low-frequency long-tail keywords, and by fully considering character semantics and part-of-speech information, combined with the accurate understanding ability of the large model, more keywords can be fully identified.
如图2所示,流程中给出了对高频次关键词和/或扩展关键词进行关键词分析的具体实现步骤。As shown in FIG. 2 , the process provides specific implementation steps for performing keyword analysis on high-frequency keywords and/or extended keywords.
具体地,在步骤S201处,获取对话系统中的知识库文本数据。在已有的对话系统中,存在的知识库文本包括业务词库和/或知识库中的问答文本数据等。通过读取出知识库文本,可以获知对话系统中已有的关键词。Specifically, at step S201, the knowledge base text data in the dialogue system is obtained. In the existing dialogue system, the existing knowledge base text includes the business vocabulary and/or the question and answer text data in the knowledge base, etc. By reading the knowledge base text, the existing keywords in the dialogue system can be obtained.
在获取知识库文本数据后,流程进行到步骤S202处,利用获得的知识库文本数据进行模型训练,以便可以获得文本分析模型。具体对文本分析模型进行训练的过程为:After obtaining the knowledge base text data, the process proceeds to step S202, where the model is trained using the obtained knowledge base text data so as to obtain a text analysis model. The specific process of training the text analysis model is as follows:
首先,通过知识库文本数据中的知识库标题和内容,构建无监督数据训练模型。具体可以采用基于Transfomer的双向编码模型(Bert)进行预训练。这种预训练任务有助于模型捕捉词汇间的依赖关系和上下文信息,为后续的下游任务提供强大的语义表示能力。First, we build an unsupervised data training model through the knowledge base title and content in the knowledge base text data. Specifically, we can use the Transformer-based bidirectional encoding model (Bert) for pre-training. This pre-training task helps the model capture the dependencies and contextual information between words, and provides powerful semantic representation capabilities for subsequent downstream tasks.
在一个实施例中,对于基于Transformer的双向编码模型进行了增量无监督训练,训练方式为“掩码语言模型”。在这种方法中,BERT模型的输入文本中的一些单词会被随机替换为一个特殊的掩码标记(例如,[MASK]),模型的任务是预测这些被掩盖的单词,从而学习到文本中的深层语义和句法信息。In one embodiment, the Transformer-based bidirectional encoding model is incrementally unsupervised trained using the "masked language model". In this method, some words in the input text of the BERT model are randomly replaced with a special mask token (e.g., [MASK]), and the model's task is to predict these masked words, thereby learning the deep semantic and syntactic information in the text.
基于Transformer的双向编码模型是预训练模型,对于开放领域文本有很好的理解能力,但是对于业务领域文本却学习不足。例如在养殖业领域的专业名词或短语会存在特定的解释,而利用知识库构建文本做增强训练,可以很好的弥补预训练模型的不足。The Transformer-based bidirectional encoding model is a pre-trained model that has a good understanding of open-domain texts, but is not good at learning business-domain texts. For example, professional terms or phrases in the aquaculture field have specific explanations, and using the knowledge base to build text for enhanced training can make up for the shortcomings of the pre-trained model.
进一步,流程前进到步骤S203,利用所述文本分析模型对所述高频次关键词和/或所述扩展关键词进行分析,以获得候选关键词。在该步骤中,主要通过统计和分析的方法来进行关键词关系分析。Further, the process proceeds to step S203, where the high frequency keywords and/or the extended keywords are analyzed using the text analysis model to obtain candidate keywords. In this step, the keyword relationship analysis is mainly performed by statistical and analytical methods.
在一个实施例中,可以使用自然语言处理技术,包括词性分析(例如结巴分词工具),对文本进行处理,从而统计和识别关键词之间的嵌套关系和短语中元素的词性。In one embodiment, natural language processing technology, including part-of-speech analysis (such as the Jieba word segmentation tool), can be used to process the text, so as to count and identify the nested relationship between keywords and the parts of speech of elements in a phrase.
具体地,先获取所有的名词关键词;然后根据长度进行排序,把关键词集合做笛卡尔积,生成n*(n-1)/2对关键词;随后,通过词性联合字符相似度规则和向量相似度规则进一步筛选关键词对。Specifically, all noun keywords are first obtained; then they are sorted according to length, and the keyword set is Cartesian product to generate n*(n-1)/2 pairs of keywords; then, the keyword pairs are further screened through the part-of-speech joint character similarity rule and the vector similarity rule.
在一个实施例场景中,通过词性和字符相似度和点互信息对关键词短语进行筛选,需要符合如下规则:In one embodiment scenario, keyword phrases are screened by part of speech, character similarity and point mutual information, and the following rules need to be met:
1、关键词短语分词后不能出现非动词或名词词性。1. Non-verb or noun parts of speech cannot appear after the keyword phrase segmentation.
2、字符相似度用杰卡德相似度,保留相似度大于0.3的短语对。2. Character similarity uses Jaccard similarity, and phrase pairs with similarity greater than 0.3 are retained.
3、点互信息值极高的短语对。3. Phrase pairs with extremely high mutual information values.
通过在上述步骤S202中基于Transformer的双向编码器模型,能够把短语文本转为预定维度的向量,例如768维度的向量。然后用余弦相似度度量两个短语对应的向量的相似程度。最后我们保留相似度大于0.85的短语对作为候选短语对。By using the Transformer-based bidirectional encoder model in step S202, the phrase text can be converted into a vector of a predetermined dimension, such as a vector of 768 dimensions. Then the cosine similarity is used to measure the similarity between the vectors corresponding to the two phrases. Finally, we retain the phrase pairs with a similarity greater than 0.85 as candidate phrase pairs.
对于候选短语对,将其中较短的短语作为核心词,然后根据较短的短语进行分组,每组可以保留20个短语对。For candidate phrase pairs, the shorter phrase is used as the core word, and then the phrases are grouped according to the shorter phrases, and 20 phrase pairs can be retained in each group.
其次,识别关键词及其短语中的子主题,进行主题与子主题识别。Secondly, identify the sub-topics in the keywords and their phrases, and perform topic and sub-topic identification.
在一个实施例中,对于关键词对应短语,识别出关键词(主题)和短语中的其他成分(子主题),如疾病关键词与子主题构成子主题短语。使用上文中无监督数据训练模型结合K-means聚类算法来识别和提取主题与子主题。In one embodiment, for keyword-corresponding phrases, keywords (topics) and other components (subtopics) in the phrases are identified, such as disease keywords and subtopics forming subtopic phrases. The unsupervised data training model described above is combined with the K-means clustering algorithm to identify and extract topics and subtopics.
在分析文本过程的实施例中,首先,利用无监督数据训练模型对文本进行向量化处理,输出例如768维度的向量。然后,使用K-means聚类算法对这些向量进行聚类,以识别出文本中的主题实体和相关的子主题信息。聚类后,获取每个聚类的中心点,并召回距离中心点最近预定数量的关键词,如召回50个候选关键词。接着,对每个关键词进行N元分析,合并结果并统计出现概率较大(例如超过0.2)的N元核心词,这些核心词可以被视为潜在的主题或子主题关键词。In an embodiment of the text analysis process, first, the text is vectorized using an unsupervised data training model to output a vector of, for example, 768 dimensions. Then, the vectors are clustered using a K-means clustering algorithm to identify the subject entities and related sub-topic information in the text. After clustering, the center point of each cluster is obtained, and the keywords closest to the center point are recalled in a predetermined number, such as 50 candidate keywords. Next, an N-gram analysis is performed on each keyword, the results are merged, and the N-gram core words with a large probability of occurrence (for example, more than 0.2) are counted. These core words can be regarded as potential subject or sub-topic keywords.
最后,根据主题和子主题成分,进行关键词拓展。Finally, expand keywords based on the topic and sub-topic components.
在一个实施例中,先搜集上述步骤中找到的高质量词语集合A,对其进行审核。这里审核可以通过人工进行,也可以利用审核机制自动化筛选。例如对于应该是“蓝耳病治疗”的关键词语,审核的词语为“蓝耳病治疗后”,则该词语不符合保留标准,应将其筛选掉。然后再对通过审核后的词语,保留有效关键词,并将其保存到用户字典中。接着,重新加载用户字典,对文本进行词性分析。并通过词性标注,将这些关键词短语拆分为实体和属性,例如“蓝耳病治疗”可以拆分为“蓝耳病”(实体)+“治疗”(属性)。接下来,经过审核,通过用同类词语随机替换短语中的实体或属性,拓展出一系列新的关键词短语。最后,利用文本大模型对这些新生成的关键词短语再进行语义合理性评估,保留那些在内部或外部知识库中出现的或语义上合理的关键词短语,从而得到一批新的高质量关键词候选集。In one embodiment, the high-quality word set A found in the above steps is first collected and audited. The audit here can be performed manually or automatically screened using an audit mechanism. For example, for the key word that should be "blue ear disease treatment", the audited word is "after blue ear disease treatment", then the word does not meet the retention standard and should be screened out. Then, for the words that have passed the audit, retain the valid keywords and save them in the user dictionary. Then, reload the user dictionary and perform part-of-speech analysis on the text. And through part-of-speech tagging, these keyword phrases are split into entities and attributes, such as "blue ear disease treatment" can be split into "blue ear disease" (entity) + "treatment" (attribute). Next, after auditing, a series of new keyword phrases are expanded by randomly replacing entities or attributes in phrases with similar words. Finally, the text big model is used to perform semantic rationality assessment on these newly generated keyword phrases, retaining those keyword phrases that appear in internal or external knowledge bases or are semantically reasonable, thereby obtaining a new batch of high-quality keyword candidate sets.
如图3示意性地示出了根据本发明实施例的电子设备300的示例性结构图。如图3中所示,本发明的电子设备300可以包括:处理器301;以及存储器302,其上存储有用于构建关键词库的计算机程序,当计算机程序由处理器301执行时,使得实现前文所述第一方面或其任意一项实施例所述的方法。FIG3 schematically shows an exemplary structural diagram of an electronic device 300 according to an embodiment of the present invention. As shown in FIG3, the electronic device 300 of the present invention may include: a processor 301; and a memory 302, on which a computer program for building a keyword library is stored, and when the computer program is executed by the processor 301, the method described in the first aspect or any one of its embodiments is implemented.
具体地,在电子设备300中,根据不同的应用场景,处理器301可以采取各种类型的芯片,例如中央处理单元(CPU),还可以是其他通用微处理器或者也可以是任何常规的处理器等。除此以外,这些处理器可以根据需要进行选择和配置。例如,如果需要进行高速运算和复杂的图像处理,则可以选择高性能的GPU或其他专门的处理器。Specifically, in the electronic device 300, according to different application scenarios, the processor 301 can adopt various types of chips, such as a central processing unit (CPU), other general-purpose microprocessors, or any conventional processors. In addition, these processors can be selected and configured as needed. For example, if high-speed computing and complex image processing are required, a high-performance GPU or other specialized processors can be selected.
基于上文,本发明还公开了一种计算机可读存储介质,包含存储有用于构建关键词库的计算机可读程序,当该计算机可读程序被一个或多个处理器执行时,使得实现根据前文多个实施例或实施方式所述的方法。Based on the above, the present invention further discloses a computer-readable storage medium, which contains a computer-readable program for building a keyword library. When the computer-readable program is executed by one or more processors, the method described in the above multiple embodiments or implementation methods is implemented.
在一些实施场景中,上述计算机可读存储介质可以是任何适当的磁存储介质或者磁光存储介质。任何这样的计算机存储介质可以是装置或设备的一部分,也可以是可访问或可连接到装置或设备。本发明描述的任何应用或模块可以使用可以由这样的计算机可读介质存储或以其他方式保持的计算机可读/可执行指令来实现。In some implementation scenarios, the computer-readable storage medium may be any suitable magnetic storage medium or magneto-optical storage medium. Any such computer storage medium may be part of a device or apparatus, or may be accessible or connectable to a device or apparatus. Any application or module described in the present invention may be implemented using computer-readable/executable instructions that may be stored or otherwise maintained by such a computer-readable medium.
虽然本文已经示出和描述了本发明的多个实施例,但对于本领域技术人员显而易见的是,这样的实施例只是以示例的方式来提供。本领域技术人员可以在不偏离本发明思想和精神的情况下想到许多更改、改变和替代的方式。应当理解的是在实践本发明的过程中,可以采用对本文所描述的本发明实施例的各种替代方案。所附权利要求书旨在限定本发明的保护范围,并因此覆盖这些权利要求范围内的等同或替代方案。Although multiple embodiments of the present invention have been shown and described herein, it will be apparent to those skilled in the art that such embodiments are provided by way of example only. Those skilled in the art may conceive of many changes, modifications, and alternatives without departing from the thought and spirit of the present invention. It should be understood that in the process of practicing the present invention, various alternatives to the embodiments of the present invention described herein may be adopted. The appended claims are intended to define the scope of protection of the present invention, and therefore cover equivalents or alternatives within the scope of these claims.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410869190.4A CN118838993A (en) | 2024-07-01 | 2024-07-01 | Method for constructing keyword library and related products thereof |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410869190.4A CN118838993A (en) | 2024-07-01 | 2024-07-01 | Method for constructing keyword library and related products thereof |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN118838993A true CN118838993A (en) | 2024-10-25 |
Family
ID=93138257
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202410869190.4A Pending CN118838993A (en) | 2024-07-01 | 2024-07-01 | Method for constructing keyword library and related products thereof |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN118838993A (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119941194A (en) * | 2025-04-09 | 2025-05-06 | 盛业信息科技服务(深圳)有限公司 | Key information extraction method based on business audit |
| CN120508659A (en) * | 2025-07-18 | 2025-08-19 | 北京火山引擎科技有限公司 | Label generation method, device, equipment and product based on large model and configuration |
-
2024
- 2024-07-01 CN CN202410869190.4A patent/CN118838993A/en active Pending
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN119941194A (en) * | 2025-04-09 | 2025-05-06 | 盛业信息科技服务(深圳)有限公司 | Key information extraction method based on business audit |
| CN119941194B (en) * | 2025-04-09 | 2025-08-05 | 盛业信息科技服务(深圳)有限公司 | Key information extraction method based on business audit |
| CN120508659A (en) * | 2025-07-18 | 2025-08-19 | 北京火山引擎科技有限公司 | Label generation method, device, equipment and product based on large model and configuration |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12443596B2 (en) | Systems and methods for generating a contextually and conversationally correct response to a query | |
| CN106649260B (en) | Product characteristic structure tree construction method based on comment text mining | |
| CN111680173A (en) | A CMR Model for Unified Retrieval of Cross-Media Information | |
| CN119988588A (en) | A large model-based multimodal document retrieval enhancement generation method | |
| CN118296132B (en) | Customer service searching method and system based on intelligent large model | |
| CN112667794A (en) | Intelligent question-answer matching method and system based on twin network BERT model | |
| US11188819B2 (en) | Entity model establishment | |
| CN115017303B (en) | Method, computing device and medium for enterprise risk assessment based on news text | |
| CN105205699A (en) | User label and hotel label matching method and device based on hotel comments | |
| CN112307182B (en) | An Extended Query Method for Pseudo-Relevant Feedback Based on Question Answering System | |
| CN118838993A (en) | Method for constructing keyword library and related products thereof | |
| CN119557500B (en) | A method and system for accurate search of Internet massive data based on AI technology | |
| CN119691140A (en) | Intelligent question-answering system construction method and system based on LLM large language model | |
| CN118626611A (en) | Retrieval method, device, electronic device and readable storage medium | |
| CN119669530A (en) | Knowledge graph generation-assisted teaching question answering method and system based on LLM | |
| CN116702786B (en) | Chinese professional term extraction method and system integrating rules and statistical features | |
| Mezentseva et al. | Optimization of analysis and minimization of information losses in text mining | |
| CN119088898B (en) | Intelligent text retrieval and analysis system driven by natural language processing | |
| CN115017304A (en) | Method, computing device and medium for enterprise risk assessment based on news text | |
| CN115017305A (en) | Method, computing device, and medium for identifying business-related entities in news text | |
| CN118013956A (en) | A topic association analysis method for power marketing audit based on text semantics | |
| Rybak et al. | Machine learning-enhanced text mining as a support tool for research on climate change: theoretical and technical considerations | |
| CN114153947A (en) | Document processing method, device, equipment and storage medium | |
| CN119474380B (en) | A conflict and dispute event early warning method, system, program product and storage medium | |
| CN114297350B (en) | A natural language-oriented urban domain knowledge model query method and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |