CN116991969B - Configurable grammatical relationship retrieval method, system, electronic device and storage medium - Google Patents
Configurable grammatical relationship retrieval method, system, electronic device and storage medium Download PDFInfo
- Publication number
- CN116991969B CN116991969B CN202310590928.9A CN202310590928A CN116991969B CN 116991969 B CN116991969 B CN 116991969B CN 202310590928 A CN202310590928 A CN 202310590928A CN 116991969 B CN116991969 B CN 116991969B
- Authority
- CN
- China
- Prior art keywords
- list
- grammar
- word
- relationship
- speech
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3335—Syntactic pre-processing, e.g. stopword elimination, stemming
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Machine Translation (AREA)
Abstract
Description
技术领域Technical Field
本发明涉及自然语言处理领域,具体涉及一种可配置语法关系的检索方法、系统、电子设备及存储介质。The present invention relates to the field of natural language processing, and in particular to a retrieval method, system, electronic device and storage medium for configurable grammatical relations.
背景技术Background Art
语法关系如词性、句法结构能够描述句子内部单个词、词与词之间的复杂关系。随着大规模句子语法关系标注数据集的出现,语法关系分析(如词性标注,句法分析)的性能得到了大幅提升,目前已经广泛应用在许多不同领域中,例如信息检索领域。一方面是因为语法关系分析作为自然语言处理技术的基础组件之一,发展已经相对成熟,有许多开箱即用的开源软件。另一方面,语法关系分析和模型训练相比,在时间和经济成本上的成本都相对小得多。在检索系统中,传统字符串匹配的方法存在无法匹配语义的缺陷。而语法关系分析刚好可以借助对句子内部关系的描述,来理解同一个语义的多种表达。Grammatical relations such as part of speech and syntactic structure can describe the complex relationship between individual words and words within a sentence. With the emergence of large-scale sentence grammatical relationship annotation datasets, the performance of grammatical relationship analysis (such as part of speech tagging and syntactic analysis) has been greatly improved, and it has been widely used in many different fields, such as information retrieval. On the one hand, as one of the basic components of natural language processing technology, grammatical relationship analysis has been relatively mature and there are many open source software that can be used out of the box. On the other hand, compared with model training, the time and economic costs of grammatical relationship analysis are much smaller. In the retrieval system, the traditional string matching method has the defect of not being able to match semantics. Grammatical relationship analysis can understand the multiple expressions of the same semantics by describing the internal relationship of the sentence.
目前已经有许多研究试图利用语法关系解决这些问题。如申请公布号为CN106716408A的中国发明专利公开了一种语义文本搜索方法,对待检索文本的进行句法分析和词性标注,根据不同的句法结构,构建不同的语义类别的检索树。这样一来,检索词只会匹配中相关的语义类别,从而排除噪声;申请公布号为CN113407739A的中国发明专利公开了一种信息标题中概念的确定方法、装置和存储介质,其中对待检索文本的进行依存句法分析,抽取出标题的概念主干(例如名词性短语:“起泡胶做法”),从而简化和聚焦语义,去除无关的信息,优化检索效果;申请公布号为CN105786963A的中国发明专利公开了一种语料库的检索方法及系统,提供了一种新型的检索表达式,可以完成正则和句法结构的混合检索。特别地,检索表达式可以灵活描述模糊检索:只指定语法关系,不需要给出具体词语。通过对待检索文本的进行句法分析和词性标注,最终可以匹配符合检索表达式的结果。At present, there have been many studies trying to solve these problems by using grammatical relations. For example, the Chinese invention patent with application publication number CN106716408A discloses a semantic text search method, which performs syntactic analysis and part-of-speech tagging on the search text, and constructs search trees of different semantic categories according to different syntactic structures. In this way, the search term will only match the relevant semantic categories, thereby eliminating noise; the Chinese invention patent with application publication number CN113407739A discloses a method, device and storage medium for determining concepts in information titles, wherein the search text is subjected to dependency syntactic analysis, and the conceptual trunk of the title (such as a noun phrase: "how to make bubble glue") is extracted, thereby simplifying and focusing semantics, removing irrelevant information, and optimizing the search effect; the Chinese invention patent with application publication number CN105786963A discloses a corpus search method and system, providing a new type of search expression that can complete the mixed search of regular and syntactic structures. In particular, the search expression can flexibly describe fuzzy search: only specify grammatical relations, and no specific words need to be given. By performing syntactic analysis and part-of-speech tagging on the search text, we can eventually match the results that meet the search expression.
以上方案都通过嵌入语法关系,进行轻量级的改动,就达到了提升检索效果的目的。但语法关系的处理都是针对待检索文本进行的,对于检索词本身而言,也只支持模糊检索部分,不支持配置检索词本身的语法关系。这样的问题是无法完成更为灵活的检索需求,另外一方面是无法应对复杂句子的检索。The above solutions all achieve the purpose of improving the search effect by embedding grammatical relations and making lightweight changes. However, the processing of grammatical relations is performed on the text to be searched. For the search terms themselves, only fuzzy search is supported, and the configuration of the grammatical relations of the search terms themselves is not supported. The problem is that more flexible search requirements cannot be met, and on the other hand, complex sentence searches cannot be handled.
发明内容Summary of the invention
针对所述缺陷,本发明实施例公开了一种可配置语法关系的检索方法、系统、电子设备及存储介质,其通过配置检索词本身的语法关系进行检索,并使用图算法进行实施,减少计算复杂度;具有更高的简便性和灵活性,不需要枚举模糊匹配部分的语法关系,适合语法关系更为复杂的句子。In response to the above-mentioned defects, an embodiment of the present invention discloses a retrieval method, system, electronic device and storage medium with configurable grammatical relations, which performs retrieval by configuring the grammatical relations of the search terms themselves and implements them using graph algorithms to reduce computational complexity. It has higher simplicity and flexibility, does not require enumeration of the grammatical relations of the fuzzy matching part, and is suitable for sentences with more complex grammatical relations.
本发明实施例第一方面公开了一种可配置语法关系的检索方法,包括:A first aspect of an embodiment of the present invention discloses a retrieval method for a configurable grammatical relationship, comprising:
获取待检索目标,所述待检索目标包括检索文本,将所述检索文本以句子为单位分割;分割而成的单个句子形成检索词条,分割而成的所有句子形成检索词条列表;Obtaining a target to be searched, the target to be searched includes a search text, and dividing the search text into sentences; the divided individual sentences form search terms, and all the divided sentences form a search term list;
基于所述检索词条列表与数据库进行正则匹配,输出符合正则匹配规则的全量词条,形成全量词条列表;Based on the search term list, regular matching is performed with the database, and all terms that meet the regular matching rules are output to form a full term list;
识别所述待检索目标中是否存在语法关系配置表,若有则提取所述语法关系配置表,若无则构建语法关系配置表;Identify whether there is a grammatical relationship configuration table in the target to be searched, if yes, extract the grammatical relationship configuration table, if no, construct the grammatical relationship configuration table;
基于所述语法关系配置表与所述全量词条列表进行语法关系匹配,输出符合语法关系匹配规则的词条,获得最终匹配列表。Based on the grammatical relationship configuration table and the full term list, grammatical relationship matching is performed, and terms that meet the grammatical relationship matching rules are output to obtain a final matching list.
作为一种可选的实施方式,在本发明实施例第一方面中,所述检索词条包括若干分词;所述语法关系配置表包括各分词之间的语法关系,所述语法关系包括所述分词之间的依存关系,与所述分词本身的词性。As an optional implementation, in the first aspect of the embodiment of the present invention, the search terms include several participles; the grammatical relationship configuration table includes the grammatical relationship between the participles, and the grammatical relationship includes the dependency relationship between the participles and the part of speech of the participles themselves.
作为一种可选的实施方式,在本发明实施例第一方面中,所述语法关系储存格式为:termSRC|posSRC,termDST|posDST,dep;As an optional implementation, in the first aspect of the embodiment of the present invention, the grammatical relationship storage format is: termSRC|posSRC, termDST|posDST, dep;
其中,termSRC为支配词,posSRC为所述支配词的词性,termDST为从属词,posDST为所述从属词的词性,dep为支配词与从属词之间的依存关系;Wherein, termSRC is the dominant word, posSRC is the part of speech of the dominant word, termDST is the subordinate word, posDST is the part of speech of the subordinate word, and dep is the dependency relationship between the dominant word and the subordinate word;
所述支配词为具体词汇、模糊词汇或未知词;当所述支配词为模糊词汇时,termSRC=*;当所述支配词为未知词时,termSRC=[n],(n=0、1、2、……、n);The dominant word is a specific word, a fuzzy word or an unknown word; when the dominant word is a fuzzy word, termSRC=*; when the dominant word is an unknown word, termSRC=[n], (n=0, 1, 2, ..., n);
所述从属词为具体词汇、模糊词汇或未知词;当所述从属词为模糊词汇时,termDST=*;当所述支配词为未知词时,termDST=[n],(n=0、1、2、……、n);The subordinate word is a specific word, a fuzzy word or an unknown word; when the subordinate word is a fuzzy word, termDST=*; when the dominant word is an unknown word, termDST=[n], (n=0, 1, 2, ..., n);
所述词性为具体词性或模糊词性;当所述词性为模糊词性时,posSRC=*;The part of speech is a specific part of speech or a fuzzy part of speech; when the part of speech is a fuzzy part of speech, posSRC=*;
所述依存关系为具体依存关系或模糊依存关系,当所述依存关系为模糊依存关系时,dep=*。The dependency relationship is a specific dependency relationship or a fuzzy dependency relationship. When the dependency relationship is a fuzzy dependency relationship, dep=*.
作为一种可选的实施方式,在本发明实施例第一方面中,所述识别所述待检索目标中是否存在语法关系配置表,若有则提取所述语法关系配置表,若无则构建语法关系配置表步骤中,构建语法关系配置表的方法包括,As an optional implementation, in the first aspect of the embodiment of the present invention, in the step of identifying whether there is a grammatical relationship configuration table in the target to be searched, extracting the grammatical relationship configuration table if there is, and constructing the grammatical relationship configuration table if there is not, the method for constructing the grammatical relationship configuration table includes:
对所述全量词条列表进行随机取样得到样本词条列表;Randomly sampling the full list of terms to obtain a sample list of terms;
基于所述样本词条列表与语法关系分析方法,获得样本语法关系列表;Based on the sample word list and the grammatical relationship analysis method, obtaining a sample grammatical relationship list;
基于所述样本语法关系列表构建语法关系选项,统计形成语法关系选项列表;Constructing grammatical relationship options based on the sample grammatical relationship list, and statistically forming a grammatical relationship option list;
读取语法关系需求选项信息,基于所述语法关系需求选项信息与语法关系选项列表生成语法关系配置表。The grammatical relationship requirement option information is read, and a grammatical relationship configuration table is generated based on the grammatical relationship requirement option information and the grammatical relationship option list.
作为一种可选的实施方式,在本发明实施例第一方面中,所述基于所述语法关系配置表与所述全量词条列表进行语法关系匹配,输出符合语法关系匹配规则的词条,获得最终匹配列表步骤包括,As an optional implementation, in the first aspect of the embodiment of the present invention, the step of performing grammatical relationship matching based on the grammatical relationship configuration table and the full term list, outputting terms that meet the grammatical relationship matching rules, and obtaining the final matching list includes:
基于所述全量词条列表与语法关系分析方法,获得全量语法关系列表;Based on the full word list and the grammatical relationship analysis method, a full grammatical relationship list is obtained;
提取所述全量语法关系列表中与所述语法关系配置表中依存关系信息一致的词条,得到匹配词条列表;Extracting entries in the full grammatical relationship list that have the same dependency information as the grammatical relationship configuration table to obtain a matching entry list;
提取所述匹配词条列表中与所述语法关系配置表中词性信息一致的词条,得到最终匹配列表。Extract the entries in the matching entry list whose part-of-speech information is consistent with that in the grammatical relationship configuration table to obtain a final matching list.
作为一种可选的实施方式,在本发明实施例第一方面中,所述语法关系分析方法包括,As an optional implementation manner, in a first aspect of the embodiment of the present invention, the grammatical relationship analysis method includes:
对所述样本词条列表或所述全量词条列表中的各词条进行分词得到各词条的分词信息,形成第一嵌套列表;Segmenting each term in the sample term list or the full term list to obtain segmentation information of each term to form a first nested list;
基于第一嵌套列表与依存句法训练模型对各词条进行依存句法分析,得到各词条的各分词之间的依存关系信息,形成第二嵌套列表;Based on the first nested list and the dependency syntax training model, dependency syntax analysis is performed on each entry to obtain dependency relationship information between each word of each entry, thereby forming a second nested list;
对所述第一嵌套列表的各词条的分词进行词性标注,得到各词条的各分词的词性信息,形成第三嵌套列表;Performing part-of-speech tagging on the word segments of each entry in the first nested list to obtain the part-of-speech information of each word segment of each entry to form a third nested list;
基于所述第一嵌套列表、第二嵌套列表与第三嵌套列表构建样本语法关系列表或全量语法关系列表。A sample grammatical relationship list or a full grammatical relationship list is constructed based on the first nested list, the second nested list and the third nested list.
作为一种可选的实施方式,在本发明实施例第一方面中,所述基于所述第一嵌套列表、第二嵌套列表与第三嵌套列表构建样本语法关系列表或全量语法关系列表步骤包括,As an optional implementation, in the first aspect of the embodiment of the present invention, the step of constructing a sample grammatical relationship list or a full grammatical relationship list based on the first nested list, the second nested list and the third nested list includes:
基于所述第一嵌套列表、第二嵌套列表与第三嵌套列表创建双向加权图,Creating a bidirectional weighted graph based on the first nested list, the second nested list, and the third nested list,
所述双向加权图的起点为termSRC|posSRC,其中termSRC为所述依存关系信息中的支配词,posSRC为支配词的词性;The starting point of the bidirectional weighted graph is termSRC|posSRC, where termSRC is the dominant word in the dependency information, and posSRC is the part of speech of the dominant word;
所述双向加权图的终点为termDST|posDST,其中termDST为所述依存关系信息中的从属词,posDST为从属词的词性;The end point of the bidirectional weighted graph is termDST|posDST, where termDST is the dependent word in the dependency information, and posDST is the part of speech of the dependent word;
所述双向加权图的边为dep,其中为所述依存关系信息中的支配词与从属词之间的依存关系;The edge of the bidirectional weighted graph is dep, where is the dependency relationship between the dominant word and the subordinate word in the dependency relationship information;
将所述双向加权图各节点信息转换为图数据结构,基于所述图数据结构构建样本语法关系列表或全量语法关系列表。The information of each node of the bidirectional weighted graph is converted into a graph data structure, and a sample grammatical relationship list or a full grammatical relationship list is constructed based on the graph data structure.
本发明实施例第二方面公开一种可配置语法关系的检索系统,包括:A second aspect of an embodiment of the present invention discloses a retrieval system capable of configuring grammatical relations, comprising:
输入模块,所述输入模块用于输入或读取待检索目标;An input module, the input module is used to input or read the target to be retrieved;
文本分割模块,所述文本分割模块用于识别所述待检索目标中的检索文本,并对所述检索文本进行分割,形成检索词条列表;A text segmentation module, the text segmentation module is used to identify the search text in the target to be searched, and segment the search text to form a search term list;
第一匹配模块,所述第一匹配模块用于将所述检索词列表与数据库进行正则匹配,形成全量词条列表;A first matching module, the first matching module is used to perform regular matching on the search word list and the database to form a full word list;
语法配置识别模块,所述语法配置识别模块用于识别所述待检索目标中是否存在语法关系配置表;A grammar configuration identification module, the grammar configuration identification module is used to identify whether there is a grammar relationship configuration table in the target to be searched;
语法配置构建模块,所述语法配置模块用于构建语法配置表;A grammar configuration building module, wherein the grammar configuration module is used to build a grammar configuration table;
第二匹配模块,所述第二匹配模块用于根据所述语法关系配置表与所述全量词条列表进行语法关系匹配,形成最终匹配列表;A second matching module, the second matching module is used to perform grammatical relationship matching with the full word list according to the grammatical relationship configuration table to form a final matching list;
展示模块,用于输出最终匹配列表并进行排序展示。The display module is used to output the final matching list and display it in order.
本发明实施例第三方面公开一种电子设备,包括:存储有可执行程序代码的存储器;与所述存储器耦合的处理器;所述处理器调用所述存储器中存储的所述可执行程序代码,用于执行本发明实施例第一方面公开的可配置语法关系的检索方法。A third aspect of an embodiment of the present invention discloses an electronic device, comprising: a memory storing executable program code; a processor coupled to the memory; the processor calls the executable program code stored in the memory to execute the retrieval method of configurable grammatical relations disclosed in the first aspect of the embodiment of the present invention.
本发明实施例第四方面公开一种计算机可读存储介质,其存储计算机程序,其中,所述计算机程序使得计算机执行本发明实施例第一方面公开的可配置语法关系的检索方法。A fourth aspect of an embodiment of the present invention discloses a computer-readable storage medium storing a computer program, wherein the computer program enables a computer to execute the retrieval method of a configurable grammatical relationship disclosed in the first aspect of an embodiment of the present invention.
与现有技术相比,本发明实施例具有以下有益效果:Compared with the prior art, the embodiments of the present invention have the following beneficial effects:
本发明实施例的可配置语法关系的检索方法通过配置检索词本身的语法关系进行匹配查找,具有更高的简便性和灵活性,不需要枚举模糊匹配部分的语法关系,并且适合语法关系更为复杂的句子,通过引入了图算法的处理方式,使整个检索匹配的计算效率大大提升。这种检索方法在许多应用上都非常实用,其中的数据库可根据实际应用场景作为大数据库、语料库、目标文件库等进行实施。比如在舆情信息关键词统计上,利用本实施例的检索方法,可以通过一个更易维护的配置模板检索关键词,提高舆情监控效果。又比如在语言学、文学研究中,使用已有语料库或者文献中的例句支持自己的观点是非常重要的。通过本实施例的检索方法,可以更容易且精准地检索到语料库中特定语法关系的例句,加快科研进度。The retrieval method with configurable grammatical relations in the embodiment of the present invention matches and searches by configuring the grammatical relations of the search terms themselves, has higher simplicity and flexibility, does not need to enumerate the grammatical relations of the fuzzy matching part, and is suitable for sentences with more complex grammatical relations. By introducing the processing method of the graph algorithm, the computational efficiency of the entire retrieval matching is greatly improved. This retrieval method is very practical in many applications, and the database therein can be implemented as a large database, corpus, target file library, etc. according to the actual application scenario. For example, in the statistics of keywords of public opinion information, using the retrieval method of this embodiment, keywords can be retrieved through a more maintainable configuration template, thereby improving the effect of public opinion monitoring. For example, in linguistics and literary research, it is very important to use examples in existing corpora or literature to support one's own views. Through the retrieval method of this embodiment, examples of specific grammatical relations in the corpus can be retrieved more easily and accurately, accelerating the progress of scientific research.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings required for use in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.
图1是本发明实施例公开的可配置语法关系的检索方法的流程示意图;FIG1 is a schematic flow chart of a configurable grammatical relationship retrieval method disclosed in an embodiment of the present invention;
图2是本发明实施例中句子“这个是我的”的依存句法树;FIG2 is a dependency syntax tree of the sentence “This is mine” according to an embodiment of the present invention;
图3是本发明实施例中句子1)的依存句法树;FIG3 is a dependency syntax tree of sentence 1) according to an embodiment of the present invention;
图4是本发明实施例中句子2)的依存句法树;FIG4 is a dependency syntax tree of sentence 2) according to an embodiment of the present invention;
图5是本发明实施例中句子3)的依存句法树;FIG5 is a dependency syntax tree of sentence 3) according to an embodiment of the present invention;
图6是本发明实施例中句子4)的依存句法树;FIG6 is a dependency syntax tree of sentence 4) according to an embodiment of the present invention;
图7是本发明实施例步骤S3的具体流程示意图;FIG7 is a schematic diagram of a specific flow chart of step S3 of an embodiment of the present invention;
图8是本发明实施例步骤S32的具体流程示意图;FIG8 is a schematic diagram of a specific flow chart of step S32 of an embodiment of the present invention;
图9是本发明实施例步骤S324的具体流程示意图;FIG9 is a schematic diagram of a specific flow chart of step S324 of an embodiment of the present invention;
图10是本发明实施例步骤S4的具体流程示意图;FIG10 is a schematic diagram of a specific flow chart of step S4 of an embodiment of the present invention;
图11是本发明实施例提供的一种可配置语法关系的检索系统的结构示意图;11 is a schematic diagram of the structure of a retrieval system with configurable grammatical relations provided in an embodiment of the present invention;
图12是本发明实施例提供的一种电子设备的结构示意图。FIG. 12 is a schematic diagram of the structure of an electronic device provided by an embodiment of the present invention.
具体实施方式DETAILED DESCRIPTION
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.
需要说明的是,本发明的说明书和权利要求书中的术语“第一”、“第二”、“第三”、“第四”等是用于区别不同的对象,而不是用于描述特定顺序。本发明实施例的术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,示例性地,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", "third", "fourth", etc. in the specification and claims of the present invention are used to distinguish different objects rather than to describe a specific order. The terms "including" and "having" in the embodiments of the present invention and any variations thereof are intended to cover non-exclusive inclusions. For example, a process, method, system, product or device including a series of steps or units is not necessarily limited to those steps or units clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products or devices.
目前的检索方法中对于语法关系的处理都是针对待检索文本进行的,对于检索词本身而言,也只支持模糊检索部分,无法配置检索词本身的语法关系进行检索。从而无法满足对于更为灵活或复杂句子的检索需求。基于此,本发明实施例公开了一种可配置语法关系的检索方法、系统、电子设备及存储介质,其通过配置检索词本身的语法关系进行检索,并使用图算法进行实施,减少计算复杂度;具有更高的简便性和灵活性,不需要枚举模糊匹配部分的语法关系,适合语法关系更为复杂的句子。In the current retrieval methods, the processing of grammatical relations is performed on the text to be retrieved. For the search terms themselves, only the fuzzy search part is supported, and the grammatical relations of the search terms themselves cannot be configured for retrieval. Therefore, it is unable to meet the retrieval needs for more flexible or complex sentences. Based on this, the embodiments of the present invention disclose a retrieval method, system, electronic device and storage medium with configurable grammatical relations, which performs retrieval by configuring the grammatical relations of the search terms themselves and uses a graph algorithm for implementation, thereby reducing computational complexity; it has higher simplicity and flexibility, does not require enumeration of the grammatical relations of the fuzzy matching part, and is suitable for sentences with more complex grammatical relations.
实施例一Embodiment 1
请参阅图1-10,图1是本发明实施例公开的可配置语法关系的检索方法的流程示意图。该方法适用于具有处理功能的手机、平板电脑等智能设备以及计算机、服务器等计算设备。如图1所示,该可配置语法关系的检索方法包括以下步骤:Please refer to Figures 1-10. Figure 1 is a flowchart of a method for retrieving configurable grammatical relations disclosed in an embodiment of the present invention. The method is applicable to smart devices such as mobile phones and tablet computers with processing functions, as well as computing devices such as computers and servers. As shown in Figure 1, the method for retrieving configurable grammatical relations includes the following steps:
步骤S1:获取待检索目标,所述待检索目标包括检索文本,将所述检索文本以句子为单位分割;分割而成的单个句子形成检索词条,分割而成的所有句子形成检索词条列表。Step S1: obtaining a target to be searched, the target to be searched includes a search text, and dividing the search text into sentences; the divided individual sentences form search terms, and all the divided sentences form a search term list.
步骤S2:基于所述检索词列表与数据库进行正则匹配,输出符合正则匹配规则的全量词条,形成全量词条列表R。Step S2: Perform regular expression matching on the search term list and the database, output all entries that meet the regular expression matching rules, and form a full entry list R.
步骤S3:识别所述待检索目标中是否存在语法关系配置表,若有则提取所述语法关系配置表,若无则构建语法关系配置表。Step S3: Identify whether there is a grammatical relationship configuration table in the target to be retrieved, if so, extract the grammatical relationship configuration table, if not, construct a grammatical relationship configuration table.
步骤S4:基于所述语法关系配置表与所述全量词条列表R进行语法关系匹配,输出符合语法关系匹配规则的词条,获得最终匹配列表。Step S4: Perform grammatical relationship matching based on the grammatical relationship configuration table and the full term list R, output terms that meet the grammatical relationship matching rules, and obtain a final matching list.
本实施例以检索词条为“打击*黑恶势力”(“*”代表任意项)进行说明,假设此次检索目的是为了得到以下匹配结果:This example uses the search term "strike *evil forces" ("*" represents any item) as an example, assuming that the purpose of this search is to obtain the following matching results:
1)打击S市涉网络黑恶势力。(此处“S市涉网络”为任意项部分内容)1) Crack down on cyber-related criminal gangs in S City. (“S City cyber-related” here refers to any part of the content)
2)加大打击黑恶势力的力度。(此处“加大”“的力度”为原句检索词外的句子成分)2) Increase the intensity of the crackdown on evil forces. (Here, "increase" and "intensity" are sentence components other than the search terms in the original sentence)
3)依法严厉打击任何形式的黑恶势力。(此处“任何形式的”为任意项部分内容,“依法严厉”为原句检索词外的句子成分)3) Severely crack down on any form of evil forces in accordance with the law. (Here, "any form" refers to any part of the item, and "severely in accordance with the law" is a sentence component other than the search term in the original sentence)
先将该检索词条转化为正则表达式“打击.*?黑恶势力”。通过与数据库中的词条进行正则匹配得到全量词条列表R。First, the search term is converted into a regular expression "Strike.*? Evil forces". The full term list R is obtained by regular matching with the terms in the database.
在现有技术中,若需要精准匹配到以上1)2)3)个句子,则需要另外一一对应的三个不同的检索表达式描绘语法关系才可得到;而在本实施例中,通过引入语法关系配置表的检索条件,将以上三个句子归为一类,即可得到匹配结果。In the prior art, if it is necessary to accurately match the above 1) 2) 3) sentences, three different corresponding search expressions are required to describe the grammatical relationship; in this embodiment, by introducing the search conditions of the grammatical relationship configuration table, the above three sentences are classified into one category, and the matching results can be obtained.
依存句法理论认为词与词之间存在“支配词”(head)和“从属词”(dependent)的主从关系,二者之间的可以被称为“依存关系”(dependency relation)。请参考图2所示,以句子“这个是我的”为例,其依存句法关系树为支配词“我”指向从属词“这个”,两者依存关系为“nsubj(名词主语)”,“我”和“这个”的词性均为“PN(代词)”。具体标签及其释义参见表2-表3。Dependency syntactic theory believes that there is a master-slave relationship between words, namely, the "head" and "dependent", which can be called a "dependency relation". Please refer to Figure 2. Taking the sentence "This is mine" as an example, its dependency syntactic relation tree is that the head "I" points to the dependent "this", and the dependency relation between the two is "nsubj (noun subject)", and the part of speech of "I" and "this" is "PN (pronoun)". For specific labels and their interpretations, please refer to Tables 2 and 3.
通过对1)2)3)个句子分别进行句法分析,其相应的依存句法树如图3-5所示,从上述依存句法树可知,通过本实施例的方案,只需要给检索词“打击*黑恶势力”一个依存句法关系条件,即,要求正则匹配后的句子,支配词“打击”和从属词“势力”的依存关系为“dobj(直接宾语)”。这样就能够将整体上语法关系不相同的三个句子归为一类,而不需要设计三个单独的检索条件,节约时间成本也增加了检索的鲁棒性。By performing syntactic analysis on sentences 1), 2), and 3), the corresponding dependency syntactic trees are shown in Figures 3-5. From the above dependency syntactic trees, it can be seen that through the solution of this embodiment, only one dependency syntactic relationship condition is needed for the search term "strike against *evil forces", that is, the dependency relationship between the dominant term "strike against" and the subordinate term "force" after regular matching is "dobj (direct object)". In this way, the three sentences with different grammatical relationships as a whole can be classified into one category without the need to design three separate search conditions, which saves time cost and increases the robustness of the search.
通过本实施例的方法,当出现以下的句子:4)打击贪腐问题和惩治黑恶势力。(此处“击贪腐问题和惩治”为任意项部分内容)时,句子4)的依存句法树如图6所示,其可以被正则匹配成功,但因为语法关系不符合预设的语法关系配置表,将不会被最终匹配。By using the method of this embodiment, when the following sentence appears: 4) Crack down on corruption and punish evil forces. (Here, "crack down on corruption and punish" is any part of the content of the item), the dependency syntax tree of sentence 4) is shown in FIG6 , which can be successfully matched by regular expression, but because the grammatical relationship does not conform to the preset grammatical relationship configuration table, it will not be finally matched.
本实施例提出一个全新的检索方法:通过配置检索词本身的语法关系进行匹配查找,具有更高的简便性和灵活性,不需要枚举模糊匹配部分的语法关系,并且适合语法关系更为复杂的句子。这种匹配方法在许多应用上都非常实用,其中的数据库可根据实际应用场景作为大数据库、语料库、目标文件库等进行实施。比如在舆情信息关键词统计上,利用本实施例的检索方法,可以通过一个更易维护的配置模板检索关键词,提高舆情监控效果。又比如在语言学、文学研究中,使用已有语料库或者文献中的例句支持自己的观点是非常重要的。通过本实施例的检索方法,可以更容易且精准地检索到语料库中特定语法关系的例句,加快科研进度。This embodiment proposes a new retrieval method: by configuring the grammatical relationship of the search term itself for matching and searching, it has higher simplicity and flexibility, does not need to enumerate the grammatical relationship of the fuzzy matching part, and is suitable for sentences with more complex grammatical relationships. This matching method is very practical in many applications, and the database therein can be implemented as a large database, corpus, target file library, etc. according to the actual application scenario. For example, in the statistics of keywords of public opinion information, using the retrieval method of this embodiment, keywords can be retrieved through a more maintainable configuration template, thereby improving the effect of public opinion monitoring. For example, in linguistics and literary research, it is very important to use examples in existing corpora or literature to support your own views. Through the retrieval method of this embodiment, examples of specific grammatical relationships in the corpus can be retrieved more easily and accurately, accelerating the progress of scientific research.
具体地,在步骤S1中,为了保证构建的依存句法树更加合理化,避免影响依存关系的复杂性和合理性。因此本实施例在步骤S1中对检索文本进行句子分割。分割的规则是将标点符号“。”、“?”、“!”作为分割符。例如:“李白(701年-762年),字太白,号青莲居士,又号‘谪仙人’。是唐代伟大的浪漫主义诗人。被后人誉为‘诗仙’,与杜甫并称为‘李杜’”,将会被分割为三个句子,形成三个检索词条,分别为“李白(701年-762年),字太白,号青莲居士,又号‘谪仙人’”,“是唐代伟大的浪漫主义诗人”和“被后人誉为‘诗仙’,与杜甫并称为‘李杜’”,这三个检索词条之集形成检索词条列表。Specifically, in step S1, in order to ensure that the constructed dependency syntax tree is more reasonable and avoid affecting the complexity and rationality of the dependency relationship. Therefore, this embodiment performs sentence segmentation on the search text in step S1. The segmentation rule is to use punctuation marks ".", "?", and "!" as separators. For example: "Li Bai (701-762), with the name Taibai, the name Qinglian Jushi, and the name 'Zhexianren'. He is a great romantic poet in the Tang Dynasty. He was praised as the 'Poet Immortal' by later generations, and was called 'Li Du' together with Du Fu." It will be divided into three sentences to form three search terms, namely "Li Bai (701-762), with the name Taibai, the name Qinglian Jushi, and the name 'Zhexianren'", "He is a great romantic poet in the Tang Dynasty" and "He was praised as the 'Poet Immortal' by later generations, and was called 'Li Du' together with Du Fu." The collection of these three search terms forms a search term list.
在步骤S2中,先通过将检索词列表中的检索词条与数据库中的词条进行正则匹配。如果匹配结果为空,说明数据库中没有预期结果,则不需要再进入到下一步骤即可直接输出无匹配结果。如果此步骤有匹配结果输出,则将符合正则匹配规则的词条输出,命名为全量词条,将所有匹配得到的全量词条集合形成全量词条列表R。In step S2, the search terms in the search term list are first matched with the terms in the database by regular expression matching. If the matching result is empty, it means that there is no expected result in the database, and there is no need to enter the next step and the no matching result can be directly output. If there is a matching result output in this step, the terms that meet the regular matching rules are output and named as full terms, and all the matched full terms are collected to form a full term list R.
在步骤S3中,需要识别待检索目标中是否存在语法关系配置表,本实施例针对不同用户提供多种语法关系配置表的获得方法,如针对于有明确的匹配需求且清楚知道检索词依存关系,可以直接给出配置表的用户,则直接提取语法关系配置表进入下一步骤。针对于未能明确将自己的检索需求翻译成依存句法检索的用户,则需要为其构建语法配置表。In step S3, it is necessary to identify whether there is a grammatical relationship configuration table in the target to be searched. This embodiment provides a variety of methods for obtaining grammatical relationship configuration tables for different users. For example, for users who have clear matching requirements and clearly know the dependency relationship of the search terms and can directly provide the configuration table, the grammatical relationship configuration table is directly extracted to proceed to the next step. For users who fail to clearly translate their search requirements into dependency syntax search, it is necessary to build a grammatical configuration table for them.
在本实施例中,检索词条包括若干分词;语法关系配置表包括各检索词条中的各分词之间的语法关系,语法关系包括分词之间的依存关系信息与分词本身的词性信息。在一些实施方式中,语法关系配置表直接存在“txt”文本格式中,每一行表示两个分词之间的依存关系与分词本身的词性,以回车作为每行之间的间隔。每行的格式为:termSRC|posSRC,termDST|posDST,dep。In this embodiment, the search terms include several segmented words; the grammatical relationship configuration table includes the grammatical relationship between the segmented words in each search term, and the grammatical relationship includes the dependency relationship information between the segmented words and the part-of-speech information of the segmented words themselves. In some embodiments, the grammatical relationship configuration table is directly stored in the "txt" text format, and each line represents the dependency relationship between two segmented words and the part-of-speech of the segmented words themselves, with carriage returns as the interval between each line. The format of each line is: termSRC|posSRC, termDST|posDST, dep.
其中,“termSRC”代表支配词,“posSRC”代表支配词的词性,二者用“|”隔开;而“termDST”代表从属词,“posDST”代表从属词的词性,二者用“|”隔开;最终一项为支配词和从属词的依存关系“dep”。同前,词性和依存关系见表2-表3。另外支配词描述,从属词描述和依存关系三类都用“,”隔开。Among them, "termSRC" represents the dominant word, "posSRC" represents the part of speech of the dominant word, and the two are separated by "|"; while "termDST" represents the subordinate word, "posDST" represents the part of speech of the subordinate word, and the two are separated by "|"; the last item is the dependency relationship "dep" between the dominant word and the subordinate word. As before, the part of speech and dependency relationship are shown in Table 2-Table 3. In addition, the three categories of dominant word description, subordinate word description and dependency relationship are all separated by ",".
因此,检索词条“打击*黑恶势力”的实施例中,如果想匹配得到句子1)、2)、3),根据依存关系信息与词性信息,可以将支配词“打击”(词性为“VV”)和从属词“势力”(词性为“NN”)的依存关系设为“dobj(直接宾语)”。在配置表中格式为:打击|VV,势力|NN,dobj。Therefore, in the embodiment of searching the term "strike against *evil forces", if you want to match sentences 1), 2), and 3), according to the dependency information and part-of-speech information, the dependency relationship between the dominant word "strike" (part-of-speech is "VV") and the subordinate word "force" (part-of-speech is "NN") can be set to "dobj (direct object)". In the configuration table, the format is: strike|VV, force|NN, dobj.
进一步地,如果不想指定具体的词性和依存关系,可以使用“*”表示任意词性。在配置表中格式为:打击|*,势力|NN,dobj。Furthermore, if you do not want to specify specific parts of speech and dependencies, you can use "*" to represent any part of speech. In the configuration table, the format is: attack|*, force|NN, dobj.
上面这个配置表示支配词“打击”(任意词性)和从属词“势力”(词性为“NN”)的依存关系设为“dobj(直接宾语)”。同理,“势力”的词性也可以设为“*”。支配词和从属词词性可以同时为“*”。说明不考虑词性,只要依存关系、支配词和从属词匹配即可。The above configuration indicates that the dependency relationship between the dominant word "strike" (any part of speech) and the subordinate word "force" (part of speech is "NN") is set to "dobj (direct object)". Similarly, the part of speech of "force" can also be set to "*". The part of speech of the dominant word and the subordinate word can be "*" at the same time. This means that the part of speech is not considered, as long as the dependency relationship, dominant word, and subordinate word match.
依存关系也可以为任意关系。从图(graph)的角度看,相当于由双向加权图变成了双向无权图,即保留支配词对从属词的指向关系,但忽略依存关系。只要“打击”和“势力”存在关系,且“打击”为支配词即可。在配置表中格式为:打击|*,势力|*,*。Dependency relations can also be arbitrary relations. From the perspective of graph, it is equivalent to changing from a bidirectional weighted graph to a bidirectional unweighted graph, that is, retaining the pointing relationship between the dominant word and the subordinate word, but ignoring the dependency relationship. As long as there is a relationship between "strike" and "force", and "strike" is the dominant word, it is sufficient. In the configuration table, the format is: strike|*, force|*,*.
进一步,如果需要匹配更复杂的关系,例如需要匹配句子4)这类的句子,可以指定“打击”和模糊匹配部分的未知动词(VV,例句(4)这里是“惩治”)的依存关系为“conj(连接)”。注意,这里未知动词的词性也可以为“*”,说明只要存在“conj”关系即可。在配置表中格式为:打击|*,[0]|VV,conj。Furthermore, if you need to match a more complex relationship, for example, if you need to match a sentence like sentence 4), you can specify the dependency relationship between "strike" and the unknown verb in the fuzzy matching part (VV, for example, sentence (4) is "punish") as "conj (connection)". Note that the part of speech of the unknown verb can also be "*", which means that as long as there is a "conj" relationship, it will be fine. In the configuration table, the format is: strike|*, [0]|VV, conj.
这里引入了新的表达,“[0]”表达存在于模糊检索中,而不存在检索词中的未知词,如果要继续描述第二个未知词,则使用“[1]”。即中括号中包含一个自然数。注意,需要按顺序连续递增,不能跳号。通常,这样的配置会使用多个表达结合,描述未知词和句子中多个词的依存关系。例如可以增加更多的条件来确保匹配中4)这类的句子。在配置表中表达如下:A new expression is introduced here. The expression "[0]" exists in the fuzzy search, but does not exist in the unknown word in the search term. If you want to continue to describe the second unknown word, use "[1]". That is, a natural number is contained in the brackets. Note that it needs to be incremented in sequence and cannot be skipped. Usually, such a configuration will use multiple expressions to describe the dependency relationship between the unknown word and multiple words in the sentence. For example, more conditions can be added to ensure matching of sentences such as 4). It is expressed as follows in the configuration table:
打击|*,[0]|VV,conjhit|*,[0]|VV,conj
[0]|VV,势力|NN,dobj[0]|VV, power|NN, dobj
打击|*,[1]|*,cchit|*,[1]|*,cc
第二行表示未知动词和“势力”的依存关系为“dobj”。第三行表示第二个未知词和“打击”的关系为“cc”。这里要注意,在配置表中存在多个表达的时候,需要统一词性,否则会报错。例如:不能第一行“[0]|VV”,而第二行“[0]|NN”或者“[0]|*”。配置表中多条表达式的关系为“AND”(无需显式添加),如果要执行“OR”操作,配置多个文件即可,支持一个检索词加多个配置表的检索功能。The second line indicates that the dependency relationship between the unknown verb and "power" is "dobj". The third line indicates that the relationship between the second unknown word and "strike" is "cc". Please note that when there are multiple expressions in the configuration table, the parts of speech need to be unified, otherwise an error will be reported. For example: the first line cannot be "[0]|VV", and the second line cannot be "[0]|NN" or "[0]|*". The relationship between multiple expressions in the configuration table is "AND" (no need to add explicitly). If you want to perform an "OR" operation, you can configure multiple files. The search function of one search term plus multiple configuration tables is supported.
步骤S3中构建语法关系配置表的方法包括:The method for constructing the grammatical relationship configuration table in step S3 includes:
步骤S31:对所述全量词条列表R进行随机取样得到样本词条列表Ra。Step S31: Randomly sample the full term list R to obtain a sample term list Ra.
本实施例按照1%的比例对全量词条列表R进行随机采样,得到样本词条列表Ra。这里的比例可以根据数据库规模调节的,原则是对采样的结果数据量进行合理化控制,避免对下一步筛选模式造成压力。In this embodiment, the full word list R is randomly sampled at a ratio of 1% to obtain a sample word list Ra. The ratio here can be adjusted according to the size of the database. The principle is to rationally control the amount of sampled result data to avoid putting pressure on the next screening mode.
步骤S32:基于所述样本词条列表Ra与语法关系分析方法,获得样本语法关系列表Ga。具体步骤包括:Step S32: Based on the sample entry list Ra and the grammatical relationship analysis method, a sample grammatical relationship list Ga is obtained. The specific steps include:
步骤S321:对所述样本词条列表Ra中的各词条进行分词得到各词条的分词信息,形成第一嵌套列表Ta。Step S321: Segment each term in the sample term list Ra to obtain segmentation information of each term, and form a first nested list Ta.
在依存句法分析前,需要先对每个检索词条进行分词,本实施例使用python的第三方分词库“jieba”。这是一个专门用于中文分词的库。如果检索词中有术语或者组合词如“技术创新中心”需要被看作是一个词的,可以使用“jieba.add_word”方法将该词添加进默认词典,在分词时会被视为一个词。添加完成后,使用“jieba.lcut”方法,每个检索词条都会得到一个分词列表,样本词条列表Ra中每个检索词条分词后列表的集合是一个嵌套列表,命名为第一嵌套列表Ta。例如在python中,检索词条“这个是我的”表示为字符串(string),jieba.lcut("这个是我的")会输出分词后的列表(list):["这个","是","我","的"]。Before dependency parsing, each search term needs to be segmented first. This embodiment uses the third-party segmentation library "jieba" of python. This is a library specifically used for Chinese segmentation. If there are terms or combination words in the search term such as "Technology Innovation Center" that need to be regarded as a word, the "jieba.add_word" method can be used to add the word to the default dictionary, and it will be regarded as a word during segmentation. After adding, using the "jieba.lcut" method, each search term will get a segmentation list, and the set of the list after segmentation of each search term in the sample term list Ra is a nested list, named the first nested list Ta. For example, in python, the search term "this is mine" is expressed as a string (string), and jieba.lcut ("this is mine") will output the list after segmentation (list): ["this", "is", "I", "of"].
步骤S322:基于第一嵌套列表Ta与依存句法训练模型对各词条进行依存句法分析,得到各词条的各分词之间的依存关系信息,形成第二嵌套列表Da。Step S322: performing dependency syntactic analysis on each word based on the first nested list Ta and the dependency syntactic training model to obtain dependency relationship information between each word segment of each word to form a second nested list Da.
本步骤使用python的第三方库“hanlp”对每个句子进行依存句法分析。为了方便展示,以下使用句子“这个是我的”示例。hanlp是一个支持许多中文自然语言处理任务的库。This step uses the third-party Python library "hanlp" to perform dependency parsing on each sentence. For ease of demonstration, the following sentence "This is mine" is used as an example. hanlp is a library that supports many Chinese natural language processing tasks.
使用hanlp加载依存句法分析的预训练模型:dep=hanlp.load(hanlp.pretrained.dep.CTB9_UDC_ELECTR A_SMALL)。这里使用的“CTB9_UDC_ELECTRA_SMALL”是一个基于中文的依存句法分析开源预训练模型。hanlp支持导入其它各种语言的预训练模型,本实施例中文检索仅为便于说明,其他语言的实施均属于本专利保护范围。加载完毕后,输入分词后的列表:dep(["这个","是","我","的"]),即可得到这个检索词条的依存句法关系:[(3,"nsubj"),(3,"cop"),(0,"root"),(3,"case")]。结果是一个列表,列表里有若干个元组(tuple)组成,每个元组由分词在词条中的位置序号和依存句法关系组成。需要结合第一嵌套列表Ta解读。参考表1,列表第一个元组(3,"nsubj")指的是,第一个分词“这个”作为从属词,而第三个分词(“我”)作为支配词与其存在依存句法关系,为“nsubj”。后续的以此类推。特别地,第三个元组(0,"root")表示的是第三个分词作为从属词和其它词没有依存句法关系。对第一嵌套列表Ta中每一个元素进行依存句法分析以后,得到第二嵌套列表Da。Use hanlp to load the pre-trained model of dependency syntactic analysis: dep = hanlp.load(hanlp.pretrained.dep.CTB9_UDC_ELECTRA_SMALL). The "CTB9_UDC_ELECTRA_SMALL" used here is an open source pre-trained model of dependency syntactic analysis based on Chinese. hanlp supports the import of pre-trained models in various other languages. The Chinese search in this embodiment is only for the convenience of explanation. The implementation of other languages is within the scope of protection of this patent. After loading, enter the list after word segmentation: dep(["this","is","I","of"]), and you can get the dependency syntactic relationship of this search term: [(3,"nsubj"),(3,"cop"),(0,"root"),(3,"case")]. The result is a list consisting of several tuples, each of which consists of the position number of the word segmentation in the term and the dependency syntactic relationship. It needs to be interpreted in conjunction with the first nested list Ta. Referring to Table 1, the first tuple (3,"nsubj") in the list means that the first participle "this" is a subordinate word, and the third participle ("I") is a dominant word and has a dependency syntactic relationship with it, which is "nsubj". The following tuples are similar. In particular, the third tuple (0,"root") means that the third participle as a subordinate word has no dependency syntactic relationship with other words. After performing dependency syntactic analysis on each element in the first nested list Ta, the second nested list Da is obtained.
步骤S323:对所述第一嵌套列表Ta的各词条的分词进行词性标注,得到各词条的各分词的词性信息,形成第三嵌套列表Pa。Step S323: perform part-of-speech tagging on the word segments of each entry in the first nested list Ta to obtain the part-of-speech information of each word segment of each entry, and form a third nested list Pa.
本步骤对第一嵌套列表Ta进行词性标注,得到各词条的各分词的词性信息。使用hanlp的词性标注接口:pos=hanlp.load(hanlp.pretrained.pos.CTB9_POS_ELECTRA_SMALL)。例如:pos(["这个","是","我","的"]),得到词性列表:["PN","VC","PN","DEG"]。因此,第一嵌套列表Ta经过词性标注处理后得到嵌套的第三嵌套列表Pa。This step performs part-of-speech tagging on the first nested list Ta to obtain the part-of-speech information of each word segment of each entry. Use the part-of-speech tagging interface of hanlp: pos = hanlp.load (hanlp.pretrained.pos.CTB9_POS_ELECTRA_SMALL). For example: pos (["this", "is", "I", "of"]), get the part-of-speech list: ["PN", "VC", "PN", "DEG"]. Therefore, after the first nested list Ta is processed by part-of-speech tagging, the nested third nested list Pa is obtained.
步骤S324:基于所述第一嵌套列表Ta、第二嵌套列表Da与第三嵌套列表Pa构建样本语法关系列表Ga。具体步骤包括:Step S324: construct a sample grammatical relationship list Ga based on the first nested list Ta, the second nested list Da and the third nested list Pa. The specific steps include:
步骤S3241:基于所述第一嵌套列表Ta、第二嵌套列表Da与第三嵌套列表Pa创建双向加权图。Step S3241: Create a bidirectional weighted graph based on the first nested list Ta, the second nested list Da and the third nested list Pa.
所述双向加权图的起点为termSRC|posSRC,其中termSRC为所述依存关系信息中的支配词,posSRC为支配词的词性;The starting point of the bidirectional weighted graph is termSRC|posSRC, where termSRC is the dominant word in the dependency information, and posSRC is the part of speech of the dominant word;
所述双向加权图的终点为termDST|posDST,其中termDST为所述依存关系信息中的从属词,posDST为从属词的词性;The end point of the bidirectional weighted graph is termDST|posDST, where termDST is the dependent word in the dependency information, and posDST is the part of speech of the dependent word;
所述双向加权图的边为dep,其中为所述依存关系信息中的支配词与从属词之间的依存关系;The edge of the bidirectional weighted graph is dep, where is the dependency relationship between the dominant word and the subordinate word in the dependency relationship information;
步骤S3242:将所述双向加权图各节点信息转换为图数据结构,基于所述图数据结构构建样本语法关系列表Ga。Step S3242: Convert the node information of the bidirectional weighted graph into a graph data structure, and construct a sample grammatical relationship list Ga based on the graph data structure.
从实施上讲,依存句法树可以看作是一个双向加权图(bidirectional weightedgraph)的问题。一个图(或网络)中,通常是由节点(node)和边(edge)组成的。而双向加权图中,每条边是有方向的,而且每条边存在权重(属性)。因此可以用一个三元组来表示一个点指向另一个点的关系——(src,dst,dep)。“src”代表起点,对应依存句法理论中的支配词,“dst”代表终点,对应从属词。“dep”边的属性,对应依存关系。因此,句子“这个是我的”的图表示方法见表1。使用图表征依存句法树以后,通过图算法或者简单的节点属性就可以得到需要的信息。例如可以针对给定节点,可以得到其所有有连接的节点,这大大减少计算复杂度,具有更高的简便性和灵活性。In terms of implementation, the dependency syntax tree can be viewed as a problem of a bidirectional weighted graph. A graph (or network) is usually composed of nodes and edges. In a bidirectional weighted graph, each edge is directional and has a weight (attribute). Therefore, a triple can be used to represent the relationship from one point to another point - (src, dst, dep). "src" represents the starting point, corresponding to the dominant word in the dependency syntax theory, and "dst" represents the end point, corresponding to the subordinate word. The attribute of the "dep" edge corresponds to the dependency relationship. Therefore, the graph representation method of the sentence "This is mine" is shown in Table 1. After using a graph to represent the dependency syntax tree, the required information can be obtained through graph algorithms or simple node attributes. For example, for a given node, all its connected nodes can be obtained, which greatly reduces the computational complexity and has higher simplicity and flexibility.
使用python的第三方库“networkx”构造图。这是在python中常用来构造图(网络)的库,内置了大量图算法,可以直接调用。以“这个是我的”为例,首先创建一个双向图:g=networkx.DiGraph()。接着,可以用写好的脚本将句法树的关系转写为表1描述的关系,作为一个嵌套列表在python中表示,各分词用分词后的列表的序号表示:dep=[["3","1","nsubj"],["3","2","cop"],["3","4","case"]]。将这个嵌套列表dep的关系添加到刚创建好的双向图中,得到完整的双向加权图:g.add_weighted_edges_from(dep,weight="dep")。再利用词性列表进一步将词性信息补充进去:networkx.set_node_attributes(g,{"1":"PN","2":"VC","3":"PN","4":"DEG"},name='pos')。这里g代表创建的双向图,从空图不断更新信息。第二嵌套列表Da中,通过构建图的操作后,会得到一个样本语法关系列表Ga,列表中元素由图构成。Use the third-party python library "networkx" to construct a graph. This is a library commonly used to construct graphs (networks) in python. It has a large number of built-in graph algorithms and can be called directly. Taking "This is mine" as an example, first create a bidirectional graph: g = networkx.DiGraph(). Then, you can use the script to transcribe the relationship of the syntax tree into the relationship described in Table 1, as a nested list in python, and each word is represented by the sequence number of the list after word segmentation: dep = [["3","1","nsubj"],["3","2","cop"],["3","4","case"]]. Add the relationship of this nested list dep to the bidirectional graph just created to obtain a complete bidirectional weighted graph: g.add_weighted_edges_from(dep,weight="dep"). Then use the part-of-speech list to further add part-of-speech information: networkx.set_node_attributes(g,{"1":"PN","2":"VC","3":"PN","4":"DEG"},name='pos'). Here g represents the created bidirectional graph, which continuously updates information from the empty graph. In the second nested list Da, after the graph construction operation, a sample grammatical relation list Ga is obtained, and the elements in the list are composed of graphs.
表1Table 1
步骤S33;基于所述样本语法关系列表Ga构建语法关系选项,统计形成语法关系选项列表。Step S33: construct grammatical relationship options based on the sample grammatical relationship list Ga, and statistically form a grammatical relationship option list.
步骤S34;读取语法关系需求选项信息,基于所述语法关系需求选项信息与语法关系选项列表生成语法关系配置表。Step S34: read the grammatical relationship requirement option information, and generate a grammatical relationship configuration table based on the grammatical relationship requirement option information and the grammatical relationship option list.
样本语法关系列表Ga构建完毕后,用户可以通过python脚本对语法关系项进行统计,列出数量最高的十项,也可以直接对依存句法关系可视化,确认自己的检索需求,配置检索词语法关系表。至此,检索词语法关系表构建完毕。这一步的操作主要是针对没有构建好检索词配置表的用户。通过小规模采样来了解依存句法关系的分布,进一步确认自身的检索条件,完成配置语法表,在下一步中进行检索。大大降低了本专利的学习曲线。After the sample grammatical relationship list Ga is constructed, the user can use the python script to count the grammatical relationship items and list the ten items with the highest number. The user can also directly visualize the dependency syntactic relationship, confirm their own search needs, and configure the search term grammatical relationship table. At this point, the search term grammatical relationship table has been constructed. This step is mainly for users who have not built a search term configuration table. Through small-scale sampling, the distribution of dependency syntactic relationships can be understood, and the search conditions can be further confirmed. The configuration of the grammatical table is completed and the search is performed in the next step. The learning curve of this patent is greatly reduced.
在步骤S4中,基于所述语法关系配置表与所述全量词条列表R进行语法关系匹配,输出符合语法关系匹配规则的词条,获得最终匹配列表具体步骤包括:In step S4, grammatical relationship matching is performed based on the grammatical relationship configuration table and the full term list R, and terms that meet the grammatical relationship matching rules are output. The specific steps of obtaining the final matching list include:
步骤S41;基于所述全量词条列表R与语法关系分析方法,获得全量语法关系列表G。Step S41: Based on the full word list R and the grammatical relationship analysis method, a full grammatical relationship list G is obtained.
此步骤与步骤S32方法一致,执行对象由样本词条列表Ra变为全量词条列表R。执行这些操作以后,得到全量语法关系列表G,里面的元素是单个图,每个图包含一个句子的依存句法关系和词性,此处不再赘述。This step is consistent with the method of step S32, and the execution object is changed from the sample word list Ra to the full word list R. After performing these operations, the full grammatical relationship list G is obtained, and the elements in it are single graphs, each of which contains the dependency syntactic relationship and part of speech of a sentence, which will not be repeated here.
步骤S42;提取所述全量语法关系列表G中与所述语法关系配置表中依存关系信息一致的词条,得到匹配词条列表。Step S42: extract the entries in the full grammatical relationship list G that are consistent with the dependency information in the grammatical relationship configuration table to obtain a matching entry list.
根据语法配置表中的依存关系信息的要求,对全量语法关系列表G中的每个图进行检索。例如,检索词为“打击.*?黑恶势力”,配置表为“打击|VV,势力|NN,dobj”。假定检索词正则匹配中的全量词条已经全部构建为图。那么在句子1)、2)、3)转化的图中,都可以使用networkx自带方法查询是否“打击”和“势力”存在依存关系“dobj”:gi.get_edge_data(ni,mi)。其中gi表示三个句子各自的图,ni和mi代表的是支配词和从属词的序号,即“打击”和“势力”在三个检索词条中分词列表中的序号。如果结果为空,该方法返回为空,输出无匹配结果信息。否则返回词典(dictionary):{"dep":depi}。depi代表的是三个句子中“打击”和“势力”的依存关系。实际中,我们可以知道三个句子的depi是“dobj”。这说明三个句子中依存关系匹配成功。According to the requirements of the dependency information in the grammar configuration table, each graph in the full grammar relationship list G is searched. For example, the search term is "strike.*? evil forces", and the configuration table is "strike|VV, forces|NN,dobj". Assume that all the terms in the regular matching of the search term have been constructed as graphs. Then in the graphs converted from sentences 1), 2), and 3), you can use the networkx built-in method to query whether "strike" and "force" have a dependency relationship "dobj": gi.get_edge_data(ni,mi). Among them, gi represents the graphs of the three sentences, and ni and mi represent the ordinal numbers of the dominant and subordinate words, that is, the ordinal numbers of "strike" and "force" in the word list of the three search terms. If the result is empty, the method returns empty and outputs no matching result information. Otherwise, it returns the dictionary: {"dep":depi}. depi represents the dependency relationship between "strike" and "force" in the three sentences. In practice, we can know that the depi of the three sentences is "dobj". This shows that the dependency relationships in the three sentences are matched successfully.
步骤S43;提取所述匹配词条列表中与所述语法关系配置表中词性信息一致的词条,得到最终匹配列表F。Step S43: extract the entries in the matching entry list whose part-of-speech information is consistent with that in the grammatical relationship configuration table, and obtain a final matching list F.
进一步,检查词性。执行networkx方法访问节点属性,即词性:gi.nodes[ni]和gi.nodes[mi],如果结果输出对应为{"pos":"VV"}{"pos":"NN"}说明词性也符合配置表要求。依存关系和词性符合语法关系配置表描述,这三个句子都会输出,并作为最终匹配列表F进行展示。Further, check the part of speech. Execute the networkx method to access the node attributes, that is, the part of speech: gi.nodes[ni] and gi.nodes[mi]. If the result output corresponds to {"pos":"VV"}{"pos":"NN"}, it means that the part of speech also meets the requirements of the configuration table. The dependency relationship and part of speech meet the description of the grammatical relationship configuration table. All three sentences will be output and displayed as the final matching list F.
对于语法关系配置表中出现未知词的情况,脚本会先确认实际语法关系中是否存在未知词。这种情况下,不能直接获取两个节点的边属性,而需要遍历已知词的边。例如:语法关系配置表为“打击|*,[0]|VV,conj”,以已知词作为支配词,遍历支配词节点的“out_edges”:for src,dst,dep in g.out_edges(xi,data=True)。其中g是“打击”所在句子建构的图实例,xi代表“打击”在该句中分词列表的序号。如果有结果输出,说明“打击”存在“out_edges”。src输出为xi,dst输出为未知词在该句中分词列表的序号。dep为二者的依存句法关系。所以,当存在一个输出的dst词性满足“VV”的要求,且dep又是“conj”,说明匹配成功。否则失败。而dst则是未知词[0]。如果语法关系配置表是“[0]|VV,势力|NN,dobj”,操作和前面一致,但已知词作为从属词,遍历该词的“in_edges”:for src,dst,dep in g.in_edges(xi,data=True)。In the case of unknown words in the grammatical relationship configuration table, the script will first confirm whether there are unknown words in the actual grammatical relationship. In this case, the edge attributes of the two nodes cannot be directly obtained, but the edges of the known words need to be traversed. For example: the grammatical relationship configuration table is "strike|*,[0]|VV,conj", with the known words as the dominant words, traverse the "out_edges" of the dominant word node: for src,dst,dep in g.out_edges(xi,data=True). Among them, g is the graph instance constructed in the sentence where "strike" is located, and xi represents the sequence number of the word list of "strike" in the sentence. If there is a result output, it means that "strike" has "out_edges". The output of src is xi, and the output of dst is the sequence number of the word list of the unknown word in the sentence. dep is the dependency syntactic relationship between the two. Therefore, when there is an output dst part of speech that meets the requirements of "VV" and dep is "conj", it means that the match is successful. Otherwise, it fails. And dst is the unknown word [0]. If the grammatical relationship configuration table is "[0]|VV,力量|NN,dobj", the operation is the same as before, but the known word is used as a subordinate word, and the "in_edges" of the word is traversed: for src,dst,dep in g.in_edges(xi,data=True).
在一些优选的实施方式中,还将进一步使用gpu对检索流程进行加速,基于python的gpu数据分析框架“RAPIDS”,提供了networkx在gpu平台上的替换产品“cuGraph”。不需要修改太多代码就可以加速运算。除此之外,在进行检索时也可以使用并行运算技术。这些优化都能有效保证检索速度。In some preferred implementations, GPUs will be further used to accelerate the search process. The Python-based GPU data analysis framework "RAPIDS" provides a replacement product "cuGraph" for NetworkX on the GPU platform. The calculation can be accelerated without modifying too much code. In addition, parallel computing technology can also be used when searching. These optimizations can effectively ensure the search speed.
总的来说,本实施例提出了一个可配置语法关系的检索方法,利用这个方法,比传统方法更为便捷和容错地完成匹配。从实施上,本实施例引入了图表示的概念,使整个匹配的计算效率大大提升。本方法可以广泛运用在舆情监控,语言学语料库研究方面,和此前的检索工具相比,也将大大提升准确率。In general, this embodiment proposes a retrieval method with configurable grammatical relations. Using this method, matching can be completed more conveniently and fault-tolerantly than traditional methods. In terms of implementation, this embodiment introduces the concept of graph representation, which greatly improves the computational efficiency of the entire matching. This method can be widely used in public opinion monitoring and linguistic corpus research, and will greatly improve the accuracy compared to previous retrieval tools.
实施例二Embodiment 2
请参阅图11,图11是本发明实施例公开的可配置语法关系的检索系统的结构示意图。如图11所示,该可配置语法关系的检索系统可以包括:Please refer to Figure 11, which is a schematic diagram of the structure of a retrieval system with configurable grammatical relations disclosed in an embodiment of the present invention. As shown in Figure 11, the retrieval system with configurable grammatical relations may include:
输入模块,所述输入模块用于输入或读取待检索目标;An input module, the input module is used to input or read the target to be retrieved;
文本分割模块,所述文本分割模块用于识别所述待检索目标中的检索文本,并对所述检索文本进行分割,形成检索词条列表;A text segmentation module, the text segmentation module is used to identify the search text in the target to be searched, and segment the search text to form a search term list;
第一匹配模块,所述第一匹配模块用于将所述检索词列表与数据库进行正则匹配,形成全量词条列表;A first matching module, the first matching module is used to perform regular matching on the search word list and the database to form a full word list;
语法配置识别模块,所述语法配置识别模块用于识别所述待检索目标中是否存在语法关系配置表;A grammar configuration identification module, the grammar configuration identification module is used to identify whether there is a grammar relationship configuration table in the target to be searched;
语法配置构建模块,所述语法配置模块用于构建语法配置表;A grammar configuration building module, wherein the grammar configuration module is used to build a grammar configuration table;
第二匹配模块,所述第二匹配模块用于根据所述语法关系配置表与所述全量词条列表进行语法关系匹配,形成最终匹配列表;A second matching module, the second matching module is used to perform grammatical relationship matching with the full word list according to the grammatical relationship configuration table to form a final matching list;
展示模块,用于输出最终匹配列表并进行排序展示。The display module is used to output the final matching list and display it in order.
本发明实施例的可配置语法关系的检索系统,比传统方法更为便捷和容错地完成匹配。引入了图表示的概念,使整个匹配的计算效率大大提升。本方法可以广泛运用在舆情监控,语言学语料库研究方面,和此前的检索工具相比,也将大大提升准确率。The configurable grammatical relationship retrieval system of the embodiment of the present invention completes the matching more conveniently and fault-tolerantly than the traditional method. The concept of graph representation is introduced to greatly improve the computational efficiency of the entire matching. This method can be widely used in public opinion monitoring and linguistic corpus research, and will greatly improve the accuracy compared with previous retrieval tools.
实施例三Embodiment 3
请参阅图12,图12是本发明实施例公开的一种电子设备的结构示意图。电子设备可以是计算机以及服务器等,当然,在一定情况下,还可以是手机、平板电脑以及监控终端等智能设备,以及具有处理功能的图像采集装置。如图12所示,该电子设备可以包括:Please refer to FIG. 12, which is a schematic diagram of the structure of an electronic device disclosed in an embodiment of the present invention. The electronic device may be a computer, a server, etc. Of course, in certain circumstances, it may also be a smart device such as a mobile phone, a tablet computer, and a monitoring terminal, as well as an image acquisition device with processing functions. As shown in FIG. 12, the electronic device may include:
存储有可执行程序代码的存储器510;A memory 510 storing executable program codes;
与存储器510耦合的处理器520;a processor 520 coupled to the memory 510;
其中,处理器520调用存储器510中存储的可执行程序代码,执行实施例一中的可配置语法关系的检索方法中的部分或全部步骤。The processor 520 calls the executable program code stored in the memory 510 to execute part or all of the steps in the configurable grammatical relationship search method in the first embodiment.
本发明实施例公开一种计算机可读存储介质,其存储计算机程序,其中,该计算机程序使得计算机执行实施例一中的可配置语法关系的检索方法中的部分或全部步骤。The embodiment of the present invention discloses a computer-readable storage medium storing a computer program, wherein the computer program enables a computer to execute part or all of the steps in the retrieval method of configurable grammatical relations in the first embodiment.
本发明实施例还公开一种计算机程序产品,其中,当计算机程序产品在计算机上运行时,使得计算机执行实施例一中的可配置语法关系的检索方法中的部分或全部步骤。The embodiment of the present invention further discloses a computer program product, wherein when the computer program product is run on a computer, the computer is enabled to execute part or all of the steps in the retrieval method for configurable grammatical relations in the first embodiment.
本发明实施例还公开一种应用发布平台,其中,应用发布平台用于发布计算机程序产品,其中,当计算机程序产品在计算机上运行时,使得计算机执行实施例一中的可配置语法关系的检索方法中的部分或全部步骤。An embodiment of the present invention further discloses an application publishing platform, wherein the application publishing platform is used to publish a computer program product, wherein when the computer program product runs on a computer, the computer executes part or all of the steps in the retrieval method of configurable grammatical relations in embodiment one.
在本发明的各种实施例中,应理解,所述各过程的序号的大小并不意味着执行顺序的必然先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本发明实施例的实施过程构成任何限定。In various embodiments of the present invention, it should be understood that the size of the serial numbers of the processes does not necessarily mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物单元,即可位于一个地方,或者也可以分布到多个网络单元上。可根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, i.e., they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本发明各实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。所述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The integrated unit may be implemented in the form of hardware or in the form of software functional units.
所述集成的单元若以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可获取的存储器中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或者部分,可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干请求用以使得一台计算机设备(可以为个人计算机、服务器或者网络设备等,具体可以是计算机设备中的处理器)执行本发明的各个实施例所述方法的部分或全部步骤。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer-accessible memory. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product, which is stored in a memory and includes several requests for a computer device (which can be a personal computer, a server or a network device, etc., specifically a processor in a computer device) to perform some or all of the steps of the method described in each embodiment of the present invention.
在本发明所提供的实施例中,应理解,“与A对应的B”表示B与A相关联,根据A可以确定B。但还应理解,根据A确定B并不意味着仅仅根据A确定B,还可以根据A和/或其他信息确定B。In the embodiments provided by the present invention, it should be understood that "B corresponding to A" means that B is associated with A, and B can be determined according to A. However, it should also be understood that determining B according to A does not mean determining B only according to A, and B can also be determined according to A and/or other information.
本领域普通技术人员可以理解所述实施例的各种方法中的部分或全部步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质包括只读存储器(Read-Only Memory,ROM)、随机存储器(RandomAccess Memory,RAM)、可编程只读存储器(Programmable Read-only Memory,PROM)、可擦除可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM)、一次可编程只读存储器(One-timeProgrammable Read-Only Memory,OTPROM)、电子抹除式可复写只读存储器(Electrically-Erasable Programmable Read-Only Memory,EEPROM)、只读光盘(CompactDisc Read-Only Memory,CD-ROM)或其他光盘存储器、磁盘存储器、磁带存储器、或者能够用于携带或存储数据的计算机可读的任何其他介质。A person of ordinary skill in the art can understand that some or all of the steps in the various methods of the embodiments can be completed by instructing related hardware through a program, and the program can be stored in a computer-readable storage medium, and the storage medium includes a read-only memory (ROM), a random access memory (RAM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), a one-time programmable read-only memory (OTPROM), an electronically erasable rewritable read-only memory (EEPROM), a compact disc (CD-ROM) or other optical disc storage, magnetic disk storage, magnetic tape storage, or any other computer-readable medium that can be used to carry or store data.
以上对本发明实施例公开的可配置语法关系的检索方法、系统、电子设备及存储介质进行了详细介绍,本文中应用了具体个例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想;同时,对于本领域的一般技术人员,依据本发明的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本发明的限制。The above is a detailed introduction to the retrieval method, system, electronic device and storage medium for configurable grammatical relationships disclosed in the embodiments of the present invention. Specific examples are used herein to illustrate the principles and implementation methods of the present invention. The description of the above embodiments is only used to help understand the method of the present invention and its core idea. At the same time, for those skilled in the art, according to the ideas of the present invention, there will be changes in the specific implementation methods and application scopes. In summary, the content of this specification should not be understood as a limitation on the present invention.
表2:CTB词性标注标签Table 2: CTB part-of-speech tagging tags
表3:UD依存关系标签Table 3: UD dependency labels
Claims (8)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310590928.9A CN116991969B (en) | 2023-05-23 | 2023-05-23 | Configurable grammatical relationship retrieval method, system, electronic device and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310590928.9A CN116991969B (en) | 2023-05-23 | 2023-05-23 | Configurable grammatical relationship retrieval method, system, electronic device and storage medium |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN116991969A CN116991969A (en) | 2023-11-03 |
| CN116991969B true CN116991969B (en) | 2024-03-19 |
Family
ID=88523904
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310590928.9A Active CN116991969B (en) | 2023-05-23 | 2023-05-23 | Configurable grammatical relationship retrieval method, system, electronic device and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN116991969B (en) |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104252533A (en) * | 2014-09-12 | 2014-12-31 | 百度在线网络技术(北京)有限公司 | Search method and search device |
| CN104484374A (en) * | 2014-12-08 | 2015-04-01 | 百度在线网络技术(北京)有限公司 | Method and device for creating Internet encyclopedia entry |
| CN110502642A (en) * | 2019-08-21 | 2019-11-26 | 武汉工程大学 | A Method of Entity Relationship Extraction Based on Dependency Syntax Analysis and Rules |
| CN112347767A (en) * | 2021-01-07 | 2021-02-09 | 腾讯科技(深圳)有限公司 | Text processing method, device and equipment |
| CN113886527A (en) * | 2021-10-20 | 2022-01-04 | 前锦网络信息技术(上海)有限公司 | A natural language semantic extraction method and system |
| CN116090450A (en) * | 2022-11-28 | 2023-05-09 | 荣耀终端有限公司 | Text processing method and computing device |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US7526425B2 (en) * | 2001-08-14 | 2009-04-28 | Evri Inc. | Method and system for extending keyword searching to syntactically and semantically annotated data |
-
2023
- 2023-05-23 CN CN202310590928.9A patent/CN116991969B/en active Active
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN104252533A (en) * | 2014-09-12 | 2014-12-31 | 百度在线网络技术(北京)有限公司 | Search method and search device |
| CN104484374A (en) * | 2014-12-08 | 2015-04-01 | 百度在线网络技术(北京)有限公司 | Method and device for creating Internet encyclopedia entry |
| CN110502642A (en) * | 2019-08-21 | 2019-11-26 | 武汉工程大学 | A Method of Entity Relationship Extraction Based on Dependency Syntax Analysis and Rules |
| CN112347767A (en) * | 2021-01-07 | 2021-02-09 | 腾讯科技(深圳)有限公司 | Text processing method, device and equipment |
| CN113886527A (en) * | 2021-10-20 | 2022-01-04 | 前锦网络信息技术(上海)有限公司 | A natural language semantic extraction method and system |
| CN116090450A (en) * | 2022-11-28 | 2023-05-09 | 荣耀终端有限公司 | Text processing method and computing device |
Non-Patent Citations (1)
| Title |
|---|
| 基于依存语法的汉语复句关系词自动标识;荣蕾;《基于依存语法的汉语复句关系词自动标识(月刊)》(第01期);全文 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN116991969A (en) | 2023-11-03 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US11989519B2 (en) | Applied artificial intelligence technology for using natural language processing and concept expression templates to train a natural language generation system | |
| KR102844617B1 (en) | System and method for performing semantic search using a natural language understanding (NLU) framework | |
| CN103970798B (en) | The search and matching of data | |
| CN112256860A (en) | Semantic retrieval method, system, device and storage medium for customer service dialogue content | |
| EP3799640A1 (en) | Semantic parsing of natural language query | |
| US12175193B2 (en) | System and method for lookup source segmentation scoring in a natural language understanding (NLU) framework | |
| US20220245353A1 (en) | System and method for entity labeling in a natural language understanding (nlu) framework | |
| TWI735380B (en) | Natural language processing method and computing apparatus thereof | |
| CN105045852A (en) | Full-text search engine system for teaching resources | |
| US11775778B2 (en) | Machine translation of entities | |
| CN108319583B (en) | Method and system for extracting knowledge from Chinese language material library | |
| CN115114420A (en) | A knowledge graph question answering method, terminal device and storage medium | |
| US12499313B2 (en) | Ensemble scoring system for a natural language understanding (NLU) framework | |
| CN112988952B (en) | Multi-level-length text vector retrieval method and device and electronic equipment | |
| CN112560425A (en) | Template generation method and device, electronic equipment and storage medium | |
| Kwon | Reading customers’ minds through textual big data: Challenges, practical guidelines, and proposals | |
| CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
| US12265796B2 (en) | Lookup source framework for a natural language understanding (NLU) framework | |
| US12299391B2 (en) | System and method for repository-aware natural language understanding (NLU) using a lookup source framework | |
| Revanth et al. | Nl2sql: Natural language to sql query translator | |
| CN116991969B (en) | Configurable grammatical relationship retrieval method, system, electronic device and storage medium | |
| Rosyiq et al. | Information extraction from Twitter using DBpedia ontology: Indonesia tourism places | |
| CN114896518A (en) | Recommendation method and device for business case, computer equipment and storage medium | |
| Karimi et al. | Natural language query and control interface for database using afghan language | |
| Dhivyashree et al. | A Combined Model of NLP with Business Process Modelling for Sentiment Analysis |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |