CN114742068A - Multi-sentence correlation analysis method and system for ISO19650 standard text - Google Patents
Multi-sentence correlation analysis method and system for ISO19650 standard text Download PDFInfo
- Publication number
- CN114742068A CN114742068A CN202210355791.4A CN202210355791A CN114742068A CN 114742068 A CN114742068 A CN 114742068A CN 202210355791 A CN202210355791 A CN 202210355791A CN 114742068 A CN114742068 A CN 114742068A
- Authority
- CN
- China
- Prior art keywords
- sentence
- sentences
- standard
- words
- iso19650
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Machine Translation (AREA)
Abstract
Description
技术领域technical field
本发明涉及信息处理技术领域,具体地,涉及一种基于NLP和本体模型的ISO19650标准系列文本中多语句间的关联分析方法。The invention relates to the technical field of information processing, in particular, to a method for association analysis between multiple sentences in ISO19650 standard series texts based on NLP and ontology model.
背景技术Background technique
工程项目开发要求所有参与者及时传达明确的信息。除了IFC文件格式,他们还需要一个信息管理框架来支持他们的协作。ISO 19650标准系列提供了这样一个框架来建立可靠的信息源。由于由5部分组成的ISO 19650标准系列构成了一个复杂的系统,建筑行业希望能够捕获这些标准中的语义信息。Engineering project development requires clear and timely information from all involved. In addition to the IFC file format, they also needed an information management framework to support their collaboration. The ISO 19650 family of standards provides such a framework for establishing reliable sources of information. Since the 5-part ISO 19650 family of standards forms a complex system, the construction industry hopes to capture the semantic information in these standards.
但是,人工提取ISO 19650标准系列中语义信息不仅耗时,成本也很高。因此,需要专门开发了一种基于NLP的语义信息提取方法,进而借助ISO 19650标准的本体模型自动分析各个标准条文之间的关联和参引关系。However, manual extraction of semantic information in the ISO 19650 standard series is not only time-consuming but also costly. Therefore, it is necessary to develop a method for extracting semantic information based on NLP, and then automatically analyze the association and reference relationship between various standard clauses with the help of the ontology model of the ISO 19650 standard.
公开号为CN110096692B的发明专利,公开了一种语义信息处理方法和装置,该语义信息处理方法包括根据得到的题干,将题干划分为已知条件和结论两部分;根据得到的已知条件和结论,提取已知条件和结论中的显性语义信息;当已知条件和/或结论中存在隐性语义信息时,提取已知条件和/结论中的隐性语义信息;合并提取到的显性语义信息和隐性语义信息,得到题干的语义信息。The invention patent with publication number CN110096692B discloses a semantic information processing method and device. The semantic information processing method includes dividing the question stem into known conditions and conclusions according to the obtained question stem; according to the obtained known condition and conclusion, extract explicit semantic information in known conditions and conclusions; when implicit semantic information exists in known conditions and/or conclusions, extract implicit semantic information in known conditions and/or conclusions; merge the extracted Explicit semantic information and implicit semantic information, get the semantic information of the question stem.
发明内容SUMMARY OF THE INVENTION
针对现有技术中的缺陷,本发明提供一种ISO 19650标准文本的多语句关联分析方法及系统。Aiming at the defects in the prior art, the present invention provides a method and system for multi-sentence association analysis of ISO 19650 standard text.
根据本发明提供的一种ISO 19650标准文本的多语句关联分析方法及系统,所述方案如下:According to a method and system for multi-statement association analysis of ISO 19650 standard text provided by the present invention, the scheme is as follows:
第一方面,提供了一种ISO 19650标准文本的多语句关联分析方法,所述方法包括:In a first aspect, a multi-sentence association analysis method of ISO 19650 standard text is provided, and the method includes:
步骤S1:对ISO 19650标准系列中的语句进行分词和换词处理,获得经过预处理的语句;Step S1: performing word segmentation and word replacement processing on the sentences in the ISO 19650 standard series to obtain preprocessed sentences;
步骤S2:对经过预处理的语句进行依存句法分析,获取语句中词语之间的依存关系;Step S2: perform dependency syntax analysis on the preprocessed statement to obtain the dependency relationship between words in the statement;
步骤S3:根据依存关系到语义关系的转换规则,针对语句中词语间的依存关系进行推理,得到单个语句中词语之间的语义关系;Step S3: According to the conversion rule of the dependency relationship to the semantic relationship, inference is performed on the dependency relationship between words in the sentence, and the semantic relationship between the words in a single sentence is obtained;
步骤S4:将单个语句中的语义关系导入图数据库,将ISO标准的本体模型导入图数据库,建立各个句子中词语与本体模型中词语的链接,推理多个语句之间的关联关系。Step S4: import the semantic relationship in a single sentence into the graph database, import the ISO standard ontology model into the graph database, establish the links between the words in each sentence and the words in the ontology model, and infer the association between multiple sentences.
优选地,所述步骤S1包括:Preferably, the step S1 includes:
步骤S1.1:获取中文版ISO 19650标准系列的文本文件;Step S1.1: Obtain the text files of the Chinese version of the ISO 19650 standard series;
步骤S1.2:按每个标准条目抽取语句,并进行语句分词;Step S1.2: extract sentences according to each standard entry, and perform sentence segmentation;
步骤S1.3:对分词得到的词语进行换词,用上位词替换专业术语。Step S1.3: Change words for the words obtained by word segmentation, and replace professional terms with hypernyms.
优选地,所述文本文件为docx文件,使用开源ZLib库将docx文件解压到一组XML文件中,然后从这些解压文件中按照ISO 19650标准系列的条目编码规律分析XML文件,从中提取标准的条目内容,删除所有字体和段落排版,最后生成一个包含语句列表的纯文本文件。Preferably, the text file is a docx file, and the open source ZLib library is used to decompress the docx file into a set of XML files, and then from these decompressed files, the XML files are analyzed according to the entry encoding rules of the ISO 19650 standard series, and the standard entries are extracted therefrom. content, remove all font and paragraph typography, and finally generate a plain text file with a list of statements.
优选地,所述步骤S2包括:通过依存关系解析器对语句进行句法树分析,为语句中的每个词语都标记一个词性,找出语句中的中心词,确定与中心词关联的非中心词,将非中心词再作为中心词开始下一轮的相关非中心词的查找,最后获得一个多层次的依存句法树。Preferably, the step S2 includes: performing a syntax tree analysis on the sentence through a dependency parser, marking a part of speech for each word in the sentence, finding out the central word in the sentence, and determining the non-central word associated with the central word , the non-central word is used as the central word to start the next round of related non-central word search, and finally a multi-level dependency syntax tree is obtained.
优选地,所述步骤S3包括:语义关系推理,设计依存关系到语义关系的映射规则,根据该映射规则将所述依存句法树转化为二元的语义关系。Preferably, the step S3 includes: reasoning about the semantic relationship, designing a mapping rule between the dependency relationship and the semantic relationship, and converting the dependency syntax tree into a binary semantic relationship according to the mapping rule.
第二方面,提供了一种ISO 19650标准文本的多语句关联分析系统,所述系统包括:In a second aspect, a multi-sentence association analysis system for ISO 19650 standard text is provided, and the system includes:
模块M1:对ISO 19650标准系列中的语句进行分词和换词处理,获得经过预处理的语句;Module M1: Perform word segmentation and word replacement processing on sentences in the ISO 19650 standard series to obtain preprocessed sentences;
模块M2:对经过预处理的语句进行依存句法分析,获取语句中词语之间的依存关系;Module M2: perform dependency syntax analysis on the preprocessed statement to obtain the dependency relationship between words in the statement;
模块M3:根据依存关系到语义关系的转换规则,针对语句中词语间的依存关系进行推理,得到单个语句中词语之间的语义关系;Module M3: According to the conversion rule of the dependency relationship to the semantic relationship, infer the dependency relationship between the words in the sentence, and obtain the semantic relationship between the words in a single sentence;
模块M4:将单个语句中的语义关系导入图数据库,将ISO标准的本体模型导入图数据库,建立各个句子中词语与本体模型中词语的链接,推理多个语句之间的关联关系。Module M4: Import the semantic relationship in a single sentence into the graph database, import the ISO standard ontology model into the graph database, establish the link between the words in each sentence and the words in the ontology model, and infer the association between multiple sentences.
优选地,所述模块M1包括:Preferably, the module M1 includes:
模块M1.1:获取中文版ISO 19650标准系列的文本文件;Module M1.1: Obtain the text files of the Chinese version of the ISO 19650 standard series;
模块M1.2:按每个标准条目抽取语句,并进行语句分词;Module M1.2: Extract sentences according to each standard entry, and perform sentence segmentation;
模块M1.3:对分词得到的词语进行换词,用上位词替换专业术语。Module M1.3: Swap words obtained from word segmentation, and replace professional terms with hypernyms.
优选地,所述文本文件为docx文件,使用开源ZLib库将docx文件解压到一组XML文件中,然后从这些解压文件中按照ISO 19650标准系列的条目编码规律分析XML文件,从中提取标准的条目内容,删除所有字体和段落排版,最后生成一个包含语句列表的纯文本文件。Preferably, the text file is a docx file, and the open source ZLib library is used to decompress the docx file into a set of XML files, and then from these decompressed files, the XML files are analyzed according to the entry encoding rules of the ISO 19650 standard series, and the standard entries are extracted therefrom. content, remove all font and paragraph typography, and finally generate a plain text file with a list of statements.
优选地,所述模块M2包括:通过依存关系解析器对语句进行句法树分析,为语句中的每个词语都标记一个词性,找出语句中的中心词,确定与中心词关联的非中心词,将非中心词再作为中心词开始下一轮的相关非中心词的查找,最后获得一个多层次的依存句法树。Preferably, the module M2 includes: performing a syntax tree analysis on the sentence through a dependency parser, marking a part of speech for each word in the sentence, finding out the central word in the sentence, and determining the non-central word associated with the central word , the non-central word is used as the central word to start the next round of related non-central word search, and finally a multi-level dependency syntax tree is obtained.
优选地,所述模块M3包括:语义关系推理,设计依存关系到语义关系的映射规则,根据该映射规则将所述依存句法树转化为二元的语义关系。Preferably, the module M3 includes: semantic relationship reasoning, designing a mapping rule between dependencies and semantic relationships, and converting the dependency syntax tree into binary semantic relationships according to the mapping rules.
与现有技术相比,本发明具有如下的有益效果:Compared with the prior art, the present invention has the following beneficial effects:
本发明通过映射规则从句法关系中推断领域语义关系,极大地克服语料库不足带来的困难,且通过实验验证了所提出的信息提取方法的可行性和实用性。The invention infers the domain semantic relationship from the syntactic relationship through the mapping rules, which greatly overcomes the difficulty caused by the insufficient corpus, and verifies the feasibility and practicability of the proposed information extraction method through experiments.
附图说明Description of drawings
通过阅读参照以下附图对非限制性实施例所作的详细描述,本发明的其它特征、目的和优点将会变得更明显:Other features, objects and advantages of the present invention will become more apparent by reading the detailed description of non-limiting embodiments with reference to the following drawings:
图1为本发明整体流程示意图;Fig. 1 is the overall flow schematic diagram of the present invention;
图2为依存句法树的框架示意图;Fig. 2 is the frame schematic diagram of the dependency syntax tree;
图3为一个句子的依存树示例;Figure 3 is an example of a dependency tree of a sentence;
图4为典型的语义关系示意图;Figure 4 is a schematic diagram of a typical semantic relationship;
图5为从句法关系到语义关系的转换示意图;Fig. 5 is the conversion schematic diagram from syntactic relation to semantic relation;
图6为Neo4J图数据库中存储的知识图谱;Fig. 6 is the knowledge graph stored in the Neo4J graph database;
图7为推断标准语句之间关联关系的案例。FIG. 7 is a case of inferring the association relationship between standard sentences.
具体实施方式Detailed ways
下面结合具体实施例对本发明进行详细说明。以下实施例将有助于本领域的技术人员进一步理解本发明,但不以任何形式限制本发明。应当指出的是,对本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变化和改进。这些都属于本发明的保护范围。The present invention will be described in detail below with reference to specific embodiments. The following examples will help those skilled in the art to further understand the present invention, but do not limit the present invention in any form. It should be noted that, for those skilled in the art, several changes and improvements can be made without departing from the inventive concept. These all belong to the protection scope of the present invention.
本发明实施例提供了一种基于NLP标准文本的多语句间的关联分析方法,参照图1所示,该方法具体包括:An embodiment of the present invention provides a method for analyzing the association between multiple sentences based on NLP standard text. Referring to FIG. 1 , the method specifically includes:
步骤S1:对ISO 19650中的语句进行分词和换词处理,获得经过预处理的语句。Step S1: Perform word segmentation and word replacement processing on the sentences in ISO 19650 to obtain preprocessed sentences.
该步骤S1具体包括:The step S1 specifically includes:
步骤S1.1:获取ISO 19650标准系列中文翻译文本的文件。Step S1.1: Obtain the file of the Chinese translation text of the ISO 19650 standard series.
步骤S1.2:按标准条目抽取语句,并对语句进行分词。Step S1.2: Extract sentences according to standard items, and perform word segmentation on the sentences.
步骤S1.3:对分词得到的词语(语句中的词语对应的命名实体)进行换词,用上位词替换专业术语,被替换的上位词是DDParser识别和分析准确率较高的词汇。Step S1.3: Replacing the words obtained by the segmentation (named entities corresponding to the words in the sentence), and replacing the professional terms with the hypernyms. The replaced hypernyms are words with higher recognition and analysis accuracy by DDParser.
其中,文本文件为docx文件,使用开源ZLib库将docx文件解压到一组XML文件中,然后从这些解压文件中按照ISO 19650的条目编码规律分析XML文件,从中提取标准的条目内容,删除所有字体和段落排版,最后生成一个包含语句列表的纯文本文件。Among them, the text file is a docx file, use the open source ZLib library to decompress the docx file into a set of XML files, and then analyze the XML file from these decompressed files according to the ISO 19650 entry coding rules, extract the standard entry content from it, delete all fonts and paragraph typesetting, and finally generate a plain text file containing a list of statements.
步骤S2:对经过预处理的语句进行依存句法分析,获取语句中词语之间的依存关系。通过依存关系解析器对语句进行句法树分析,为语句中的每个词语都标记一个词性,找出语句中的中心词,确定与中心词关联的非中心词,将非中心词再作为中心词开始下一轮的相关非中心词的查找,最后获得一个多层次的依存句法树。Step S2: Perform dependency syntax analysis on the preprocessed sentence to obtain the dependency relationship between words in the sentence. Syntax tree analysis is performed on the sentence through the dependency parser, each word in the sentence is marked with a part of speech, the central word in the sentence is found, the non-central word associated with the central word is determined, and the non-central word is used as the central word again Start the next round of searching for related non-center words, and finally obtain a multi-level dependency syntax tree.
步骤S3:根据依存关系到语义关系的转换规则,针对语句中词语间的依存关系进行推理,得到单个语句中词语之间的语义关系。语义关系推理,设计依存关系到语义关系的映射规则,根据该映射规则将所述依存句法树转化为二元的语义关系。Step S3: According to the conversion rule of the dependency relationship to the semantic relationship, inference is performed on the dependency relationship between words in the sentence, and the semantic relationship between the words in a single sentence is obtained. Semantic relationship reasoning, designing a mapping rule between dependencies and semantic relationships, and converting the dependency syntax tree into binary semantic relationships according to the mapping rules.
步骤S4:将单个语句中的语义关系导入图数据库,将ISO标准的本体模型导入图数据库,建立各个句子中词语(语句中的词语对应的命名实体)与本体模型中词语的链接,推理多个语句之间的关联关系。Step S4: import the semantic relationship in a single sentence into the graph database, import the ISO standard ontology model into the graph database, establish the links between the words in each sentence (named entities corresponding to the words in the sentence) and the words in the ontology model, and reason multiple relationship between sentences.
本发明还提供一种基于NLP标准文本的多语句间的关联分析系统,该系统具体包括:The present invention also provides an association analysis system between multiple sentences based on NLP standard text, the system specifically includes:
模块M1:对ISO 19650中的语句进行分词和换词处理,获得经过预处理的语句。Module M1: Perform word segmentation and word replacement processing on sentences in ISO 19650 to obtain preprocessed sentences.
在模块M1中具体包括:Specifically included in module M1:
模块M1.1:获取ISO 19650标准系列中文翻译文本的文件。Module M1.1: Obtain files for Chinese translation texts of the ISO 19650 standard series.
模块M1.2:按标准条目抽取语句,并对语句进行分词。Module M1.2: Extract sentences according to standard items and segment the sentences.
模块M1.3:对分词得到的词语(语句中的词语对应的命名实体)进行换词,用上位词替换专业术语,被替换的上位词是DDParser识别和分析准确率较高的词汇。Module M1.3: Replace the words obtained by the segmentation (named entities corresponding to the words in the sentence), and replace the professional terms with the hypernyms. The replaced hypernyms are words with high recognition and analysis accuracy by DDParser.
其中,文本文件为docx文件,使用开源ZLib库将docx文件解压到一组XML文件中,然后从这些解压文件中按照ISO 19650的条目编码规律分析XML文件,从中提取标准的条目内容,删除所有字体和段落排版,最后生成一个包含语句列表的纯文本文件。Among them, the text file is a docx file, use the open source ZLib library to decompress the docx file into a set of XML files, and then analyze the XML file from these decompressed files according to the ISO 19650 entry coding rules, extract the standard entry content from it, delete all fonts and paragraph typesetting, and finally generate a plain text file containing a list of statements.
模块M2:对经过预处理的语句进行依存句法分析,获取语句中词语之间的依存关系。通过依存关系解析器对语句进行句法树分析,为语句中的每个词语都标记一个词性,找出语句中的中心词,确定与中心词关联的非中心词,将非中心词再作为中心词开始下一轮的相关非中心词的查找,最后获得一个多层次的依存句法树。Module M2: Perform dependency syntax analysis on the preprocessed statement to obtain the dependency relationship between words in the statement. Syntax tree analysis is performed on the sentence through the dependency parser, each word in the sentence is marked with a part of speech, the central word in the sentence is found, the non-central word associated with the central word is determined, and the non-central word is used as the central word again Start the next round of searching for related non-center words, and finally obtain a multi-level dependency syntax tree.
模块M3:根据依存关系到语义关系的转换规则,针对语句中词语间的依存关系进行推理,得到单个语句中词语之间的语义关系。语义关系推理,设计依存关系到语义关系的映射规则,根据该映射规则将所述依存句法树转化为二元的语义关系。Module M3: According to the conversion rule of the dependency relationship to the semantic relationship, infer the dependency relationship between the words in the sentence, and obtain the semantic relationship between the words in a single sentence. Semantic relationship reasoning, designing a mapping rule between dependencies and semantic relationships, and converting the dependency syntax tree into binary semantic relationships according to the mapping rules.
模块M4:将单个语句中的语义关系导入图数据库,将ISO标准的本体模型导入图数据库,建立各个句子中词语与本体模型中词语的链接,推理多个语句之间的关联关系。Module M4: Import the semantic relationship in a single sentence into the graph database, import the ISO standard ontology model into the graph database, establish the link between the words in each sentence and the words in the ontology model, and infer the association between multiple sentences.
接下来,对本发明进行更为具体的说明。Next, the present invention will be described in more detail.
本发明提供一种基于NLP和本体模型的ISO 19650标准系列文本中多语句间的关联分析方法,具体是基于NLP和本体模型的ISO 19650标准系列中文翻译文本的语义信息提取与语句关联分析方法,该方法包括:The present invention provides an association analysis method among multiple sentences in ISO 19650 standard series texts based on NLP and ontology model, in particular a semantic information extraction and sentence association analysis method of ISO 19650 standard series Chinese translation texts based on NLP and ontology model, The method includes:
1、ISO 19650文本的预处理:1. Preprocessing of ISO 19650 text:
参照图1所示,展示了将ISO 19650标准系列中的句进行预处理流程,先将句子分词,然后针将其中专业术语进行替换为其上位词。Referring to Figure 1, it shows the process of preprocessing sentences in the ISO 19650 standard series. First, the sentences are segmented, and then the specialized terms are replaced with their hypernyms.
将英文版的原始标准文本首先翻译成中文,并以Microsoft docx格式存储。然后使用图1所示的预处理程序提取ISO标准每个部分中的所有句子。The original standard text in English is first translated into Chinese and stored in Microsoft docx format. All sentences in each part of the ISO standard are then extracted using the preprocessing procedure shown in Figure 1.
docx文件实际上是XML文件集合的压缩包,包含文本内容和用于格式化或排版定义的标记。使用开源ZLib库将docx文件解压到一组XML文件中,然后从这些解压文件中提取标准的文本内容,删除所有字体和段落排版。通过这种方式,程序生成一个包含句子列表的纯文本文件。A docx file is actually a zip file of a collection of XML files, containing textual content and markup for formatting or typesetting definitions. Use the open source ZLib library to extract docx files into a set of XML files, then extract standard text content from these unzipped files, removing all fonts and paragraph typography. In this way, the program generates a plain text file containing a list of sentences.
与英语句子不同,汉语句子没有空间将字符分割成单词,这导致难以评估单词之间的关系,因此采用Jieba库将ISO 19650标准文本中的每个句子分割成一个由多个中文词语构成的数组,每个中文词语包含一个或多个汉字。此外,由于Jieba识别新词的能力有限,因此为了增强Jieba识别新词的能力,需要使用ISO 19650专业词汇库。例如,在不使用ISO19650专业词汇库的情况下“公共数据环境”这个专有词汇通常被Jieba分词成为“公共”、“数据”、“环境”三个词语,显然这不是预期的结果。所以,在将包含专业术语“公共数据环境”的ISO 19650专业词汇表并入Jieba专用词典后,就可以得到正确的分词结果。Unlike English sentences, Chinese sentences have no space to split characters into words, which makes it difficult to evaluate the relationship between words, so Jieba library is used to split each sentence in ISO 19650 standard text into an array of multiple Chinese words , each Chinese word contains one or more Chinese characters. In addition, since Jieba's ability to recognize new words is limited, in order to enhance Jieba's ability to recognize new words, it is necessary to use ISO 19650 professional vocabulary. For example, in the case of not using the ISO19650 professional vocabulary, the proprietary vocabulary "public data environment" is usually divided into three words "public", "data" and "environment" by Jieba, which is obviously not the expected result. Therefore, after incorporating the ISO 19650 professional vocabulary containing the professional term "common data environment" into the Jieba special dictionary, the correct word segmentation results can be obtained.
在分词完成后,接着进行将专业术语替换为其上位词,被替换的上位词是DDParser识别和分析准确率较高的词汇。为了实现上述替换,在ISO 19650专业词汇库中给每一个专业术语词汇都定义了其上位词。比如,“公共数据环境”的上位词是“信息源”。因为ISO 19650中的许多专业会将会造成下一步的依存句法关系分析困难,所以将其替换为更容易识别和分析的上位词。After the word segmentation is completed, the professional term is then replaced with its hypernym, and the replaced hypernym is a vocabulary with high recognition and analysis accuracy by DDParser. In order to realize the above-mentioned replacement, in the ISO 19650 professional vocabulary, each professional term vocabulary is defined with its hypernym. For example, the hypernym of "common data environment" is "information source". Because many specialties in ISO 19650 will make the next step in the analysis of dependency syntactic relations difficult, they are replaced by hypernyms that are easier to identify and analyze.
2、ISO 19650标准文本的语义信息提取:2. Semantic information extraction from ISO 19650 standard text:
2.1、依存句法关系的获取:2.1. Obtaining the dependency syntactic relationship:
一个句子的结构可以通过依存句法分析进行获得,这个过程将标记每个中文词语的词性,同时确定其与句子中其他成分之间的依存关系,最终将句子解释成一棵句法依存树,更方便后续计算机对句子的理解。利用开源的百度依存关系解析器(DDParser)分析一系列词语之间的依存关系。The structure of a sentence can be obtained through dependency syntax analysis. This process will mark the part-of-speech of each Chinese word, and at the same time determine its dependencies with other components in the sentence, and finally interpret the sentence as a syntactic dependency tree, which is more convenient for follow-up. Computer comprehension of sentences. Use the open source Baidu Dependency Parser (DDParser) to analyze the dependencies between a series of words.
参照图2所示,DDParser是在深度双仿射注意力(DeepBiaffineAttention)模型基础上开发句法分析模型。第i个词语ei的输入向量是其嵌入向量eword i和字符级LSTM向量charLSTM(wi)的串联,如下式所示:Referring to Figure 2, DDParser develops a syntactic analysis model based on the DeepBiaffineAttention model. The input vector of the ith word e i is the concatenation of its embedding vector e word i and the character-level LSTM vector charLSTM( wi ), as follows:
ei=eword i⊕charLSTM(wi)e i =e word i ⊕charLSTM(wi i )
其中,charLSTM(wi)是将第i个字中的每个字符顺序馈送到BiLSTM层而产生的串联向量。然后,每个ei被输入到三层BiLSTM中,产生高维向量ri。随后,通过多层感知(MLP)降低每个ri向量的维数。这种降维可以排除掉对单词之间的依赖关系影响极小的信息。最后,利用双仿射注意识别依赖项及其句法类型。where charLSTM( wi ) is the concatenated vector resulting from sequentially feeding each character in the ith word to the BiLSTM layer. Then, each e i is input into a three-layer BiLSTM, resulting in a high-dimensional vector ri . Subsequently, the dimensionality of each ri vector is reduced by Multilayer Perception (MLP). This dimensionality reduction can exclude information that has little effect on the dependencies between words. Finally, the dependencies and their syntactic types are identified using bi-affine attention.
每个词可以被视为一个头部项或一个依赖项,其词向量输入深度双仿射注意力模型后,分别计算依赖弧Si arc和关系的分数Si rel。根据这两个分数,可以推算出对应两个词汇之间的依存关系。对于图2中的示例,“Is”和“Informationsource”之间的句法关系被评估为Verb-Object(VOB)。Each word can be regarded as a head item or a dependency, and after its word vector is input into the deep bi-affine attention model, the dependency arc S i arc and the relation score S i rel are calculated respectively. Based on these two scores, the dependency relationship between the corresponding two words can be calculated. For the example in Figure 2, the syntactic relationship between "Is" and "Informationsource" is evaluated as Verb-Object(VOB).
DDParser已经在手动标记数据集DuCTB上进行了训练,该库中标注了两个中文词语之间的带注释的依存关系。DuCBT共有24个词性标签和14种依存关系。下表1列出了部分依存关系。DDParser has been trained on the manually labeled dataset DuCTB, which annotates the annotated dependencies between two Chinese words. DuCBT has a total of 24 part-of-speech tags and 14 kinds of dependencies. Table 1 below lists some of the dependencies.
表1Table 1
图3展示了一个依存句法树的示例。由一个从中心词到修饰词发出一条有向弧链接,每个有向弧链接表示一个语法方面的依赖关系。例如,主语“公共数据环境”与谓语“是”之间有一个弧,表示依存关系“SBV”,而宾语"信息源”也与同一谓语“是”有依存关系“VOB”。进一步可以从这两个依存关系中推断出两个词语之间的语义关系。Figure 3 shows an example of a dependency syntax tree. A directed arc link is issued from the head word to the modifier, each directed arc link represents a grammatical aspect dependency. For example, there is an arc between the subject "common data environment" and the predicate "is", indicating the dependency "SBV", and the object "information source" also has a dependency "VOB" with the same predicate "is". Further, the semantic relationship between two words can be inferred from these two dependencies.
2.2、语义关系推理:2.2. Semantic relational reasoning:
DDParser主要处理句子的句法结构,而不是语义关系。句法关系侧重于传递句子语法结构的信息,而语义信息抽取关注的是实体(概念)之间在语义上的联系。因此,语义关系对理解文本更加重要,后续可以从这些语义关系中推断出多个句子之间的关联。DDParser mainly deals with the syntactic structure of sentences, not semantic relations. Syntactic relations focus on conveying information about the grammatical structure of sentences, while semantic information extraction focuses on semantic connections between entities (concepts). Therefore, semantic relations are more important for understanding the text, and the associations between multiple sentences can be inferred from these semantic relations later.
为了从句子的语法结构推断出其中的语义信息,设计一套从一个或多个依存关系到语义关系的映射规则,下表2列出了典型的映射规则。使用这些映射规则,DDParser推导出的依存句法关系可以转化为语义关系。表中的字母A、I、F、O分别表示实体类型、信息、功能和其他。In order to infer the semantic information from the grammatical structure of the sentence, we design a set of mapping rules from one or more dependencies to semantic relations. Table 2 lists typical mapping rules. Using these mapping rules, the dependency syntactic relations derived by DDParser can be transformed into semantic relations. The letters A, I, F, O in the table represent entity type, information, function and others, respectively.
表2Table 2
另外,为了更好的解释语义关系,通过对ISO 19650系列的人工分析,提出了一个ISO 19650的本体模型,该模型定义了信息、角色和行为三个核心概念,同时定义了三个核心概念之间的语义关系参照图4所示。In addition, in order to better explain the semantic relationship, through the manual analysis of the ISO 19650 series, an ontology model of ISO 19650 is proposed, which defines three core concepts of information, role and behavior, and defines one of the three core concepts. The semantic relationship between them is shown in Figure 4.
参照图5所示,说明了将图3中的依存关系转换为语义关系。例如遵循表2中描述的映射规则,两个语法关系SBV(“公共数据环境“和”是“)和VOB(”信息源”和“是”)被映射为主语和宾语之间的一种语义关系“下义”。即“公共数据环境”是“信息源”的下义词。同时,“信息源”和“认可”之间的ATT(定语关系)被转换为两个实体之间的语义关系“协商”,以及“协商”的中心词”信息源”是“信息”类型,而修饰词“认可”是“功能”类型,而且是语义关系“协商”的触发词。此外,另外两个句法关系ATT和ADV被转换成两个语义关系“属性”和“限制”。Referring to FIG. 5 , the conversion of the dependencies in FIG. 3 into semantic relationships is illustrated. For example, following the mapping rules described in Table 2, two grammatical relations SBV ("Common Data Context" and "is") and VOB ("source of information" and "is") are mapped to a semantic between subject and object Relationship "underlying". That is, "common data environment" is a synonym for "information source". At the same time, the ATT (attributive relationship) between "information source" and "approval" is converted into a semantic relationship "negotiation" between the two entities, and the central word "information source" of "negotiation" is "information" type, The modifier "approval" is of the "functional" type and is the trigger word for the "negotiation" of the semantic relationship. Furthermore, two other syntactic relations ATT and ADV are transformed into two semantic relations "attribute" and "restriction".
3、基于图数据库的条文关联关系推理:3. Article association reasoning based on graph database:
当参由于ISO 19650标准文本中包含许多相互关联的语句,这些句子之间或者通过共享的概念相互关联,或者通过ISO 19650的本体模型相互关联。在查阅ISO 19650标准系列文本时,经常需要判断各个语句之间的参引或依赖关系。因此,采用图数据库Neo4J描述前述标准语句中分析获得的语义关系、ISO 19650本体模型以及两者之间的链接。与SQL关系数据库相比,Neo4J中对边可以定义丰富的属性描述,可以大大减少SQL数据库所需的连接操作,从而提高查询与推理速度。When referring to the ISO 19650 standard text contains many interrelated sentences, these sentences are either related to each other through shared concepts, or related to each other through the ISO 19650 ontology model. When reviewing the ISO 19650 standard series of texts, it is often necessary to determine the references or dependencies between individual statements. Therefore, the graph database Neo4J is used to describe the semantic relations obtained by the analysis in the aforementioned standard sentences, the ISO 19650 ontology model and the links between the two. Compared with SQL relational databases, Neo4J can define rich attribute descriptions for edges, which can greatly reduce the connection operations required by SQL databases, thereby improving the speed of query and reasoning.
具体来说,一个节点用于表示命名实体,一个有向边用于表示两个节点之间的二元关系。节点和边都可以具有多个属性。句子与本体模型的关联也可以用图数据库Neo4J表示。在Neo4J中,同一句子中的词语序列由前一个词到后一个词的“AFTER”关系表示。例如图6显示,句法关系由修饰词到中心词的“Depend_On”关系表示。句法关系的类型表示为“Depend_On”边的属性。Specifically, a node is used to represent a named entity, and a directed edge is used to represent a binary relationship between two nodes. Both nodes and edges can have multiple properties. The association of sentences with ontology models can also be represented by the graph database Neo4J. In Neo4J, a sequence of words in the same sentence is represented by the "AFTER" relation from the previous word to the next word. For example, Figure 6 shows that the syntactic relationship is represented by the "Depend_On" relationship of modifiers to head words. The type of syntactic relationship is represented as an attribute of the "Depend_On" edge.
Neo4J数据库的图形结构可进一步用于知识推理。图7展示了使用图数据库Neo4J进行推理的典型案例。ISO 19650第1部分第3.3.15节中的句子表示“共同”修饰中心词“认可”,认可的上位词是“行为”,通过本体模型可以推理得知“行为”一定有“执行者”。“角色”是ISO 19650第1部分第3.2.1节中定义的执行者。同时,在图7的第二句中,有“执行者”三个下位词“人员”,“组织”和“单元”。这意味着参与项目建设过程的这些人员、组织和单元应该就公共数据环境的组成达成一致意见,因为它是共享的信息源。上述推理过程可以采用以下的Cypher查询代码实现:The graph structure of Neo4J database can be further used for knowledge reasoning. Figure 7 shows a typical case of inference using the graph database Neo4J. The sentence in Section 3.3.15 of ISO 19650 Part 1 indicates that "common" modifies the central word "approval", and the hypernym of approval is "behavior". Through the ontology model, it can be inferred that "behavior" must have an "executor". A "role" is an actor as defined in ISO 19650 Part 1, Section 3.2.1. Meanwhile, in the second sentence of Figure 7, there are three hyponyms of "executor", "person", "organization" and "unit". This means that these people, organizations and units involved in the project construction process should agree on what constitutes a common data environment as it is a shared source of information. The above reasoning process can be implemented by the following Cypher query code:
MATCH(a:公共数据环境)MATCH (a: Common Data Environment)
WHERE(a)-[*]-(b:共同)WHERE(a)-[*]-(b: common)
MATCH(c:角色)MATCH (c: role)
WHERE(b)-[:Trigger_Word]->(c)WHERE(b)-[:Trigger_Word]->(c)
MATCH(d)MATCH(d)
WHERE(c)-[:Hyponymy]-[d]WHERE(c)-[:Hyponymy]-[d]
RETURNdRETURNd
工程基础设施项目管理需要所有的项目参与者之间共享BIM信息。项目参与者需要一个广泛认可的BIM信息管理协作框架。ISO 19650标准系列为建立可靠的共享信息源提供了概念、原则和流程。但是由5部分组成的ISO 19650标准系列构成了一个非常复杂的文本系统,基础设施建设者们希望有一个高效的工具来捕捉这些标准中的语义信息,从而能更好地发现和推断标准条文之间的关联和参引。Engineering infrastructure project management requires the sharing of BIM information among all project participants. Project participants need a widely recognized BIM information management collaboration framework. The ISO 19650 family of standards provides concepts, principles and processes for establishing a reliable source of shared information. But the 5-part ISO 19650 standard series constitutes a very complex text system, and infrastructure builders hope to have an efficient tool to capture the semantic information in these standards, so as to better discover and infer the content of the standard provisions. relationships and references.
本领域技术人员知道,除了以纯计算机可读程序代码方式实现本发明提供的系统及其各个装置、模块、单元以外,完全可以通过将方法步骤进行逻辑编程来使得本发明提供的系统及其各个装置、模块、单元以逻辑门、开关、专用集成电路、可编程逻辑控制器以及嵌入式微控制器等的形式来实现相同功能。所以,本发明提供的系统及其各项装置、模块、单元可以被认为是一种硬件部件,而对其内包括的用于实现各种功能的装置、模块、单元也可以视为硬件部件内的结构;也可以将用于实现各种功能的装置、模块、单元视为既可以是实现方法的软件模块又可以是硬件部件内的结构。Those skilled in the art know that, in addition to implementing the system provided by the present invention and its various devices, modules, and units in the form of purely computer-readable program codes, the system provided by the present invention and its various devices can be implemented by logically programming the method steps. , modules, and units realize the same function in the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers. Therefore, the system provided by the present invention and its various devices, modules and units can be regarded as a kind of hardware components, and the devices, modules and units included in it for realizing various functions can also be regarded as hardware components. The device, module and unit for realizing various functions can also be regarded as both a software module for realizing the method and a structure within a hardware component.
以上对本发明的具体实施例进行了描述。需要理解的是,本发明并不局限于上述特定实施方式,本领域技术人员可以在权利要求的范围内做出各种变化或修改,这并不影响本发明的实质内容。在不冲突的情况下,本申请的实施例和实施例中的特征可以任意相互组合。Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the above-mentioned specific embodiments, and those skilled in the art can make various changes or modifications within the scope of the claims, which do not affect the essential content of the present invention. The embodiments of the present application and features in the embodiments may be combined with each other arbitrarily, provided that there is no conflict.
Claims (10)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210355791.4A CN114742068A (en) | 2022-04-06 | 2022-04-06 | Multi-sentence correlation analysis method and system for ISO19650 standard text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210355791.4A CN114742068A (en) | 2022-04-06 | 2022-04-06 | Multi-sentence correlation analysis method and system for ISO19650 standard text |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114742068A true CN114742068A (en) | 2022-07-12 |
Family
ID=82280132
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210355791.4A Pending CN114742068A (en) | 2022-04-06 | 2022-04-06 | Multi-sentence correlation analysis method and system for ISO19650 standard text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114742068A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110270607A1 (en) * | 2006-10-10 | 2011-11-03 | Konstantin Zuev | Method and system for semantic searching of natural language texts |
CN105224630A (en) * | 2015-09-24 | 2016-01-06 | 中国科学院自动化研究所 | Based on the integrated approach of Ontology on Semantic Web data |
CN106155999A (en) * | 2015-04-09 | 2016-11-23 | 科大讯飞股份有限公司 | Semantics comprehension on natural language method and system |
CN110597999A (en) * | 2019-08-01 | 2019-12-20 | 湖北工业大学 | A Judicial Case Knowledge Graph Construction Method Dependent on Syntax Analysis Relation Extraction Model |
CN111488406A (en) * | 2020-04-16 | 2020-08-04 | 南京安链数据科技有限公司 | Graph database management method |
CN113792542A (en) * | 2021-10-12 | 2021-12-14 | 南京新一代人工智能研究院有限公司 | An Intent Understanding Method Integrating Syntactic Analysis and Semantic Role Pruning |
-
2022
- 2022-04-06 CN CN202210355791.4A patent/CN114742068A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110270607A1 (en) * | 2006-10-10 | 2011-11-03 | Konstantin Zuev | Method and system for semantic searching of natural language texts |
CN106155999A (en) * | 2015-04-09 | 2016-11-23 | 科大讯飞股份有限公司 | Semantics comprehension on natural language method and system |
CN105224630A (en) * | 2015-09-24 | 2016-01-06 | 中国科学院自动化研究所 | Based on the integrated approach of Ontology on Semantic Web data |
CN110597999A (en) * | 2019-08-01 | 2019-12-20 | 湖北工业大学 | A Judicial Case Knowledge Graph Construction Method Dependent on Syntax Analysis Relation Extraction Model |
CN111488406A (en) * | 2020-04-16 | 2020-08-04 | 南京安链数据科技有限公司 | Graph database management method |
CN113792542A (en) * | 2021-10-12 | 2021-12-14 | 南京新一代人工智能研究院有限公司 | An Intent Understanding Method Integrating Syntactic Analysis and Semantic Role Pruning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9152623B2 (en) | Natural language processing system and method | |
US6658377B1 (en) | Method and system for text analysis based on the tagging, processing, and/or reformatting of the input text | |
CN112926345B (en) | Multi-feature fusion neural machine translation error detection method based on data augmentation training | |
WO2016051551A1 (en) | Text generation system | |
Rodrigues et al. | Advanced applications of natural language processing for performing information extraction | |
AU2019203783B2 (en) | Extraction of tokens and relationship between tokens from documents to form an entity relationship map | |
CN113609838A (en) | Document information extraction and mapping method and system | |
JP2020190970A (en) | Document processing device, method therefor, and program | |
WO2008059111A2 (en) | Natural language processing | |
Martinez-Rico et al. | Can deep learning techniques improve classification performance of vandalism detection in Wikipedia? | |
CN115017335A (en) | Knowledge graph construction method and system | |
CN117251567A (en) | Multi-domain knowledge extraction methods | |
Yan et al. | Chemical name extraction based on automatic training data generation and rich feature set | |
JP2019083040A (en) | System and method for generating data for generating sentences | |
CN109815497B (en) | Character attribute extraction method based on syntactic dependency | |
Mekki et al. | Tokenization of Tunisian Arabic: A comparison between three machine learning models | |
Rahat et al. | Parsa: An open information extraction system for Persian | |
Mohamed et al. | Lexicon and Rule-based Word Lemmatization Approach for the Somali Language | |
WO2020026229A2 (en) | Proposition identification in natural language and usage thereof | |
CN114742068A (en) | Multi-sentence correlation analysis method and system for ISO19650 standard text | |
Baral et al. | An exploration of datalog applications to language documentation and reclamation | |
Alrehaili et al. | Discovering Qur’anic Knowledge through AQD: Arabic Qur’anic Database, a Multiple Resources Annotation-level Search | |
Faty et al. | SenOpinion: a new lexicon for opinion tagging in Senegalese news comments | |
Vanetik et al. | Multilingual text analysis: History, tasks, and challenges | |
Khalil et al. | Challenges in information retrieval from unstructured arabic data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |