[go: up one dir, main page]

CN120012729B - Method, device, terminal equipment and storage medium for generating document content based on artificial intelligence - Google Patents

Method, device, terminal equipment and storage medium for generating document content based on artificial intelligence Download PDF

Info

Publication number
CN120012729B
CN120012729B CN202510502214.7A CN202510502214A CN120012729B CN 120012729 B CN120012729 B CN 120012729B CN 202510502214 A CN202510502214 A CN 202510502214A CN 120012729 B CN120012729 B CN 120012729B
Authority
CN
China
Prior art keywords
sentence
text
word
sub
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510502214.7A
Other languages
Chinese (zh)
Other versions
CN120012729A (en
Inventor
严海
杨宇
王竹欣
李学峰
邹骏毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuhai Media Sunac Technology Co ltd
Original Assignee
Zhuhai Media Sunac Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuhai Media Sunac Technology Co ltd filed Critical Zhuhai Media Sunac Technology Co ltd
Priority to CN202510502214.7A priority Critical patent/CN120012729B/en
Publication of CN120012729A publication Critical patent/CN120012729A/en
Application granted granted Critical
Publication of CN120012729B publication Critical patent/CN120012729B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides an artificial intelligence-based document content generation method, an artificial intelligence-based document content generation device, terminal equipment and a storage medium. The method comprises the steps of carrying out sentence recognition on a first text to obtain a key sentence and a first position, carrying out recognition on a key sentence to obtain a first word and a first type, carrying out word replacement on a first word in the first text according to the first type to obtain a second text, carrying out clustering on the second text to obtain a clustering result, carrying out similar text analysis on the clustering result to obtain a general sentence and a second position of the general sentence, carrying out recognition on the general sentence to obtain a second word and a second type, carrying out rule extraction on the key sentence according to the first word and the first type to obtain a first rule, carrying out rule extraction on the general sentence according to the second word and the second type to obtain a second rule, obtaining a target keyword and a third type of an event to be published, and generating target document content according to the first position and the second position by combining the first rule and the second rule according to the target keyword and the third type.

Description

基于人工智能的公文内容生成方法、装置、终端设备以及存储 介质Official document content generation method, device, terminal device and storage medium based on artificial intelligence

技术领域Technical Field

本发明涉及人工智能技术领域,尤其涉及一种基于人工智能的公文内容生成方法、装置、终端设备以及存储介质。The present invention relates to the field of artificial intelligence technology, and in particular to an artificial intelligence-based document content generation method, device, terminal equipment and storage medium.

背景技术Background Art

在实际应用场景里,以政府机关和企事业单位为例,公文是极为重要的上传下达、记录事项的工具。公文承载着信息传递、决策部署等关键功能,有着严格的格式规范与严谨的语言表达要求。然而,若借助文本生成模型来生成公文内容可能会由于文本生成模型的随机性使得生成的公文在格式上可能偏离规范,像页面设置、排版布局等方面难以达标;在语言表达上也难以契合公文的严谨性,可能出现用词不当、表述随意等问题。故而现有技术在对公文内容进行文本生成时生成的内容与实际需求不一致,进而导致生成的公文质量低下,难以满足辅助实际工作的需求。In actual application scenarios, taking government agencies and enterprises and institutions as examples, official documents are extremely important tools for uploading and recording matters. Official documents carry key functions such as information transmission and decision-making deployment, and have strict format specifications and rigorous language expression requirements. However, if the content of official documents is generated with the help of a text generation model, the format of the generated official documents may deviate from the specifications due to the randomness of the text generation model, and it may be difficult to meet the standards in aspects such as page settings and typesetting layout; it is also difficult to match the rigor of the official documents in terms of language expression, and there may be problems such as inappropriate wording and casual expressions. Therefore, when the existing technology generates text for official document content, the generated content is inconsistent with the actual needs, which leads to the low quality of the generated official documents, which is difficult to meet the needs of assisting actual work.

发明内容Summary of the invention

本发明实施例的主要目的在于提供一种基于人工智能的公文内容生成方法、装置、终端设备以及存储介质,旨在解决相关技术中在对公文内容进行文本生成时生成的公文质量低下,难以满足辅助实际工作需求的问题。The main purpose of the embodiments of the present invention is to provide a method, device, terminal device and storage medium for generating official document content based on artificial intelligence, aiming to solve the problem in the related technology that the quality of the official document generated when the text of the official document content is generated is low and it is difficult to meet the needs of assisting actual work.

第一方面,本发明实施例提供一种基于人工智能的公文内容生成方法,包括:In a first aspect, an embodiment of the present invention provides a method for generating official document content based on artificial intelligence, comprising:

对第一文本进行语句识别获得关键句和所述关键句的第一位置;Performing sentence recognition on the first text to obtain a key sentence and a first position of the key sentence;

对所述关键句进行关键词识别获得第一词语和所述第一词语的第一类型;Performing keyword recognition on the key sentence to obtain a first word and a first type of the first word;

根据所述第一类型对所述第一文本中的所述第一词语进行词语替换获得第二文本;Performing word replacement on the first word in the first text according to the first type to obtain a second text;

根据所述第二文本进行文本聚类获得目标聚类结果,并对所述目标聚类结果中每个子类簇进行相似文本分析获得所述子类簇的通用语句和所述通用语句的第二位置;Performing text clustering according to the second text to obtain a target clustering result, and performing similar text analysis on each sub-cluster in the target clustering result to obtain a common sentence of the sub-cluster and a second position of the common sentence;

对所述通用语句进行关键词识别获得第二词语和所述第二词语的第二类型;Performing keyword recognition on the general sentence to obtain a second word and a second type of the second word;

根据所述第一词语和所述第一类型对所述关键句进行规则提取获得第一语句规则;Extracting rules from the key sentence according to the first word and the first type to obtain a first sentence rule;

根据所述第二词语和所述第二类型对所述通用语句进行规则提取获得第二语句规则;Extracting rules from the general sentence according to the second word and the second type to obtain a second sentence rule;

获得待公布事件的目标关键词和所述目标关键词的第三类型;Obtaining a target keyword of an event to be announced and a third type of the target keyword;

根据所述目标关键词和所述第三类型结合所述第一语句规则和所述第二语句规则按照所述第一位置和所述第二位置生成所述待公布事件的目标公文内容。The target official document content of the event to be announced is generated according to the target keyword and the third type in combination with the first sentence rule and the second sentence rule according to the first position and the second position.

第二方面,本发明实施例提供一种基于人工智能的公文内容生成装置,包括:In a second aspect, an embodiment of the present invention provides an official document content generation device based on artificial intelligence, comprising:

语句识别模块,用于对第一文本进行语句识别获得关键句和所述关键句的第一位置;A sentence recognition module, used for performing sentence recognition on the first text to obtain a key sentence and a first position of the key sentence;

第一词语识别模块,用于对所述关键句进行关键词识别获得第一词语和所述第一词语的第一类型;A first word recognition module, configured to perform keyword recognition on the key sentence to obtain a first word and a first type of the first word;

替换处理模块,用于根据所述第一类型对所述第一文本中的所述第一词语进行词语替换获得第二文本;a replacement processing module, configured to replace the first word in the first text according to the first type to obtain a second text;

聚类分析模块,用于根据所述第二文本进行文本聚类获得目标聚类结果,并对所述目标聚类结果中每个子类簇进行相似文本分析获得所述子类簇的通用语句和所述通用语句的第二位置;A cluster analysis module, configured to perform text clustering according to the second text to obtain a target clustering result, and perform similar text analysis on each sub-cluster in the target clustering result to obtain a common sentence of the sub-cluster and a second position of the common sentence;

第二词语识别模块,用于对所述通用语句进行关键词识别获得第二词语和所述第二词语的第二类型;A second word recognition module, used for performing keyword recognition on the general sentence to obtain a second word and a second type of the second word;

第一规则提取模块,用于根据所述第一词语和所述第一类型对所述关键句进行规则提取获得第一语句规则;A first rule extraction module, configured to extract rules from the key sentence according to the first word and the first type to obtain a first sentence rule;

第二规则提取模块,用于根据所述第二词语和所述第二类型对所述通用语句进行规则提取获得第二语句规则;A second rule extraction module, configured to extract rules from the general sentence according to the second word and the second type to obtain a second sentence rule;

数据获取模块,用于获得待公布事件的目标关键词和所述目标关键词的第三类型;A data acquisition module, used to obtain a target keyword of an event to be announced and a third type of the target keyword;

公文生成模块,用于根据所述目标关键词和所述第三类型结合所述第一语句规则和所述第二语句规则按照所述第一位置和所述第二位置生成所述待公布事件的目标公文内容。The official document generation module is used to generate the target official document content of the event to be announced according to the first position and the second position in combination with the first sentence rule and the second sentence rule according to the target keyword and the third type.

第三方面,本发明实施例还提供一种终端设备,所述终端设备包括处理器、存储器、存储在所述存储器上并可被所述处理器执行的计算机程序以及用于实现所述处理器和所述存储器之间的连接通信的数据总线,其中所述计算机程序被所述处理器执行时,实现如本发明说明书提供的任一项基于人工智能的公文内容生成方法的步骤。In a third aspect, an embodiment of the present invention further provides a terminal device, comprising a processor, a memory, a computer program stored in the memory and executable by the processor, and a data bus for realizing connection and communication between the processor and the memory, wherein when the computer program is executed by the processor, the steps of any one of the methods for generating official document content based on artificial intelligence provided in the specification of the present invention are realized.

第四方面,本发明实施例还提供一种存储介质,用于计算机可读存储,其特征在于,所述存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现如本发明说明书提供的任一项基于人工智能的公文内容生成方法的步骤。In a fourth aspect, an embodiment of the present invention further provides a storage medium for computer-readable storage, characterized in that the storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to implement the steps of any one of the methods for generating official document content based on artificial intelligence provided in the specification of the present invention.

本发明实施例提供一种基于人工智能的公文内容生成方法、装置、终端设备以及存储介质,该方法包括:对第一文本进行语句识别获得关键句和关键句的第一位置,从而可以快速定位第一文本对应的核心内容为后续的规则提取提供良好的支撑;对关键句进行关键词识别获得第一词语和第一词语的第一类型;根据第一类型对第一文本中的第一词语进行词语替换获得第二文本,从而可以实现不同词语下相同类型之间对应的不同表述,进而降低不同词语的干扰为后续的文本聚类提供良好的支撑,进而根据第二文本进行文本聚类获得目标聚类结果,并对目标聚类结果中每个子类簇进行相似文本分析获得子类簇的通用语句和通用语句的第二位置,再对通用语句进行关键词识别获得第二词语和第二词语的第二类型,从而根据第一词语和第一类型对关键句进行规则提取获得第一语句规则;并根据第二词语和第二类型对通用语句进行规则提取获得第二语句规则;获得待公布事件的目标关键词和目标关键词的第三类型;根据目标关键词和第三类型结合第一语句规则和第二语句规则按照第一位置和第二位置生成待公布事件的目标公文内容,从而可以生成较为规范和一致的公文内容,进一步提高生成公文的质量,为进一步提高相关人员的工作效率提供了支撑,也解决了相关技术中在对公文内容进行文本生成时生成的公文质量低下,难以满足辅助实际工作需求的问题。The embodiment of the present invention provides a method, apparatus, terminal device and storage medium for generating official document content based on artificial intelligence. The method comprises: performing sentence recognition on a first text to obtain a key sentence and a first position of the key sentence, so as to quickly locate the core content corresponding to the first text and provide good support for subsequent rule extraction; performing keyword recognition on the key sentence to obtain a first word and a first type of the first word; performing word replacement on the first word in the first text according to the first type to obtain a second text, so as to achieve different corresponding expressions between the same type under different words, thereby reducing the interference of different words and providing good support for subsequent text clustering, and then performing text clustering according to the second text to obtain a target clustering result, and performing similar text analysis on each subclass cluster in the target clustering result to obtain a common sentence of the subclass cluster and a common sentence of the common sentence. The second position, and then perform keyword recognition on the general sentence to obtain the second word and the second type of the second word, thereby performing rule extraction on the key sentence according to the first word and the first type to obtain the first sentence rule; and performing rule extraction on the general sentence according to the second word and the second type to obtain the second sentence rule; obtain the target keyword of the event to be announced and the third type of the target keyword; generate the target official document content of the event to be announced according to the first position and the second position according to the target keyword and the third type combined with the first sentence rule and the second sentence rule, thereby generating more standardized and consistent official document content, further improving the quality of generated official documents, providing support for further improving the work efficiency of relevant personnel, and also solving the problem in the related technology that the quality of the official documents generated when performing text generation on the official document content is low and it is difficult to meet the needs of assisting actual work.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for use in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative work.

图1为本发明实施例提供的一种基于人工智能的公文内容生成方法的流程示意图;FIG1 is a flow chart of a method for generating official document content based on artificial intelligence provided by an embodiment of the present invention;

图2为本发明实施例提供的另一种基于人工智能的公文内容生成装置的模块结构示意图;FIG2 is a schematic diagram of the module structure of another device for generating official document content based on artificial intelligence provided by an embodiment of the present invention;

图3为本发明实施例提供的一种终端设备的结构示意框图。FIG3 is a schematic block diagram of the structure of a terminal device provided by an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The following will be combined with the drawings in the embodiments of the present invention to clearly and completely describe the technical solutions in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

附图中所示的流程图仅是示例说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解、组合或部分合并,因此实际执行的顺序有可能根据实际情况改变。The flowcharts shown in the accompanying drawings are only examples and do not necessarily include all the contents and operations/steps, nor must they be executed in the order described. For example, some operations/steps may also be decomposed, combined or partially merged, so the actual execution order may change according to actual conditions.

应当理解,在此本发明说明书中所使用的术语仅仅是出于描述特定实施例的目的而并不意在限制本发明。如在本发明说明书和所附权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。It should be understood that the terms used in this specification of the present invention are only for the purpose of describing specific embodiments and are not intended to limit the present invention. As used in the specification of the present invention and the appended claims, unless the context clearly indicates otherwise, the singular forms "a", "an" and "the" are intended to include plural forms.

本发明实施例提供一种基于人工智能的公文内容生成方法、装置、终端设备以及存储介质。其中,该基于人工智能的公文内容生成方法可应用于终端设备中,该终端设备可以是平板电脑、笔记本电脑、台式电脑、个人数字助理和穿戴式设备等电子设备。该终端设备可以为服务器,也可以为服务器集群。The embodiment of the present invention provides a method, device, terminal device and storage medium for generating official document content based on artificial intelligence. The method for generating official document content based on artificial intelligence can be applied to a terminal device, which can be an electronic device such as a tablet computer, a laptop computer, a desktop computer, a personal digital assistant and a wearable device. The terminal device can be a server or a server cluster.

下面结合附图,对本发明的一些实施例作详细说明。在不冲突的情况下,下述的实施例及实施例中的特征可以相互组合。Some embodiments of the present invention are described in detail below in conjunction with the accompanying drawings. In the absence of conflict, the following embodiments and features in the embodiments can be combined with each other.

请参照图1,图1为本发明实施例提供的一种基于人工智能的公文内容生成方法的流程示意图。Please refer to FIG. 1 , which is a flowchart of a method for generating official document content based on artificial intelligence provided by an embodiment of the present invention.

如图1所示,该基于人工智能的公文内容生成方法包括步骤S101至步骤S109。As shown in FIG1 , the document content generation method based on artificial intelligence includes steps S101 to S109 .

步骤S101、对第一文本进行语句识别获得关键句和所述关键句的第一位置。Step S101: Perform sentence recognition on a first text to obtain a key sentence and a first position of the key sentence.

示例性地,第一文本为从数据库中获得已经发布或者公布的历史公文文本,进而从历史公文文本中获得目标标题,并对历史公文文本进行语句分割获得多个分割语句,进而根据余弦相似度判断每个分割语句与目标标题之间的关联程度或者相似程度,从而将关联程度或者相似程度最大时对应的分割语句确定为对应的关键句。Exemplarily, the first text is a historical official document text that has been published or announced obtained from a database, and then a target title is obtained from the historical official document text, and the historical official document text is segmented to obtain multiple segmented sentences, and then the degree of association or similarity between each segmented sentence and the target title is judged according to the cosine similarity, so that the segmented sentence corresponding to the maximum degree of association or similarity is determined as the corresponding key sentence.

示例性地,获得对历史公文文本进行语句分割后每个分割语句对应的语句序号,进而将关联程度或者相似程度最大时对应的分割语句确定为关键句时,该分割语句对应的语句序号确定为该关键句对应的第一位置。Exemplarily, after the historical official document text is segmented, the sentence number corresponding to each segmented sentence is obtained, and then when the segmented sentence corresponding to the maximum correlation or similarity is determined as the key sentence, the sentence number corresponding to the segmented sentence is determined as the first position corresponding to the key sentence.

在一些实施方式中,所述对第一文本进行语句识别获得关键句和所述关键句的第一位置,包括:对所述第一文本进行关键词提取获得所述第一文本对应的第三词语和所述第三词语对应的词语权重;对所述第一文本进行语句分割获得初始语句,并根据所述第三词语和所述词语权重确定所述初始语句对应的语句权重;将所述初始语句中任意一个句子确定为第一语句,并从所述初始语句中获得所述第一语句相邻状态下对应的第二语句;从所述语句权重中获得所述第一语句对应的第一权重和所述第二语句对应的第二权重;根据所述第二权重对所述第一权重进行权重调节获得所述第一语句对应的目标权重;根据所述目标权重对所述初始语句进行语句识别获得所述关键句,并根据所述关键句从所述第一文本中获得所述关键句对应的所述第一位置。In some embodiments, the sentence recognition of the first text to obtain the key sentence and the first position of the key sentence includes: performing keyword extraction on the first text to obtain the third word corresponding to the first text and the word weight corresponding to the third word; performing sentence segmentation on the first text to obtain an initial sentence, and determining the sentence weight corresponding to the initial sentence according to the third word and the word weight; determining any sentence in the initial sentence as the first sentence, and obtaining the second sentence corresponding to the first sentence in an adjacent state from the initial sentence; obtaining the first weight corresponding to the first sentence and the second weight corresponding to the second sentence from the sentence weight; weight-adjusting the first weight according to the second weight to obtain the target weight corresponding to the first sentence; performing sentence recognition on the initial sentence according to the target weight to obtain the key sentence, and obtaining the first position corresponding to the key sentence from the first text according to the key sentence.

示例性地,使用词频-逆文档频率算法对第一文本进行关键词提取获得第一文本对应的第三词语,并从第一文本中获得第三词语对应的词频,进而根据该词频确定第三词语对应的词语权重,其中,还可以根据第三词语出现的位置进一步调整词语权重。Exemplarily, a word frequency-inverse document frequency algorithm is used to perform keyword extraction on the first text to obtain a third word corresponding to the first text, and the word frequency corresponding to the third word is obtained from the first text, and then the word weight corresponding to the third word is determined based on the word frequency, wherein the word weight can be further adjusted according to the position where the third word appears.

示例性地,根据语句分隔符如句号、问号、感叹号等将第一文本分割成多个初始语句。从而对于每个初始语句将初始语句中包含的第三词语的词语权重进行累加得到该初始语句的语句权重。例如,一个初始语句中包含了词语权重为0.3、0.2和0.1的三个关键词,那么该初始语句的语句权重就是0.3+0.2+0.1= 0.6。Exemplarily, the first text is divided into a plurality of initial sentences according to sentence separators such as a period, a question mark, an exclamation mark, etc. Thus, for each initial sentence, the word weights of the third words contained in the initial sentence are accumulated to obtain the sentence weight of the initial sentence. For example, an initial sentence contains three keywords with word weights of 0.3, 0.2, and 0.1, then the sentence weight of the initial sentence is 0.3+0.2+0.1=0.6.

示例性地,从多个初始语句中按照顺序依次选择一个句子作为第一语句。例如,先选择第一个初始语句作为第一语句。如果第一语句是第一文本中的第一个语句,那么它的相邻第二语句就是紧随其后的语句;如果第一语句不是第一个语句,那么它的相邻第二语句可以是它前面的语句或者后面的语句,具体根据实际需求确定。Exemplarily, a sentence is selected from a plurality of initial sentences in order as the first sentence. For example, the first initial sentence is selected as the first sentence. If the first sentence is the first sentence in the first text, then its adjacent second sentence is the sentence that follows it immediately; if the first sentence is not the first sentence, then its adjacent second sentence can be the sentence before it or the sentence after it, which is determined according to actual needs.

示例性地,在确定第一语句和第二语句后,从语句权重中获得第一语句对应的第一权重和第二语句对应的第二权重,从而根据第二权重和第一权重进行比例关系计算获得获得比例系数,当比例系数大于或者等于预设数值时,则表明第一权重和第二权重之间的差距较大,则无需根据第二权重对第一权重进行处理可直接将第一权重确定为目标权重,也即第一语句和第二语句的重要性已有明显区分,若再进行额外的权重调整,可能会破坏原本合理的权重关系,导致关键句识别出现偏差。当比例系数小于预设数值时,则表明第一权重和第二权重之间的差距较小,则先计算第一权重和第二权重之间的差距数值,从而确定一个调整系数结合差距数值对第一权重和第二权重中较小的数据进行向下调整,对第一权重和第二权重中较大的数据进行向上调整,也即当第一权重和第二权重差距较小时,对较小权重向下调整,对较大权重向上调整。这样可以进一步拉大两者之间的差距,使重要语句的权重更加突出,从而在后续的关键句识别过程中,更容易筛选出真正重要的语句,从而获得调整后第一语句对应的目标权重。Exemplarily, after determining the first sentence and the second sentence, the first weight corresponding to the first sentence and the second weight corresponding to the second sentence are obtained from the sentence weights, so as to obtain the proportional coefficient by calculating the proportional relationship between the second weight and the first weight. When the proportional coefficient is greater than or equal to the preset value, it indicates that the gap between the first weight and the second weight is large, and the first weight can be directly determined as the target weight without processing the first weight according to the second weight, that is, the importance of the first sentence and the second sentence has been clearly distinguished. If additional weight adjustments are made, the original reasonable weight relationship may be destroyed, resulting in deviations in key sentence recognition. When the proportional coefficient is less than the preset value, it indicates that the gap between the first weight and the second weight is small, and the gap value between the first weight and the second weight is first calculated to determine an adjustment coefficient. The smaller data in the first weight and the second weight are adjusted downward, and the larger data in the first weight and the second weight are adjusted upward, that is, when the gap between the first weight and the second weight is small, the smaller weight is adjusted downward, and the larger weight is adjusted upward. In this way, the gap between the two can be further widened, so that the weight of the important sentence is more prominent, so that in the subsequent key sentence recognition process, it is easier to screen out the truly important sentences, so as to obtain the target weight corresponding to the adjusted first sentence.

示例性地,将全部初始语句分别作为第一语句进行权重调节从而获得每个初始语句对应地目标权重,并确定筛选阈值,进而将目标权重大于阈值的初始语句确定为关键句。Exemplarily, all initial sentences are respectively weighted as first sentences to obtain a target weight corresponding to each initial sentence, and a screening threshold is determined, and then the initial sentences with target weights greater than the threshold are determined as key sentences.

示例性地,获得对第一文本进行语句分割后每个初始语句对应的语句序号,进而从语句序号中获得关键句对应的语句序号,从而将该语句序号确定为该关键句对应的第一位置。Exemplarily, the sentence sequence number corresponding to each initial sentence after sentence segmentation of the first text is obtained, and then the sentence sequence number corresponding to the key sentence is obtained from the sentence sequence number, so as to determine the sentence sequence number as the first position corresponding to the key sentence.

在一些实施方式中,所述根据所述第二权重对所述第一权重进行权重调节获得所述第一语句对应的目标权重,包括:从所述初始语句中获得位于所述第二语句左侧位置对应的第一序列语句,以及从所述初始语句中获得位于所述第二语句右侧位置对应的第二序列语句;从所述第三词语中获得第一序列语句中每个第一子语句对应的关联关键词,并从所述词语权重中获得所述关联关键词对应的相关权重;从所述语句权重中获得所述第一序列语句中每个所述第一子语句对应的第三权重和从所述语句权重中获得所述第二序列语句中每个第二子语句对应的第四权重;根据所述相关权重结合所述第三权重和所述第四权重利用所述第二权重对所述第一权重进行权重调节获得所述第一语句对应的所述目标权重;其中,根据下列公式获得所述目标权重:In some embodiments, the weight adjustment of the first weight according to the second weight to obtain the target weight corresponding to the first sentence includes: obtaining a first sequence sentence corresponding to the left position of the second sentence from the initial sentence, and obtaining a second sequence sentence corresponding to the right position of the second sentence from the initial sentence; obtaining an associated keyword corresponding to each first sub-sentence in the first sequence sentence from the third word, and obtaining a relevant weight corresponding to the associated keyword from the word weight; obtaining a third weight corresponding to each first sub-sentence in the first sequence sentence from the sentence weight and obtaining a fourth weight corresponding to each second sub-sentence in the second sequence sentence from the sentence weight; weight adjustment of the first weight using the second weight according to the relevant weight combined with the third weight and the fourth weight to obtain the target weight corresponding to the first sentence; wherein the target weight is obtained according to the following formula:

其中,表示第i个所述第一语句对应的所述目标权重,表示调节参数,表示第i个所述第一语句对应的所述第一权重,n表示所述第一序列语句对应的句子数量,y表示第h个所述第一子语句对应的所述关联关键词的数量,表示第h个所述第一子语句对应的第k个所述关联关键词对应的所述相关权重,表示第h个所述第一子语句对应的所述第三权重,g表示所述第二序列语句对应的句子数量,表示第t个所述第二子语句对应的所述第四权重,表示第i个所述第一语句相邻状态下对应的所述第二语句的所述第二权重。in, represents the target weight corresponding to the i-th first sentence, represents the adjustment parameter, represents the first weight corresponding to the i-th first sentence, n represents the number of sentences corresponding to the first sequence of sentences, y represents the number of associated keywords corresponding to the h-th first sub-sentence, represents the relevant weight corresponding to the kth associated keyword corresponding to the hth first sub-sentence, represents the third weight corresponding to the hth first sub-sentence, g represents the number of sentences corresponding to the second sequence of sentences, represents the fourth weight corresponding to the t-th second sub-statement, Represents the second weight of the second sentence corresponding to the i-th adjacent state of the first sentence.

示例性地,对第一文本进行数据分割获得多个初始语句后,将多个初始语句按照在第一文本中的顺序存储获得数据存储结果,从而在确定第一语句对应的二语句后,从数据存储结果中找到第二语句的位置,然后把位于第二语句左侧的所有语句提取出来形成第一序列语句;把位于第二语句右侧的所有语句提取出来形成第二序列语句。Exemplarily, after performing data segmentation on the first text to obtain multiple initial sentences, the multiple initial sentences are stored in the order in the first text to obtain a data storage result, so that after determining the second sentence corresponding to the first sentence, the position of the second sentence is found from the data storage result, and then all sentences on the left side of the second sentence are extracted to form a first sequence of sentences; all sentences on the right side of the second sentence are extracted to form a second sequence of sentences.

示例性地,根据第三词语通过词语匹配找出与第一序列语句里每个第一子语句相关联的关联关键词。然后根据第三词语和词语权重之间的对应关系查找这些关联关键词对应的相关权重。Exemplarily, the associated keywords associated with each first sub-sentence in the first sequence of sentences are found through word matching according to the third word, and then the relevant weights corresponding to these associated keywords are found according to the corresponding relationship between the third word and the word weight.

示例性地,根据初始语句和语句权重的对应关系从语句权重中分别获取第一序列语句里每个第一子语句对应的第三权重,以及第二序列语句里每个第二子语句对应的第四权重。Exemplarily, according to the correspondence between the initial statement and the statement weight, the third weight corresponding to each first sub-statement in the first sequence of statements and the fourth weight corresponding to each second sub-statement in the second sequence of statements are respectively obtained from the statement weight.

示例性地,利用下列公式根据相关权重结合第三权重和第四权重利用第二权重对第一权重进行权重调节获得所述第一语句对应的目标权重;Exemplarily, the target weight corresponding to the first sentence is obtained by weight-adjusting the first weight using the second weight according to the relevant weight combined with the third weight and the fourth weight using the following formula;

其中,表示第i个第一语句对应的目标权重,表示调节参数,表示第i个第一语句对应的第一权重,n表示第一序列语句对应的句子数量,y表示第h个第一子语句对应的关联关键词的数量,表示第h个第一子语句对应的第k个关联关键词对应的相关权重,表示第h个第一子语句对应的第三权重,g表示第二序列语句对应的句子数量,表示第t个第二子语句对应的第四权重,表示第i个第一语句相邻状态下对应的第二语句的第二权重。in, represents the target weight corresponding to the i-th first sentence, represents the adjustment parameter, represents the first weight corresponding to the i-th first sentence, n represents the number of sentences corresponding to the first sequence sentence, y represents the number of associated keywords corresponding to the h-th first sub-sentence, represents the relevant weight of the kth associated keyword corresponding to the hth first sub-sentence, represents the third weight corresponding to the hth first sub-sentence, g represents the number of sentences corresponding to the second sequence sentence, represents the fourth weight corresponding to the t-th second sub-statement, Represents the second weight of the second sentence corresponding to the adjacent state of the i-th first sentence.

示例性地,调节参数是一个预先设定好的值用于控制调节的程度;从而通过第二语句左侧语句序列和右侧语句序列的权重信息和关联关键词对应的相关权重借助第二语句对应的第二权重将第一语句与第一文本对应的完整信息进行融合,从而更准确的获得第一语句对应的目标权重,进而为后续获得关键句提供更好的支撑。Exemplarily, the adjustment parameter is a pre-set value used to control the degree of adjustment; thereby, the complete information corresponding to the first sentence and the first text is fused through the weight information of the left sentence sequence and the right sentence sequence of the second sentence and the relevant weights corresponding to the associated keywords with the help of the second weight corresponding to the second sentence, so as to more accurately obtain the target weight corresponding to the first sentence, thereby providing better support for the subsequent acquisition of key sentences.

具体地,在计算第一语句对应的目标权重时,借助第二语句将第一语句与第一文本对应的完整信息进行融合,进而使得权重调节更加全面和灵活,从而进一步提升了权重计算的可靠性和准确性。Specifically, when calculating the target weight corresponding to the first sentence, the first sentence is integrated with the complete information corresponding to the first text with the help of the second sentence, thereby making the weight adjustment more comprehensive and flexible, thereby further improving the reliability and accuracy of the weight calculation.

步骤S102、对所述关键句进行关键词识别获得第一词语和所述第一词语的第一类型。Step S102: performing keyword recognition on the key sentence to obtain a first word and a first type of the first word.

示例性地,利用命名识别模型对关键句识别获得关键句中每个词语对应的词语类型标识,并根据词语类型标识确定关键句对应的第一词语和该第一词语对应的第一类型。Exemplarily, the naming recognition model is used to recognize the key sentence to obtain the word type identifier corresponding to each word in the key sentence, and the first word corresponding to the key sentence and the first type corresponding to the first word are determined according to the word type identifier.

例如,命名识别模型为基于深度学习的双向长短期记忆网络与条件随机场模型。从而使用预训练模型利用标注好的数据集对预训练模型进行微调,其中,数据集中包含大量的句子,并且句子中每个词语都标注了对应的词语类型,如人名、地名、组织机构名、日期、其他等。从而在训练过程中不断调整模型的参数,以提高模型对词语类型的识别准确率,进而将关键句输入到训练好的命名识别模型中,模型会对每个词语进行分析,并输出对应的词语类型标识。从而将词语类型标识为人名、地名、组织机构名、日期时对应的词语确定为第一词语,并将该第一词语对应的词语类型标识确定为第一类型。For example, the naming recognition model is a bidirectional long short-term memory network and conditional random field model based on deep learning. Therefore, the pre-trained model is fine-tuned using a labeled data set, wherein the data set contains a large number of sentences, and each word in the sentence is labeled with the corresponding word type, such as a person's name, a place name, an organization name, a date, and others. Therefore, the parameters of the model are continuously adjusted during the training process to improve the model's recognition accuracy of the word type, and then the key sentence is input into the trained naming recognition model. The model will analyze each word and output the corresponding word type identifier. Therefore, when the word type is identified as a person's name, a place name, an organization name, or a date, the corresponding word is determined as the first word, and the word type identifier corresponding to the first word is determined as the first type.

步骤S103、根据所述第一类型对所述第一文本中的所述第一词语进行词语替换获得第二文本。Step S103: Replace the first word in the first text according to the first type to obtain a second text.

示例性地,第一类型有人名、地名、组织机构名、时间等,进而为每个第一类型定义通用的目标替换词,例如,对于人名类型,可以统一用“某人”来替换;对于地名类型,用“某地”替换,从而将定义好的不同第一类型及其对应的目标替换词整理成一个替换词库,方便后续查找和使用。Exemplarily, the first type includes person names, place names, organization names, time, etc., and then a common target replacement word is defined for each first type. For example, for the person name type, it can be uniformly replaced with "someone"; for the place name type, it can be replaced with "somewhere", so that the defined different first types and their corresponding target replacement words are organized into a replacement word library to facilitate subsequent search and use.

示例性地,使用前面提到的命名识别模型对第一文本进行处理,识别出其中第一词语和第一词语对应的第一类型,进而根据第一类型在替换词库中查找对应的目标替换词,进而根据目标替换词对第一文本中对应的第一词语进行词语替换进而获得替换后的第二文本。从而将第一类型对应的第一词语替换为通用的表述后,第二文本可以适用于更广泛的场景,为后续的文本聚类提供良好的支撑。Exemplarily, the aforementioned naming recognition model is used to process the first text, identify the first word and the first type corresponding to the first word, and then search for the corresponding target replacement word in the replacement word library according to the first type, and then replace the corresponding first word in the first text according to the target replacement word to obtain the replaced second text. After the first word corresponding to the first type is replaced with a general expression, the second text can be applied to a wider range of scenarios, providing good support for subsequent text clustering.

例如,对于第一文本为“张三昨天去了北京”,通过命名实体识别可以确定“张三”是人名,“北京”是地名,进而从替换词库中查找对应的目标替换词,人名对应的目标替换词为“某人”,地名对应的目标替换词为“某地”,进而可以得到替换后的第二文本为“某人昨天去了某地”。For example, for the first text "Zhang San went to Beijing yesterday", through named entity recognition, we can determine that "Zhang San" is a person's name and "Beijing" is a place name, and then find the corresponding target replacement word from the replacement vocabulary. The target replacement word corresponding to the person's name is "someone", and the target replacement word corresponding to the place name is "somewhere", and then we can get the second text after replacement as "someone went to somewhere yesterday".

步骤S104、根据所述第二文本进行文本聚类获得目标聚类结果,并对所述目标聚类结果中每个子类簇进行相似文本分析获得所述子类簇的通用语句和所述通用语句的第二位置。Step S104: performing text clustering according to the second text to obtain a target clustering result, and performing similar text analysis on each sub-cluster in the target clustering result to obtain a common sentence of the sub-cluster and a second position of the common sentence.

示例性地,使用词嵌入如Word2Vec、GloVe等,第二文本中对应的词语映射到低维向量空间从而获得第二文本中每个词语对应的词嵌入向量,进而通过计算第二文本中词语的词嵌入向量的平均值或加权平均值来表示第二文本,进而将每个第二文本转换为特征向量后,将这些向量组合成一个特征矩阵,矩阵的每一行代表一个文本,每一列代表一个特征,从而使用聚类算法如层次聚类根据特征矩阵进行文本聚类从而获得目标聚类结果。Exemplarily, using word embedding such as Word2Vec, GloVe, etc., the corresponding words in the second text are mapped to a low-dimensional vector space to obtain the word embedding vector corresponding to each word in the second text, and then the second text is represented by calculating the average or weighted average of the word embedding vectors of the words in the second text. After each second text is converted into a feature vector, these vectors are combined into a feature matrix, each row of the matrix represents a text, and each column represents a feature, so that a clustering algorithm such as hierarchical clustering is used to perform text clustering according to the feature matrix to obtain the target clustering result.

示例性地,由于第二文本中已经利用通用词语将不同类型的词语进行统一,因此在根据第二文本进行聚类获得目标聚类结果时关键词语已经统一,基于此,目标聚类结果里的每个子类簇所包含的文本往往具有相同或相近的含义。鉴于此,当对目标聚类结果中的每个子类簇开展相似文本分析时,便能够总结得出每个子类簇所对应的通用语句。For example, since different types of words have been unified by using common words in the second text, the key words have been unified when the target clustering result is obtained by clustering the second text. Based on this, the texts contained in each sub-cluster in the target clustering result often have the same or similar meanings. In view of this, when similar text analysis is performed on each sub-cluster in the target clustering result, the common sentences corresponding to each sub-cluster can be summarized.

示例性地,对目标聚类结果中每个子类簇中对应的子文本中将任意两个子文本之间的语句进行相似度计算,从而获得这来两个子文本之间的最相近语句,进而获得子类簇中任意两个子文本之间的全部最相近语句并对全部最相近语句进行交集处理,进而将交集处理结果中数量最多的语句确定为子类簇对应的通用语句。Exemplarily, the similarity of sentences between any two sub-texts in the corresponding sub-texts in each sub-class cluster in the target clustering result is calculated to obtain the most similar sentence between the two sub-texts, and then all the most similar sentences between any two sub-texts in the sub-class cluster are obtained and intersection processing is performed on all the most similar sentences, and then the sentences with the largest number in the intersection processing results are determined as the common sentences corresponding to the sub-class cluster.

例如,对于每个子类簇中的任意两个子文本,将其中一个子文本的每一条语句与另一个子文本的所有语句进行相似度计算。例如,子文本A有3条语句,子文本B有4条语句,那么需要进行3×4 = 12 次相似度计算,从而在每一组语句相似度计算结果中,找出相似度最高的语句对,这些语句对就是这两个子文本之间的最相近语句。对于上述例子,会从12次计算结果中选出相似度最高的语句组合,进而对每个子类簇中任意两个子文本之间的最相近语句进行收集和整理,形成一个包含所有最相近语句的集合。对收集到的全部最相近语句进行交集处理,也就是找出在所有最相近语句组合中都出现的语句。可以通过多次比较不同组合的语句,逐步筛选出共同的部分,从而统计交集处理结果中每条语句出现的次数。将出现数量最多的语句确定为该子类簇对应的通用语句。这个通用语句代表了该子类簇中大多数子文本所共有的核心语义。For example, for any two subtexts in each subclass cluster, the similarity of each sentence in one subtext is calculated with all sentences in the other subtext. For example, if subtext A has 3 sentences and subtext B has 4 sentences, then 3×4 = 12 similarity calculations are required, so that in each group of sentence similarity calculation results, the sentence pairs with the highest similarity are found. These sentence pairs are the most similar sentences between the two subtexts. For the above example, the sentence combination with the highest similarity is selected from the 12 calculation results, and then the most similar sentences between any two subtexts in each subclass cluster are collected and sorted to form a set containing all the most similar sentences. Intersection processing is performed on all the collected most similar sentences, that is, to find the sentences that appear in all the most similar sentence combinations. By comparing sentences of different combinations multiple times, the common parts can be gradually screened out, so as to count the number of occurrences of each sentence in the intersection processing results. The sentence with the largest number of occurrences is determined as the common sentence corresponding to the subclass cluster. This common sentence represents the core semantics shared by most subtexts in the subclass cluster.

示例性地,从子类簇中获得通用语句在每个子文本中出现的位置信息,进而对位置信息进行归纳总结从而获得通用语句对应的第二位置,第二位置可以是通用语句在子文本中出现最多次数的位置信息,也可以是通用语句在子文本中出现的每个位置信息的集合。Exemplarily, the position information of the common sentence in each sub-text is obtained from the sub-class cluster, and then the position information is summarized to obtain the second position corresponding to the common sentence. The second position can be the position information of the common sentence appearing the most times in the sub-text, or it can be the collection of each position information of the common sentence appearing in the sub-text.

在一些实施方式中,所述根据所述第二文本进行文本聚类获得目标聚类结果,包括:对所述第二文本进行关键词识别获得所述第二文本对应的文本关键词;对所述文本关键词进行词语合并获得全部关键词,并对所述全部关键词中任意两个关键词进行相似度计算获得相关相似值;根据所述相关相似值对所述全部关键词进行分类获得第一词组和第二词组;根据所述第一词组利用所述第一文本对应的所述文本关键词进行类簇分类获得第一分类结果,并根据所述第一分类结果获得每个第一子分类结果对应的第一中心词语和所述第一中心词语对应的第一中心权重;根据所述第二词组利用所述第一文本对应的所述文本关键词进行类簇分类获得第二分类结果,并根据所述第二分类结果获得每个第二子分类结果对应的第二中心词语和所述第二中心词语对应的第二中心权重;根据所述第一中心词语和所述第二中心词语结合所述第一中心权重和所述第二中心权重,确定所述第一中心词语在所述第二中心词语中对应的第一关联词语和所述第二中心词语在所述第一中心词语中对应的第二关联词语;根据所述第一关联词语和所述第二关联词语对所述第一分类结果和所述第二分类结果进行匹配,获得多个融合词组下对应的融合相似度;根据所述融合相似度和所述融合词组计算所述第二文本之间对应的文本相似度;根据所述文本相似度对所述第二文本进行文本聚类获得所述目标聚类结果。In some embodiments, the text clustering according to the second text to obtain the target clustering result includes: performing keyword recognition on the second text to obtain text keywords corresponding to the second text; performing word merging on the text keywords to obtain all keywords, and performing similarity calculation on any two keywords among all the keywords to obtain relevant similarity values; classifying all the keywords according to the relevant similarity values to obtain a first phrase and a second phrase; performing cluster classification according to the first phrase using the text keywords corresponding to the first text to obtain a first classification result, and obtaining a first central word corresponding to each first sub-classification result and a first central weight corresponding to the first central word according to the first classification result; performing cluster classification according to the second phrase using the text keywords corresponding to the first text to obtain a first sub-classification result. A second classification result is obtained, and according to the second classification result, a second central word corresponding to each second sub-classification result and a second central weight corresponding to the second central word are obtained; according to the first central word and the second central word combined with the first central weight and the second central weight, a first associated word corresponding to the first central word in the second central word and a second associated word corresponding to the second central word in the first central word are determined; according to the first associated word and the second associated word, the first classification result and the second classification result are matched to obtain corresponding fusion similarities under multiple fusion phrases; according to the fusion similarities and the fusion phrases, the corresponding text similarities between the second texts are calculated; according to the text similarities, the second texts are text clustered to obtain the target clustering result.

示例性地,第二文本为经过目标关键词处理后的第一文本,进而根据词频- 逆文档频率算法对第二文本进行关键词处理处理识别出其中的关键词从而得到第二文本对应的文本关键词,并将所有识别出的文本关键词汇总在一起,去除重复的词语,形成全部关键词。Exemplarily, the second text is the first text processed with target keywords, and then keyword processing is performed on the second text according to the word frequency-inverse document frequency algorithm to identify the keywords therein to obtain text keywords corresponding to the second text, and all identified text keywords are aggregated together, repeated words are removed, and all keywords are formed.

示例性地,利用相似度计算方法如编辑距离、余弦相似度计算全部关键词中任意两个关键词之间的相似度得到相关相似值,从而将相关相似值大于预设阈值的关键词归为一类,相关相似值小于或者等于预设阈值的关键词归为另一类,进而形成第一词组和第二词组。Exemplarily, similarity calculation methods such as edit distance and cosine similarity are used to calculate the similarity between any two keywords among all keywords to obtain relevant similarity values, thereby classifying keywords with relevant similarity values greater than a preset threshold into one category, and keywords with relevant similarity values less than or equal to the preset threshold into another category, thereby forming a first phrase and a second phrase.

示例性地,以第一词组为依据结合第二文本对应的文本关键词,计算文本关键词和第一词组之间的相似度从而根据相似度结果对第二文本进行类簇分类,得到第一分类结果。同样地,以第二词组为依据结合第二文本对应的文本关键词,计算文本关键词和第二词组之间的相似度从而根据相似度结果对第二文本进行类簇分类,得到第二分类结果。Exemplarily, based on the first phrase and combined with the text keywords corresponding to the second text, the similarity between the text keywords and the first phrase is calculated, and the second text is clustered according to the similarity result to obtain a first classification result. Similarly, based on the second phrase and combined with the text keywords corresponding to the second text, the similarity between the text keywords and the second phrase is calculated, and the second text is clustered according to the similarity result to obtain a second classification result.

示例性地,对于第一分类结果中的每个第一子分类结果,通过计算该子分类中关键词的综合重要性例如结合词频、在子分类中的位置等因素来确定第一子分类结果对应的第一中心词语,并根据其重要性赋予相应的第一中心权重。同样地,对于第二分类结果中的每个第二子分类结果,通过计算该子分类中关键词的综合重要性例如结合词频、在子分类中的位置等因素来确定第二子分类结果对应的第二中心词语,并根据其重要性赋予相应的第二中心权重。Exemplarily, for each first sub-classification result in the first classification result, the first central word corresponding to the first sub-classification result is determined by calculating the comprehensive importance of the keywords in the sub-classification, for example, combining factors such as word frequency, position in the sub-classification, and the like, and a corresponding first central weight is assigned according to its importance. Similarly, for each second sub-classification result in the second classification result, the second central word corresponding to the second sub-classification result is determined by calculating the comprehensive importance of the keywords in the sub-classification, for example, combining factors such as word frequency, position in the sub-classification, and the like, and a corresponding second central weight is assigned according to its importance.

示例性地,比较第一中心词语和第二中心词语,结合第一中心权重、第二中心权重,找出第一中心词语在第二中心词语中与之关联度较高的第一关联词语,以及第二中心词语在第一中心词语中对应的第二关联词语。Exemplarily, the first central word and the second central word are compared, and the first central weight and the second central weight are combined to find the first associated word with a higher degree of association with the first central word in the second central word, and the second associated word corresponding to the second central word in the first central word.

示例性地,根据第一关联词语和第二关联词语进行相似度计算,从而将相似度最大时第一分类结果中的第一子分类结果和第二分类结果中的第二子分类结果进行匹配,从而将第一子分类结果对应的第一中心词语和第二子分类结果对应的第二中心词语形成一个融合词组,进而再将已经形成的融合词组对应的第一关联词语和第二关联词语进行删除后,再继续获得剩余相似度计算结果对应的最大值,从而形成对应的另一个融合词组,进而获得多个融合词组,进而对于每个融合词组,综合考虑包含的词语之间的相似度以及对应的权重进而确定对应的融合相似度。Exemplarily, similarity is calculated based on the first associated words and the second associated words, so that the first sub-classification result in the first classification result and the second sub-classification result in the second classification result when the similarity is the largest are matched, so that the first central word corresponding to the first sub-classification result and the second central word corresponding to the second sub-classification result form a fused phrase, and then the first associated words and the second associated words corresponding to the formed fused phrase are deleted, and then the maximum value corresponding to the remaining similarity calculation result is obtained, so as to form another corresponding fused phrase, and then multiple fused phrases are obtained, and then for each fused phrase, the similarity between the included words and the corresponding weights are comprehensively considered to determine the corresponding fused similarity.

示例性地,使用预训练的语言模型计算融合词组和第二文本的语义向量之间的相似度,以此作为数据关联度,进而将每个第二文本与每个融合词组的数据关联度乘以该融合词组的融合相似度,然后对所有融合词组的计算结果进行求和。得到的结果就是第二文本在全部融合词组下对应的文本相似度,从而将计算得到的文本相似度作为输入从而使用聚类算法对第二文本进行聚类最终得到目标聚类结果。Exemplarily, the similarity between the semantic vectors of the fused phrase and the second text is calculated using a pre-trained language model, and this is used as the data association degree, and then the data association degree between each second text and each fused phrase is multiplied by the fusion similarity of the fused phrase, and then the calculation results of all fused phrases are summed. The result obtained is the text similarity corresponding to the second text under all fused phrases, and the calculated text similarity is used as input to cluster the second text using a clustering algorithm to finally obtain the target clustering result.

具体地,确定关联词语和融合词组的过程,有助于挖掘文本之间的深层关联,不仅仅是表面的词汇匹配,还能发现语义上的潜在联系,为进一步的文本聚类提供更有价值的信息。Specifically, the process of determining associated words and fused phrases helps to explore the deep connections between texts, not just the superficial lexical matching, but also discovers the potential semantic connections, providing more valuable information for further text clustering.

在一些实施方式中,所述根据所述融合相似度和所述融合词组计算所述第二文本之间对应的文本相似度,包括:从所述第二文本中确定第一子文本和第二子文本;获得所述融合词组中每个子关键词在所述第一子文本中对应的第一频次和每个所述子关键词在所述第二子文本中对应的第二频次;从所述第一分类结果中获得所述融合词组对应的第三子分类结果和从所述第二分类结果中获得所述融合词组对应的第四子分类结果;获得所述第三子分类结果中关键词对应的第一数量和获得所述第四子分类结果中关键词对应的第二数量;根据所述第一频次、所述第二频次、所述第一数量和所述第二数量融合所述融合词组对应的所述融合相似度获得所述第一子文本和所述第二子文本对应的所述文本相似度;其中,根据下列公式获得所述文本相似度:In some embodiments, the calculating the text similarity between the second texts according to the fused similarity and the fused phrase includes: determining a first subtext and a second subtext from the second text; obtaining a first frequency corresponding to each subkeyword in the fused phrase in the first subtext and a second frequency corresponding to each subkeyword in the second subtext; obtaining a third subclassification result corresponding to the fused phrase from the first classification result and a fourth subclassification result corresponding to the fused phrase from the second classification result; obtaining a first number corresponding to the keyword in the third subclassification result and a second number corresponding to the keyword in the fourth subclassification result; fusing the fused similarity corresponding to the fused phrase according to the first frequency, the second frequency, the first number and the second number to obtain the text similarity corresponding to the first subtext and the second subtext; wherein the text similarity is obtained according to the following formula:

其中,表示第i个所述第一子文本和第j个所述第二子文本对应的所述文本相似度,num2表示所述融合词组对应的数量,num1表示第q个所述融合词组中所述子关键词对应的词语数量,表示第q个所述融合词组中第r个所述子关键词在第i个所述第一子文本中对应的所述第一频次,表示第q个所述融合词组中第r个所述子关键词在第j个所述第二子文本中对应的所述第二频次,表示第q个所述融合词组对应的所述第三子分类结果中关键词对应的所述第一数量,表示第q个所述融合词组对应的所述第四子分类结果中关键词对应的所述第二数量,表示第q个所述融合词组对应的所述融合相似度。in, represents the text similarity between the i-th first subtext and the j-th second subtext, num2 represents the number of the fused phrases, num1 represents the number of words corresponding to the sub-keywords in the q-th fused phrase, represents the first frequency corresponding to the rth sub-keyword in the qth fused phrase in the ith first sub-text, represents the second frequency corresponding to the rth sub-keyword in the qth fused phrase in the jth second sub-text, represents the first number corresponding to the keywords in the third sub-classification result corresponding to the qth fused phrase, represents the second number corresponding to the keywords in the fourth sub-classification result corresponding to the qth fused phrase, represents the fusion similarity corresponding to the qth fused phrase.

示例性地,从第二文本中任意选取两个子文本将它们分别定义为第一子文本和第二子文本。这里可以按照顺序依次选取,也可以随机抽取,具体方式根据实际需求确定。Exemplarily, two subtexts are randomly selected from the second text and defined as the first subtext and the second subtext, respectively. The subtexts can be selected in sequence or randomly, and the specific method is determined according to actual needs.

示例性地,针对融合词组中的每个子关键词,分别在第一子文本和第二子文本中进行词频统计。逐字逐句扫描第一子文本,记录每个子关键词出现的次数,得到对应的第一频次;同样地,对第二子文本进行扫描,记录每个子关键词出现的次数,得到对应的第二频次。Exemplarily, for each sub-keyword in the fused phrase, word frequency statistics are performed in the first sub-text and the second sub-text respectively. The first sub-text is scanned word by word, the number of times each sub-keyword appears is recorded, and the corresponding first frequency is obtained; similarly, the second sub-text is scanned, the number of times each sub-keyword appears is recorded, and the corresponding second frequency is obtained.

示例性地,根据第一分类结果从中找出与当前融合词组相对应的子分类结果,并将其确定为第三子分类结果。同样,在第二分类结果中找出与该融合词组对应的子分类结果,并将其确定为第四子分类结果。Exemplarily, a sub-classification result corresponding to the current fused phrase is found from the first classification result and determined as the third sub-classification result. Similarly, a sub-classification result corresponding to the fused phrase is found from the second classification result and determined as the fourth sub-classification result.

示例性地,对第三子分类结果中的关键词进行计数,得到第一数量;对第四子分类结果中的关键词进行计数,得到第二数量。Exemplarily, the keywords in the third sub-category results are counted to obtain a first number; and the keywords in the fourth sub-category results are counted to obtain a second number.

示例性地,将前面得到的第一频次、第二频次、第一数量、第二数量以及融合词组对应的融合相似度代入到下列公式中进行计算获得文本相似度:Exemplarily, the first frequency, the second frequency, the first quantity, the second quantity, and the fusion similarity corresponding to the fused phrase obtained above are substituted into the following formula to calculate and obtain the text similarity:

其中,表示第i个第一子文本和第j个第二子文本对应的文本相似度,num2表示融合词组对应的数量,num1表示第q个融合词组中子关键词对应的词语数量,表示第q个融合词组中第r个子关键词在第i个第一子文本中对应的第一频次,表示第q个融合词组中第r个子关键词在第j个第二子文本中对应的第二频次表示第q个融合词组对应的第三子分类结果中关键词对应的第一数量,表示第q个融合词组对应的第四子分类结果中关键词对应的第二数量,表示第q个融合词组对应的融合相似度。in, represents the text similarity between the i-th first subtext and the j-th second subtext, num2 represents the number of fused phrases, and num1 represents the number of words corresponding to the sub-keywords in the q-th fused phrase. represents the first frequency of the rth sub-keyword in the qth fused phrase in the i-th first sub-text, Indicates the second frequency of the rth sub-keyword in the qth fusion phrase in the jth second sub-text represents the first number of keywords in the third sub-classification result corresponding to the qth fused phrase, represents the second number of keywords in the fourth sub-classification result corresponding to the qth fused phrase, Represents the fusion similarity corresponding to the qth fused phrase.

示例性地,根据上述公式进行文本相似度计算,该算法在对第一子文本和第二子文本进行深度挖掘处理后,提取到的文本信息蕴含于融合词组和子分类结果中,从而在计算第一子文本和第二子文本之间的文本相似度时,以融合词组内词语数量占文本关键词总数的比例为权重,得到第一子文本和第二子文本之间的文本相似度,上述方法在一定程度上能减少无效的计算地同时考虑词语的语义信息,能更好捕捉词语间的相似度关系。Exemplarily, text similarity is calculated according to the above formula. After the algorithm performs deep mining on the first sub-text and the second sub-text, the extracted text information is contained in the fused phrase and the sub-classification results. Therefore, when calculating the text similarity between the first sub-text and the second sub-text, the ratio of the number of words in the fused phrase to the total number of text keywords is used as the weight to obtain the text similarity between the first sub-text and the second sub-text. The above method can reduce invalid calculations to a certain extent while considering the semantic information of words, and can better capture the similarity relationship between words.

具体地,通过对融合词组和子关键词的分析,以及结合不同分类结果中的信息,能够深入挖掘文本的语义信息。例如,子关键词的频次体现了其在文本中的重要性,不同分类结果中的关键词数量反映了文本在不同分类体系下的特征,这些都有助于更好地理解文本的语义内涵,从而准确的文本相似度为后续的文本聚类提供良好的基础,能够将语义相近的文本更准确地归为一类,提高聚类结果的质量,使聚类后的类别更具有代表性和区分度。Specifically, by analyzing fusion phrases and sub-keywords, and combining the information in different classification results, we can deeply explore the semantic information of the text. For example, the frequency of sub-keywords reflects its importance in the text, and the number of keywords in different classification results reflects the characteristics of the text under different classification systems. These are helpful to better understand the semantic connotation of the text, so that accurate text similarity provides a good foundation for subsequent text clustering, and can more accurately classify semantically similar texts into one category, improve the quality of clustering results, and make the clustered categories more representative and distinguishable.

在一些实施方式中,所述对所述目标聚类结果中每个子类簇进行相似文本分析获得所述子类簇的通用语句和所述通用语句的第二位置,包括:获得所述子类簇中对应的第三子文本和第四子文本,并对所述第三子文本进行文本分割获得第一分割结果和对所述第四子文本进行文本分割获得第二分割结果;获得所述第一分割结果中对应的第三子语句和所述第二分割结果对应的第四子语句;对所述第三子语句进行关键词识别获得第一关键词和对所述第四子语句进行关键词识别获得第二关键词;对所述第一关键词和所述第二关键词进行交集处理获得所述第三子语句和所述第四子语句对应的相同关键词和所述相同关键词对应的目标数量;对所述第一关键词进行数量统计获得第三数量,并对所述第三数量进行对数求解获得第一结果;对所述第二关键词进行数量统计获得第四数量,并对所述第四数量进行对数求解获得第二结果;对所述第一结果和所述第二结果进行求和获得目标结果,并将所述目标数量和所述目标结果进行比值计算获得所述第三子语句和所述第四子语句之间对应的语句相似度;根据所述语句相似度确定所述第三子文本和所述第四子文本之间对应的相似语句;根据所述子类簇对所述相似语句进行数据统计获得所述子类簇对应的所述通用语句;根据所述通用语句在所述子类簇中进行位置查找获得所述通用语句对应的所述第二位置。In some embodiments, the performing similar text analysis on each sub-cluster in the target clustering result to obtain the common sentence of the sub-cluster and the second position of the common sentence includes: obtaining the corresponding third sub-text and fourth sub-text in the sub-cluster, and performing text segmentation on the third sub-text to obtain a first segmentation result and performing text segmentation on the fourth sub-text to obtain a second segmentation result; obtaining the third sub-sentence corresponding to the first segmentation result and the fourth sub-sentence corresponding to the second segmentation result; performing keyword recognition on the third sub-sentence to obtain a first keyword and performing keyword recognition on the fourth sub-sentence to obtain a second keyword; performing intersection processing on the first keyword and the second keyword to obtain the same keyword corresponding to the third sub-sentence and the fourth sentence and the target corresponding to the same keyword The method comprises the steps of: performing quantitative counting on the first keyword to obtain a third quantity, and performing logarithmic solution on the third quantity to obtain a first result; performing quantitative counting on the second keyword to obtain a fourth quantity, and performing logarithmic solution on the fourth quantity to obtain a second result; summing the first result and the second result to obtain a target result, and performing ratio calculation on the target quantity and the target result to obtain the sentence similarity corresponding to the third sub-sentence and the fourth sub-sentence; determining the similar sentences corresponding to the third sub-text and the fourth sub-text according to the sentence similarity; performing data statistics on the similar sentences according to the sub-class cluster to obtain the general sentence corresponding to the sub-class cluster; and performing position search in the sub-class cluster according to the general sentence to obtain the second position corresponding to the general sentence.

示例性地,从目标聚类结果的每个子类簇中随机选择两个子文本获得对应的第三子文本和第四子文本,进而按照标点符号(句号、感叹号、问号等)、换行符或者特定的分隔标识,将第三子文本分割成一个个独立的部分,得到第一分割结果。采用与分割第三子文本相同的方法,对第四子文本进行分割,得到第二分割结果。Exemplarily, two subtexts are randomly selected from each subclass cluster of the target clustering result to obtain the corresponding third subtext and fourth subtext, and then the third subtext is segmented into independent parts according to punctuation marks (period, exclamation mark, question mark, etc.), line breaks or specific separation marks to obtain the first segmentation result. The fourth subtext is segmented using the same method as the segmentation of the third subtext to obtain the second segmentation result.

示例性地,从第一分割结果里提取出第三子语句,从第二分割结果中提取出第四子语句。第三子语句和第四子语句通常是具有完整语义的短句,从而使用运用关键词识别如基于词频统计、基于图的排序算法对第三子语句进行分析,找出其中具有代表性和重要性的词汇,得到第一关键词,并采用同样的关键词识别方法对第四子语句进行处理,得到第二关键词。Exemplarily, the third sub-sentence is extracted from the first segmentation result, and the fourth sub-sentence is extracted from the second segmentation result. The third sub-sentence and the fourth sub-sentence are usually short sentences with complete semantics, so the third sub-sentence is analyzed using keyword recognition such as a word frequency statistics-based, graph-based sorting algorithm to find representative and important words therein to obtain the first keyword, and the fourth sub-sentence is processed using the same keyword recognition method to obtain the second keyword.

示例性地,将第一关键词和第二关键词进行对比找出它们共同包含的关键词,这些共同关键词就是第三子语句和第四子语句对应的相同关键词,并统计相同关键词的数量,得到目标数量,进而统计第一关键词的数量,得到第三数量,然后对第三数量取对数,得到第一结果。统计第二关键词的数量,得到第四数量,接着对第四数量取对数,得到第二结果。Exemplarily, the first keyword and the second keyword are compared to find out the keywords they contain in common. These common keywords are the same keywords corresponding to the third sub-sentence and the fourth sub-sentence, and the number of the same keywords is counted to obtain the target number, and then the number of the first keyword is counted to obtain the third number, and then the logarithm of the third number is taken to obtain the first result. The number of the second keyword is counted to obtain the fourth number, and then the logarithm of the fourth number is taken to obtain the second result.

示例性地,把第一结果和第二结果相加得到目标结果,从而采用目标数量除以目标结果进而得到第三子语句和第四子语句之间对应的语句相似度。Exemplarily, the first result and the second result are added to obtain a target result, and the target number is divided by the target result to obtain the corresponding sentence similarity between the third sub-sentence and the fourth sub-sentence.

示例性地,设定一个相似度阈值,当第三子语句和第四子语句之间的语句相似度大于该阈值时,就认为这两个子语句是相似语句。从而将第三子文本中的全部第三子语句和第四子文本中的全部第四子语句进行两两计算从而获得第三子文本和第四子文本之间的相似语句。Exemplarily, a similarity threshold is set, and when the sentence similarity between the third sub-sentence and the fourth sub-sentence is greater than the threshold, the two sub-sentences are considered to be similar sentences. Thus, all third sub-sentences in the third sub-text and all fourth sub-sentences in the fourth sub-text are calculated pairwise to obtain similar sentences between the third sub-text and the fourth sub-text.

示例性地,对整个子类簇中的所有相似语句进行数据统计,例如统计每个相似语句出现的频率。出现频率较高的相似语句可以被确定为该子类簇对应的通用语句。Exemplarily, data statistics are performed on all similar sentences in the entire subclass cluster, such as counting the frequency of occurrence of each similar sentence. Similar sentences with higher frequency of occurrence can be determined as common sentences corresponding to the subclass cluster.

示例性地,在子类簇的所有文本中,逐个查找通用语句出现的位置,记录下这些位置信息,从而得到通用语句对应的第二位置。Exemplarily, in all texts of the sub-class cluster, the positions where the common sentences appear are searched one by one, and the position information is recorded, so as to obtain the second position corresponding to the common sentence.

具体地,通过细致的文本处理和相似度计算,能够精准地找出子类簇中文本之间的相似之处,提炼出通用语句。这些通用语句反映了子类簇文本对应的固定语句,与本公文内容需要传达的核心事件并不相关,从而为后续生成具有有一定要求或者规范的公文内容提供良好的支撑。Specifically, through meticulous text processing and similarity calculation, we can accurately find the similarities between texts in sub-cluster and extract common sentences. These common sentences reflect the fixed sentences corresponding to the sub-cluster texts and are irrelevant to the core events that need to be conveyed in the official document content, thus providing good support for the subsequent generation of official document content with certain requirements or specifications.

步骤S105、对所述通用语句进行关键词识别获得第二词语和所述第二词语的第二类型。Step S105: perform keyword recognition on the general sentence to obtain a second word and a second type of the second word.

示例性地,命名识别模型为基于深度学习的双向长短期记忆网络与条件随机场模型,从而利用命名识别模型对通用语句进行语句识别获得通用语句中每个词语对应的词语类型标识,并根据词语类型标识确定通用语句对应的第二词语和该第二词语对应的第二类型。Exemplarily, the naming recognition model is a bidirectional long short-term memory network and conditional random field model based on deep learning, so that the naming recognition model is used to perform sentence recognition on general sentences to obtain the word type identifier corresponding to each word in the general sentence, and determine the second word corresponding to the general sentence and the second type corresponding to the second word according to the word type identifier.

步骤S106、根据所述第一词语和所述第一类型对所述关键句进行规则提取获得第一语句规则。Step S106: extract rules from the key sentence according to the first word and the first type to obtain a first sentence rule.

示例性地,通过机器学习分类算法对关键句进行事件分类从而获得关键句对应的第一事件类型,进而根据第一词语确定该第一类型与第一事件类型对应的第一事件关联,从而将关键句中第一词语更换为第一类型后获得第一规则,进而根据第一类型和第一事件类型之间的第一事件关联和第一规则确定关键句对应的第一语句规则。Exemplarily, the key sentence is classified into event categories through a machine learning classification algorithm to obtain a first event type corresponding to the key sentence, and then the first event association between the first type and the first event type is determined based on the first word, so that the first word in the key sentence is replaced with the first type to obtain the first rule, and then the first sentence rule corresponding to the key sentence is determined based on the first event association between the first type and the first event type and the first rule.

例如,从关键句中提取与事件相关的特征,比如事件的主体(人名、机构名、公司名)、动作、时间、地点等。从而将关键句中的第一词语更换为对应的第一类型,并将第一类型用中括号或者双引号标注出以和其他内容进行区别。进而确定第一词语所代表的第一类型与已经确定的事件类型之间的关系。比如,如果第一词语是“学校”,事件类型是“教育事件”,那么可以分析出“学校”这一类型与“教育事件”存在紧密的关联,并确定第一类型与事件类型之间的事件关系为事件主体。从而将关键句中的第一词语替换为第一类型。例如,若第一词语是“学校”,第一类型是“机构名”,关键句“学校禁止乱收费”就变为“【机构名】禁止乱收费”。For example, extract event-related features from the key sentence, such as the subject of the event (name, organization name, company name), action, time, place, etc. Therefore, replace the first word in the key sentence with the corresponding first type, and mark the first type with brackets or double quotes to distinguish it from other content. Then determine the relationship between the first type represented by the first word and the determined event type. For example, if the first word is "school" and the event type is "education event", then it can be analyzed that the type "school" is closely related to "education event", and determine the event relationship between the first type and the event type as the event subject. Therefore, replace the first word in the key sentence with the first type. For example, if the first word is "school" and the first type is "institution name", the key sentence "schools are prohibited from charging random fees" becomes "[institution name] is prohibited from charging random fees".

步骤S107、根据所述第二词语和所述第二类型对所述通用语句进行规则提取获得第二语句规则。Step S107: extract rules from the general sentence according to the second word and the second type to obtain a second sentence rule.

示例性地,通过机器学习分类算法对通用语句进行事件分类从而获得通用语句对应的第二事件类型,进而根据第二词语确定该第二类型与第二事件类型对应的第二事件关联,从而将通用语句中第二词语更换为第二类型后获得第二规则,进而根据第二类型和第二事件类型之间的第一事件关联和第二规则确定通用语句对应的第二语句规则。Exemplarily, a general statement is classified into event categories through a machine learning classification algorithm to obtain a second event type corresponding to the general statement, and then the second event association between the second type and the second event type is determined based on the second word, so that the second word in the general statement is replaced with the second type to obtain the second rule, and then the second statement rule corresponding to the general statement is determined based on the first event association between the second type and the second event type and the second rule.

需要说明的是,通用语句为公文内容在发布时对应的常用语句,常常与需要发布的事件无关,关键句为公文内容在发布时对应的事件信息。It should be noted that the common sentences are the commonly used sentences corresponding to the official document content when it is released, which are often unrelated to the event that needs to be released. The key sentences are the event information corresponding to the official document content when it is released.

步骤S108、获得待公布事件的目标关键词和所述目标关键词的第三类型。Step S108: Obtain a target keyword of the event to be announced and a third type of the target keyword.

示例性地,获得目标用户需要发布的待公布事件,从而根据命名实体识别模型对待公布事件进行识别获得待公布事件中每个词语对应的词语类型,从而根据词语类型获得待公布事件的目标关键词和目标关键词的第三类型。Exemplarily, an event to be published that a target user needs to publish is obtained, and then the event to be published is identified according to a named entity recognition model to obtain the word type corresponding to each word in the event to be published, and then the target keyword of the event to be published and the third type of the target keyword are obtained according to the word type.

步骤S109、根据所述目标关键词和所述第三类型结合所述第一语句规则和所述第二语句规则按照所述第一位置和所述第二位置生成所述待公布事件的目标公文内容。Step S109: Generate target official document content of the event to be announced according to the target keyword and the third type in combination with the first sentence rule and the second sentence rule according to the first position and the second position.

示例性地,将待公布事件和第一语句规则和第二语句规则对应的事件类型进行相似度计算从而从第一语句规则和第二语句规则中筛选出待公布事件对应的第一关联规则和第二关联规则。Exemplarily, similarity calculation is performed between the event to be published and the event types corresponding to the first sentence rule and the second sentence rule, so as to filter out the first association rule and the second association rule corresponding to the event to be published from the first sentence rule and the second sentence rule.

示例性地,确定目标关键词和待公布事件之间的事件关系,从而根据事件关系结合第三类型将目标关键词按照第一关联规则生成对应的目标关键句,以及根据根据事件关系结合第三类型将目标关键词按照第二关联规则生成对应的目标通用句,从而按照第一关联规则在第一文本中第一位置和第二关联规则在第一文本中的第二位置生成待公布事件的目标公文内容。Exemplarily, the event relationship between the target keyword and the event to be announced is determined, so that the target keyword is generated into a corresponding target key sentence according to the first association rule based on the event relationship in combination with the third type, and the target keyword is generated into a corresponding target general sentence according to the second association rule based on the event relationship in combination with the third type, so that the target official document content of the event to be announced is generated in the first position in the first text according to the first association rule and in the second position in the first text according to the second association rule.

在一些实施方式中,所述根据所述目标关键词和所述第三类型结合所述第一语句规则和所述第二语句规则按照所述第一位置和所述第二位置生成所述待公布事件的目标公文内容,包括:根据所述目标关键词和所述第三类型结合所述第一语句规则获得第一生成语句;根据所述目标关键词和所述第三类型结合所述第二语句规则获得第二生成语句;根据所述第一位置和所述第二位置对所述第一生成语句和所述第二生成语句进行语句合并获得多个初始公文内容;根据质量评估模型所述初始公文内容进行文本质量评估获得所述初始公文内容对应的目标文本质量;根据所述目标文本质量从所述初始公文内容中筛选得到所述待公布事件的所述目标公文内容。In some embodiments, generating the target official document content of the event to be announced according to the first position and the second position based on the target keyword and the third type in combination with the first sentence rule and the second sentence rule includes: obtaining a first generated sentence according to the target keyword and the third type in combination with the first sentence rule; obtaining a second generated sentence according to the target keyword and the third type in combination with the second sentence rule; merging the first generated sentence and the second generated sentence according to the first position and the second position to obtain multiple initial official document contents; performing text quality assessment on the initial official document content according to a quality assessment model to obtain a target text quality corresponding to the initial official document content; and screening the target official document content of the event to be announced from the initial official document content according to the target text quality.

示例性地,将待公布事件和第一语句规则和第二语句规则对应的事件类型进行相似度计算从而从第一语句规则和第二语句规则中筛选出待公布事件对应的第一关联规则和第二关联规则。Exemplarily, similarity calculation is performed between the event to be published and the event types corresponding to the first sentence rule and the second sentence rule, so as to filter out the first association rule and the second association rule corresponding to the event to be published from the first sentence rule and the second sentence rule.

示例性地,确定目标关键词和待公布事件之间的事件关系,从而根据事件关系结合第三类型将目标关键词按照第一关联规则生成对应的第一生成语句,以及根据根据事件关系结合第三类型将目标关键词按照第二关联规则生成对应的第二生成语句。Exemplarily, the event relationship between the target keyword and the event to be announced is determined, so that the target keyword is generated into a corresponding first generated sentence according to the first association rule based on the event relationship in combination with the third type, and the target keyword is generated into a corresponding second generated sentence according to the second association rule based on the event relationship in combination with the third type.

示例性地,第一位置和第二位置指示了第一生成语句和第二生成语句在合并时的排列顺序和组合方式,从而根据第一位置和第二位置的要求,将第一生成语句和第二生成语句进行合并得到初始合并内容,进而再借助文本生成模型基于初始合并内容进行文本扩充从而得到多个初始公文内容,文本生成模型可以是基于神经网络的模型也可以是基于深度学习的模型。Exemplarily, the first position and the second position indicate the arrangement order and combination method of the first generated sentence and the second generated sentence when they are merged, so that according to the requirements of the first position and the second position, the first generated sentence and the second generated sentence are merged to obtain the initial merged content, and then the text is expanded based on the initial merged content with the help of a text generation model to obtain multiple initial official document contents. The text generation model can be a model based on a neural network or a model based on deep learning.

示例性地,构建一个能够全面评估文本质量的模型。这个模型可以考虑多个方面的因素,如语法正确性、语义连贯性、逻辑合理性、信息完整性等,进而每个初始公文内容输入到质量评估模型中,根据模型的评估指标和标准对其进行打分,得到每个初始公文内容对应的目标文本质量。For example, a model that can comprehensively evaluate the text quality is constructed. This model can consider multiple factors, such as grammatical correctness, semantic coherence, logical rationality, information integrity, etc. Then, each initial document content is input into the quality evaluation model, and it is scored according to the evaluation indicators and standards of the model to obtain the target text quality corresponding to each initial document content.

示例性地,设定一个目标文本质量的阈值,从而从所有初始公文内容中筛选出大于该阈值的公文内容,进而将目标文本质量大于该阈值的公文内容就是待公布事件的目标公文内容。Exemplarily, a threshold of target text quality is set to filter out the document content with a quality greater than the threshold from all initial document content, and then the document content with a target text quality greater than the threshold is the target document content of the event to be announced.

请参阅图2,图2为本申请实施例提供的一种基于人工智能的公文内容生成装置200,该基于人工智能的公文内容生成装置200包括语句识别模块201、第一词语识别模块202、替换处理模块203、聚类分析模块204、第二词语识别模块205、第一规则提取模块206、第二规则提取模块207、数据获取模块208、公文生成模块209,其中,语句识别模块201,用于对第一文本进行语句识别获得关键句和所述关键句的第一位置;第一词语识别模块202,用于对所述关键句进行关键词识别获得第一词语和所述第一词语的第一类型;替换处理模块203,用于根据所述第一类型对所述第一文本中的所述第一词语进行词语替换获得第二文本;聚类分析模块204,用于根据所述第二文本进行文本聚类获得目标聚类结果,并对所述目标聚类结果中每个子类簇进行相似文本分析获得所述子类簇的通用语句和所述通用语句的第二位置;第二词语识别模块205,用于对所述通用语句进行关键词识别获得第二词语和所述第二词语的第二类型;第一规则提取模块206,用于根据所述第一词语和所述第一类型对所述关键句进行规则提取获得第一语句规则;第二规则提取模块207,用于根据所述第二词语和所述第二类型对所述通用语句进行规则提取获得第二语句规则;数据获取模块208,用于获得待公布事件的目标关键词和所述目标关键词的第三类型;公文生成模块209,用于根据所述目标关键词和所述第三类型结合所述第一语句规则和所述第二语句规则按照所述第一位置和所述第二位置生成所述待公布事件的目标公文内容。Please refer to Figure 2, which is an artificial intelligence-based document content generation device 200 provided in an embodiment of the present application. The artificial intelligence-based document content generation device 200 includes a sentence recognition module 201, a first word recognition module 202, a replacement processing module 203, a cluster analysis module 204, a second word recognition module 205, a first rule extraction module 206, a second rule extraction module 207, a data acquisition module 208, and an official document generation module 209, wherein the sentence recognition module 201 is used to perform sentence recognition on a first text to obtain a key sentence and a first position of the key sentence; the first word recognition module 202 is used to perform keyword recognition on the key sentence to obtain a first word and a first type of the first word; the replacement processing module 203 is used to perform word replacement on the first word in the first text according to the first type to obtain a second text; the cluster analysis module 204 is used to perform text clustering on the second text to obtain Obtain target clustering results, and perform similar text analysis on each sub-cluster in the target clustering results to obtain the general sentence of the sub-cluster and the second position of the general sentence; a second word recognition module 205 is used to perform keyword recognition on the general sentence to obtain the second word and the second type of the second word; a first rule extraction module 206 is used to perform rule extraction on the key sentence according to the first word and the first type to obtain the first sentence rule; a second rule extraction module 207 is used to perform rule extraction on the general sentence according to the second word and the second type to obtain the second sentence rule; a data acquisition module 208 is used to obtain the target keyword of the event to be announced and the third type of the target keyword; an official document generation module 209 is used to generate the target official document content of the event to be announced according to the first position and the second position in combination with the first sentence rule and the second sentence rule.

在一些实施方式中,基于人工智能的公文内容生成装置200可应用于终端设备。In some implementations, the official document content generation device 200 based on artificial intelligence can be applied to a terminal device.

需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的基于人工智能的公文内容生成装置200的具体工作过程,可以参考前述基于人工智能的公文内容生成方法实施例中的对应过程,在此不再赘述。It should be noted that those skilled in the art can clearly understand that, for the sake of convenience and brevity of description, the specific working process of the official document content generation device 200 based on artificial intelligence described above can refer to the corresponding process in the aforementioned official document content generation method embodiment based on artificial intelligence, and will not be repeated here.

请参阅图3,图3为本发明实施例提供的一种终端设备的结构示意性框图。Please refer to FIG. 3 , which is a schematic block diagram of the structure of a terminal device provided in an embodiment of the present invention.

如图3所示,终端设备300包括处理器301和存储器302,处理器301和存储器302通过总线303连接,该总线比如为I2C(Inter-integrated Circuit)总线。As shown in FIG3 , the terminal device 300 includes a processor 301 and a memory 302 . The processor 301 and the memory 302 are connected via a bus 303 , such as an I2C (Inter-integrated Circuit) bus.

具体地,处理器301用于提供计算和控制能力,支撑整个终端设备的运行。处理器301可以是中央处理单元 (Central Processing Unit,CPU),该处理器301还可以是其他通用处理器、数字信号处理器 (Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列 (Field-Programmable Gate Array,FPGA) 或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。其中,通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。Specifically, the processor 301 is used to provide computing and control capabilities to support the operation of the entire terminal device. The processor 301 may be a central processing unit (CPU), and the processor 301 may also be other general-purpose processors, digital signal processors (DSP), application-specific integrated circuits (ASIC), field-programmable gate arrays (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor, etc.

具体地,存储器302可以是Flash芯片、只读存储器 (ROM,Read-Only Memory)磁盘、光盘、U盘或移动硬盘等。Specifically, the memory 302 can be a Flash chip, a read-only memory (ROM) disk, an optical disk, a USB flash drive, or a mobile hard disk.

本领域技术人员可以理解,图3中示出的结构,仅仅是与本发明实施例方案相关的部分结构的框图,并不构成对本发明实施例方案所应用于其上的终端设备的限定,具体的服务器可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。Those skilled in the art will understand that the structure shown in FIG. 3 is merely a block diagram of a partial structure related to the embodiment of the present invention, and does not constitute a limitation on the terminal device to which the embodiment of the present invention is applied. A specific server may include more or fewer components than those shown in the figure, or combine certain components, or have a different arrangement of components.

其中,所述处理器用于运行存储在存储器中的计算机程序,并在执行所述计算机程序时实现本发明实施例提供的任意一种所述的基于人工智能的公文内容生成方法。The processor is used to run a computer program stored in the memory, and implement any one of the methods for generating official document content based on artificial intelligence provided by the embodiments of the present invention when executing the computer program.

在一实施例中,所述处理器用于运行存储在存储器中的计算机程序,并在执行所述计算机程序时实现如下步骤:In one embodiment, the processor is used to run a computer program stored in the memory, and implements the following steps when executing the computer program:

对第一文本进行语句识别获得关键句和所述关键句的第一位置;Performing sentence recognition on the first text to obtain a key sentence and a first position of the key sentence;

对所述关键句进行关键词识别获得第一词语和所述第一词语的第一类型;Performing keyword recognition on the key sentence to obtain a first word and a first type of the first word;

根据所述第一类型对所述第一文本中的所述第一词语进行词语替换获得第二文本;Performing word replacement on the first word in the first text according to the first type to obtain a second text;

根据所述第二文本进行文本聚类获得目标聚类结果,并对所述目标聚类结果中每个子类簇进行相似文本分析获得所述子类簇的通用语句和所述通用语句的第二位置;Performing text clustering according to the second text to obtain a target clustering result, and performing similar text analysis on each sub-cluster in the target clustering result to obtain a common sentence of the sub-cluster and a second position of the common sentence;

对所述通用语句进行关键词识别获得第二词语和所述第二词语的第二类型;Performing keyword recognition on the general sentence to obtain a second word and a second type of the second word;

根据所述第一词语和所述第一类型对所述关键句进行规则提取获得第一语句规则;Extracting rules from the key sentence according to the first word and the first type to obtain a first sentence rule;

根据所述第二词语和所述第二类型对所述通用语句进行规则提取获得第二语句规则;Extracting rules from the general sentence according to the second word and the second type to obtain a second sentence rule;

获得待公布事件的目标关键词和所述目标关键词的第三类型;Obtaining a target keyword of an event to be announced and a third type of the target keyword;

根据所述目标关键词和所述第三类型结合所述第一语句规则和所述第二语句规则按照所述第一位置和所述第二位置生成所述待公布事件的目标公文内容。The target official document content of the event to be announced is generated according to the target keyword and the third type in combination with the first sentence rule and the second sentence rule according to the first position and the second position.

需要说明的是,所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的终端设备的具体工作过程,可以参考前述基于人工智能的公文内容生成方法实施例中的对应过程,在此不再赘述。It should be noted that technical personnel in the relevant field can clearly understand that, for the convenience and brevity of description, the specific working process of the terminal device described above can refer to the corresponding process in the aforementioned embodiment of the document content generation method based on artificial intelligence, and will not be repeated here.

本发明实施例还提供一种存储介质,用于计算机可读存储,所述存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现如本发明实施例说明书提供的任一项基于人工智能的公文内容生成方法的步骤。An embodiment of the present invention also provides a storage medium for computer-readable storage, wherein the storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to implement the steps of any one of the methods for generating official document content based on artificial intelligence provided in the description of the embodiment of the present invention.

其中,所述存储介质可以是前述实施例所述的终端设备的内部存储单元,例如所述终端设备的硬盘或内存。所述存储介质也可以是所述终端设备的外部存储设备,例如所述终端设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(SecureDigital,SD)卡,闪存卡(Flash Card)等。The storage medium may be an internal storage unit of the terminal device described in the foregoing embodiment, such as a hard disk or memory of the terminal device. The storage medium may also be an external storage device of the terminal device, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (SecureDigital, SD) card, a flash card (Flash Card), etc., equipped on the terminal device.

本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。在硬件实施例中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有多个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。It will be appreciated by those skilled in the art that all or some of the steps, systems, and functional modules/units in the methods disclosed above may be implemented as software, firmware, hardware, and appropriate combinations thereof. In a hardware embodiment, the division between the functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, a physical component may have multiple functions, or a function or step may be performed by several physical components in cooperation. Some or all physical components may be implemented as software executed by a processor, such as a central processing unit, a digital signal processor, or a microprocessor, or implemented as hardware, or implemented as an integrated circuit, such as an application-specific integrated circuit. Such software may be distributed on a computer-readable medium, which may include a computer storage medium (or non-transitory medium) and a communication medium (or transient medium). As is known to those skilled in the art, the term computer storage medium includes volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information (such as computer-readable instructions, data structures, program modules, or other data). Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tapes, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store the desired information and can be accessed by a computer. In addition, it is well known to those skilled in the art that communication media typically contain computer-readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media.

应当理解,在本发明说明书和所附权利要求书中使用的术语“和/ 或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。It should be understood that the term "and/or" used in the present specification and the appended claims refers to any combination of one or more of the associated listed items and all possible combinations, including these combinations. It should be noted that, in this article, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or system including a series of elements includes not only those elements, but also includes other elements not explicitly listed, or also includes elements inherent to such process, method, article or system. In the absence of further restrictions, an element defined by the sentence "including a..." does not exclude the presence of other identical elements in the process, method, article or system including the element.

上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。以上所述,仅为本发明的具体实施例,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以权利要求的保护范围为准。The serial numbers of the embodiments of the present invention are only for description and do not represent the advantages and disadvantages of the embodiments. The above description is only a specific embodiment of the present invention, but the protection scope of the present invention is not limited thereto. Any technician familiar with the technical field can easily think of various equivalent modifications or replacements within the technical scope disclosed by the present invention, and these modifications or replacements should be included in the protection scope of the present invention. Therefore, the protection scope of the present invention shall be based on the protection scope of the claims.

Claims (9)

1.一种基于人工智能的公文内容生成方法,其特征在于,所述方法包括:1. A method for generating official document content based on artificial intelligence, characterized in that the method comprises: 对第一文本进行语句识别获得关键句和所述关键句的第一位置;Performing sentence recognition on the first text to obtain a key sentence and a first position of the key sentence; 对所述关键句进行关键词识别获得第一词语和所述第一词语的第一类型;Performing keyword recognition on the key sentence to obtain a first word and a first type of the first word; 根据所述第一类型对所述第一文本中的所述第一词语进行词语替换获得第二文本;Performing word replacement on the first word in the first text according to the first type to obtain a second text; 根据所述第二文本进行文本聚类获得目标聚类结果,并对所述目标聚类结果中每个子类簇进行相似文本分析获得所述子类簇的通用语句和所述通用语句的第二位置;Performing text clustering according to the second text to obtain a target clustering result, and performing similar text analysis on each sub-cluster in the target clustering result to obtain a common sentence of the sub-cluster and a second position of the common sentence; 对所述通用语句进行关键词识别获得第二词语和所述第二词语的第二类型;Performing keyword recognition on the general sentence to obtain a second word and a second type of the second word; 根据所述第一词语和所述第一类型对所述关键句进行规则提取获得第一语句规则;Extracting rules from the key sentence according to the first word and the first type to obtain a first sentence rule; 根据所述第二词语和所述第二类型对所述通用语句进行规则提取获得第二语句规则;Extracting rules from the general sentence according to the second word and the second type to obtain a second sentence rule; 获得待公布事件的目标关键词和所述目标关键词的第三类型;Obtaining a target keyword of an event to be announced and a third type of the target keyword; 根据所述目标关键词和所述第三类型结合所述第一语句规则和所述第二语句规则按照所述第一位置和所述第二位置生成所述待公布事件的目标公文内容;Generate the target official document content of the event to be announced according to the target keyword and the third type in combination with the first sentence rule and the second sentence rule according to the first position and the second position; 其中,所述对第一文本进行语句识别获得关键句和所述关键句的第一位置,包括:The step of performing sentence recognition on the first text to obtain a key sentence and a first position of the key sentence includes: 对所述第一文本进行关键词提取获得所述第一文本对应的第三词语和所述第三词语对应的词语权重;Perform keyword extraction on the first text to obtain a third word corresponding to the first text and a word weight corresponding to the third word; 对所述第一文本进行语句分割获得初始语句,并根据所述第三词语和所述词语权重确定所述初始语句对应的语句权重;Performing sentence segmentation on the first text to obtain an initial sentence, and determining a sentence weight corresponding to the initial sentence according to the third word and the word weight; 将所述初始语句中任意一个句子确定为第一语句,并从所述初始语句中获得所述第一语句相邻状态下对应的第二语句;Determine any one of the initial sentences as a first sentence, and obtain a second sentence corresponding to the first sentence in an adjacent state from the initial sentence; 从所述语句权重中获得所述第一语句对应的第一权重和所述第二语句对应的第二权重;Obtaining a first weight corresponding to the first sentence and a second weight corresponding to the second sentence from the sentence weights; 根据所述第二权重对所述第一权重进行权重调节获得所述第一语句对应的目标权重;Adjusting the first weight according to the second weight to obtain a target weight corresponding to the first sentence; 根据所述目标权重对所述初始语句进行语句识别获得所述关键句,并根据所述关键句从所述第一文本中获得所述关键句对应的所述第一位置。The initial sentence is subjected to sentence recognition according to the target weight to obtain the key sentence, and the first position corresponding to the key sentence is obtained from the first text according to the key sentence. 2.根据权利要求1所述的方法,其特征在于,所述根据所述第二权重对所述第一权重进行权重调节获得所述第一语句对应的目标权重,包括:2. The method according to claim 1, characterized in that the step of adjusting the first weight according to the second weight to obtain a target weight corresponding to the first statement comprises: 从所述初始语句中获得位于所述第二语句左侧位置对应的第一序列语句,以及从所述初始语句中获得位于所述第二语句右侧位置对应的第二序列语句;Obtaining a first sequence of statements corresponding to a left position of the second statement from the initial statement, and obtaining a second sequence of statements corresponding to a right position of the second statement from the initial statement; 从所述第三词语中获得第一序列语句中每个第一子语句对应的关联关键词,并从所述词语权重中获得所述关联关键词对应的相关权重;Obtaining the associated keywords corresponding to each first sub-sentence in the first sequence of sentences from the third words, and obtaining the relevant weights corresponding to the associated keywords from the word weights; 从所述语句权重中获得所述第一序列语句中每个所述第一子语句对应的第三权重和从所述语句权重中获得所述第二序列语句中每个第二子语句对应的第四权重;Obtaining a third weight corresponding to each of the first sub-sentences in the first sequence of sentences from the sentence weights and obtaining a fourth weight corresponding to each of the second sub-sentences in the second sequence of sentences from the sentence weights; 根据所述相关权重结合所述第三权重和所述第四权重利用所述第二权重对所述第一权重进行权重调节获得所述第一语句对应的所述目标权重;According to the relevant weight, combined with the third weight and the fourth weight, the first weight is weight-adjusted using the second weight to obtain the target weight corresponding to the first sentence; 其中,根据下列公式获得所述目标权重:The target weight is obtained according to the following formula: 其中,表示第i个所述第一语句对应的所述目标权重,表示调节参数,表示第i个所述第一语句对应的所述第一权重,n表示所述第一序列语句对应的句子数量,y表示第h个所述第一子语句对应的所述关联关键词的数量,表示第h个所述第一子语句对应的第k个所述关联关键词对应的所述相关权重,表示第h个所述第一子语句对应的所述第三权重,g表示所述第二序列语句对应的句子数量,表示第t个所述第二子语句对应的所述第四权重,表示第i个所述第一语句相邻状态下对应的所述第二语句的所述第二权重。in, represents the target weight corresponding to the i-th first sentence, represents the adjustment parameter, represents the first weight corresponding to the i-th first sentence, n represents the number of sentences corresponding to the first sequence of sentences, y represents the number of associated keywords corresponding to the h-th first sub-sentence, represents the relevant weight corresponding to the kth associated keyword corresponding to the hth first sub-sentence, represents the third weight corresponding to the hth first sub-sentence, g represents the number of sentences corresponding to the second sequence of sentences, represents the fourth weight corresponding to the t-th second sub-statement, Represents the second weight of the second sentence corresponding to the i-th adjacent state of the first sentence. 3.根据权利要求1所述的方法,其特征在于,所述根据所述第二文本进行文本聚类获得目标聚类结果,包括:3. The method according to claim 1, characterized in that the step of performing text clustering according to the second text to obtain a target clustering result comprises: 对所述第二文本进行关键词识别获得所述第二文本对应的文本关键词;Performing keyword recognition on the second text to obtain text keywords corresponding to the second text; 对所述文本关键词进行词语合并获得全部关键词,并对所述全部关键词中任意两个关键词进行相似度计算获得相关相似值;Merging the text keywords to obtain all the keywords, and calculating the similarity between any two keywords in all the keywords to obtain relevant similarity values; 根据所述相关相似值对所述全部关键词进行分类获得第一词组和第二词组;Classifying all the keywords according to the relevant similarity values to obtain a first word group and a second word group; 根据所述第一词组利用所述第二文本对应的所述文本关键词进行类簇分类获得第一分类结果,并根据所述第一分类结果获得每个第一子分类结果对应的第一中心词语和所述第一中心词语对应的第一中心权重;Performing cluster classification using the text keywords corresponding to the second text according to the first phrase to obtain a first classification result, and obtaining a first central word corresponding to each first sub-classification result and a first central weight corresponding to the first central word according to the first classification result; 根据所述第二词组利用所述第二文本对应的所述文本关键词进行类簇分类获得第二分类结果,并根据所述第二分类结果获得每个第二子分类结果对应的第二中心词语和所述第二中心词语对应的第二中心权重;Performing cluster classification using the text keywords corresponding to the second text according to the second phrase to obtain a second classification result, and obtaining a second central word corresponding to each second sub-classification result and a second central weight corresponding to the second central word according to the second classification result; 根据所述第一中心词语和所述第二中心词语结合所述第一中心权重和所述第二中心权重,确定所述第一中心词语在所述第二中心词语中对应的第一关联词语和所述第二中心词语在所述第一中心词语中对应的第二关联词语;Determine, according to the first central word and the second central word in combination with the first central weight and the second central weight, a first associated word corresponding to the first central word in the second central word and a second associated word corresponding to the second central word in the first central word; 根据所述第一关联词语和所述第二关联词语对所述第一分类结果和所述第二分类结果进行匹配,获得多个融合词组下对应的融合相似度;Matching the first classification result and the second classification result according to the first associated words and the second associated words to obtain corresponding fusion similarities under multiple fusion phrases; 根据所述融合相似度和所述融合词组计算所述第二文本之间对应的文本相似度;Calculating the text similarity corresponding to the second texts according to the fusion similarity and the fusion phrase; 根据所述文本相似度对所述第二文本进行文本聚类获得所述目标聚类结果。The second text is clustered according to the text similarity to obtain the target clustering result. 4.根据权利要求3所述的方法,其特征在于,所述根据所述融合相似度和所述融合词组计算所述第二文本之间对应的文本相似度,包括:4. The method according to claim 3, characterized in that the calculating the text similarity between the second texts according to the fusion similarity and the fusion phrase comprises: 从所述第二文本中确定第一子文本和第二子文本;determining a first subtext and a second subtext from the second text; 获得所述融合词组中每个子关键词在所述第一子文本中对应的第一频次和每个所述子关键词在所述第二子文本中对应的第二频次;Obtaining a first frequency corresponding to each sub-keyword in the fused phrase in the first sub-text and a second frequency corresponding to each sub-keyword in the second sub-text; 从所述第一分类结果中获得所述融合词组对应的第三子分类结果和从所述第二分类结果中获得所述融合词组对应的第四子分类结果;Obtain a third sub-classification result corresponding to the fused phrase from the first classification result and obtain a fourth sub-classification result corresponding to the fused phrase from the second classification result; 获得所述第三子分类结果中关键词对应的第一数量和获得所述第四子分类结果中关键词对应的第二数量;Obtaining a first quantity corresponding to the keyword in the third sub-category result and obtaining a second quantity corresponding to the keyword in the fourth sub-category result; 根据所述第一频次、所述第二频次、所述第一数量和所述第二数量融合所述融合词组对应的所述融合相似度获得所述第一子文本和所述第二子文本对应的所述文本相似度;According to the first frequency, the second frequency, the first number, and the second number, the fusion similarity corresponding to the fusion phrase is fused to obtain the text similarity corresponding to the first subtext and the second subtext; 其中,根据下列公式获得所述文本相似度:The text similarity is obtained according to the following formula: 其中,表示第i个所述第一子文本和第j个所述第二子文本对应的所述文本相似度,num2表示所述融合词组对应的数量,num1表示第q个所述融合词组中所述子关键词对应的词语数量,表示第q个所述融合词组中第r个所述子关键词在第i个所述第一子文本中对应的所述第一频次,表示第q个所述融合词组中第r个所述子关键词在第j个所述第二子文本中对应的所述第二频次,表示第q个所述融合词组对应的所述第三子分类结果中关键词对应的所述第一数量,表示第q个所述融合词组对应的所述第四子分类结果中关键词对应的所述第二数量,表示第q个所述融合词组对应的所述融合相似度。in, represents the text similarity between the i-th first subtext and the j-th second subtext, num2 represents the number of the fused phrases, num1 represents the number of words corresponding to the sub-keywords in the q-th fused phrase, represents the first frequency corresponding to the rth sub-keyword in the qth fused phrase in the ith first sub-text, represents the second frequency corresponding to the rth sub-keyword in the qth fused phrase in the jth second sub-text, represents the first number corresponding to the keywords in the third sub-classification result corresponding to the qth fused phrase, represents the second number corresponding to the keywords in the fourth sub-classification result corresponding to the qth fused phrase, represents the fusion similarity corresponding to the qth fused phrase. 5.根据权利要求1所述的方法,其特征在于,所述对所述目标聚类结果中每个子类簇进行相似文本分析获得所述子类簇的通用语句和所述通用语句的第二位置,包括:5. The method according to claim 1, characterized in that the step of performing similar text analysis on each sub-cluster in the target clustering result to obtain a common sentence of the sub-cluster and a second position of the common sentence comprises: 获得所述子类簇中对应的第三子文本和第四子文本,并对所述第三子文本进行文本分割获得第一分割结果和对所述第四子文本进行文本分割获得第二分割结果;Obtaining a third subtext and a fourth subtext corresponding to the subclass cluster, and performing text segmentation on the third subtext to obtain a first segmentation result and performing text segmentation on the fourth subtext to obtain a second segmentation result; 获得所述第一分割结果中对应的第三子语句和所述第二分割结果对应的第四子语句;Obtaining a third sub-sentence corresponding to the first segmentation result and a fourth sub-sentence corresponding to the second segmentation result; 对所述第三子语句进行关键词识别获得第一关键词和对所述第四子语句进行关键词识别获得第二关键词;Performing keyword recognition on the third sub-sentence to obtain a first keyword and performing keyword recognition on the fourth sub-sentence to obtain a second keyword; 对所述第一关键词和所述第二关键词进行交集处理获得所述第三子语句和所述第四子语句对应的相同关键词和所述相同关键词对应的目标数量;Performing intersection processing on the first keyword and the second keyword to obtain the same keyword corresponding to the third sub-sentence and the fourth sub-sentence and the target number corresponding to the same keyword; 对所述第一关键词进行数量统计获得第三数量,并对所述第三数量进行对数求解获得第一结果;Counting the first keyword to obtain a third quantity, and performing logarithmic solution on the third quantity to obtain a first result; 对所述第二关键词进行数量统计获得第四数量,并对所述第四数量进行对数求解获得第二结果;Counting the number of the second keyword to obtain a fourth number, and performing logarithmic solution on the fourth number to obtain a second result; 对所述第一结果和所述第二结果进行求和获得目标结果,并将所述目标数量和所述目标结果进行比值计算获得所述第三子语句和所述第四子语句之间对应的语句相似度;Summing the first result and the second result to obtain a target result, and performing a ratio calculation between the target quantity and the target result to obtain a sentence similarity corresponding to the third sub-sentence and the fourth sub-sentence; 根据所述语句相似度确定所述第三子文本和所述第四子文本之间对应的相似语句;Determining similar sentences corresponding to the third subtext and the fourth subtext according to the sentence similarity; 根据所述子类簇对所述相似语句进行数据统计获得所述子类簇对应的所述通用语句;Performing data statistics on the similar sentences according to the sub-class cluster to obtain the common sentences corresponding to the sub-class cluster; 根据所述通用语句在所述子类簇中进行位置查找获得所述通用语句对应的所述第二位置。A position search is performed in the subclass cluster according to the general statement to obtain the second position corresponding to the general statement. 6.根据权利要求1所述的方法,其特征在于,所述根据所述目标关键词和所述第三类型结合所述第一语句规则和所述第二语句规则按照所述第一位置和所述第二位置生成所述待公布事件的目标公文内容,包括:6. The method according to claim 1, characterized in that the step of generating the target official document content of the event to be announced according to the target keyword and the third type in combination with the first sentence rule and the second sentence rule according to the first position and the second position comprises: 根据所述目标关键词和所述第三类型结合所述第一语句规则获得第一生成语句;Obtaining a first generated sentence according to the target keyword and the third type in combination with the first sentence rule; 根据所述目标关键词和所述第三类型结合所述第二语句规则获得第二生成语句;Obtaining a second generated sentence according to the target keyword and the third type combined with the second sentence rule; 根据所述第一位置和所述第二位置对所述第一生成语句和所述第二生成语句进行语句合并获得多个初始公文内容;Merging the first generated sentence and the second generated sentence according to the first position and the second position to obtain a plurality of initial official document contents; 根据质量评估模型所述初始公文内容进行文本质量评估获得所述初始公文内容对应的目标文本质量;Performing text quality assessment on the initial official document content according to the quality assessment model to obtain a target text quality corresponding to the initial official document content; 根据所述目标文本质量从所述初始公文内容中筛选得到所述待公布事件的所述目标公文内容。The target official document content of the event to be announced is obtained by screening the initial official document content according to the target text quality. 7.一种基于人工智能的公文内容生成装置,其特征在于,包括:7. An artificial intelligence-based document content generation device, characterized by comprising: 语句识别模块,用于对第一文本进行语句识别获得关键句和所述关键句的第一位置,其中,所述对第一文本进行语句识别获得关键句和所述关键句的第一位置,包括:对所述第一文本进行关键词提取获得所述第一文本对应的第三词语和所述第三词语对应的词语权重;对所述第一文本进行语句分割获得初始语句,并根据所述第三词语和所述词语权重确定所述初始语句对应的语句权重;将所述初始语句中任意一个句子确定为第一语句,并从所述初始语句中获得所述第一语句相邻状态下对应的第二语句;从所述语句权重中获得所述第一语句对应的第一权重和所述第二语句对应的第二权重;根据所述第二权重对所述第一权重进行权重调节获得所述第一语句对应的目标权重;根据所述目标权重对所述初始语句进行语句识别获得所述关键句,并根据所述关键句从所述第一文本中获得所述关键句对应的所述第一位置;A sentence recognition module, used for performing sentence recognition on a first text to obtain a key sentence and a first position of the key sentence, wherein the performing sentence recognition on the first text to obtain a key sentence and a first position of the key sentence includes: performing keyword extraction on the first text to obtain a third word corresponding to the first text and a word weight corresponding to the third word; performing sentence segmentation on the first text to obtain an initial sentence, and determining a sentence weight corresponding to the initial sentence according to the third word and the word weight; determining any sentence in the initial sentence as a first sentence, and obtaining a second sentence corresponding to the first sentence in an adjacent state from the initial sentence; obtaining a first weight corresponding to the first sentence and a second weight corresponding to the second sentence from the sentence weight; weight-adjusting the first weight according to the second weight to obtain a target weight corresponding to the first sentence; performing sentence recognition on the initial sentence according to the target weight to obtain the key sentence, and obtaining the first position corresponding to the key sentence from the first text according to the key sentence; 第一词语识别模块,用于对所述关键句进行关键词识别获得第一词语和所述第一词语的第一类型;A first word recognition module, configured to perform keyword recognition on the key sentence to obtain a first word and a first type of the first word; 替换处理模块,用于根据所述第一类型对所述第一文本中的所述第一词语进行词语替换获得第二文本;a replacement processing module, configured to replace the first word in the first text according to the first type to obtain a second text; 聚类分析模块,用于根据所述第二文本进行文本聚类获得目标聚类结果,并对所述目标聚类结果中每个子类簇进行相似文本分析获得所述子类簇的通用语句和所述通用语句的第二位置;A cluster analysis module, configured to perform text clustering according to the second text to obtain a target clustering result, and perform similar text analysis on each sub-cluster in the target clustering result to obtain a common sentence of the sub-cluster and a second position of the common sentence; 第二词语识别模块,用于对所述通用语句进行关键词识别获得第二词语和所述第二词语的第二类型;A second word recognition module, used for performing keyword recognition on the general sentence to obtain a second word and a second type of the second word; 第一规则提取模块,用于根据所述第一词语和所述第一类型对所述关键句进行规则提取获得第一语句规则;A first rule extraction module, configured to extract rules from the key sentence according to the first word and the first type to obtain a first sentence rule; 第二规则提取模块,用于根据所述第二词语和所述第二类型对所述通用语句进行规则提取获得第二语句规则;A second rule extraction module, configured to extract rules from the general sentence according to the second word and the second type to obtain a second sentence rule; 数据获取模块,用于获得待公布事件的目标关键词和所述目标关键词的第三类型;A data acquisition module, used to obtain a target keyword of an event to be announced and a third type of the target keyword; 公文生成模块,用于根据所述目标关键词和所述第三类型结合所述第一语句规则和所述第二语句规则按照所述第一位置和所述第二位置生成所述待公布事件的目标公文内容。The official document generation module is used to generate the target official document content of the event to be announced according to the first position and the second position in combination with the first sentence rule and the second sentence rule according to the target keyword and the third type. 8.一种终端设备,其特征在于,所述终端设备包括处理器、存储器;8. A terminal device, characterized in that the terminal device comprises a processor and a memory; 所述存储器用于存储计算机程序;The memory is used to store computer programs; 所述处理器用于执行所述计算机程序并在执行所述计算机程序时实现如权利要求1至6中任一项所述的基于人工智能的公文内容生成方法。The processor is used to execute the computer program and implement the method for generating official document content based on artificial intelligence as described in any one of claims 1 to 6 when executing the computer program. 9.一种计算机存储介质,用于计算机存储,其特征在于,所述计算机存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以实现权利要求1至6中任一项所述的基于人工智能的公文内容生成方法的步骤。9. A computer storage medium for computer storage, characterized in that the computer storage medium stores one or more programs, and the one or more programs can be executed by one or more processors to implement the steps of the document content generation method based on artificial intelligence as described in any one of claims 1 to 6.
CN202510502214.7A 2025-04-22 2025-04-22 Method, device, terminal equipment and storage medium for generating document content based on artificial intelligence Active CN120012729B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510502214.7A CN120012729B (en) 2025-04-22 2025-04-22 Method, device, terminal equipment and storage medium for generating document content based on artificial intelligence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510502214.7A CN120012729B (en) 2025-04-22 2025-04-22 Method, device, terminal equipment and storage medium for generating document content based on artificial intelligence

Publications (2)

Publication Number Publication Date
CN120012729A CN120012729A (en) 2025-05-16
CN120012729B true CN120012729B (en) 2025-07-04

Family

ID=95676613

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510502214.7A Active CN120012729B (en) 2025-04-22 2025-04-22 Method, device, terminal equipment and storage medium for generating document content based on artificial intelligence

Country Status (1)

Country Link
CN (1) CN120012729B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597312A (en) * 2020-12-28 2021-04-02 深圳壹账通智能科技有限公司 Text classification method and device, electronic equipment and readable storage medium
CN114860942A (en) * 2022-07-05 2022-08-05 北京云迹科技股份有限公司 Text intention classification method, device, equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115169359A (en) * 2022-07-20 2022-10-11 思必驰科技股份有限公司 Sentence generation method, electronic device and storage medium for expanding corpus

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597312A (en) * 2020-12-28 2021-04-02 深圳壹账通智能科技有限公司 Text classification method and device, electronic equipment and readable storage medium
CN114860942A (en) * 2022-07-05 2022-08-05 北京云迹科技股份有限公司 Text intention classification method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN120012729A (en) 2025-05-16

Similar Documents

Publication Publication Date Title
US20210382878A1 (en) Systems and methods for generating a contextually and conversationally correct response to a query
US8630989B2 (en) Systems and methods for information extraction using contextual pattern discovery
Mustafa et al. Multi-label classification of research articles using Word2Vec and identification of similarity threshold
WO2021139262A1 (en) Document mesh term aggregation method and apparatus, computer device, and readable storage medium
US8170969B2 (en) Automated computation of semantic similarity of pairs of named entity phrases using electronic document corpora as background knowledge
WO2017167067A1 (en) Method and device for webpage text classification, method and device for webpage text recognition
WO2021175009A1 (en) Early warning event graph construction method and apparatus, device, and storage medium
US20140214835A1 (en) System and method for automatically classifying documents
CN113377927A (en) Similar document detection method and device, electronic equipment and storage medium
Ekbal et al. Multiobjective optimization for classifier ensemble and feature selection: an application to named entity recognition
Saravanan et al. Identification of rhetorical roles for segmentation and summarization of a legal judgment
CN103678422A (en) Web page classification method and device and training method and device of web page classifier
CN113761125B (en) Dynamic summary determination method and device, computing device and computer storage medium
CN112000802A (en) Software defect positioning method based on similarity integration
Li et al. Emotion-cause span extraction: a new task to emotion cause identification in texts
Ali et al. An Improved FakeBERT for Fake News Detection.
Mohemad et al. Performance analysis in text clustering using k-means and k-medoids algorithms for Malay crime documents
Kumar et al. Transformer-based models for language identification: A comparative study
Nguyen et al. TabEAno: table to knowledge graph entity annotation
CN120012729B (en) Method, device, terminal equipment and storage medium for generating document content based on artificial intelligence
Safikhani et al. Enhancing autonlp with fine-tuned BERT models: an evaluation of text representation methods for autopytorch
Hirsch et al. Evolving Lucene search queries for text classification
Li-Juan et al. A classification method of Vietnamese news events based on maximum entropy model
US20170293863A1 (en) Data analysis system, and control method, program, and recording medium therefor
Deny et al. Inshort text summarization of news article

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant