[go: up one dir, main page]

CN114756673A - Method, device, electronic device and storage medium for generating policy text abstract - Google Patents

Method, device, electronic device and storage medium for generating policy text abstract Download PDF

Info

Publication number
CN114756673A
CN114756673A CN202210208867.0A CN202210208867A CN114756673A CN 114756673 A CN114756673 A CN 114756673A CN 202210208867 A CN202210208867 A CN 202210208867A CN 114756673 A CN114756673 A CN 114756673A
Authority
CN
China
Prior art keywords
policy
candidate
text
target
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210208867.0A
Other languages
Chinese (zh)
Inventor
陈芷昕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An International Smart City Technology Co Ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202210208867.0A priority Critical patent/CN114756673A/en
Publication of CN114756673A publication Critical patent/CN114756673A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明实施例公开了一种政策文本摘要生成方法、装置、电子设备及存储介质,该政策文本摘要生成方法通过获取候选政策文本的候选标题,对候选标题进行分类,得到候选政策文本对应的政策类别,从候选政策文本中确定文本关键词,确定文本关键词在预设的关键词数据库中出现的第一频次,根据第一频次确定候选政策文本的文本重要度,根据文本重要度在各个政策类别的候选政策文本中确定目标政策文本,从目标政策文本中提取出目标关键句,根据目标关键句生成目标政策文本对应的政策类别的政策摘要,相较于人工整理能够自动且快速地生成政策摘要,能够提高生成政策文本摘要的效率以及提高生成政策摘要的准确性。

Figure 202210208867

The embodiment of the present invention discloses a method, device, electronic device and storage medium for generating a policy text abstract. The policy text abstract generating method obtains the candidate title of the candidate policy text, classifies the candidate title, and obtains the policy corresponding to the candidate policy text Category, determine the text keywords from the candidate policy texts, determine the first frequency of the text keywords appearing in the preset keyword database, determine the text importance of the candidate policy texts according to the first frequency, and use the text importance in each policy. Determine the target policy text from the candidate policy text of the category, extract the target key sentence from the target policy text, and generate the policy summary of the policy category corresponding to the target policy text according to the target key sentence. Compared with manual sorting, the policy can be generated automatically and quickly. Abstract, which can improve the efficiency of generating policy text summaries and the accuracy of generating policy summaries.

Figure 202210208867

Description

政策文本摘要生成方法、装置、电子设备及存储介质Method, device, electronic device and storage medium for generating policy text abstract

技术领域technical field

本发明涉及自然语言处理技术领域,特别是涉及一种政策文本摘要生成方法、装置、电子设备及存储介质。The present invention relates to the technical field of natural language processing, and in particular, to a method, device, electronic device and storage medium for generating policy text abstracts.

背景技术Background technique

随着信息化社会的来临,电子版的政策新闻替代了纸质版的政策新闻。其中政策早报为政策新闻的一种形式,且政策早报基于政策库中每日更新的政策生成的主要政策摘要合集,目的是便于读者在短时间内去了解当日重要的政策时事。因此,对于时刻关注政策走向且缺乏充足阅读时间的人们而言政策早报是阅读的首要选项。With the advent of the information society, the electronic version of the policy news has replaced the paper version of the policy news. Among them, the Policy Morning Post is a form of policy news, and the Policy Morning Post is a collection of main policy summaries generated based on the daily updated policies in the policy database, in order to facilitate readers to understand the important policy and current affairs of the day in a short time. Therefore, for people who are always concerned about the direction of the policy and lack sufficient reading time, the Policy Morning Post is the first choice for reading.

在相关技术中,政策早报是由人工每日定时整理生成,需要凭借着整理者的经验理解政策文本,然后再提炼出政策摘要进而整理合集成一个政策早报。然而,采用人工进行政策摘要的提炼,不仅耗费人力,且降低了生成政策摘要的效率。In the related technology, the policy morning report is generated by manual daily regular sorting. It is necessary to understand the policy text based on the experience of the organizer, and then extract the policy summary and organize it into a policy morning report. However, using manual extraction of policy summaries not only consumes manpower, but also reduces the efficiency of generating policy summaries.

发明内容SUMMARY OF THE INVENTION

以下是对本文详细描述的主题的概述。本概述并非是为了限制权利要求的保护范围。The following is an overview of the topics detailed in this article. This summary is not intended to limit the scope of protection of the claims.

本发明实施例提供了一种政策文本摘要生成方法、装置、电子设备及存储介质,能够提高生成政策文本摘要的效率。The embodiments of the present invention provide a method, apparatus, electronic device and storage medium for generating a policy text abstract, which can improve the efficiency of generating a policy text abstract.

第一方面,本发明实施例提供了一种政策文本摘要生成方法,包括:In a first aspect, an embodiment of the present invention provides a method for generating a policy text summary, including:

获取候选政策文本的候选标题,对所述候选标题进行分类,得到所述候选政策文本对应的政策类别;Obtain the candidate title of the candidate policy text, classify the candidate title, and obtain the policy category corresponding to the candidate policy text;

从所述候选政策文本中确定文本关键词,确定所述文本关键词在预设的关键词数据库中出现的第一频次,根据所述第一频次确定所述候选政策文本的文本重要度;Determine text keywords from the candidate policy texts, determine the first frequency of the text keywords appearing in a preset keyword database, and determine the text importance of the candidate policy texts according to the first frequency;

根据所述文本重要度在各个所述政策类别的所述候选政策文本中确定目标政策文本;determining a target policy text among the candidate policy texts of each of the policy categories according to the text importance;

从所述目标政策文本中提取出目标关键句,根据所述目标关键句生成所述目标政策文本对应的所述政策类别的政策摘要。A target key sentence is extracted from the target policy text, and a policy summary of the policy category corresponding to the target policy text is generated according to the target key sentence.

进一步,所述对所述候选标题进行分类,得到所述候选政策文本对应的政策类别,包括:Further, classifying the candidate titles to obtain a policy category corresponding to the candidate policy text, including:

对所述候选标题进行分词处理,得到标题关键词;Perform word segmentation processing on the candidate title to obtain title keywords;

确定所述标题关键词在所述候选政策文本中出现的第二频次,根据所述第二频次计算所述标题关键词的关键词向量;determining the second frequency of the title keyword appearing in the candidate policy text, and calculating the keyword vector of the title keyword according to the second frequency;

根据所述关键词向量计算所述候选标题的标题向量;Calculate the title vector of the candidate title according to the keyword vector;

根据所述标题向量对所述候选标题进行分类处理,得到所述候选政策文本对应的政策类别。Classify the candidate title according to the title vector to obtain the policy category corresponding to the candidate policy text.

进一步,所述从所述候选政策文本中确定文本关键词,包括:Further, the determining text keywords from the candidate policy texts includes:

对所述候选政策文本进行分词处理,得到文本候选词;Perform word segmentation processing on the candidate policy text to obtain text candidate words;

计算所述文本候选词在所述候选政策文本中的关键词分值;Calculate the keyword score of the text candidate word in the candidate policy text;

按照所述关键词分值由大到小的顺序对所述文本候选词进行排序,将排名处于第一阈值之前的所述文本候选词确定为文本关键词;或者按照所述关键词分值由小到大的顺序对所述文本候选词进行排序,将排名处于第二阈值之后的所述文本候选词确定为文本关键词。Sort the text candidates according to the keyword scores in descending order, and determine the text candidates whose ranking is before the first threshold as text keywords; or according to the keyword scores by The text candidate words are sorted in order from small to large, and the text candidate words ranked after the second threshold are determined as text keywords.

进一步,所述根据所述第一频次确定所述候选政策文本的文本重要度,包括:Further, the determining the text importance of the candidate policy text according to the first frequency includes:

根据所述第一频次与所述文本关键词对应的所述关键词分值之间的乘积,得到所述文本关键词的词语重要度;Obtain the word importance of the text keyword according to the product of the first frequency and the keyword score corresponding to the text keyword;

根据所述候选政策文本中所有所述文本关键词的所述词语重要度之和,得到所述候选政策文本的文本重要度。According to the sum of the word importances of all the text keywords in the candidate policy text, the text importance of the candidate policy text is obtained.

进一步,所述根据所述文本重要度在各个所述政策类别的所述候选政策文本中确定目标政策文本,包括:Further, determining a target policy text from the candidate policy texts of each of the policy categories according to the text importance includes:

在各个所述政策类别中,按照所述文本重要度由大到小的顺序对所述候选政策文本进行排序,将排名处于第三阈值之前的所述候选政策文本确定为对应的所述政策类别中的目标政策文本;In each of the policy categories, the candidate policy texts are sorted in descending order of the importance of the texts, and the candidate policy texts ranked before the third threshold are determined as the corresponding policy categories target policy text in;

或者,在各个所述政策类别中,按照所述文本重要度由小到大的顺序对所述候选政策文本进行排序,将排名处于第四阈值之后的所述候选政策文本确定为对应的所述政策类别中的目标政策文本。Alternatively, in each of the policy categories, the candidate policy texts are sorted in descending order of the importance of the texts, and the candidate policy texts ranked after the fourth threshold are determined as the corresponding The target policy text in the policy category.

进一步,所述从所述目标政策文本中提取出目标关键句,包括:Further, the extraction of target key sentences from the target policy text includes:

对所述目标政策文本进行分句处理,得到目标候选句;Perform sentence segmentation processing on the target policy text to obtain target candidate sentences;

对所述目标候选句进行向量化处理,得到候选句向量;Perform vectorization processing on the target candidate sentence to obtain a candidate sentence vector;

根据所述候选句向量计算每两个所述目标候选句之间的相似度值;Calculate the similarity value between each two of the target candidate sentences according to the candidate sentence vector;

根据所述相似度值计算所述目标候选句对应的候选句分值;Calculate the candidate sentence score corresponding to the target candidate sentence according to the similarity value;

按照所述候选句分值由大到小的顺序对所述目标候选句进行排序,将排名处于第五阈值之前的所述目标候选句确定为目标关键句;或者按照所述候选句分值由小到大的顺序对所述目标候选句进行排序,将排名处于第六阈值之后的所述目标候选句确定为目标关键句。Sort the target candidate sentences according to the scores of the candidate sentences in descending order, and determine the target candidate sentences ranked before the fifth threshold as the target key sentences; or according to the score of the candidate sentences by The target candidate sentences are sorted in order from small to large, and the target candidate sentences ranked after the sixth threshold are determined as target key sentences.

进一步,所述根据所述相似度值计算所述目标候选句对应的候选句分值,包括:Further, calculating the candidate sentence score corresponding to the target candidate sentence according to the similarity value includes:

将所述目标候选句作为句子节点,所述相似度值作为对应的两个所述句子节点之间的连接边,根据所述句子节点与所述连接边构建候选句图;The target candidate sentence is used as a sentence node, and the similarity value is used as a connection edge between two corresponding sentence nodes, and a candidate sentence graph is constructed according to the sentence node and the connection edge;

确定所述句子节点的初始分值以及概率系数,所述概率系数用于表征所述句子节点指向所述候选句图中其他任一节点的概率;determining an initial score and a probability coefficient of the sentence node, where the probability coefficient is used to represent the probability that the sentence node points to any other node in the candidate sentence graph;

根据所述概率系数以及所述相似度值在所述候选句图中传播更新所述初始分值直至收敛,将收敛时的所述初始分值作为所述目标候选句对应的候选句分值。The initial score is propagated and updated in the candidate sentence graph according to the probability coefficient and the similarity value until convergence, and the initial score at the time of convergence is used as the candidate sentence score corresponding to the target candidate sentence.

第二方面,本发明实施例还提供了一种政策文本摘要生成装置,包括:In a second aspect, an embodiment of the present invention further provides an apparatus for generating a policy text summary, including:

政策文本分类模块,用于获取候选政策文本的候选标题,对所述候选标题进行分类,得到所述候选政策文本对应的政策类别;a policy text classification module, configured to obtain candidate titles of candidate policy texts, classify the candidate titles, and obtain policy categories corresponding to the candidate policy texts;

文本重要度确定模块,用于从所述候选政策文本中确定文本关键词,确定所述文本关键词在预设的关键词数据库中出现的第一频次,根据所述第一频次确定所述候选政策文本的文本重要度;A text importance determination module, configured to determine text keywords from the candidate policy text, determine the first frequency of the text keywords appearing in a preset keyword database, and determine the candidate according to the first frequency The textual importance of the policy text;

目标政策文本确定模块,用于根据所述文本重要度在各个所述政策类别的所述候选政策文本中确定目标政策文本;a target policy text determination module, configured to determine target policy texts in the candidate policy texts of each of the policy categories according to the text importance;

政策摘要生成模块,用于从所述目标政策文本中提取出目标关键句,根据所述目标关键句生成所述目标政策文本对应的所述政策类别的政策摘要。A policy summary generating module is configured to extract a target key sentence from the target policy text, and generate a policy summary of the policy category corresponding to the target policy text according to the target key sentence.

第三方面,本发明实施例还提供了一种电子设备,包括存储器、处理器,所述存储器存储有计算机程序,所述处理器执行所述计算机程序时实现第一方面所述的政策文本摘要生成方法。In a third aspect, an embodiment of the present invention further provides an electronic device, including a memory and a processor, wherein the memory stores a computer program, and the processor implements the policy text summary of the first aspect when executing the computer program Generate method.

第四方面,本发明实施例还提供了一种计算机可读存储介质,所述存储介质存储有程序,所述程序被处理器执行实现如第一方面所述的政策文本摘要生成方法。In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the storage medium stores a program, and the program is executed by a processor to implement the method for generating a policy text summary according to the first aspect.

本发明实施例至少包括以下有益效果:The embodiments of the present invention at least include the following beneficial effects:

本发明实施例提出的一种政策文本摘要生成方法,通过获取候选政策文本的候选标题,对候选标题进行分类,得到候选政策文本对应的政策类别,从候选政策文本中确定文本关键词,确定文本关键词在预设的关键词数据库中出现的第一频次,根据第一频次确定候选政策文本的文本重要度,根据文本重要度在各个政策类别的候选政策文本中确定目标政策文本,从目标政策文本中提取出目标关键句,根据目标关键句生成目标政策文本对应的政策类别的政策摘要,相较于人工整理能够自动且快速地生成政策摘要,提高生成政策文本摘要的效率;并且基于候选标题确定候选政策文本的政策类别,使得生成的政策摘要对应目标政策文本的政策类别,可以实现不同政策摘要的分类效果,便于读者进行阅读;另外,通过文本关键词出现的第一频次来确定候选政策文本的文本重要度,进而可以确定不同政策类别中较重要的目标政策文本,后续通过目标政策文本来生成政策摘要,有利于提高生成政策摘要的准确性。A method for generating a policy text abstract proposed by an embodiment of the present invention obtains the candidate titles of the candidate policy texts, classifies the candidate titles, obtains the policy categories corresponding to the candidate policy texts, determines the text keywords from the candidate policy texts, and determines the texts. The first frequency of the keyword appearing in the preset keyword database, the text importance of the candidate policy text is determined according to the first frequency, and the target policy text is determined in the candidate policy text of each policy category according to the text importance, and the target policy text is determined from the target policy text. The target key sentence is extracted from the text, and the policy summary of the policy category corresponding to the target policy text is generated according to the target key sentence. Compared with manual sorting, the policy summary can be automatically and quickly generated, and the efficiency of generating the policy text summary is improved; and based on the candidate title Determine the policy category of the candidate policy text, so that the generated policy summary corresponds to the policy category of the target policy text, which can achieve the classification effect of different policy summaries, which is convenient for readers to read; in addition, the candidate policy is determined by the first frequency of text keywords. The text importance of the text can then determine the more important target policy texts in different policy categories, and then generate policy summaries through the target policy texts, which is beneficial to improve the accuracy of generating policy summaries.

本发明的其他特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本发明而了解。本发明的目的和其他优点可通过在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Other features and advantages of the present invention will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the description, claims and drawings.

附图说明Description of drawings

附图用来提供对本发明技术方案的进一步理解,并且构成说明书的一部分,与本发明的实施例一起用于解释本发明的技术方案,并不构成对本发明技术方案的限制。The accompanying drawings are used to provide a further understanding of the technical solutions of the present invention, and constitute a part of the description. They are used to explain the technical solutions of the present invention together with the embodiments of the present invention, and do not constitute a limitation on the technical solutions of the present invention.

图1为本发明实施例提供的一种实施环境的示意图;FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present invention;

图2为本发明实施例提供的政策文本摘要生成方法的流程示意图;2 is a schematic flowchart of a method for generating a policy text summary provided by an embodiment of the present invention;

图3为本发明实施例提供的对候选标题进行分类的具体流程示意图;3 is a schematic diagram of a specific flow for classifying candidate titles provided by an embodiment of the present invention;

图4为本发明实施例提供的候选政策文本一个分类例子示意图;4 is a schematic diagram of a classification example of candidate policy texts provided by an embodiment of the present invention;

图5为本发明实施例提供的确定文本关键词的具体流程示意图;5 is a schematic diagram of a specific flow of determining text keywords provided by an embodiment of the present invention;

图6为本发明实施例提供的确定文本重要度的具体流程示意图;6 is a schematic diagram of a specific flow of determining the importance of a text provided by an embodiment of the present invention;

图7为本发明实施例提供的提取目标关键句的具体流程示意图;7 is a schematic diagram of a specific flow of extracting target key sentences provided by an embodiment of the present invention;

图8为本发明实施例提供的计算候选句分值的具体流程示意图;8 is a schematic diagram of a specific flow of calculating candidate sentence scores according to an embodiment of the present invention;

图9为本发明实施例提供的政策文本摘要生成装置的结构示意图;9 is a schematic structural diagram of an apparatus for generating a policy text summary provided by an embodiment of the present invention;

图10为本发明实施例提供的电子设备的结构示意图。FIG. 10 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.

应了解,在本发明实施例的描述中,若干个的含义是一个以上,多个(或多项)的含义是两个以上,大于、小于、超过等理解为不包括本数,以上、以下、以内等理解为包括本数。如果有描述到“第一”、“第二”等只是用于区分技术特征为目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量或者隐含指明所指示的技术特征的先后关系。It should be understood that, in the description of the embodiments of the present invention, the meaning of several is more than one, the meaning of multiple (or multiple) is more than two, and greater than, less than, exceeding, etc. are understood as not including this number, above, below, Within, etc., are understood to include this number. If there is a description of "first", "second", etc., it is only for the purpose of distinguishing technical features, and cannot be understood as indicating or implying relative importance, or implicitly indicating the number of indicated technical features or implicitly indicating the indicated The sequence of technical characteristics.

除非另有定义,本文所使用的所有的技术和科学术语与属于本发明的技术领域的技术人员通常理解的含义相同。本文中所使用的术语只是为了描述本发明实施例的目的,不是旨在限制本发明。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terms used herein are for the purpose of describing the embodiments of the present invention only, and are not intended to limit the present invention.

首先,对本申请中涉及的若干名词进行解析:First, some terms involved in this application are analyzed:

人工智能(artificial intelligence,AI):是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及应用系统的一门新的技术科学;人工智能是计算机科学的一个分支,人工智能企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器,该领域的研究包括机器人、语言识别、图像识别、自然语言处理和专家系统等。人工智能可以对人的意识、思维的信息过程的模拟。人工智能还是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。Artificial intelligence (AI): It is a new technical science that studies and develops theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science, artificial intelligence. Intelligence attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence. Research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.

本申请实施例可以基于人工智能技术对相关的数据进行获取和处理。其中,人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。The embodiments of the present application may acquire and process related data based on artificial intelligence technology. Among them, artificial intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results. .

随着信息化社会的来临,电子版的政策新闻替代了纸质版的政策新闻。其中政策早报为政策新闻的一种形式,且政策早报基于政策库中每日更新的政策生成的主要政策摘要合集,目的是便于读者在短时间内去了解当日重要的政策时事。因此,对于时刻关注政策走向且缺乏充足阅读时间的人们而言政策早报是阅读的首要选项。With the advent of the information society, the electronic version of the policy news has replaced the paper version of the policy news. Among them, the Policy Morning Post is a form of policy news, and the Policy Morning Post is a collection of main policy summaries generated based on the daily updated policies in the policy database, in order to facilitate readers to understand the important policy and current affairs of the day in a short time. Therefore, for people who are always concerned about the direction of the policy and lack sufficient reading time, the Policy Morning Post is the first choice for reading.

在相关技术中,政策早报是由人工每日定时整理生成,需要凭借着整理者的经验理解政策文本,然后再提炼出政策摘要进而整理合集成一个政策早报。然而,采用人工进行政策摘要的提炼,不仅耗费人力,且降低了生成政策摘要的效率。In the related technology, the policy morning report is generated by manual daily regular sorting. It is necessary to understand the policy text based on the experience of the organizer, and then extract the policy summary and organize it into a policy morning report. However, using manual extraction of policy summaries not only consumes manpower, but also reduces the efficiency of generating policy summaries.

基于此,本发明实施例提供一种政策文本摘要生成方法、装置、电子设备及存储介质,能够提高生成政策文本摘要的效率。Based on this, embodiments of the present invention provide a method, device, electronic device, and storage medium for generating a policy text summary, which can improve the efficiency of generating a policy text summary.

本发明实施例提供的政策文本摘要生成方法、装置、电子设备及存储介质,具体通过如下实施例进行说明,首先描述本公开实施例中的政策文本摘要生成方法。The method, device, electronic device, and storage medium for generating a policy text abstract provided by the embodiments of the present invention are specifically described by the following embodiments. First, the method for generating a policy text abstract in the embodiment of the present disclosure is described.

参照图1,图1为本发明实施例提供的一种实施环境的示意图,该实施环境包括终端101和服务器102,其中,终端101和服务器102之间通过通信网络103连接。Referring to FIG. 1 , FIG. 1 is a schematic diagram of an implementation environment provided by an embodiment of the present invention. The implementation environment includes a terminal 101 and a server 102 , wherein the terminal 101 and the server 102 are connected through a communication network 103 .

服务器102可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN(Content Delivery Network,内容分发网络)、以及大数据和人工智能平台等基础云计算服务的云服务器。The server 102 may be an independent physical server, or a server cluster or a distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, Cloud servers for basic cloud computing services such as middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms.

另外,服务器102还可以是区块链网络中的一个节点服务器。In addition, the server 102 may also be a node server in the blockchain network.

终端101可以是智能手机、平板电脑、笔记本电脑、台式计算机、智能音箱、智能手表、车载终端等,但并不局限于此。终端101以及服务器102可以通过有线或无线通信方式进行直接或间接地连接,本发明实施例在此不做限制。The terminal 101 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, a vehicle terminal, etc., but is not limited thereto. The terminal 101 and the server 102 may be directly or indirectly connected through wired or wireless communication, which is not limited in this embodiment of the present invention.

基于图1所示的实施环境,本发明实施例提供了一种政策文本摘要生成方法,该政策文本摘要生成方法可以由图1所示的服务器102执行,也可以由图1所示的终端101执行,或者由图1所示的终端101和服务器102配合执行,本发明实施例以该政策文本摘要生成方法由图1所示的服务器102执行为例进行说明。Based on the implementation environment shown in FIG. 1 , an embodiment of the present invention provides a method for generating policy text summaries, and the method for generating policy text summaries can be executed by the server 102 shown in FIG. 1 , or by the terminal 101 shown in FIG. 1 . Executed, or executed cooperatively by the terminal 101 and the server 102 shown in FIG. 1 , the embodiment of the present invention is described by taking the method for generating a policy text summary executed by the server 102 shown in FIG. 1 as an example.

参照图2,图2为本发明实施例提供的政策文本摘要生成方法的流程示意图,该政策文本摘要生成方法包括但不限于以下步骤201至步骤203。Referring to FIG. 2 , FIG. 2 is a schematic flowchart of a method for generating a policy text summary provided by an embodiment of the present invention. The method for generating a policy text summary includes but is not limited to the following steps 201 to 203 .

步骤201:获取候选政策文本的候选标题,对候选标题进行分类,得到候选政策文本对应的政策类别;Step 201: Obtain the candidate title of the candidate policy text, classify the candidate title, and obtain the policy category corresponding to the candidate policy text;

步骤202:从候选政策文本中确定文本关键词,确定文本关键词在预设的关键词数据库中出现的第一频次,根据第一频次确定候选政策文本的文本重要度;Step 202: Determine the text keywords from the candidate policy texts, determine the first frequency of the text keywords appearing in the preset keyword database, and determine the text importance of the candidate policy texts according to the first frequency;

步骤203:根据文本重要度在各个政策类别的候选政策文本中确定目标政策文本;Step 203: Determine the target policy text from the candidate policy texts of each policy category according to the text importance;

步骤204:从目标政策文本中提取出目标关键句,根据目标关键句生成目标政策文本对应的政策类别的政策摘要。Step 204 : extracting the target key sentence from the target policy text, and generating a policy summary of the policy category corresponding to the target policy text according to the target key sentence.

其中,候选政策文本的数量可以为多个,候选政策文本可以为时事政策,包括但不限于例如经济政策、行业政策、地方政策等不同的政策类别。在一种可能的实现方式中,可以获取预先设置的网站当日发布的新闻类别为时事政策的新闻文本,以得到候选政策文本。具体地,可以获取来自预先设置的网站当天发布的新闻文本,并在多个新闻文本中获取新闻类别为时事政策的新闻文本,以得到候选政策文本,其中,预先设置的网站包括新闻类网站、咨询类网站、视频类网站等一种或者多种,在实际应用时,可以根据需求设置需要获取的网站。其中,获取当天发布的新闻文本主要获取新闻文本的发布日,然后获取当日日期,以获取发布日与当日日期一致的新闻文本,同时获取新闻文本的新闻类别,以获取新闻类别为时事政策的新闻文本得到候选政策文本。The number of candidate policy texts may be multiple, and the candidate policy texts may be current affairs policies, including but not limited to different policy categories such as economic policies, industry policies, and local policies. In a possible implementation manner, the news texts of which the news category published by the website on the current day is a current affairs policy may be obtained, so as to obtain candidate policy texts. Specifically, it is possible to obtain news texts published on the current day from a preset website, and obtain news texts whose news category is current affairs policy from multiple news texts, so as to obtain candidate policy texts, wherein the preset websites include news websites, One or more kinds of consulting websites, video websites, etc. In actual application, the websites to be obtained can be set according to the requirements. Among them, obtaining the news text released on the current day mainly obtains the release date of the news text, and then obtains the date of the current day to obtain the news text whose release date is the same as the date of the current day, and obtains the news category of the news text to obtain the news whose news category is the current affairs policy. text gets the candidate policy text.

其中,候选标题用于概括候选政策文本的内容,例如,候选标题可以是“L地区数字经济产业政策体系解读”,对候选标题进行分类,即可得到候选政策文本对应的政策类别。例如根据上述候选标题“L地区数字经济产业政策体系解读”可以确定对应的候选政策文本的政策类别为地方政策或者经济政策。Among them, the candidate title is used to summarize the content of the candidate policy text. For example, the candidate title can be "Interpretation of the digital economy and industrial policy system in the L region". By classifying the candidate title, the policy category corresponding to the candidate policy text can be obtained. For example, according to the above-mentioned candidate title "Interpretation of Digital Economy Industrial Policy System in Region L", it can be determined that the policy category of the corresponding candidate policy text is local policy or economic policy.

其中,候选政策文本中存在多个文本词语,并不是所有的文本词语在预设关键词数据库的出现频次都需要计算,只需要获取关键的文本词语作为候选政策文本的文本关键词,再根据文本关键词在预设关键词数据库的出现频次确定候选政策的重要度。Among them, there are multiple text words in the candidate policy text, and not all the text words need to be calculated in the preset keyword database. It is only necessary to obtain the key text words as the text keywords of the candidate policy text, and then according to the text The frequency of occurrence of the keyword in the preset keyword database determines the importance of the candidate policy.

其中,关键词数据库可以预先设置好,在一种可能的实现方式中,可以通过统计预设的历史时间范围内所有候选政策文本的文本关键词,然后将所有候选政策文本的文本关键词放置于预设关键词数据库进行存储,且预设关键数据库存储预设历史时间范围内每一篇候选政策文本的文本关键词。例如,预设历史时间范围为前1个月,也即获取当前时间前一个月内所有候选政策文本的文本关键词存储在预设关键词数据库中。其中,计算当前的候选政策文本的文本关键词出现的第一频次,可以直接根据当前候选政策文本的文本关键词与预设关键词数据库存储的候选政策文本的文本关键词是否一致,从而得到文本关键词出现的第一频次。例如,若预设关键词数据库存储100篇候选政策文本的文本关键词中含有“碳中和”,则当前的候选政策文本中“碳中和”这一文本关键词出现的第一频次为100。文本关键词在预设的关键词数据库中出现的第一频次越大,表明候选政策文本的文本重要度越高。The keyword database can be preset. In a possible implementation, the text keywords of all candidate policy texts within a preset historical time range can be counted, and then the text keywords of all candidate policy texts can be placed in the The preset keyword database is stored, and the preset key database stores the text keywords of each candidate policy text within the preset historical time range. For example, the preset historical time range is the previous month, that is, the text keywords of all candidate policy texts within one month before the current time are obtained and stored in the preset keyword database. The first frequency of the text keywords of the current candidate policy texts is calculated, and the text can be obtained directly according to whether the text keywords of the current candidate policy texts are consistent with the text keywords of the candidate policy texts stored in the preset keyword database. The first frequency that the keyword appears. For example, if the text keyword of 100 candidate policy texts stored in the preset keyword database contains "carbon neutrality", the first frequency of the text keyword "carbon neutrality" in the current candidate policy text is 100 . The higher the first frequency of text keywords in the preset keyword database, the higher the text importance of the candidate policy text.

其中,在本发明实施例中,根据候选政策文本的政策类别进行分类,不同的政策类别里有对应的候选政策文本,然后,可以将文本重要度最高的候选政策文本确定为对应政策类别中的目标政策文本。例如,候选政策文本包括:A、B、C、D、E、F、G、H,根据政策类别将多个候选政策文本划分为第一政策类别的候选政策文本包括:A、C、E;第二政策类别的候选政策文本包括:B、D、G;第三政策类别的候选政策文本包括:F、H;第一政策类别中文本重要度最高的候选政策文本为C,第二政策类别中文本重要度最高的候选政策文本为D,第三政策类别中文本重要度最高的候选政策文本为F,因此确定第一政策类别的目标政策文本为候选政策文本C,第二政策类别的目标政策文本为候选政策文本D,第三政策类别的目标政策文本为候选政策文本F。因此确定不同政策类别的目标政策文本,形成对应的政策早报,以节省用户查看政策资讯的时间。可以理解的是,在对候选政策文本进行分类时,同一个候选政策文本可以划分至不同的政策类别,例如上述候选政策文本A即可以划分至第一政策类别,也可以划分至第二政策类别。Among them, in the embodiment of the present invention, classification is performed according to the policy category of the candidate policy text, and there are corresponding candidate policy texts in different policy categories, and then the candidate policy text with the highest text importance can be determined as the corresponding policy category. Target policy text. For example, the candidate policy texts include: A, B, C, D, E, F, G, H, and the candidate policy texts that divide the multiple candidate policy texts into the first policy category according to the policy category include: A, C, E; The candidate policy texts of the second policy category include: B, D, G; the candidate policy texts of the third policy category include: F, H; the candidate policy text with the highest text importance in the first policy category is C, and the second policy category The candidate policy text with the highest Chinese text importance is D, and the candidate policy text with the highest text importance in the third policy category is F. Therefore, the target policy text of the first policy category is determined as the candidate policy text C, and the target of the second policy category is determined as the candidate policy text C. The policy text is candidate policy text D, and the target policy text of the third policy category is candidate policy text F. Therefore, target policy texts of different policy categories are determined, and corresponding policy morning reports are formed, so as to save the time for users to view policy information. It can be understood that when classifying candidate policy texts, the same candidate policy text can be classified into different policy classes. For example, the above candidate policy text A can be classified into the first policy class or into the second policy class. .

本发明实施例提供的政策文本摘要生成方法,通过获取候选政策文本的候选标题,对候选标题进行分类,得到候选政策文本对应的政策类别,从候选政策文本中确定文本关键词,确定文本关键词在预设的关键词数据库中出现的第一频次,根据第一频次确定候选政策文本的文本重要度,根据文本重要度在各个政策类别的候选政策文本中确定目标政策文本,从目标政策文本中提取出目标关键句,根据目标关键句生成目标政策文本对应的政策类别的政策摘要,相较于人工整理能够自动且快速地生成政策摘要,提高生成政策文本摘要的效率;并且基于候选标题确定候选政策文本的政策类别,使得生成的政策摘要对应目标政策文本的政策类别,可以实现不同政策摘要的分类效果,便于读者进行阅读;另外,通过文本关键词出现的第一频次来确定候选政策文本的文本重要度,进而可以确定不同政策类别中较重要的目标政策文本,后续通过目标政策文本来生成政策摘要,有利于提高生成政策摘要的准确性。In the method for generating a policy text abstract provided by the embodiment of the present invention, the candidate titles of the candidate policy texts are obtained, the candidate titles are classified, the policy categories corresponding to the candidate policy texts are obtained, the text keywords are determined from the candidate policy texts, and the text keywords are determined. The first frequency that appears in the preset keyword database, the text importance of the candidate policy text is determined according to the first frequency, and the target policy text is determined in the candidate policy texts of each policy category according to the text importance, and the target policy text is determined from the target policy text. The target key sentence is extracted, and the policy summary of the policy category corresponding to the target policy text is generated according to the target key sentence. Compared with manual sorting, the policy summary can be automatically and quickly generated, and the efficiency of generating the policy text summary is improved; and the candidate title is determined based on the candidate title. The policy category of the policy text makes the generated policy summary correspond to the policy category of the target policy text, which can achieve the classification effect of different policy summaries and facilitate readers to read; in addition, the first frequency of the text keywords to determine the candidate policy text. The importance of the text can be used to determine the more important target policy texts in different policy categories, and then generate policy summaries through the target policy texts, which is beneficial to improve the accuracy of generating policy summaries.

在一种可能的实现方式中,参照图3,图3为本发明实施例提供的对候选标题进行分类的具体流程示意图,上述步骤201中,对候选标题进行分类,得到候选政策文本对应的政策类别,具体可以包括以下步骤301至步骤304。In a possible implementation, referring to FIG. 3, FIG. 3 is a schematic diagram of a specific flow of classifying candidate titles provided by an embodiment of the present invention. In the above step 201, the candidate titles are classified to obtain the policy corresponding to the candidate policy text The category may specifically include the following steps 301 to 304 .

步骤301:对候选标题进行分词处理,得到标题关键词;Step 301: perform word segmentation processing on candidate titles to obtain title keywords;

步骤302:确定标题关键词在候选政策文本中出现的第二频次,根据第二频次计算标题关键词的关键词向量;Step 302: Determine the second frequency of the title keyword appearing in the candidate policy text, and calculate the keyword vector of the title keyword according to the second frequency;

步骤303:根据关键词向量计算候选标题的标题向量;Step 303: Calculate the title vector of the candidate title according to the keyword vector;

步骤304:根据标题向量对候选标题进行分类处理,得到候选政策文本对应的政策类别。Step 304: Classify the candidate titles according to the title vector to obtain the policy category corresponding to the candidate policy text.

其中,对候选标题进行分词处理,可以采用分析软件实现。将候选标题中的停用词去除,然后将去除停用词的候选标题采用分词软件进行分词,以得到标题关键词。停用词包括:标点符号、数字、英文字母等,而采用的分析软件则可以是jieba软件。其中,jieba软件进行分词的原理主要有全模式、精准模式和搜索引擎模式,全模式则是将句子中的所有可能成词的词语都扫描出来;精确模式则是试图将句子最精确地切开,适合文本分词;搜索引擎模式则是在精确模式的基础上,对长词(字数>2)再次切分,提高召回率,适用于搜索引擎分词。在本实施例中,采用全模式分词以将候选标题中的可能成词的词语分出来。例如,若候选标题为“L地区“1+N+S”数字经济产业政策体系解读”,将候选标题中的停用词去除,以得到“L地区数字经济产业政策体系解读”,然后再采用jieba软件对候选标题进行分词,得到标题关键词为“L地区、数字、经济、产业、政策、体系、解读”。Among them, the word segmentation processing of the candidate title can be realized by using analysis software. The stop words in the candidate titles are removed, and then the candidate titles with the stop words removed are segmented by word segmentation software to obtain title keywords. Stop words include: punctuation marks, numbers, English letters, etc., and the analysis software used can be jieba software. Among them, the principle of jieba software for word segmentation mainly includes full mode, precise mode and search engine mode. The full mode is to scan all possible words in the sentence; the precise mode is to try to cut the sentence most accurately. , suitable for text segmentation; the search engine mode is based on the precise mode, and the long words (words > 2) are segmented again to improve the recall rate, which is suitable for search engine segmentation. In this embodiment, the full mode word segmentation is used to separate out the words that may become words in the candidate title. For example, if the candidate title is "Interpretation of the "1+N+S" digital economy industrial policy system in L region", remove the stop words in the candidate title to obtain "Interpretation of the digital economy industrial policy system in L region", and then use The jieba software performs word segmentation on the candidate titles, and obtains the title keywords as "L region, numbers, economy, industry, policy, system, interpretation".

将候选标题划分成多个标题关键词后,需要计算每个标题关键词在候选政策中出现的第二频次,然后根据每个标题关键词在候选政策中出现的第二频次确定标题关键词的向量。在一种可能的实现方式中,可以直接利用每个标题关键词在候选政策中出现的第二频次作为标题关键词的向量,例如第二频次为3,则标题关键词的向量为“3”。除此以外,还可以采用TF-IDF(Term Frequency-Inverse Document Frequency,词频-逆文件频率)计算标题关键词的向量;TF-IDF的主要思想是:如果某个词或短语在一篇文章中出现的频率TF高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。TF-IDF实际上是:TF*IDF,TF词频(Term Frequency),IDF逆向文件频率(InverseDocument Frequency)。TF表示词条在文档d中出现的第二频次。IDF的主要思想是:如果包含词条t的文档越少,也就是n越小,IDF越大,则说明词条t具有很好的类别区分能力。因此,将获取的标题关键词采用tfidf方法计算每个标题关键词在候选政策文本中出现的概率,以得到标题关键词的向量,以作为后续模型训练的向量输入。例如,分词处理得到标题关键词分别为“L地区、数字、经济、产业、政策、体系、解读”,然后采用tfidf方法计算标题关键词的向量分别为“0.023、0.021、0、0、0.002……”。After dividing the candidate title into multiple title keywords, it is necessary to calculate the second frequency of each title keyword appearing in the candidate policy, and then determine the title keyword according to the second frequency of each title keyword appearing in the candidate policy. vector. In a possible implementation, the second frequency of each title keyword appearing in the candidate policy can be directly used as the vector of the title keyword. For example, if the second frequency is 3, the vector of the title keyword is "3" . In addition, TF-IDF (Term Frequency-Inverse Document Frequency, term frequency-inverse document frequency) can also be used to calculate the vector of title keywords; the main idea of TF-IDF is: if a certain word or phrase is in an article If the frequency of occurrence TF is high, and it rarely appears in other articles, it is considered that the word or phrase has a good ability to distinguish between categories and is suitable for classification. TF-IDF is actually: TF*IDF, TF term frequency (Term Frequency), IDF inverse document frequency (InverseDocument Frequency). TF represents the second frequency of the term appearing in document d. The main idea of IDF is: if there are fewer documents containing term t, that is, the smaller n is, and the larger the IDF is, it means that term t has a good ability to distinguish categories. Therefore, the obtained title keywords are calculated using the tfidf method to calculate the probability of each title keyword appearing in the candidate policy text to obtain the title keyword vector, which is used as the vector input for subsequent model training. For example, the title keywords obtained by word segmentation processing are "L region, number, economy, industry, policy, system, interpretation", and then the vector of title keywords calculated by the tfidf method is "0.023, 0.021, 0, 0, 0.002... …”.

计算出每个标题关键词的向量值后,可以将所有标题关键词的向量值进行拼接得到候选标题的标题向量。例如,基于上述例子,标题关键词的向量分别为[0.023,0.021,0,0,0.002……]。After calculating the vector value of each title keyword, the title vector of the candidate title can be obtained by splicing the vector values of all the title keywords. For example, based on the above example, the vectors of title keywords are respectively [0.023, 0.021, 0, 0, 0.002...].

接着,将标题向量输入SVM模型以对候选标题进行分类。其中,SVM模型也称为支持向量机模型,且本例中采用的是硬边界SVM,该方法是在线性可分问题中求解最大边距超平面(maximum-margin hyperplane)的算法,约束条件是样本点到决策边界的距离大于等于1。硬边界SVM可以转化为一个等价的二次凸优化(quadratic convex optimization)问题进行求解。通过SVM模型对标题向量进行处理可以得到标题向量的分类值,这个分类值可以用来代表对应的候选政策文本所属的政策类别。Next, the title vector is fed into the SVM model to classify candidate titles. Among them, the SVM model is also called the support vector machine model, and the hard-bound SVM is used in this example. This method is an algorithm for solving the maximum-margin hyperplane in a linearly separable problem. The constraints are The distance from the sample point to the decision boundary is greater than or equal to 1. The hard boundary SVM can be transformed into an equivalent quadratic convex optimization problem to solve. By processing the title vector through the SVM model, the classification value of the title vector can be obtained, and this classification value can be used to represent the policy category to which the corresponding candidate policy text belongs.

其中,分类值也即代表着标题关键词在各个板块的分类结果,然后根据标题关键词的分类值确定候选政策文本的政策类别。例如,上述标题向量的分类值为[0 0 1 0 0 10],其中,分类值中有7个元素,每个元素依次代表七种不同的政策类别,元素为“1”表明该标题向量属于该政策类别,元素为“0”表明该标题向量不属于该政策类别。因此,根据上述分类值也即得到候选政策属于第三种政策类别和第六种政策类别。Among them, the classification value also represents the classification result of the title keyword in each section, and then the policy category of the candidate policy text is determined according to the classification value of the title keyword. For example, the classification value of the above title vector is [0 0 1 0 0 10], where there are 7 elements in the classification value, each element in turn represents seven different policy categories, and an element of "1" indicates that the title vector belongs to The policy category, an element of "0" indicates that the title vector does not belong to this policy category. Therefore, according to the above classification values, it is obtained that the candidate policies belong to the third policy category and the sixth policy category.

例如,参照图4,图4为本发明实施例提供的候选政策文本一个分类例子示意图,若输入的候选政策文本的标题文本为“L地区“1+N+S”数字经济产业政策体系解读”,将标题文本进行分词,得到标题关键词为“L地区、数字、经济、产业、政策、体系、解读”。然后采用TF-IDF方法计算标题关键词的向量分别为“0.023、0.021、0、0、0.002……”,再根据SVM模型将每个标题关键词的向量值进行分类,以得到政策的分类值为[0 0 1 0 0 1 0],假设第三种政策类别和第六种政策类别为地方政策和经济政策,因此得到该候选政策文本的政策类别为地方政策和经济政策。For example, referring to FIG. 4, FIG. 4 is a schematic diagram of a classification example of the candidate policy text provided by the embodiment of the present invention. If the title text of the input candidate policy text is "L region "1+N+S" Digital Economy Industrial Policy System Interpretation" , the title text is divided into words, and the title keywords are "L region, number, economy, industry, policy, system, interpretation". Then the TF-IDF method is used to calculate the vector of title keywords as "0.023, 0.021, 0, 0, 0.002...", and then the vector value of each title keyword is classified according to the SVM model to obtain the classification value of the policy is [0 0 1 0 0 1 0], assuming that the third and sixth policy categories are local policies and economic policies, so the policy categories that get the candidate policy text are local policies and economic policies.

在一种可能的实现方式中,参照图5,图5为本发明实施例提供的确定文本关键词的具体流程示意图,上述步骤202中,从候选政策文本中确定文本关键词,具体可以包括以下步骤501至步骤503。In a possible implementation manner, referring to FIG. 5, FIG. 5 is a schematic diagram of a specific flow of determining text keywords provided by an embodiment of the present invention. In the above step 202, determining text keywords from candidate policy texts may specifically include the following Steps 501 to 503 .

步骤501:对候选政策文本进行分词处理,得到文本候选词;Step 501: Perform word segmentation processing on the candidate policy text to obtain text candidate words;

步骤502:计算文本候选词在候选政策文本中的关键词分值;Step 502: Calculate the keyword scores of the text candidate words in the candidate policy text;

步骤503:按照关键词分值由大到小的顺序对文本候选词进行排序,将排名处于第一阈值之前的文本候选词确定为文本关键词;或者按照关键词分值由小到大的顺序对文本候选词进行排序,将排名处于第二阈值之后的文本候选词确定为文本关键词。Step 503: Sort the text candidates according to the order of the keyword scores from large to small, and determine the text candidate words ranked before the first threshold as the text keywords; or according to the order of the keyword scores from small to large The text candidate words are sorted, and the text candidate words ranked after the second threshold are determined as text keywords.

其中,对候选政策文本进行分词处理,也可以采用分析软件实现。计算文本候选词在候选政策文本中的关键词分值,可以采用上述的TF-IDF计算得到,其中,在计算文本候选词在候选政策文本中的关键词分值,IDF是根据历史的全部政策文本来计算得到,从而使得关键词分值更加合理和准确。Among them, the word segmentation processing of the candidate policy text can also be realized by using analysis software. Calculating the keyword scores of text candidate words in candidate policy texts can be calculated using the above TF-IDF. Among them, when calculating the keyword scores of text candidate words in candidate policy texts, IDF is based on all historical policies. Text to calculate, so that the keyword score is more reasonable and accurate.

得到了多个文本候选词的关键词分值后,可以根据关键词分值将文本关键词进行排序以得到排序顺序,根据排序顺序获取前面预设数量的文本候选词作为文本关键词,也即得到几个能代表候选政策的词语。其中,排序的方式可以是按照关键词分值由大到小的顺序或者按照关键词分值由小到大的顺序,本发明实施例不做限定,第一阈值和第二阈值可以根据实际情况设置,例如可以是50、70、100等,本发明实施例不做限定。以按照关键词分值由大到小的顺序对文本候选词进行排序为例,文本候选词包括:K1、K2、K3、K4、K5、K6、K7、K8,假设第一阈值是3,文本候选词的排序结果为K3、K4、K5、K1、K2、K6、K7、K8,因此确定文本关键词为K3、K4、K5。After obtaining the keyword scores of multiple text candidate words, the text keywords can be sorted according to the keyword scores to obtain a sorting order, and the previous preset number of text candidate words are obtained as text keywords according to the sorting order, that is, Get a few words that represent candidate policies. The sorting method may be in descending order of keyword scores or in descending order of keyword scores, which is not limited in the embodiment of the present invention, and the first threshold and the second threshold may be based on actual conditions. The setting, for example, may be 50, 70, 100, etc., which is not limited in the embodiment of the present invention. Take the sorting of text candidates according to the order of keyword scores from large to small as an example, the text candidates include: K1, K2, K3, K4, K5, K6, K7, K8, assuming that the first threshold is 3, the text The ranking results of the candidate words are K3, K4, K5, K1, K2, K6, K7, K8, so the text keywords are determined to be K3, K4, K5.

计算文本候选词在候选政策文本中的关键词分值,除了确定文本关键词以外,还将文本关键词对应的关键词分值与文本关键词在预设的关键词数据库中出现的第一频次相结合来确定候选政策文本的文本重要度。参照图6,图6为本发明实施例提供的确定文本重要度的具体流程示意图,上述步骤202中,根据第一频次确定候选政策文本的文本重要度,具体可以包括以下步骤601至步骤602。Calculate the keyword score of the candidate text word in the candidate policy text. In addition to determining the text keyword, the keyword score corresponding to the text keyword and the first frequency of the text keyword appearing in the preset keyword database are also calculated. combined to determine the text importance of candidate policy texts. Referring to FIG. 6 , FIG. 6 is a schematic diagram of a specific flow of determining text importance provided by an embodiment of the present invention. In the above step 202 , determining the text importance of candidate policy texts according to the first frequency may specifically include the following steps 601 to 602 .

步骤601:根据第一频次与文本关键词对应的关键词分值之间的乘积,得到文本关键词的词语重要度;Step 601: Obtain the word importance of the text keyword according to the product of the first frequency and the keyword score corresponding to the text keyword;

步骤602:根据候选政策文本中所有文本关键词的词语重要度之和,得到候选政策文本的文本重要度。Step 602: Obtain the text importance of the candidate policy text according to the sum of the word importances of all text keywords in the candidate policy text.

具体地,将候选政策文本的文本关键词出现的关键词分值和前预设时间段的文本关键词出现的第一频次相乘得到文本关键词的重要度,然后候选政策文本中所有文本关键词的重要度求和得到候选政策文本的重要度,以便于通过候选政策文本的重要度进行政策文本的筛选,从而获取文本重要度最高的候选政策文本作为该政策类别的目标政策文本。Specifically, the importance of the text keywords is obtained by multiplying the keyword scores of the text keywords of the candidate policy text and the first frequency of the text keywords in the previous preset time period, and then all the text keywords in the candidate policy text are The importance of the candidate policy text is obtained by summing the importance of the word, so that the policy text can be screened by the importance of the candidate policy text, so as to obtain the candidate policy text with the highest text importance as the target policy text of the policy category.

其中,可以通过TF-IDF计算文本候选词在政策全文中的关键分值,其中,TF-IDF的主要思想是:如果某个词或短语在一篇文章中出现的频率TF高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。因此,根据文本候选词在候选政策文本的全文中出现的频次确定每个文本候选词的关键分值,以通过关键分值来评判该文本候选词对于候选政策文本的重要程度。Among them, the key scores of text candidates in the full text of the policy can be calculated by TF-IDF. The main idea of TF-IDF is: if a word or phrase appears frequently in one article TF, and in other If it rarely appears in the article, it is considered that the word or phrase has a good ability to distinguish between categories and is suitable for classification. Therefore, the key score of each text candidate word is determined according to the frequency of the text candidate word in the full text of the candidate policy text, so as to judge the importance of the text candidate word to the candidate policy text by the key score.

其中,计算候选政策文本的文本重要度的公式如下:Among them, the formula for calculating the text importance of candidate policy texts is as follows:

Figure BDA0003530218050000121
Figure BDA0003530218050000121

式中,sj表示文本重要度,tfidfi,表示当前候选政策文本的文本关键词的关键词分值,kfi表示文本关键词出现的第一频次,i,j为正整数。In the formula, s j represents the importance of the text, tfidf i represents the keyword score of the text keywords of the current candidate policy text, kf i represents the first frequency of text keywords, and i, j are positive integers.

通过将候选政策文本的文本关键词的关键词分值和前预设时间段的文本关键词出现的第一频次相乘得到文本关键词的重要度,能够将关键词分值和第一频次相结合,使得候选政策文本的文本重要度更加能够反映候选政策文本的重要性,使得候选政策文本的文本重要度更加合理准确。The importance of the text keywords can be obtained by multiplying the keyword scores of the text keywords of the candidate policy text and the first frequency of the text keywords in the previous preset time period, and the keyword scores can be correlated with the first frequency. Combined, the text importance of the candidate policy text can better reflect the importance of the candidate policy text, making the text importance of the candidate policy text more reasonable and accurate.

在一种可能的实现方式中,上述步骤203中,根据文本重要度在各个政策类别的候选政策文本中确定目标政策文本,具体可以在各个政策类别中,按照文本重要度由大到小的顺序对候选政策文本进行排序,将排名处于第三阈值之前的候选政策文本确定为对应的政策类别中的目标政策文本;In a possible implementation manner, in the above step 203, the target policy text is determined from the candidate policy texts of each policy category according to the text importance. Specifically, in each policy category, the text importance may be in descending order. Sort the candidate policy texts, and determine the candidate policy texts ranked before the third threshold as the target policy texts in the corresponding policy category;

或者,在各个政策类别中,按照文本重要度由小到大的顺序对候选政策文本进行排序,将排名处于第四阈值之后的候选政策文本确定为对应的政策类别中的目标政策文本。Alternatively, in each policy category, the candidate policy texts are sorted in ascending order of text importance, and the candidate policy text ranked after the fourth threshold is determined as the target policy text in the corresponding policy category.

其中,第三阈值和第四阈值可以根据实际情况设置,例如可以是1、2、3等,本发明实施例不做限定。例如,以按照文本重要度由大到小的顺序对候选政策文本进行排序为例子,第三阈值为1,候选政策文本包括:A、B、C、D、E、F、G、H,根据政策类别将多个候选政策文本划分为第一政策类别的候选政策文本包括:A、C、E;第二政策类别的候选政策文本包括:B、D、G;第三政策类别的候选政策文本包括:F、H;第一政策类别中文本重要度的排序结果为C、A、E,第二政策类别中文本重要度的排序结果为D、G、B,第三政策类别中文本重要度的排序结果为F、H,因此确定第一政策类别的目标政策文本为候选政策文本C,第二政策类别的目标政策文本为候选政策文本D,第三政策类别的目标政策文本为候选政策文本F。The third threshold and the fourth threshold may be set according to actual conditions, for example, may be 1, 2, 3, etc., which are not limited in this embodiment of the present invention. For example, taking the example of sorting candidate policy texts in descending order of text importance, the third threshold is 1, and the candidate policy texts include: A, B, C, D, E, F, G, H, according to The policy category divides multiple candidate policy texts into the candidate policy texts of the first policy category including: A, C, E; the candidate policy texts of the second policy category include: B, D, G; the candidate policy texts of the third policy category Including: F, H; the ranking results of text importance in the first policy category are C, A, E, the ranking results of text importance in the second policy category are D, G, B, and the text importance in the third policy category The sorting results are F and H, so the target policy text of the first policy category is determined as the candidate policy text C, the target policy text of the second policy category is the candidate policy text D, and the target policy text of the third policy category is the candidate policy text. F.

在一种可能的实现方式中,参照图7,图7为本发明实施例提供的提取目标关键句的具体流程示意图,上述步骤204中,从目标政策文本中提取出目标关键句,具体可以包括以下步骤701至步骤705。In a possible implementation, referring to FIG. 7 , FIG. 7 is a schematic diagram of a specific flow of extracting target key sentences provided by an embodiment of the present invention. In the above step 204, the target key sentences are extracted from the target policy text, which may specifically include The following steps 701 to 705.

步骤701:对目标政策文本进行分句处理,得到目标候选句;Step 701: Perform sentence segmentation processing on the target policy text to obtain target candidate sentences;

步骤702:对目标候选句进行向量化处理,得到候选句向量;Step 702: Perform vectorization processing on the target candidate sentence to obtain a candidate sentence vector;

步骤703:根据候选句向量计算每两个目标候选句之间的相似度值;Step 703: Calculate the similarity value between each two target candidate sentences according to the candidate sentence vector;

步骤704:根据相似度值计算目标候选句对应的候选句分值;Step 704: Calculate the candidate sentence score corresponding to the target candidate sentence according to the similarity value;

步骤705:按照候选句分值由大到小的顺序对目标候选句进行排序,将排名处于第五阈值之前的目标候选句确定为目标关键句;或者按照候选句分值由小到大的顺序对目标候选句进行排序,将排名处于第六阈值之后的目标候选句确定为目标关键句。Step 705: Sort the target candidate sentences in descending order of the candidate sentence scores, and determine the target candidate sentence ranked before the fifth threshold as the target key sentence; or in the order of the candidate sentence scores from small to large The target candidate sentences are sorted, and the target candidate sentences ranked after the sixth threshold are determined as target key sentences.

其中,可以采用分析软件对目标政策文本进行分句处理,得到目标候选句,分句处理的依据可以是标点符号。然后,可以采用Word2vec模型转换目标候选句为候选句向量。Word2vec为词向量模型,将目标政策文本中每个目标候选句中的词语转换为对应的词向量,然后将目标候选句中的词语的词向量进行拼接,从而将目标候选句转换为候选句向量。Among them, analysis software can be used to process the target policy text into sentences to obtain target candidate sentences, and the basis for the sentence processing can be punctuation marks. Then, the Word2vec model can be used to convert the target candidate sentence into a candidate sentence vector. Word2vec is a word vector model, which converts the words in each target candidate sentence in the target policy text into the corresponding word vector, and then splices the word vectors of the words in the target candidate sentence to convert the target candidate sentence into a candidate sentence vector .

其中,根据候选句向量计算每两个目标候选句之间的相似度值,可以计算两个候选句向量之间的余弦相似度,或者也可以计算两个候选句向量之间的欧氏距离,本发明实施例不做限定。需要补充说明的是,本发明实施例中计算的是每两个目标候选句之间的相似度值,例如目标候选句包括S1、S2、S3和S4,则对于目标候选句S1来说,需要计算目标候选句S1与目标候选句S2之间的相似度值,计算目标候选句S1与目标候选句S3之间的相似度值,以及计算目标候选句S1与目标候选句S4之间的相似度值。Among them, the similarity value between each two target candidate sentences is calculated according to the candidate sentence vectors, the cosine similarity between the two candidate sentence vectors can be calculated, or the Euclidean distance between the two candidate sentence vectors can also be calculated, This embodiment of the present invention is not limited. It should be added that, in the embodiment of the present invention, the similarity value between each two target candidate sentences is calculated. For example, the target candidate sentences include S1, S2, S3 and S4, and the target candidate sentence S1 needs to be Calculate the similarity value between the target candidate sentence S1 and the target candidate sentence S2, calculate the similarity value between the target candidate sentence S1 and the target candidate sentence S3, and calculate the similarity between the target candidate sentence S1 and the target candidate sentence S4 value.

然后,根据相似度值计算目标候选句对应的候选句分值后,在根据候选句分值对目标候选句进行排序,进而确定出目标关键句。其中,第五阈值和第六阈值可以根据实际情况设置,例如可以是1、2、3等,本发明实施例不做限定。例如,以按照候选句分值由大到小的顺序对目标候选句进行排序为例子,第五阈值为1,目标候选句包括:S1、S2、S3、S4、S5、S6、S7、S8,假设第五阈值是2,目标候选句的排序结果为S3、S4、S5、S1、S2、S6、S7、S8,因此确定目标关键句为S3、S4。Then, after calculating the candidate sentence score corresponding to the target candidate sentence according to the similarity value, the target candidate sentence is sorted according to the candidate sentence score, and then the target key sentence is determined. The fifth threshold and the sixth threshold may be set according to actual conditions, for example, may be 1, 2, 3, etc., which are not limited in this embodiment of the present invention. For example, take the example of sorting the target candidate sentences in descending order of the candidate sentence scores, the fifth threshold is 1, and the target candidate sentences include: S1, S2, S3, S4, S5, S6, S7, S8, Assuming that the fifth threshold is 2, the ranking results of the target candidate sentences are S3, S4, S5, S1, S2, S6, S7, and S8, so the target key sentences are determined to be S3 and S4.

在一种可能的实现方式中,参照图8,图8为本发明实施例提供的计算候选句分值的具体流程示意图,上述步骤704中,根据相似度值计算目标候选句对应的候选句分值,具体可以包括以下步骤801至步骤803。In a possible implementation, referring to FIG. 8 , FIG. 8 is a schematic diagram of a specific flow of calculating a candidate sentence score provided by an embodiment of the present invention. In the above step 704, the candidate sentence score corresponding to the target candidate sentence is calculated according to the similarity value. value, which may specifically include the following steps 801 to 803 .

步骤801:将目标候选句作为句子节点,相似度值作为对应的两个句子节点之间的连接边,根据句子节点与连接边构建候选句图;Step 801: The target candidate sentence is used as a sentence node, and the similarity value is used as a connection edge between the corresponding two sentence nodes, and a candidate sentence graph is constructed according to the sentence node and the connection edge;

步骤802:确定句子节点的初始分值以及概率系数;Step 802: Determine the initial score and probability coefficient of the sentence node;

步骤803:根据概率系数以及相似度值在候选句图中传播更新初始分值直至收敛,将收敛时的初始分值作为目标候选句对应的候选句分值。Step 803: Propagating and updating the initial score in the candidate sentence graph according to the probability coefficient and the similarity value until convergence, and using the initial score at the time of convergence as the candidate sentence score corresponding to the target candidate sentence.

其中,一个句子节点对应一个目标候选句,候选句图可以表示为G=(V,E),G为候选句图,V为句子节点集合,E为连接边集合。候选句图中任意两个句子节点Vi、Vj之间的连接边的权重为Wji,本发明实施例中以相似度值作为权重,对于一个给定的句子节点Vi,In(Vi)为指向该句子节点的句子节点集合,Out(Vi)为句子节点Vi指向的句子节点集合。Among them, a sentence node corresponds to a target candidate sentence, and the candidate sentence graph can be expressed as G=(V, E), where G is the candidate sentence graph, V is the sentence node set, and E is the connecting edge set. The weight of the connection edge between any two sentence nodes V i and V j in the candidate sentence graph is W ji . In the embodiment of the present invention, the similarity value is used as the weight. For a given sentence node V i , In(V i ) is the set of sentence nodes pointing to the sentence node, and Out(V i ) is the set of sentence nodes pointed to by the sentence node V i .

候选句分值可以采用以下公式进行计算:The candidate sentence score can be calculated using the following formula:

Figure BDA0003530218050000151
Figure BDA0003530218050000151

其中,WS(Vi)表示句子节点Vi的候选句分值,WS(Vj)表示句子节点Vj的候选句分值,d表示概率系数,概率系数用于表征句子节点指向候选句图中其他任一节点的概率,In(Vi)为指向该句子节点的句子节点集合,Out(Vi)为句子节点Vi指向的句子节点集合,Vj为句子节点集合In(Vi)的其中一个句子节点,Vk为句子节点集合Out(Vi)的其中一个句子节点,Wji为句子节点Vi、Vj之间的相似度值,Wjk为句子节点Vk、Vj之间的相似度值。概率系数的取值范围为0到1,代表从图中某一特定点指向其他任意点的概率,在本发明实施例中,概率系数的取值为0.85,可以理解的是,概率系数可以根据实际情况设置,本发明实施例不做限定。Among them, WS(V i ) represents the candidate sentence score of the sentence node V i , WS(V j ) represents the candidate sentence score of the sentence node V j , d represents the probability coefficient, and the probability coefficient is used to indicate that the sentence node points to the candidate sentence graph the probability of any other node in One of the sentence nodes of , V k is one of the sentence nodes of the sentence node set Out(V i ), W ji is the similarity value between the sentence nodes V i and V j , W jk is the sentence nodes V k , V j similarity value between. The value range of the probability coefficient is 0 to 1, which represents the probability of pointing from a certain point in the figure to any other point. In this embodiment of the present invention, the value of the probability coefficient is 0.85. It can be understood that the probability coefficient can be determined according to The actual setting is not limited in this embodiment of the present invention.

其中,候选句图中每个句子节点均会赋予初始分值,然后根据概率系数以及相似度值在候选句图中传播更新各个句子节点的初始分值直至收敛,及根据上述候选句分值的公式进行迭代计算,将收敛时的初始分值作为目标候选句对应的候选句分值。通过构建候选句图,并根据根据概率系数以及相似度值在候选句图中传播计算候选句分值,能够高效准确地计算出候选句分值,进而根据候选句分值确定目标关键句。Among them, each sentence node in the candidate sentence graph is given an initial score, and then the initial score of each sentence node is propagated and updated in the candidate sentence graph according to the probability coefficient and similarity value until convergence, and according to the above candidate sentence scores The formula is iteratively calculated, and the initial score at the time of convergence is used as the candidate sentence score corresponding to the target candidate sentence. By constructing the candidate sentence graph, and calculating the candidate sentence score according to the probability coefficient and similarity value in the candidate sentence graph, the candidate sentence score can be efficiently and accurately calculated, and then the target key sentence can be determined according to the candidate sentence score.

目标关键句的数量可以为一个或者多个,当得到了多个目标关键句后,可以将多个目标关键句进行合并以得到政策摘要。并且,最终生成的政策摘要作为所属政策类别中的政策摘要。由于政策早报是为了让读者快速地获取当天的重要政策内容,阅读花费的时间不宜过长,因此政策摘要提取是为了获取政策文本中的总结信息,使读者不需要花费时间阅读全文,也可以了解到这篇政策需要表达的中心思想。The number of target key sentences can be one or more, and when multiple target key sentences are obtained, the multiple target key sentences can be combined to obtain a policy summary. And, the finally generated policy summary serves as the policy summary in the policy category it belongs to. Since the policy morning paper is to allow readers to quickly obtain the important policy content of the day, the reading time should not be too long, so the policy summary extraction is to obtain the summary information in the policy text, so that readers do not need to spend time reading the full text, but also can understand to the central idea that this policy needs to express.

可以理解的是,虽然上述各个流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本实施例中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,上述流程图中的至少一部分步骤可以包括多个步骤或者多个阶段,这些步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It can be understood that, although the steps in the above-mentioned flowcharts are displayed in sequence according to the arrows, these steps are not necessarily executed in the sequence indicated by the arrows. Unless explicitly stated in this embodiment, the execution of these steps is not strictly limited in sequence, and these steps may be executed in other sequences. Moreover, at least a part of the steps in the above flow chart may include multiple steps or multiple stages. These steps or stages are not necessarily executed at the same time, but may be executed at different times. The execution sequence of these steps or stages It is also not necessarily performed sequentially, but may be performed alternately or alternately with other steps or at least a portion of a step or phase within the other steps.

另外,参照图9,图9为本发明实施例提供的政策文本摘要生成装置的结构示意图,本发明实施例还提供了一种政策文本摘要生成装置,该政策文本摘要生成装置900包括:In addition, referring to FIG. 9, FIG. 9 is a schematic structural diagram of an apparatus for generating a policy text summary provided by an embodiment of the present invention. An embodiment of the present invention further provides a policy text summary generating apparatus, and the policy text summary generating apparatus 900 includes:

政策文本分类模块901,用于获取候选政策文本的候选标题,对候选标题进行分类,得到候选政策文本对应的政策类别;The policy text classification module 901 is used to obtain candidate titles of candidate policy texts, classify the candidate titles, and obtain policy categories corresponding to the candidate policy texts;

文本重要度确定模块902,用于从候选政策文本中确定文本关键词,确定文本关键词在预设的关键词数据库中出现的第一频次,根据第一频次确定候选政策文本的文本重要度;A text importance determination module 902, configured to determine text keywords from the candidate policy texts, determine the first frequency of the text keywords appearing in the preset keyword database, and determine the text importance of the candidate policy texts according to the first frequency;

目标政策文本确定模块903,用于根据文本重要度在各个政策类别的候选政策文本中确定目标政策文本;a target policy text determination module 903, configured to determine target policy texts from candidate policy texts of each policy category according to the text importance;

政策摘要生成模块904,用于从目标政策文本中提取出目标关键句,根据目标关键句生成目标政策文本对应的政策类别的政策摘要。The policy summary generating module 904 is used for extracting target key sentences from the target policy text, and generating a policy summary of the policy category corresponding to the target policy text according to the target key sentences.

其中,本发明实施例提出的政策文本摘要生成装置900与上述政策文本摘要方法基于相同的发明构思,上述政策文本摘要生成装置900通过获取候选政策文本的候选标题,对候选标题进行分类,得到候选政策文本对应的政策类别,从候选政策文本中确定文本关键词,确定文本关键词在预设的关键词数据库中出现的第一频次,根据第一频次确定候选政策文本的文本重要度,根据文本重要度在各个政策类别的候选政策文本中确定目标政策文本,从目标政策文本中提取出目标关键句,根据目标关键句生成目标政策文本对应的政策类别的政策摘要,相较于人工整理能够自动且快速地生成政策摘要,提高生成政策文本摘要的效率;并且基于候选标题确定候选政策文本的政策类别,使得生成的政策摘要对应目标政策文本的政策类别,可以实现不同政策摘要的分类效果,便于读者进行阅读;另外,通过文本关键词出现的第一频次来确定候选政策文本的文本重要度,进而可以确定不同政策类别中较重要的目标政策文本,后续通过目标政策文本来生成政策摘要,有利于提高生成政策摘要的准确性。The policy text summary generating apparatus 900 proposed in the embodiment of the present invention is based on the same inventive concept as the policy text summarizing method. The policy text summary generating apparatus 900 obtains candidate titles of candidate policy texts, classifies the candidate titles, and obtains candidate titles. The policy category corresponding to the policy text, determine the text keywords from the candidate policy texts, determine the first frequency of the text keywords appearing in the preset keyword database, and determine the text importance of the candidate policy texts according to the first frequency. The importance determines the target policy text in the candidate policy texts of each policy category, extracts the target key sentence from the target policy text, and generates the policy summary of the policy category corresponding to the target policy text according to the target key sentence. And quickly generate policy summaries, improve the efficiency of generating policy text summaries; and determine the policy categories of candidate policy texts based on candidate titles, so that the generated policy summaries correspond to the policy categories of the target policy texts, and can achieve the classification effect of different policy summaries, which is convenient for Readers read it; in addition, the text importance of candidate policy texts is determined by the first frequency of text keywords, and then the more important target policy texts in different policy categories can be determined, and subsequent policy summaries are generated through the target policy texts. It is beneficial to improve the accuracy of generating policy summaries.

可以理解的是,上述各个语义识别装置还可以具体用于执行在上述语义识别方法实施例中描述的各种流程。It can be understood that, each of the foregoing semantic recognition apparatuses may also be specifically configured to execute various processes described in the foregoing semantic recognition method embodiments.

参照图10,图10为本发明实施例提供的电子设备的结构示意图。电子设备1000包括:存储器1001、处理器1002及存储在存储器1001上并可在处理器1002上运行的计算机程序,计算机程序运行时用于执行上述的语义识别方法。Referring to FIG. 10 , FIG. 10 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention. The electronic device 1000 includes: a memory 1001, a processor 1002, and a computer program stored in the memory 1001 and running on the processor 1002, and the computer program is used to execute the above-mentioned semantic recognition method when running.

处理器1002和存储器1001可以通过总线或者其他方式连接。The processor 1002 and the memory 1001 may be connected by a bus or other means.

存储器1001作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序以及非暂态性计算机可执行程序,如本发明实施例描述的语义识别方法。处理器1002通过运行存储在存储器1001中的非暂态软件程序以及指令,从而实现上述的语义识别方法。As a non-transitory computer-readable storage medium, the memory 1001 can be used to store non-transitory software programs and non-transitory computer-executable programs, such as the semantic recognition method described in the embodiments of the present invention. The processor 1002 implements the above-mentioned semantic recognition method by running the non-transitory software programs and instructions stored in the memory 1001 .

存储器1001可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储执行上述的语义识别方法。此外,存储器1001可以包括高速随机存取存储器1001,还可以包括非暂态存储器1001,例如至少一个储存设备存储器件、闪存器件或其他非暂态固态存储器件。在一些实施方式中,存储器1001可选包括相对于处理器1002远程设置的存储器1001,这些远程存储器1001可以通过网络连接至该电子设备1000。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 1001 may include a storage program area and a storage data area, wherein the storage program area may store an operating system and an application program required by at least one function; the storage data area may store and execute the above-mentioned semantic recognition method. Additionally, memory 1001 may include high-speed random access memory 1001, and may also include non-transitory memory 1001, such as at least one storage device storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, the memory 1001 may optionally include memory 1001 located remotely from the processor 1002, and these remote memories 1001 may be connected to the electronic device 1000 through a network. Examples of such networks include, but are not limited to, the Internet, an intranet, a local area network, a mobile communication network, and combinations thereof.

实现上述的语义识别方法所需的非暂态软件程序以及指令存储在存储器1001中,当被一个或者多个处理器1002执行时,执行上述的语义识别方法。The non-transitory software programs and instructions required to implement the above-mentioned semantic recognition method are stored in the memory 1001 , and when executed by one or more processors 1002 , execute the above-mentioned semantic recognition method.

本发明实施例还提供了计算机可读存储介质,存储有计算机可执行指令,计算机可执行指令用于执行上述的语义识别方法。Embodiments of the present invention further provide a computer-readable storage medium storing computer-executable instructions, where the computer-executable instructions are used to execute the foregoing semantic recognition method.

在一实施例中,该计算机可读存储介质存储有计算机可执行指令,该计算机可执行指令被一个或多个控制处理器执行,可以实现上述的语义识别方法。In one embodiment, the computer-readable storage medium stores computer-executable instructions, and the computer-executable instructions are executed by one or more control processors to implement the above-mentioned semantic recognition method.

以上所描述的装置实施例仅仅是示意性的,其中作为分离部件说明的单元可以是或者也可以不是物理上分开的,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。The apparatus embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统可以被实施为软件、固件、硬件及其适当的组合。某些物理组件或所有物理组件可以被实施为由处理器,如中央处理器、数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、储存设备存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包括计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。Those of ordinary skill in the art can understand that all or some of the steps and systems in the methods disclosed above can be implemented as software, firmware, hardware, and appropriate combinations thereof. Some or all physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit . Such software may be distributed on computer-readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules or other data flexible, removable and non-removable media. Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tapes, storage device storage or other magnetic storage devices, or Any other medium that can be used to store the desired information and that can be accessed by a computer. In addition, communication media typically include computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and can include any information delivery media, as is well known to those of ordinary skill in the art .

还应了解,本发明实施例提供的各种实施方式可以任意进行组合,以实现不同的技术效果。It should also be understood that various implementation manners provided in the embodiments of the present invention may be arbitrarily combined to achieve different technical effects.

以上是对本发明的较佳实施进行了具体说明,但本发明并不局限于上述实施方式,熟悉本领域的技术人员在不违背本发明精神的前提下还可作出种种的等同变形或替换,这些等同的变形或替换均包含在本发明权利要求所限定的范围内。The preferred implementation of the present invention has been specifically described above, but the present invention is not limited to the above-mentioned embodiments. Those skilled in the art can also make various equivalent deformations or replacements on the premise of not violating the spirit of the present invention. These Equivalent modifications or substitutions are included within the scope defined by the claims of the present invention.

Claims (10)

1. A policy text summary generation method is characterized by comprising the following steps:
obtaining candidate titles of candidate policy texts, and classifying the candidate titles to obtain policy categories corresponding to the candidate policy texts;
determining text keywords from the candidate policy text, determining first frequency of the text keywords appearing in a preset keyword database, and determining text importance of the candidate policy text according to the first frequency;
determining a target policy text in the candidate policy texts of the policy categories according to the text importance;
and extracting a target key sentence from the target policy text, and generating a policy abstract of the policy category corresponding to the target policy text according to the target key sentence.
2. The method of claim 1, wherein the classifying the candidate titles to obtain the policy categories corresponding to the candidate policy texts comprises:
performing word segmentation processing on the candidate titles to obtain title keywords;
determining a second frequency of the title keywords appearing in the candidate policy text, and calculating keyword vectors of the title keywords according to the second frequency;
calculating title vectors of the candidate titles according to the keyword vectors;
and classifying the candidate titles according to the title vectors to obtain policy categories corresponding to the candidate policy texts.
3. The method of claim 1 wherein the determining text keywords from the candidate policy text comprises:
performing word segmentation processing on the candidate policy text to obtain text candidate words;
calculating a keyword score of the text candidate word in the candidate policy text;
sorting the text candidate words according to the descending order of the keyword scores, and determining the text candidate words with the ranking before a first threshold value as text keywords; or sorting the text candidate words according to the sequence of the scores of the keywords from small to large, and determining the text candidate words ranked after a second threshold value as text keywords.
4. The method of claim 3, wherein the determining the text importance of the candidate policy text according to the first frequency comprises:
obtaining word importance of the text keywords according to the product of the first frequency and the keyword scores corresponding to the text keywords;
and obtaining the text importance of the candidate policy text according to the sum of the word importance of all the text keywords in the candidate policy text.
5. The method of claim 1, wherein the determining a target policy text among the candidate policy texts in each policy category according to the text importance comprises:
in each policy category, sorting the candidate policy texts according to the sequence of the text importance degrees from large to small, and determining the candidate policy texts with the ranking before a third threshold value as target policy texts in the corresponding policy category;
or in each policy category, the candidate policy texts are sorted according to the order of the text importance degrees from small to large, and the candidate policy texts ranked after the fourth threshold are determined as the target policy texts in the corresponding policy category.
6. The method of claim 1, wherein the extracting target key sentences from the target policy text comprises:
performing sentence division processing on the target policy text to obtain a target candidate sentence;
vectorizing the target candidate sentence to obtain a candidate sentence vector;
calculating the similarity value between every two target candidate sentences according to the candidate sentence vectors;
calculating a candidate sentence score corresponding to the target candidate sentence according to the similarity value;
sequencing the target candidate sentences according to the sequence of the candidate sentence values from large to small, and determining the target candidate sentences ranked before a fifth threshold value as target key sentences; or sequencing the target candidate sentences according to the sequence of the candidate sentence values from small to large, and determining the target candidate sentences ranked after a sixth threshold value as target key sentences.
7. The method of claim 6, wherein the calculating a candidate sentence score corresponding to the target candidate sentence according to the similarity value comprises:
taking the target candidate sentence as a sentence node, taking the similarity value as a connecting edge between two corresponding sentence nodes, and constructing a candidate sentence graph according to the sentence node and the connecting edge;
determining an initial score and a probability coefficient of the sentence node, wherein the probability coefficient is used for representing the probability that the sentence node points to any other node in the candidate sentence graph;
and propagating and updating the initial score in the candidate sentence graph until convergence according to the probability coefficient and the similarity value, and taking the initial score during convergence as a candidate sentence score corresponding to the target candidate sentence.
8. A policy text summary generation apparatus, comprising:
the policy text classification module is used for acquiring candidate titles of the candidate policy texts, classifying the candidate titles and obtaining policy categories corresponding to the candidate policy texts;
the text importance determining module is used for determining text keywords from the candidate policy text, determining the first frequency of the text keywords appearing in a preset keyword database, and determining the text importance of the candidate policy text according to the first frequency;
a target policy text determining module, configured to determine a target policy text from the candidate policy texts in each policy category according to the text importance;
and the policy abstract generating module is used for extracting a target key sentence from the target policy text and generating a policy abstract of the policy category corresponding to the target policy text according to the target key sentence.
9. An electronic device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor implements the policy text summary generating method according to any one of claims 1 to 7 when executing the computer program.
10. A computer-readable storage medium, characterized in that the storage medium stores a program, the program being executed by a processor to implement the policy text digest generation method according to any one of claims 1 to 7.
CN202210208867.0A 2022-03-03 2022-03-03 Method, device, electronic device and storage medium for generating policy text abstract Pending CN114756673A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210208867.0A CN114756673A (en) 2022-03-03 2022-03-03 Method, device, electronic device and storage medium for generating policy text abstract

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210208867.0A CN114756673A (en) 2022-03-03 2022-03-03 Method, device, electronic device and storage medium for generating policy text abstract

Publications (1)

Publication Number Publication Date
CN114756673A true CN114756673A (en) 2022-07-15

Family

ID=82324788

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210208867.0A Pending CN114756673A (en) 2022-03-03 2022-03-03 Method, device, electronic device and storage medium for generating policy text abstract

Country Status (1)

Country Link
CN (1) CN114756673A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118395232A (en) * 2024-04-26 2024-07-26 中企知研(北京)科技有限公司 Digital service resource optimized storage method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635103A (en) * 2018-12-17 2019-04-16 北京百度网讯科技有限公司 Abstraction generating method and device
CN110069623A (en) * 2017-12-06 2019-07-30 腾讯科技(深圳)有限公司 Summary texts generation method, device, storage medium and computer equipment
CN112307205A (en) * 2020-10-22 2021-02-02 首都师范大学 Text classification method, system and computer storage medium based on automatic summarization

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110069623A (en) * 2017-12-06 2019-07-30 腾讯科技(深圳)有限公司 Summary texts generation method, device, storage medium and computer equipment
CN109635103A (en) * 2018-12-17 2019-04-16 北京百度网讯科技有限公司 Abstraction generating method and device
CN112307205A (en) * 2020-10-22 2021-02-02 首都师范大学 Text classification method, system and computer storage medium based on automatic summarization

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118395232A (en) * 2024-04-26 2024-07-26 中企知研(北京)科技有限公司 Digital service resource optimized storage method and system

Similar Documents

Publication Publication Date Title
Negara et al. Topic modelling twitter data with latent dirichlet allocation method
CN108717408B (en) A sensitive word real-time monitoring method, electronic equipment, storage medium and system
US9589208B2 (en) Retrieval of similar images to a query image
Sebastiani Classification of text, automatic
Pereira et al. Using web information for author name disambiguation
US11227183B1 (en) Section segmentation based information retrieval with entity expansion
CN109271514B (en) Generation method, classification method, device and storage medium of short text classification model
CN109241277B (en) Text vector weighting method and system based on news keywords
KR20180011254A (en) Web page training methods and devices, and search intent identification methods and devices
CN109086355B (en) Hot-spot association relation analysis method and system based on news subject term
US20230074771A1 (en) Hierarchical clustering on graphs for taxonomy extraction and applications thereof
CN112989208B (en) Information recommendation method and device, electronic equipment and storage medium
CN112926297B (en) Method, apparatus, device and storage medium for processing information
CN110858217A (en) Method and device for detecting microblog sensitive topics and readable storage medium
CN109635157A (en) Model generating method, video searching method, device, terminal and storage medium
CN114255067A (en) Data pricing method and device, electronic equipment and storage medium
US20240168999A1 (en) Hierarchical clustering on graphs for taxonomy extraction and applications thereof
Jo Using K Nearest Neighbors for text segmentation with feature similarity
CN113157857B (en) News-oriented hot topic detection method, device and equipment
CN111930949B (en) Search string processing method and device, computer readable medium and electronic equipment
CN114756673A (en) Method, device, electronic device and storage medium for generating policy text abstract
Rousseau Graph-of-words: mining and retrieving text with networks of features
CN118194859A (en) Picture and text matching method, device, equipment and medium
CN111681731A (en) Method for automatically marking colors of inspection report
CN115496066B (en) Text analysis system, method, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination