[go: up one dir, main page]

CN118747293A - Document writing intelligent recall method and device and document generation method and device - Google Patents

Document writing intelligent recall method and device and document generation method and device Download PDF

Info

Publication number
CN118747293A
CN118747293A CN202410619691.7A CN202410619691A CN118747293A CN 118747293 A CN118747293 A CN 118747293A CN 202410619691 A CN202410619691 A CN 202410619691A CN 118747293 A CN118747293 A CN 118747293A
Authority
CN
China
Prior art keywords
content
chapter
chapter content
recall
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410619691.7A
Other languages
Chinese (zh)
Inventor
陶铸
周红喆
李兴栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Deep Intelligent Pharma Technology Co ltd
Original Assignee
Beijing Deep Intelligent Pharma Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Deep Intelligent Pharma Technology Co ltd filed Critical Beijing Deep Intelligent Pharma Technology Co ltd
Priority to CN202410619691.7A priority Critical patent/CN118747293A/en
Publication of CN118747293A publication Critical patent/CN118747293A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/18Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Pure & Applied Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Operations Research (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Machine Translation (AREA)

Abstract

The application provides an intelligent document writing recall method and device and a document generation method and device, wherein the intelligent document writing recall method comprises the following steps: carrying out structuring treatment on a reference file, preprocessing the reference file after structuring treatment, and preprocessing target chapter content, wherein the target chapter content is chapter title information of content to be generated; calculating vector cosine similarity between the preprocessed target chapter content and the preprocessed reference chapter content of the reference file, and calculating semantic matching similarity between the preprocessed target chapter content and the preprocessed reference chapter content; and carrying out weighted calculation on the cosine similarity and the semantic matching similarity to obtain a similarity score, and determining whether to recall the reference section based on the similarity score. The application solves the technical problem of inaccurate recall content during document writing.

Description

文档写作智能召回方法、装置及文档生成方法、装置Document writing intelligent recall method and device and document generation method and device

技术领域Technical Field

本申请涉及人工智能技术领域,具体而言,涉及一种文档写作智能召回方法、装置及文档生成方法、装置。The present application relates to the field of artificial intelligence technology, and more specifically, to a method and device for intelligent recall of document writing and a method and device for document generation.

背景技术Background Art

随着人工智能技术的发展,尤其是GPT4大语言模型发布以来,强大的文本理解生成能力赋予了文本生成内容的无限可能。大语言模型能够产生看起来非常真实的文本或图像内容。然而,这些模型也可能产生“幻觉”(hallucinations),模型生成的信息虽然看似合理,但实际上是错误的或与现实不符的。目前主要通过信息召回来解决大语言模型的幻觉问题。检索到的文档为生成器提供了额外的、与主题相关的信息,这有助于模型在生成答案时使用更准确的事实和数据,通过引用检索到的文档中的信息,生成器可以产生更符合事实的内容,减少基于错误信息的幻觉现象。With the development of artificial intelligence technology, especially since the release of the GPT4 large language model, the powerful text understanding generation capability has given unlimited possibilities for text generation content. Large language models are able to produce text or image content that looks very real. However, these models may also produce "hallucinations", where the information generated by the model seems reasonable but is actually wrong or inconsistent with reality. At present, the hallucination problem of large language models is mainly solved through information recall. The retrieved documents provide the generator with additional, topic-related information, which helps the model use more accurate facts and data when generating answers. By referring to the information in the retrieved documents, the generator can produce content that is more in line with the facts and reduce the phenomenon of hallucinations based on false information.

通用的信息召回技术会对文本内容进行固定长度的块切分,防止输入大语言模型的内容过多,导致模型注意力分散。之后会对每个块进行向量化以及对输入内容进行向量化,通过计算向量相似度来召回相似内容。通用方法固定长度的快切分导致每个块的语义信息不完整并且仅仅采用句子相似度计算的方法并不能精确的召回相似内容。The general information recall technology will divide the text content into fixed-length blocks to prevent the input of too much content into the large language model, which will cause the model to be distracted. Each block will then be vectorized and the input content will be vectorized, and similar content will be recalled by calculating vector similarity. The general method of fixed-length fast segmentation leads to incomplete semantic information of each block, and the method of calculating sentence similarity alone cannot accurately recall similar content.

根据现有的医学领域文档自动写作的需求,在写作过程中需要根据参考资料进行总结分析,为了向模型提供更高质量的可供参考的专业文本,提高写作效率,需要有更好的文本召回策略。According to the existing demand for automatic writing of medical documents, it is necessary to summarize and analyze the reference materials during the writing process. In order to provide the model with higher-quality professional texts for reference and improve writing efficiency, a better text recall strategy is needed.

针对上述的问题,目前尚未提出有效的解决方案。To address the above-mentioned problems, no effective solution has been proposed yet.

发明内容Summary of the invention

本发明实施例提供了一种文档写作智能召回方法、装置及文档生成方法、装置,以至少解决文档写作时召回内容不准确的技术问题。The embodiments of the present invention provide a method and device for intelligent recall of document writing and a method and device for document generation, so as to at least solve the technical problem of inaccurate recall content when writing a document.

根据本发明实施例的一个方面,提供了一种文档写作智能召回方法,包括:对参考文件进行结构化处理,对结构化处理后的所述参考文件进行预处理,并对目标章节内容进行预处理,其中,所述目标章节内容是待生成内容的章节标题信息;计算预处理后的所述目标章节内容和预处理后的所述参考文件的参考章节内容之间的向量余弦相似度,并计算预处理后的所述目标章节内容和所述参考章节内容之间的语义匹配相似度;对所述余弦相似度和所述语义匹配相似度进行加权计算,得到相似度评分,并基于所述相似度评分确定是否召回所述参考章节。According to one aspect of an embodiment of the present invention, a method for intelligent recall of document writing is provided, comprising: structuring a reference file, preprocessing the reference file after the structuring, and preprocessing a target chapter content, wherein the target chapter content is chapter title information of content to be generated; calculating the vector cosine similarity between the preprocessed target chapter content and the preprocessed reference chapter content of the reference file, and calculating the semantic matching similarity between the preprocessed target chapter content and the reference chapter content; performing weighted calculation on the cosine similarity and the semantic matching similarity to obtain a similarity score, and determining whether to recall the reference chapter based on the similarity score.

根据本发明实施例的另一方面,还提供了一种文档生成方法,包括:接收目标章节内容;利用上述所述的文档写作智能召回方法,判断是否召回参考章节内容;在判断需要召回所述参考章节内容的情况下,召回所述参考章节内容,并基于所述目标章节内容和所述参考章节内容生成文档。According to another aspect of an embodiment of the present invention, a document generation method is also provided, including: receiving target chapter content; using the above-mentioned document writing intelligent recall method to determine whether to recall reference chapter content; if it is determined that the reference chapter content needs to be recalled, recalling the reference chapter content, and generating a document based on the target chapter content and the reference chapter content.

根据本发明实施例的又一方面,还提供了一种文档写作智能召回装置,包括:处理模块,被配置为对参考文件进行结构化处理,对结构化处理后的所述参考文件进行预处理,并对目标章节内容进行预处理,其中,所述目标章节内容是基于所述参考章节内容智能生成的内容;相似度计算模块,被配置为计算预处理后的所述目标章节内容和预处理后的所述参考文件的参考章节内容之间的向量余弦相似度,并计算预处理后的所述目标章节内容和与处理后的所述参考文件的参考章节内容之间的语义匹配相似度;召回模块,被配置为对所述余弦相似度和所述语义匹配相似度进行加权计算,得到相似度评分,并基于所述相似度评分确定是否召回所述参考章节内容。According to another aspect of an embodiment of the present invention, there is also provided an intelligent recall device for document writing, comprising: a processing module, configured to perform structural processing on a reference file, preprocess the reference file after structural processing, and preprocess the target chapter content, wherein the target chapter content is content intelligently generated based on the reference chapter content; a similarity calculation module, configured to calculate the vector cosine similarity between the preprocessed target chapter content and the preprocessed reference chapter content of the reference file, and calculate the semantic matching similarity between the preprocessed target chapter content and the reference chapter content of the processed reference file; a recall module, configured to perform weighted calculation on the cosine similarity and the semantic matching similarity to obtain a similarity score, and determine whether to recall the reference chapter content based on the similarity score.

根据本发明实施例的又一方面,还提供了一种文档生成装置,包括:接收模块,被配置为接收目标章节内容;判断模块,被配置为利用上述文档写作智能召回方法,判断是否需要召回所述参考章节内容;召回模块,被配置为在判断需要召回所述参考章节内容的情况下,召回所述参考章节内容,并基于所述目标章节内容和所述参考章节内容生成文档。According to another aspect of an embodiment of the present invention, a document generation device is also provided, including: a receiving module, configured to receive target chapter content; a judging module, configured to use the above-mentioned document writing intelligent recall method to judge whether it is necessary to recall the reference chapter content; a recall module, configured to recall the reference chapter content if it is judged that the reference chapter content needs to be recalled, and generate a document based on the target chapter content and the reference chapter content.

在本发明实施例中,对参考文件进行结构化处理,对结构化处理后的所述参考文件进行预处理,并对目标章节内容进行预处理,其中,所述目标章节内容是待生成内容的章节标题信息;计算预处理后的所述目标章节内容和预处理后的所述参考文件的参考章节内容之间的向量余弦相似度,并计算预处理后的所述目标章节内容和所述参考章节内容之间的语义匹配相似度;对所述余弦相似度和所述语义匹配相似度进行加权计算,得到相似度评分,并基于所述相似度评分确定是否召回所述参考章节。通过上述方法,解决了文档写作时召回内容不准确的技术问题。In an embodiment of the present invention, a reference file is structured, the reference file after structured processing is preprocessed, and the target chapter content is preprocessed, wherein the target chapter content is the chapter title information of the content to be generated; the vector cosine similarity between the preprocessed target chapter content and the preprocessed reference chapter content of the reference file is calculated, and the semantic matching similarity between the preprocessed target chapter content and the reference chapter content is calculated; the cosine similarity and the semantic matching similarity are weighted to obtain a similarity score, and whether to recall the reference chapter is determined based on the similarity score. The above method solves the technical problem of inaccurate content recall during document writing.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

构成本申请的一部分的说明书附图用来提供对本申请的进一步理解,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:The drawings constituting part of the present application are used to provide a further understanding of the present application. The illustrative embodiments and descriptions of the present application are used to explain the present application and do not constitute an improper limitation on the present application. In the drawings:

图1是根据本申请实施例的一种文档写作智能召回方法的流程图;FIG1 is a flow chart of a method for intelligent recall of document writing according to an embodiment of the present application;

图2是根据本申请实施例的另一种文档写作智能召回方法的流程图;FIG2 is a flow chart of another document writing intelligent recall method according to an embodiment of the present application;

图3是根据本申请实施例的一种相似度计算方法的流程图;FIG3 is a flow chart of a similarity calculation method according to an embodiment of the present application;

图4是根据本申请实施例的又一种文档写作智能召回方法的流程图;FIG4 is a flow chart of another method for intelligent recall of document writing according to an embodiment of the present application;

图5是根据本申请实施例的一种对数据进行预处理的方法的流程图;FIG5 is a flow chart of a method for preprocessing data according to an embodiment of the present application;

图6是根据本申请实施例的一种文档生成方法的流程图;FIG6 is a flow chart of a document generation method according to an embodiment of the present application;

图7是根据本申请实施例的一种文档写作智能召回装置的结构示意图;FIG7 is a schematic diagram of the structure of a document writing intelligent recall device according to an embodiment of the present application;

图8是根据本申请实施例的一种文档生成装置的结构示意图;FIG8 is a schematic diagram of the structure of a document generating device according to an embodiment of the present application;

图9示出了适于用来实现本公开实施例的电子设备的结构示意图。FIG. 9 shows a schematic diagram of the structure of an electronic device suitable for implementing the embodiments of the present disclosure.

具体实施方式DETAILED DESCRIPTION

需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。It should be noted that, in the absence of conflict, the embodiments and features in the embodiments of the present application can be combined with each other. The present application will be described in detail below with reference to the accompanying drawings and in combination with the embodiments.

需要注意的是,这里所使用的术语仅是为了描述具体实施方式,而非意图限制根据本申请的示例性实施方式。如在这里所使用的,除非上下文另外明确指出,否则单数形式也意图包括复数形式,此外,还应当理解的是,当在本说明书中使用术语“包含”和/或“包括”时,其指明存在特征、步骤、操作、器件、组件和/或它们的组合。It should be noted that the terms used herein are only for describing specific embodiments and are not intended to limit the exemplary embodiments according to the present application. As used herein, unless the context clearly indicates otherwise, the singular form is also intended to include the plural form. In addition, it should be understood that when the terms "comprising" and/or "including" are used in this specification, it indicates the presence of features, steps, operations, devices, components and/or combinations thereof.

除非另外具体说明,否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本申请的范围。同时,应当明白,为了便于描述,附图中所示出的各个部分的尺寸并不是按照实际的比例关系绘制的。对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论。在这里示出和讨论的所有示例中,任何具体值应被解释为仅仅是示例性的,而不是作为限制。因此,示例性实施例的其它示例可以具有不同的值。应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步讨论。Unless otherwise specifically stated, the relative arrangement, numerical expressions and numerical values of the parts and steps set forth in these embodiments do not limit the scope of the application. Meanwhile, it should be understood that, for ease of description, the sizes of the various parts shown in the accompanying drawings are not drawn according to actual proportional relationships. The technology, methods and equipment known to those of ordinary skill in the relevant art may not be discussed in detail. In all examples shown and discussed here, any specific value should be interpreted as being merely exemplary, rather than as a limitation. Therefore, other examples of exemplary embodiments may have different values. It should be noted that similar reference numerals and letters represent similar items in the following drawings, and therefore, once a certain item is defined in an accompanying drawing, it does not need to be further discussed in subsequent drawings.

实施例1Example 1

现有召回技术文档需要分割成多个文本块再进行向量嵌入。在不考虑大模型输入长度限制和成本问题情况下,其目的是在保持语义上的连贯性的同时,尽可能减少嵌入内容中的噪声,从而更有效地找到与用户查询最相关的文档部分。如果分块太大,可能包含太多不相关的信息,从而降低了检索的准确性。相反,分块太小可能会丢失必要的上下文信息,导致生成的回应缺乏连贯性或深度。固定长度的切分块导致语义不完整并且语义匹配不足,采用向量相似度方法无法很好的处理语义的相似性,句子中不同词语重要性程度不同,造成语义相似的文本丢失。Existing recall technical documents need to be segmented into multiple text blocks before vector embedding. Without considering the input length limit and cost issues of large models, the purpose is to minimize the noise in the embedded content while maintaining semantic coherence, so as to more effectively find the part of the document that is most relevant to the user query. If the block is too large, it may contain too much irrelevant information, thereby reducing the accuracy of retrieval. On the contrary, if the block is too small, necessary contextual information may be lost, resulting in a lack of coherence or depth in the generated response. Fixed-length blocks lead to incomplete semantics and insufficient semantic matching. The vector similarity method cannot handle semantic similarity well. Different words in a sentence have different levels of importance, resulting in the loss of semantically similar text.

本申请实施例公开了一种文档写作智能召回方法,如图1所示,该方法包括以下步骤:The present application embodiment discloses a document writing intelligent recall method, as shown in FIG1 , the method comprises the following steps:

步骤S102,对参考文件进行结构化处理,对结构化处理后的所述参考文件进行预处理,并对目标章节内容进行预处理,其中,所述目标章节内容是待生成内容的章节标题信息。Step S102, structurally processing the reference file, preprocessing the structurally processed reference file, and preprocessing the target chapter content, wherein the target chapter content is chapter title information of the content to be generated.

首先,对参考文件进行结构化处理。判断所述参考文件是否符合预设的格式;在所述参考文件符合所述预设的格式的情况下,从所述参考文件中提取章节标题和与所述章节标题对应的内容;在所述参考文件不符合所述预设的格式的情况下,基于所述参考文件的上下文从所述参考文件中识别并拆分出章节标题和与所述章节标题对应的内容;其中,与所述章节标题对应的内容中包括所述参考章节内容。本实施例通过对参考文件进行结构化处理可以确保文档的格式一致性,使得后续处理更加高效和准确。其次,通过提取章节标题和对应内容,可以使得文档的结构更加清晰明了,便于用户理解和检索相关信息。即使在参考文件不符合预设格式的情况下,基于上下文的拆分方法也能够有效地识别章节标题和内容,保留文档的完整性和连贯性。最重要的是,这种结构化处理可以为后续的文本分析和相似度计算提供更准确的数据基础,从而提高了召回内容的质量和准确性。First, the reference file is structured. It is determined whether the reference file conforms to the preset format; if the reference file conforms to the preset format, the chapter title and the content corresponding to the chapter title are extracted from the reference file; if the reference file does not conform to the preset format, the chapter title and the content corresponding to the chapter title are identified and split from the reference file based on the context of the reference file; wherein the content corresponding to the chapter title includes the reference chapter content. This embodiment can ensure the format consistency of the document by performing structured processing on the reference file, making subsequent processing more efficient and accurate. Secondly, by extracting the chapter title and the corresponding content, the structure of the document can be made clearer, which is convenient for users to understand and retrieve relevant information. Even if the reference file does not conform to the preset format, the context-based splitting method can effectively identify the chapter title and content, and retain the integrity and coherence of the document. Most importantly, this structured processing can provide a more accurate data basis for subsequent text analysis and similarity calculation, thereby improving the quality and accuracy of the recalled content.

接着,对结构化处理后的所述参考文件进行预处理。例如,基于医学字典,对结构化处理后的所述参考文件中的所述参考章节内容进行分词处理,得到对应于所述参考章节内容的第一分词集合,并从所述第一分词集合中过滤掉停用词;对过滤后的所述第一分词集合中符合预设条件的分词进行词义扩充和加权处理,计算词义扩充和加权处理后的所述第一分词集合中的分词的词向量,得到第一词向量集合;计算所述第一词向量集合中的词向量对应的整句的第一句向量,并对所述第一词向量集合中的词向量和相应的所述第一句向量进行加权求和,得到参考内容向量。Next, the reference document after structured processing is preprocessed. For example, based on a medical dictionary, the reference chapter content in the reference document after structured processing is segmented to obtain a first segmented word set corresponding to the reference chapter content, and stop words are filtered out from the first segmented word set; the segmented words in the filtered first segmented word set that meet preset conditions are expanded and weighted, and the word vectors of the segmented words in the first segmented word set after expanded and weighted processing are calculated to obtain a first word vector set; the first sentence vector of the whole sentence corresponding to the word vector in the first word vector set is calculated, and the word vector in the first word vector set and the corresponding first sentence vector are weighted summed to obtain a reference content vector.

同时,对目标章节内容进行预处理。例如,基于医学字典,对所述目标章节内容进行分词处理,得到对应于所述目标章节内容的第二分词集合,并从所述第二分词集合中过滤掉停用词;对过滤后的所述第二分词集合中符合预设条件的分词进行词义扩充和加权处理,计算词义扩充和加权处理后的所述第二分词集合中的分词的词向量,得到第二词向量集合;计算所述第二词向量集合中的词向量对应的整句的第二句向量,并对所述第二词向量集合中的词向量和相应的所述第二句向量进行加权求和,得到目标章节内容向量。At the same time, the target chapter content is preprocessed. For example, based on a medical dictionary, the target chapter content is segmented to obtain a second segmentation set corresponding to the target chapter content, and stop words are filtered out from the second segmentation set; the segmentations in the filtered second segmentation set that meet the preset conditions are expanded and weighted, and the word vectors of the segmentations in the second segmentation set after the expansion and weighting are calculated to obtain a second word vector set; the second sentence vector of the whole sentence corresponding to the word vector in the second word vector set is calculated, and the word vector in the second word vector set and the corresponding second sentence vector are weighted summed to obtain the target chapter content vector.

本实施例通过基于医学字典进行分词处理、过滤停用词、词义扩充和加权处理等步骤,可以更准确地表示医学领域的术语和内容,使得文本向量的表达更加丰富和准确。这样的预处理过程不仅能够提高医学领域文本的语义理解能力,还可以增强相似度计算的准确性,从而提高了文档召回的效率和精度。同时,对目标章节内容的相似预处理也能够保证在后续的相似度计算中考虑到了文本的特点和语义信息,从而为最终的召回结果提供更为可靠的基础。This embodiment can more accurately represent the terms and content in the medical field by performing word segmentation, filtering stop words, expanding word meanings, and weighting processing based on a medical dictionary, making the expression of text vectors richer and more accurate. Such a preprocessing process can not only improve the semantic understanding ability of medical field texts, but also enhance the accuracy of similarity calculations, thereby improving the efficiency and accuracy of document recall. At the same time, similar preprocessing of the target chapter content can also ensure that the characteristics and semantic information of the text are taken into account in subsequent similarity calculations, thereby providing a more reliable basis for the final recall results.

步骤S104,计算预处理后的所述目标章节内容和预处理后的所述参考文件的参考章节内容之间的向量余弦相似度,并计算预处理后的所述目标章节内容和所述参考章节内容之间的语义匹配相似度。Step S104, calculating the vector cosine similarity between the preprocessed target chapter content and the preprocessed reference chapter content of the reference file, and calculating the semantic matching similarity between the preprocessed target chapter content and the reference chapter content.

向量余弦相似度用于衡量两个向量在方向上的相似程度。在本方案中,计算预处理后的目标章节内容向量和参考章节内容向量之间的向量余弦相似度可以帮助确定它们在语义上的相似程度。较高的向量余弦相似度表示两个向量在语义上更为接近,从而增加了召回内容的准确性和相关性。Vector cosine similarity is used to measure the degree of similarity between two vectors in direction. In this scheme, calculating the vector cosine similarity between the preprocessed target chapter content vector and the reference chapter content vector can help determine their semantic similarity. A higher vector cosine similarity indicates that the two vectors are closer in semantics, thereby increasing the accuracy and relevance of the recalled content.

除了向量余弦相似度外,本实施例通过语义匹配模型(如GPT4)对目标章节内容和参考章节内容进行理解分析,可以得到它们之间的语义匹配相似度。语义匹配相似度更多地考虑了文本之间的语义关系,能够更好地捕捉文本的含义和上下文信息,从而提高了召回内容的精确性和准确性。In addition to vector cosine similarity, this embodiment uses a semantic matching model (such as GPT4) to understand and analyze the target chapter content and the reference chapter content, and can obtain the semantic matching similarity between them. Semantic matching similarity takes more consideration of the semantic relationship between texts, and can better capture the meaning and contextual information of the text, thereby improving the precision and accuracy of the recalled content.

步骤S106,对所述余弦相似度和所述语义匹配相似度进行加权计算,得到相似度评分,并基于所述相似度评分确定是否召回所述参考章节。Step S106: performing weighted calculation on the cosine similarity and the semantic matching similarity to obtain a similarity score, and determining whether to recall the reference chapter based on the similarity score.

在分别计算得到向量余弦相似度和语义匹配相似度之后,将这两种相似度进行综合计算,得到一个综合的相似度评分。这样的综合相似度计算可以更全面地考虑文本之间的各个方面,并为最终的参考文档召回提供更为准确和可靠的依据。After calculating the vector cosine similarity and semantic matching similarity respectively, these two similarities are calculated together to obtain a comprehensive similarity score. Such a comprehensive similarity calculation can more comprehensively consider all aspects of the texts and provide a more accurate and reliable basis for the final reference document recall.

在计算相似度评分时,可以设定一个相似度阈值,根据相似度评分与该阈值的比较结果来确定是否召回参考章节。例如,在所述相似度评分大于预设阈值的情况下,确定召回所述参考章节;在所述相似度评分小于等于所述预设阈值的情况下,确定不召回所述参考章节。通过设定相似度阈值,可以平衡召回结果的召回率和准确率,满足不同应用场景和用户需求的要求。When calculating the similarity score, a similarity threshold can be set, and whether to recall the reference chapter is determined based on the comparison result between the similarity score and the threshold. For example, if the similarity score is greater than a preset threshold, it is determined to recall the reference chapter; if the similarity score is less than or equal to the preset threshold, it is determined not to recall the reference chapter. By setting a similarity threshold, the recall rate and accuracy of the recall result can be balanced to meet the requirements of different application scenarios and user needs.

本申请实施例中,结构化解析处理文档,根据医学文档撰写场景,将章节和内容作为一个整体。既可以保持语义上的连贯性又保留了原始文档的结构。结构化的解析让大模型更好的理解参考文档的结构和内容,减少输入内容的噪声,提高召回内容的准确度。In the embodiment of the present application, the document is processed by structured parsing, and the chapters and contents are treated as a whole according to the medical document writing scenario. It can maintain semantic coherence and retain the structure of the original document. Structured parsing allows the large model to better understand the structure and content of the reference document, reduce the noise of the input content, and improve the accuracy of the recalled content.

此外,本申请实施例对词,句子的向量进行医学领域术语加权,通过对医学术语进行加权,可以在向量空间中更准确地表示专业词汇的重要性,从而提高与专业术语相关查询的召回质量。GPT4等先进的语言模型可以理解和处理复杂的自然语言,结合它们的语义分析能力可以提高对查询意图的理解,并提取出更为相关的参考内容。最终将向量相似度结果与GPT4语义分析相似度结果加权求和,得到最终的高相似度的参考内容。通过调整加权方案,可以根据具体的应用场景或用户需求来定制化检索系统,使其更符合特定用户群体的需求。In addition, the embodiment of the present application weights the vectors of words and sentences with medical field terms. By weighting the medical terms, the importance of professional vocabulary can be more accurately represented in the vector space, thereby improving the recall quality of queries related to professional terms. Advanced language models such as GPT4 can understand and process complex natural languages. Combined with their semantic analysis capabilities, they can improve the understanding of query intent and extract more relevant reference content. Finally, the vector similarity results are weighted and summed with the GPT4 semantic analysis similarity results to obtain the final high-similarity reference content. By adjusting the weighting scheme, the retrieval system can be customized according to specific application scenarios or user needs to make it more in line with the needs of specific user groups.

实施例2Example 2

本申请实施例公开了另一种文档写作智能召回方法,如图2所示,该方法包括以下步骤:The embodiment of the present application discloses another document writing intelligent recall method, as shown in FIG2 , the method comprises the following steps:

步骤S202,上传参考文件。Step S202, uploading reference files.

用户需上传医学写作所需的参考文件,这些文件应为Docx或doc格式,以便后续处理和分析。上述参考文件用Rd表示,多个参考文件可使用Rd1、Rd2、Rd3...表示。Users need to upload reference files required for medical writing. These files should be in Docx or doc format for subsequent processing and analysis. The above reference files are represented by Rd. Multiple reference files can be represented by Rd 1 , Rd 2 , Rd 3 ...

步骤S204,处理上传的参考文件。Step S204: Process the uploaded reference file.

针对参考文件需要区分两种情况,对于格式规整的文件,利用字体、文本位置和大小格式特征,自动提取章节标题和相应内容。对于缺乏明确格式的文件,将文本分块,并依次输入到拆分模型中,拆分模型将基于上下文识别和拆分出章节标题及其对应内容,得到参考章节内容。上述参考章节内容用Rc表示,每一个Rc包含一个章节标题Ct和一个章节内容Cc。There are two situations that need to be distinguished for reference files. For files with regular formats, the chapter titles and corresponding contents are automatically extracted using font, text position and size format features. For files that lack a clear format, the text is divided into blocks and input into the split model in sequence. The split model will identify and split the chapter titles and their corresponding contents based on the context to obtain the reference chapter content. The above reference chapter content is represented by Rc, and each Rc contains a chapter title Ct and a chapter content Cc.

步骤S206,判断是否召回。Step S206, determine whether to recall.

在特定的医学写作场景中,通常第一章节内容不具备参考价值,因此将自动排除第一章节内容。此外,实验流程图等结构化数据是文献中的关键信息,若存在,则需特别提取以供后续使用。In a specific medical writing scenario, the content of the first chapter is usually not of reference value, so it will be automatically excluded. In addition, structured data such as experimental flow charts are key information in the literature. If they exist, they need to be specially extracted for subsequent use.

对目标章节内容进行预处理,将目标章节内容和参考章节内容进行相似度计算,基于相似度计算的结果,确定是否召回参考章节内容。The target chapter content is preprocessed, the target chapter content and the reference chapter content are similarly calculated, and based on the result of the similarity calculation, it is determined whether to recall the reference chapter content.

下面将详细描述相似度计算方法,相似度计算方法是整个流程中的关键步骤,涉及提取关键的文本参考内容。该方法涉及关键点有两个:提供专业的医学领域词库,对专业的分词术语进行词义扩充;加权相似度计算,包含词、句子,GPT4语义分析,使得到的结果更加准确。The following is a detailed description of the similarity calculation method, which is a key step in the entire process and involves extracting key text reference content. This method involves two key points: providing a professional medical field vocabulary to expand the meaning of professional word segmentation terms; weighted similarity calculation, including words, sentences, and GPT4 semantic analysis, to make the results more accurate.

具体地,相似度计算方法如图3所示,包括以下步骤:Specifically, the similarity calculation method is shown in FIG3 and includes the following steps:

步骤S302,对参考文件Rd进行结构化处理。Step S302, performing structural processing on the reference file Rd.

步骤S304,对结构化处理后的参考文件进行预处理。Step S304: pre-process the reference file after the structured processing.

将参考文件Rd中章节内容Rc的参考章节内容Cc的文本进行分词,分词过程中使用医学字典,保证医学单词分词的完整性。然后,进行停用词过滤处理。之后,结合医学领域词库对专业的术语进行词义扩充同时进行权重标记,专业词语权重比大,并且根据词典中的相近词义的词进行扩充。最后对分词过滤后的单词进行向量化表示结合整句全局的向量化表示进行加权求和,融合词与句子的语义。The text of the reference chapter content Cc of the chapter content Rc in the reference file Rd is segmented. The medical dictionary is used in the segmentation process to ensure the integrity of the medical word segmentation. Then, stop word filtering is performed. After that, the professional terms are expanded and weighted in combination with the medical field vocabulary. The professional terms have a large weight ratio and are expanded according to words with similar meanings in the dictionary. Finally, the words after the segmentation filter are vectorized and combined with the global vectorized representation of the whole sentence for weighted summation, integrating the semantics of the words and sentences.

步骤S306,对目标章节内容进行预处理。Step S306: pre-process the target chapter content.

处理方法同步骤S304,此处不再赘述。其中目标章节内容使用Gc表示。The processing method is the same as step S304, which will not be described again. The target chapter content is represented by Gc.

步骤S308,进行相似度计算。Step S308, performing similarity calculation.

相似度计算的方法可以包括以下:Similarity calculation methods may include the following:

1)计算第一向量余弦相似度。1) Calculate the cosine similarity of the first vector.

计算分词后的句子(Gc中的句子Gci和Cc中的句子Cci两部分)的向量的相似度。对于在句子中的每个词,将它的word2vec向量与它的TF-IDF权重相乘,此处句子用s表示,句子中的分词用wi表示。公式如下:Calculate the similarity of the vectors of the segmented sentences (the sentence Gci in Gc and the sentence Cci in Cc). For each word in the sentence, multiply its word2vec vector by its TF-IDF weight. Here, the sentence is represented by s and the segmented words in the sentence are represented by wi . The formula is as follows:

tv(wi,s)=tfidf(wi,s,S)·v(wi)tv( wi ,s)=tfidf( wi ,s,S)·v( wi )

其中,tv(wi,s)表示词(wi)在句子(s)中的TF-IDF加权词嵌入(word2vec)向量,v(wi)是词的word2vec向量,tfidf(wi,s,S)是词的TF-IDF权重。Where tv( wi ,s) represents the TF-IDF weighted word embedding (word2vec) vector of word ( wi ) in sentence (s), v( wi ) is the word2vec vector of the word, and tfidf( wi ,s,S) is the TF-IDF weight of the word.

基于词嵌入(word2vec)向量,计算所有词的加权word2vec向量和的平均值:Based on the word embedding (word2vec) vector, calculate the average of the weighted word2vec vector sum of all words:

其中,TV(s)是句子(s)的文本向量,n表示句子(s)中词(wi)的个数。Among them, TV(s) is the text vector of sentence(s), and n represents the number of words(wi) in sentence(s).

在目标章节内容和参考章节内容的向量之间进行余弦相似度计算,具体计算公式可以为:The cosine similarity calculation is performed between the vectors of the target chapter content and the reference chapter content. The specific calculation formula can be:

其中,Cos(TVG,TVC)表示向量余弦相似度,TVGc表示目标章节内容的向量,TVCc表示参考章节内容的向量,TVGc是将Gc中的每个句子Gci代入上述TV(s)公式计算出的向量的和,TVCc是将Cc中的每个句子Cci代入上述TV(s)公式计算出的向量的和。Among them, Cos( TVG,TVC ) represents the vector cosine similarity, TV Gc represents the vector of the target chapter content, TV Cc represents the vector of the reference chapter content, TV Gc is the sum of the vectors calculated by substituting each sentence Gci in Gc into the above TV(s) formula, and TV Cc is the sum of the vectors calculated by substituting each sentence Cci in Cc into the above TV(s) formula.

本实施例通过词嵌入与TF-IDF权重的相乘,使得句向量能够更准确地反映出句子的语义内容,有助于提高句子级别的语义表达质量。同时,基于加权Word2Vec向量的平均值,得到了句子的文本向量,通过这种方式,将句子中每个词的语义贡献均衡地融合在一起,使得句子向量更加全面地代表了句子的语义信息,有利于后续的句子相似度计算和文本分类等任务的准确性和性能提升。This embodiment multiplies the word embedding and the TF-IDF weight so that the sentence vector can more accurately reflect the semantic content of the sentence, which helps to improve the quality of semantic expression at the sentence level. At the same time, based on the average value of the weighted Word2Vec vector, the text vector of the sentence is obtained. In this way, the semantic contribution of each word in the sentence is evenly integrated, so that the sentence vector more comprehensively represents the semantic information of the sentence, which is conducive to the accuracy and performance improvement of subsequent tasks such as sentence similarity calculation and text classification.

2)计算第二向量余弦相似度。2) Calculate the cosine similarity of the second vector.

将目标章节内容和参考章节内容通过GPT4直接转换为句向量。The target chapter content and the reference chapter content are directly converted into sentence vectors through GPT4.

在目标章节内容和参考章节内容直接转换的句向量之间进行余弦相似度计算,具体计算公式可以为:The cosine similarity calculation is performed between the sentence vectors directly converted from the target chapter content and the reference chapter content. The specific calculation formula can be:

其中,Cos(VG,VC)表示向量余弦相似度,VGc表示目标章节内容的句向量,VCc表示参考章节内容的句向量。Among them, Cos( VG,VC ) represents the vector cosine similarity, V Gc represents the sentence vector of the target chapter content, and V Cc represents the sentence vector of the reference chapter content.

其中,第一向量余弦相似度和第二向量余弦相似度中生成向量的方法不同。The methods of generating vectors in the first vector cosine similarity and the second vector cosine similarity are different.

3)计算语义匹配相似度。3) Calculate semantic matching similarity.

使用语义匹配模型例如GPT4直接对目标章节内容和参考章节内容进行理解分析,得到语义匹配相似度,具体计算公式为:Use a semantic matching model such as GPT4 to directly understand and analyze the target chapter content and the reference chapter content to obtain the semantic matching similarity. The specific calculation formula is:

其中,Mcos(VG,VC)表示模型的向量余弦相似度,MVGc表示目标章节内容(原始内容,没有分词、扩充专业术语等操作)的向量,MVCc表示参考章节内容(原始内容,没有分词、扩充专业术语等操作)的向量。Among them, Mcos( VG,VC ) represents the vector cosine similarity of the model, MV Gc represents the vector of the target chapter content (original content, without word segmentation, expansion of professional terms, etc.), and MV Cc represents the vector of the reference chapter content (original content, without word segmentation, expansion of professional terms, etc.).

步骤S310,计算相似度评分,具体公式为:Step S310, calculate the similarity score, the specific formula is:

SimScore=λ×Cos(TVG,TVC)+σ×Cos(VG,VC)+(1-σ-λ)×Mcos(VG,VC)SimScore=λ×Cos(TV G , TV C )+σ×Cos(VG, VC)+(1-σ-λ)×Mcos(VG, VC)

其中,SimScore表示相似度分值,σ与λ表示第一权重系数和第二权重系数(其中σ与λ的和介于0-1之间)。Among them, SimScore represents the similarity score, σ and λ represent the first weight coefficient and the second weight coefficient (wherein the sum of σ and λ is between 0 and 1).

对上面计算的第一、第二向量余弦相似度和语义匹配相似度进行权重计算,将满足相似度阈值的参考文档召回。依次对所有的参考文件目标章节进行S304~S310步骤,召回所有的符合要求的参考文档。The cosine similarity of the first and second vectors and the semantic matching similarity calculated above are weighted, and reference documents that meet the similarity threshold are recalled. Steps S304 to S310 are performed on all target sections of the reference files in sequence, and all reference documents that meet the requirements are recalled.

本申请通过对参考文本的结构化流程处理,对词、句子、语义级别的加权相似度计算,有效提高了召回内容语义的统一性以及召回内容的准确性。具体地,根据医学写作场景对参考资料进行章节结构化切分,并且采用GPT对无规整格式文件处理。此外,结合医学场景使用了专业词库,融合词,句子向量相似度以及大模型理解和处理复杂语言的能力,从而得到了先进的召回结果。This application effectively improves the semantic uniformity and accuracy of the recalled content by processing the reference text in a structured process and calculating the weighted similarity at the word, sentence, and semantic levels. Specifically, the reference materials are structured and segmented into chapters according to the medical writing scenario, and GPT is used to process files without regular formats. In addition, professional thesaurus, fusion words, sentence vector similarity, and the ability of large models to understand and process complex languages are used in combination with medical scenarios to obtain advanced recall results.

实施例3Example 3

本申请实施例公开了又一种文档写作智能召回方法,如图4所示,该方法包括以下步骤:The present application embodiment discloses another document writing intelligent recall method, as shown in FIG4 , the method comprises the following steps:

步骤S402,结构化处理参考文件。Step S402: structurally process the reference file.

1)对于格式规整的文件,利用字体、文本位置和大小格式特征,自动提取章节标题和相应内容。1) For files with regular format, automatically extract chapter titles and corresponding content using font, text position and size format features.

首先,通过分析字体特征,系统识别可能的章节标题。包括检查不同的字体样式,如粗体、斜体或特定的字体类型。章节标题通常具有较大的字号,因此系统可以设定一个字号阈值,以识别可能的标题文本。First, the system identifies possible chapter titles by analyzing font features. This includes checking for different font styles, such as bold, italic, or specific font types. Chapter titles usually have larger font sizes, so the system can set a font size threshold to identify possible title text.

其次,系统分析文本的位置特征,以确定可能的章节标题位置。章节标题通常位于页面的顶部或者具有固定的位置,因此系统可以通过检查文本的位置坐标来识别潜在的标题位置。此外,某些文件可以采用特定的布局结构,例如标题位于页面的左上角或居中,系统可以利用这些布局结构特征来进一步确定标题位置。Secondly, the system analyzes the positional features of the text to determine possible chapter title locations. Chapter titles are usually located at the top of the page or have a fixed position, so the system can identify potential title locations by checking the positional coordinates of the text. In addition, some files can adopt specific layout structures, such as the title being located in the upper left corner or centered on the page, and the system can use these layout structure features to further determine the title location.

一旦系统确定了可能的标题位置,就可以开始提取相应的内容。系统可以从标题位置开始向下搜索,并采用预设规则来提取与标题格式相匹配的文本内容。这些规则包括考虑标题的格式,如特定的标识符、样式或者文本结构。通过这些规则,系统可以准确地提取出章节标题及其相应的内容。Once the system has identified the possible title locations, it can start extracting the corresponding content. The system can start searching downward from the title location and use preset rules to extract text content that matches the title format. These rules include considering the format of the title, such as specific identifiers, styles, or text structure. Using these rules, the system can accurately extract the chapter title and its corresponding content.

总的来说,利用字体、文本位置和大小格式特征自动提取章节标题和相应内容的实现方案,可以通过分析文本的字体、位置和布局等特征,结合规则来识别和提取出文件中的章节结构,从而实现对格式规整的文件的智能处理和内容提取。In general, the implementation scheme of automatically extracting chapter titles and corresponding contents by using font, text position and size format features can analyze the font, position and layout features of the text and combine rules to identify and extract the chapter structure in the file, thereby realizing intelligent processing and content extraction of neatly formatted files.

2)对于缺乏明确格式的文件,采用文本分块的方式,并通过拆分模型识别和拆分出章节标题及其对应内容,得到参考章节内容。2) For files that lack a clear format, the text segmentation method is adopted, and the chapter titles and their corresponding contents are identified and segmented through the splitting model to obtain the reference chapter content.

首先,将整个参考文件分成较小的文本块,可以根据换行符、标点符号等作为分块的依据,将文本划分为段落或句子级别的块。这种分块的方式可以确保每个文本块都包含了一定的信息,并且减小了处理的复杂度。First, the entire reference file is divided into smaller text blocks. The text can be divided into paragraph or sentence level blocks based on line breaks, punctuation marks, etc. This block division method ensures that each text block contains certain information and reduces the complexity of processing.

可以采用深度学习模型,如循环神经网络(RNN)、卷积神经网络(CNN)或者注意力机制(Attention)等模型来构建拆分模型。拆分模型的训练数据可以是已经标注好的具有章节结构的文本数据集,以及对应的章节标题和内容的位置信息。使用标注好的训练数据对拆分模型进行训练,目标是让模型能够准确地识别出文本中的章节标题及其对应的内容。训练过程中,可以采用损失函数,如交叉熵损失函数,以及优化器,如随机梯度下降(SGD)或者Adam等来优化模型参数。Deep learning models such as recurrent neural networks (RNN), convolutional neural networks (CNN) or attention mechanisms can be used to build split models. The training data for the split model can be a text dataset with a chapter structure that has been annotated, as well as the location information of the corresponding chapter titles and content. The split model is trained using the annotated training data, with the goal of enabling the model to accurately identify the chapter titles in the text and their corresponding content. During the training process, loss functions such as the cross entropy loss function and optimizers such as stochastic gradient descent (SGD) or Adam can be used to optimize the model parameters.

将训练好的拆分模型应用于实际的文档处理中,对缺乏明确格式的文件进行章节拆分。对于每个文本块,通过拆分模型识别出可能的章节标题,并根据标题确定其对应的内容。通过以上步骤,可以实现对缺乏明确格式的文件进行章节拆分的技术方案。本实施例利用深度学习模型和文本分块的方法,能够有效地识别和拆分出文件中的章节结构,为后续的内容提取和处理提供了基础。The trained splitting model is applied to actual document processing to perform chapter splitting on files that lack a clear format. For each text block, the possible chapter titles are identified through the splitting model, and the corresponding content is determined based on the title. Through the above steps, a technical solution for performing chapter splitting on files that lack a clear format can be implemented. This embodiment utilizes a deep learning model and a text segmentation method to effectively identify and split the chapter structure in a file, providing a basis for subsequent content extraction and processing.

步骤S404,对参考文件和目标章节内容进行预处理。Step S404: pre-process the reference file and the target chapter content.

对参考文件的参考章节内容和目标章节内容中的文本分别进行以下的预处理操作,如图5所示,预处理包括以下步骤:The following preprocessing operations are performed on the text in the reference chapter content and the target chapter content of the reference file, respectively, as shown in FIG5 , and the preprocessing includes the following steps:

步骤S4042,分词。Step S4042, word segmentation.

使用分词工具,如jieba中文分词库,对目标章节内容和参考章节内容的文本进行分词。在分词过程中,需要使用医学字典作为额外的词库,以保证医学单词的完整性。这样可以确保专业术语不会被错误地拆分为多个词。Use word segmentation tools, such as Jieba Chinese word segmentation library, to segment the text of the target chapter content and reference chapter content. During the word segmentation process, it is necessary to use a medical dictionary as an additional word library to ensure the integrity of medical words. This ensures that professional terms are not mistakenly split into multiple words.

步骤S4044,停用词过滤。Step S4044, stop word filtering.

对分词结果进行停用词过滤处理,去除一些在医学文本中常见但无实际含义的词语,如“的”、“是”等。停用词过滤可以减少噪声,提高后续处理的效率和准确性。Stop word filtering is performed on the word segmentation results to remove some common words in medical texts but without actual meaning, such as "的" and "是". Stop word filtering can reduce noise and improve the efficiency and accuracy of subsequent processing.

步骤S4046,词义扩充和权重标记。Step S4046, word meaning expansion and weight marking.

结合医学领域的专业词库,对分词结果进行词义扩充和权重标记。对于医学术语,可以通过医学词典或者专业医学术语库来获取其权重信息,并标记其重要性。对于每个词,根据其在医学词库中的重要性,为其分配一个权重。专业词语的权重通常会比一般词语更高。Combined with the professional vocabulary in the medical field, the word segmentation results are expanded and weighted. For medical terms, their weight information can be obtained through medical dictionaries or professional medical terminology libraries, and their importance can be marked. For each word, a weight is assigned to it according to its importance in the medical vocabulary. The weight of professional terms is usually higher than that of general terms.

对于词库中存在的相近词义的词,进行相似词扩充。通过在词典中寻找相似词义的词,可以扩充词汇表,使得模型能够更好地理解医学文本中的内容。For words with similar meanings in the vocabulary, similar word expansion is performed. By finding words with similar meanings in the dictionary, the vocabulary can be expanded so that the model can better understand the content in the medical text.

步骤S4048,向量化表示。Step S4048, vectorized representation.

将经过词义扩充和权重标记的单词转换为向量表示。首先计算目标章节内容和参考章节内容中每个词出现的频率。高频词通常对文本的表达具有重要性。计算每个词的逆文档频率,即该词在整个语料库中的重要程度。可以使用训练语料库的文档频率信息来计算。将词频和逆文档频率相乘,得到每个词的TF-IDF值。TF-IDF值能够突出显示在当前文档中频繁出现但在整个语料库中较少出现的词语。利用自然语言处理工具,如NLTK(NaturalLanguage Toolkit)或spaCy,对文本进行词性标注,以提取词语的语法信息,例如名词、动词、形容词等。这些词性标签可以提供关于文本结构和内容的额外信息。通过计算特征与目标变量(相似度评分)之间的相关性,筛选出与相似度评分相关性高的特征。可以使用皮尔逊相关系数、斯皮尔曼相关系数等方法进行相关性分析。利用信息增益方法评估每个特征对于分类(相似度评分)的重要性,从而选择最具区分性的特征。信息增益表示在已知某个特征的条件下,对于分类的不确定性减少的程度。将提取和选择后的特征转换为特征向量。使用词袋模型(Bag-of-Words,BoW)或TF-IDF加权的词袋模型将词频、TF-IDF值等特征转换为向量表示。对于词性标签等非数值型特征,可以使用独热编码或者词嵌入技术将其转换为数值型特征向量。Convert the words that have been expanded and weighted to vector representation. First, calculate the frequency of each word in the target chapter content and the reference chapter content. High-frequency words are usually important for the expression of the text. Calculate the inverse document frequency of each word, that is, the importance of the word in the entire corpus. The document frequency information of the training corpus can be used for calculation. Multiply the word frequency and the inverse document frequency to get the TF-IDF value of each word. The TF-IDF value can highlight words that appear frequently in the current document but less frequently in the entire corpus. Use natural language processing tools such as NLTK (NaturalLanguage Toolkit) or spaCy to perform part-of-speech tagging on the text to extract grammatical information of words, such as nouns, verbs, adjectives, etc. These part-of-speech tags can provide additional information about the structure and content of the text. By calculating the correlation between the feature and the target variable (similarity score), screen out features with high correlation with the similarity score. Correlation analysis can be performed using methods such as the Pearson correlation coefficient and the Spearman correlation coefficient. The information gain method is used to evaluate the importance of each feature for classification (similarity score) so as to select the most discriminative features. Information gain indicates the degree to which the uncertainty of classification is reduced when a certain feature is known. The extracted and selected features are converted into feature vectors. Features such as word frequency and TF-IDF value are converted into vector representations using the Bag-of-Words (BoW) model or the TF-IDF weighted Bag-of-Words model. For non-numerical features such as part-of-speech tags, one-hot encoding or word embedding techniques can be used to convert them into numerical feature vectors.

本实施例通过将文本转换为特征向量,利用词频、TF-IDF值、词性标签等多种特征信息,不仅能够更好地反映文本的语法和语义信息,还能够突出显示文本中具有重要性但在整个语料库中较少出现的关键词。通过特征选择和相关性分析,筛选出与相似度评分相关性高的特征,提高了特征的区分度。最终,将提取和选择后的特征转换为特征向量,使得相似度计算更加精准,从而提高了文本相似度评分的准确性和可靠性。This embodiment converts the text into a feature vector and uses various feature information such as word frequency, TF-IDF value, part-of-speech tags, etc., which can not only better reflect the grammatical and semantic information of the text, but also highlight the keywords that are important in the text but rarely appear in the entire corpus. Through feature selection and correlation analysis, features with high correlation with similarity scores are screened out, which improves the distinguishability of features. Finally, the extracted and selected features are converted into feature vectors, making the similarity calculation more accurate, thereby improving the accuracy and reliability of text similarity scores.

步骤S406,计算相似度。Step S406, calculating the similarity.

使用计算得到的特征向量进行相似度计算。可以采用向量余弦相似度等方法来衡量目标章节内容和参考章节内容之间的相似程度。同时,使用语义匹配模型例如GPT4直接对目标章节内容和参考章节内容进行理解分析,得到语义匹配相似度。The calculated feature vector is used to calculate the similarity. Methods such as vector cosine similarity can be used to measure the similarity between the target chapter content and the reference chapter content. At the same time, a semantic matching model such as GPT4 is used to directly understand and analyze the target chapter content and the reference chapter content to obtain semantic matching similarity.

最终向量余弦相似度和语义匹配相似度进行权重计算,得到相似度评分,基于该相似度平的确定参考文档的召回章节。Finally, the vector cosine similarity and semantic matching similarity are weighted to obtain a similarity score, and the recall section of the reference document is determined based on the similarity score.

在一些实施例中,相似度评分可以通过如下方法得到:In some embodiments, the similarity score may be obtained by:

Sim(S1,S2)=β1·β2 Sim(S 1 , S 2 )=β 1 ·β 2

β1=k·λβ 1 = k·λ

β2=α1Cos(TVG,TVC)+α2Cos(VG,VC)+α3Mcos(VG,VC)β 21 Cos (TV G , TV C ) + α 2 Cos (VG, VC) + α 3 Mcos (VG, VC)

Sim(S1,S2)为相似度评分,其值由两部分决定,β1表示相似度调节系数,β2表示语义相似度值,β1包含两个具体参数:k和λ。其中,k为句子权重系数,即句子权重的调节系数,其值设为2*i/(m+n),m和n分别表示S1和S2所含的权重词语个数,i为S1和S2中相对应权重词语的个数。λ为否定系数,当两个词性标签完全不一致时,设置为-1,一般情况下为1,其中S1和S2分别表示目标章节内容中的句子和参考章节内容中的句子。Sim(S 1 ,S 2 ) is a similarity score, and its value is determined by two parts. β 1 represents the similarity adjustment coefficient, and β 2 represents the semantic similarity value. β 1 contains two specific parameters: k and λ. Among them, k is the sentence weight coefficient, that is, the adjustment coefficient of the sentence weight, and its value is set to 2*i/(m+n). m and n represent the number of weighted words contained in S 1 and S 2, respectively, and i is the number of corresponding weighted words in S 1 and S 2. λ is a negation coefficient. When the two part-of-speech tags are completely inconsistent, it is set to -1, and it is generally 1. Among them, S1 and S2 represent the sentences in the target chapter content and the sentences in the reference chapter content, respectively.

由于一个句子通过专业术语分为了两种权重占比,因此β2的值由Cos(TVG,TVC)、Cos(VG,VC)和Mcos(VG,VC)三部分相似度值组成,α1表示目标章节内容Gc权重词语数量占句子词语数量的比值,α2表示参考章节内容Cc权重词语数量占句子词语数量的比值,α3表示非权重词语数量占两个句子词语数量和的比值。Since a sentence is divided into two weight proportions through professional terms, the value of β2 is composed of three similarity values: Cos( TVG,TVC ), Cos( VG,VC ) and Mcos( VG,VC ). α1 represents the ratio of the number of weighted words in the target chapter content Gc to the number of words in the sentence, α2 represents the ratio of the number of weighted words in the reference chapter content Cc to the number of words in the sentence, and α3 represents the ratio of the number of non-weighted words to the sum of the number of words in the two sentences.

本实施例综合了句子中词汇的权重和语义相似度值,并且考虑了词性标签的一致性,使得句子相似度计算更加全面准确。本实施例的句子相似度的计算方法不仅仅考虑了句子中每个词的贡献,还综合考虑了词汇的重要性和句子中词性标签的一致性,从而更好地捕捉了句子之间的语义关系。这种方法使得句子相似度计算更加细致、准确,有助于对文本进行更精确的语义比较和语义分析,提高了文本处理任务的效率和准确性。This embodiment combines the weights and semantic similarity values of the words in the sentence, and takes into account the consistency of the part-of-speech tags, so that the sentence similarity calculation is more comprehensive and accurate. The method for calculating the sentence similarity of this embodiment not only takes into account the contribution of each word in the sentence, but also comprehensively considers the importance of the words and the consistency of the part-of-speech tags in the sentence, thereby better capturing the semantic relationship between sentences. This method makes the sentence similarity calculation more detailed and accurate, helps to perform more precise semantic comparison and semantic analysis of the text, and improves the efficiency and accuracy of text processing tasks.

实施例4Example 4

本申请实施例提供了一种文档生成方法,该方法如图6所示,包括以下步骤:The present application embodiment provides a document generation method, as shown in FIG6 , comprising the following steps:

步骤S602,接收目标章节内容;Step S602, receiving target chapter content;

步骤S604,利用文档写作智能召回方法,判断是否召回参考章节内容。Step S604, using the document writing intelligent recall method to determine whether to recall the reference chapter content.

文档写作智能召回方法是如实施例1至3中的方法,此处不再赘述。The document writing intelligent recall method is the same as the method in Examples 1 to 3, which will not be described again here.

步骤S606,在判断需要召回所述参考章节内容的情况下,召回所述参考章节内容,并基于所述目标章节内容和所述参考章节内容生成文档。Step S606: When it is determined that the reference chapter content needs to be recalled, the reference chapter content is recalled, and a document is generated based on the target chapter content and the reference chapter content.

实施例5Example 5

本申请实施例提供了一种文档写作智能召回装置,如图7所示,包括处理模块72、相似度计算模块74和召回模块76。An embodiment of the present application provides a document writing intelligent recall device, as shown in FIG. 7 , including a processing module 72 , a similarity calculation module 74 and a recall module 76 .

处理模块72被配置为对参考文件进行结构化处理,对结构化处理后的所述参考文件进行预处理,并对目标章节内容进行预处理,其中,所述目标章节内容是基于所述参考章节内容智能生成的内容。The processing module 72 is configured to perform structural processing on the reference file, preprocess the reference file after structural processing, and preprocess the target chapter content, wherein the target chapter content is content intelligently generated based on the reference chapter content.

相似度计算模块74被配置为计算预处理后的所述目标章节内容和预处理后的所述参考文件的参考章节内容之间的向量余弦相似度,并计算预处理后的所述目标章节内容和与处理后的所述参考文件的参考章节内容之间的语义匹配相似度;The similarity calculation module 74 is configured to calculate the vector cosine similarity between the preprocessed target chapter content and the preprocessed reference chapter content of the reference file, and calculate the semantic matching similarity between the preprocessed target chapter content and the processed reference chapter content of the reference file;

召回模块76被配置为对所述余弦相似度和所述语义匹配相似度进行加权计算,得到相似度评分,并基于所述相似度评分确定是否召回所述参考章节内容。The recall module 76 is configured to perform weighted calculation on the cosine similarity and the semantic matching similarity to obtain a similarity score, and determine whether to recall the reference chapter content based on the similarity score.

本申请实施例还提供了一种文档生成装置,如图8所示,该文档生成装置包括:接收模块82,被配置为接收目标章节内容;判断模块84被配置为利用实施例1至3所述的文档写作智能召回方法,判断是否需要召回所述参考章节内容;生成模块86被配置为在判断需要召回所述参考章节内容的情况下,召回所述参考章节内容,并基于所述目标章节内容和所述参考章节内容生成文档。An embodiment of the present application also provides a document generation device, as shown in Figure 8, the document generation device includes: a receiving module 82, configured to receive target chapter content; a judgment module 84 is configured to use the document writing intelligent recall method described in Examples 1 to 3 to determine whether it is necessary to recall the reference chapter content; a generation module 86 is configured to recall the reference chapter content when it is determined that the reference chapter content needs to be recalled, and generate a document based on the target chapter content and the reference chapter content.

需要说明的是:上述实施例提供的装置,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的装置与相应的文档写作智能召回方法或相应的文档生成装置实施例属于同一构思,其具体实现过程详见方法实施例,此处不再赘述。It should be noted that the device provided in the above embodiment is only illustrated by the division of the above functional modules. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the device provided in the above embodiment and the corresponding document writing intelligent recall method or the corresponding document generation device embodiment belong to the same concept. The specific implementation process is detailed in the method embodiment and will not be repeated here.

实施例5Example 5

图9示出了适于用来实现本公开实施例的电子设备的结构示意图。需要说明的是,图9示出的电子设备仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。Fig. 9 shows a schematic diagram of the structure of an electronic device suitable for implementing the embodiment of the present disclosure. It should be noted that the electronic device shown in Fig. 9 is only an example and should not bring any limitation to the function and scope of use of the embodiment of the present disclosure.

如图9所示,该电子设备包括中央处理单元(CPU)1001,其可以根据存储在只读存储器(ROM)1002中的程序或者从存储部分1008加载到随机访问存储器(RAM)1003中的程序而执行各种适当的动作和处理。在RAM 1003中,还存储有系统操作所需的各种程序和数据。CPU1001、ROM 1002以及RAM 1003通过总线1004彼此相连。输入/输出(I/O)接口1005也连接至总线1004。As shown in FIG9 , the electronic device includes a central processing unit (CPU) 1001, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 1002 or a program loaded from a storage portion 1008 into a random access memory (RAM) 1003. Various programs and data required for system operation are also stored in the RAM 1003. The CPU 1001, the ROM 1002, and the RAM 1003 are connected to each other via a bus 1004. An input/output (I/O) interface 1005 is also connected to the bus 1004.

以下部件连接至I/O接口1005:包括键盘、鼠标等的输入部分1006;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分1007;包括硬盘等的存储部分1008;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分1009。通信部分1009经由诸如因特网的网络执行通信处理。驱动器1010也根据需要连接至I/O接口1005。可拆卸介质1011,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器1010上,以便于从其上读出的计算机程序根据需要被安装入存储部分1008。The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, etc.; an output section 1007 including a cathode ray tube (CRT), a liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 1008 including a hard disk, etc.; and a communication section 1009 including a network interface card such as a LAN card, a modem, etc. The communication section 1009 performs communication processing via a network such as the Internet. A drive 1010 is also connected to the I/O interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 1010 as needed, so that a computer program read therefrom is installed into the storage section 1008 as needed.

特别地,根据本公开的实施例,下文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分1009从网络上被下载和安装,和/或从可拆卸介质1011被安装。在该计算机程序被中央处理单元(CPU)1001执行时,执行本申请的方法和装置中限定的各种功能。在一些实施例中,电子设备还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。In particular, according to an embodiment of the present disclosure, the process described below with reference to the flowchart can be implemented as a computer software program. For example, an embodiment of the present disclosure includes a computer program product, which includes a computer program carried on a computer-readable medium, and the computer program contains a program code for executing the method shown in the flowchart. In such an embodiment, the computer program can be downloaded and installed from the network through the communication part 1009, and/or installed from the removable medium 1011. When the computer program is executed by the central processing unit (CPU) 1001, various functions defined in the method and apparatus of the present application are executed. In some embodiments, the electronic device may also include an AI (Artificial Intelligence) processor, which is used to process computing operations related to machine learning.

需要说明的是,本公开所示的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、RF等等,或者上述的任意合适的组合。It should be noted that the computer-readable medium shown in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two. The computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination of the above. More specific examples of computer-readable storage media may include, but are not limited to: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium containing or storing a program that can be used by or in combination with an instruction execution system, device or device. In the present disclosure, a computer-readable signal medium may include a data signal propagated in a baseband or as part of a carrier wave, in which a computer-readable program code is carried. This propagated data signal may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. Computer-readable signal media may also be any computer-readable medium other than computer-readable storage media, which may send, propagate or transmit a program for use by or in conjunction with an instruction execution system, apparatus or device. The program code contained on the computer-readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, optical cable, RF, etc., or any suitable combination of the above.

附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图或流程图中的每个方框、以及框图或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flow chart and block diagram in the accompanying drawings illustrate the possible architecture, function and operation of the system, method and computer program product according to various embodiments of the present disclosure. In this regard, each box in the flow chart or block diagram can represent a module, a program segment, or a part of a code, and the above-mentioned module, program segment, or a part of a code contains one or more executable instructions for realizing the specified logical function. It should also be noted that in some alternative implementations, the functions marked in the box can also occur in a different order from the order marked in the accompanying drawings. For example, two boxes represented in succession can actually be executed substantially in parallel, and they can sometimes be executed in the opposite order, depending on the functions involved. It should also be noted that each box in the block diagram or flow chart, and the combination of the boxes in the block diagram or flow chart can be implemented with a dedicated hardware-based system that performs a specified function or operation, or can be implemented with a combination of dedicated hardware and computer instructions.

描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现,所描述的单元也可以设置在处理器中。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定。The units involved in the embodiments described in the present disclosure may be implemented by software or hardware, and the units described may also be arranged in a processor. The names of these units do not, in some cases, constitute limitations on the units themselves.

作为另一方面,本申请还提供了一种计算机可读介质,该计算机可读介质可以是上述实施例中描述的电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。As another aspect, the present application further provides a computer-readable medium, which may be included in the electronic device described in the above embodiment; or may exist independently without being assembled into the electronic device.

上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被一个该电子设备执行时,使得该电子设备实现如下述实施例中所述的方法。例如,所述的电子设备可以实现上述方法实施例的各个步骤等。The computer-readable medium carries one or more programs, and when the one or more programs are executed by an electronic device, the electronic device implements the method described in the following embodiments. For example, the electronic device can implement each step of the above method embodiment.

上述实施例中的集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在上述计算机可读取的存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在存储介质中,包括若干指令用以使得一台或多台计算机设备(可为个人计算机、服务器或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。If the integrated units in the above embodiments are implemented in the form of software functional units and sold or used as independent products, they can be stored in the above computer-readable storage medium. Based on this understanding, the technical solution of the present application, or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling one or more computer devices (which may be personal computers, servers, or network devices, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.

在本申请的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above embodiments of the present application, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

在本申请所提供的几个实施例中,应该理解到,所揭露的终端设备,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided in the present application, it should be understood that the disclosed terminal device can be implemented in other ways. Among them, the device embodiments described above are only schematic, for example, the division of the units is only a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of units or modules, which can be electrical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit. The above-mentioned integrated unit may be implemented in the form of hardware or in the form of software functional units.

以上所述仅是本申请的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。The above is only a preferred implementation of the present application. It should be pointed out that for ordinary technicians in this technical field, several improvements and modifications can be made without departing from the principles of the present application. These improvements and modifications should also be regarded as the scope of protection of the present application.

Claims (10)

1.一种文档写作智能召回方法,其特征在于,包括:1. A document writing intelligent recall method, characterized by comprising: 对参考文件进行结构化处理,对结构化处理后的所述参考文件进行预处理,并对目标章节内容进行预处理,其中,所述目标章节内容是待生成内容的章节标题信息;Performing structural processing on the reference file, preprocessing the structurally processed reference file, and preprocessing the target chapter content, wherein the target chapter content is chapter title information of the content to be generated; 计算预处理后的所述目标章节内容和预处理后的所述参考文件的参考章节内容之间的向量余弦相似度,并计算预处理后的所述目标章节内容和所述参考章节内容之间的语义匹配相似度;Calculating the vector cosine similarity between the preprocessed target chapter content and the preprocessed reference chapter content of the reference file, and calculating the semantic matching similarity between the preprocessed target chapter content and the reference chapter content; 对所述余弦相似度和所述语义匹配相似度进行加权计算,得到相似度评分,并基于所述相似度评分确定是否召回所述参考章节。The cosine similarity and the semantic matching similarity are weightedly calculated to obtain a similarity score, and whether to recall the reference section is determined based on the similarity score. 2.根据权利要求1所述的方法,其特征在于,对参考文件进行结构化处理,包括:2. The method according to claim 1, characterized in that the structural processing of the reference file comprises: 判断所述参考文件是否符合预设的格式;Determining whether the reference file conforms to a preset format; 在所述参考文件符合所述预设的格式的情况下,从所述参考文件中提取章节标题和与所述章节标题对应的内容;In the case where the reference file conforms to the preset format, extracting chapter titles and content corresponding to the chapter titles from the reference file; 在所述参考文件不符合所述预设的格式的情况下,基于所述参考文件的上下文从所述参考文件中识别并拆分出章节标题和与所述章节标题对应的内容,并将所述章节标题对应的内容作为所述参考章节内容。In the case that the reference file does not conform to the preset format, the chapter title and the content corresponding to the chapter title are identified and separated from the reference file based on the context of the reference file, and the content corresponding to the chapter title is used as the reference chapter content. 3.根据权利要求2所述的方法,其特征在于,对结构化处理后的所述参考文件进行预处理,包括:3. The method according to claim 2, characterized in that the reference file after structure processing is preprocessed, comprising: 基于医学字典,对结构化处理后的所述参考文件中的所述参考章节内容进行分词处理,得到对应于所述参考章节内容的第一分词集合,并从所述第一分词集合中过滤掉停用词;Based on a medical dictionary, segmenting the reference chapter content in the reference document after structured processing to obtain a first segmentation set corresponding to the reference chapter content, and filtering out stop words from the first segmentation set; 对过滤后的所述第一分词集合中符合预设条件的分词进行词义扩充和加权处理,计算词义扩充和加权处理后的所述第一分词集合中的分词的词向量,得到第一词向量集合;Performing word meaning expansion and weighting processing on the words that meet the preset conditions in the filtered first word segmentation set, calculating the word vectors of the words in the first word segmentation set after the word meaning expansion and weighting processing, and obtaining a first word vector set; 计算所述第一词向量集合中的词向量对应的整句的第一句向量,并对所述第一词向量集合中的词向量和相应的所述第一句向量进行加权求和,得到参考内容向量。The first sentence vector of the whole sentence corresponding to the word vector in the first word vector set is calculated, and the word vector in the first word vector set and the corresponding first sentence vector are weighted summed to obtain a reference content vector. 4.根据权利要求3所述的方法,其特征在于,对目标章节内容进行预处理,包括:4. The method according to claim 3, characterized in that preprocessing the target chapter content comprises: 基于医学字典,对所述目标章节内容进行分词处理,得到对应于所述目标章节内容的第二分词集合,并从所述第二分词集合中过滤掉停用词;Based on a medical dictionary, the target chapter content is segmented to obtain a second segmented word set corresponding to the target chapter content, and stop words are filtered out from the second segmented word set; 对过滤后的所述第二分词集合中符合预设条件的分词进行词义扩充和加权处理,计算词义扩充和加权处理后的所述第二分词集合中的分词的词向量,得到第二词向量集合;Performing word meaning expansion and weighting processing on the words that meet the preset conditions in the filtered second word segmentation set, calculating the word vectors of the words in the second word segmentation set after the word meaning expansion and weighting processing, and obtaining a second word vector set; 计算所述第二词向量集合中的词向量对应的整句的第二句向量,并对所述第二词向量集合中的词向量和相应的所述第二句向量进行加权求和,得到目标章节内容向量。The second sentence vector of the whole sentence corresponding to the word vector in the second word vector set is calculated, and the word vector in the second word vector set and the corresponding second sentence vector are weighted summed to obtain the target chapter content vector. 5.根据权利要求4所述的方法,其特征在于,基于所述相似度评分确定是否召回所述参考章节包括:5. The method according to claim 4, characterized in that determining whether to recall the reference section based on the similarity score comprises: 在所述相似度评分大于预设阈值的情况下,确定召回所述参考章节;When the similarity score is greater than a preset threshold, determining to recall the reference chapter; 在所述相似度评分小于等于所述预设阈值的情况下,确定不召回所述参考章节。When the similarity score is less than or equal to the preset threshold, it is determined not to recall the reference chapter. 6.一种文档生成方法,其特征在于,包括:6. A document generation method, comprising: 接收目标章节内容;Receive the target chapter content; 利用权利要求1至5中任一项所述的文档写作智能召回方法,判断是否召回参考文件中的参考章节内容;Using the document writing intelligent recall method described in any one of claims 1 to 5, determining whether to recall the reference chapter content in the reference document; 在判断需要召回所述参考章节内容的情况下,召回所述参考章节内容,并基于所述目标章节内容和所述参考章节内容生成文档。In the case where it is determined that the reference chapter content needs to be recalled, the reference chapter content is recalled, and a document is generated based on the target chapter content and the reference chapter content. 7.一种文档写作智能召回装置,其特征在于,包括:7. A document writing intelligent recall device, characterized by comprising: 处理模块,被配置为对参考文件进行结构化处理,对结构化处理后的所述参考文件进行预处理,并对目标章节内容进行预处理,其中,所述目标章节内容是待生成内容的章节标题信息;A processing module is configured to perform structural processing on the reference file, preprocess the reference file after structural processing, and preprocess the target chapter content, wherein the target chapter content is chapter title information of the content to be generated; 相似度计算模块,被配置为计算预处理后的所述目标章节内容和预处理后的所述参考文件的参考章节内容之间的向量余弦相似度,并计算预处理后的所述目标章节内容和与处理后的所述参考文件的参考章节内容之间的语义匹配相似度;A similarity calculation module is configured to calculate the vector cosine similarity between the preprocessed target chapter content and the preprocessed reference chapter content of the reference file, and calculate the semantic matching similarity between the preprocessed target chapter content and the processed reference chapter content of the reference file; 召回模块,被配置为对所述余弦相似度和所述语义匹配相似度进行加权计算,得到相似度评分,并基于所述相似度评分确定是否召回所述参考章节内容。The recall module is configured to perform weighted calculation on the cosine similarity and the semantic matching similarity to obtain a similarity score, and determine whether to recall the reference chapter content based on the similarity score. 8.一种文档生成装置,其特征在于,包括:8. A document generation device, comprising: 接收模块,被配置为接收目标章节内容;A receiving module, configured to receive target chapter content; 判断模块,被配置为利用权利要求1至5中任一项所述的文档写作智能召回方法,判断是否需要召回所述参考章节内容;A judgment module, configured to use the document writing intelligent recall method according to any one of claims 1 to 5 to judge whether the reference chapter content needs to be recalled; 生成模块,被配置为在判断需要召回所述参考章节内容的情况下,召回所述参考章节内容,并基于所述目标章节内容和所述参考章节内容生成文档。The generation module is configured to recall the reference chapter content when it is determined that the reference chapter content needs to be recalled, and generate a document based on the target chapter content and the reference chapter content. 9.一种电子设备,其特征在于,包括:9. An electronic device, comprising: 存储器,被配置为存储计算机程序;a memory configured to store a computer program; 处理器,被配置为在所述程序运行时,使得计算机执行如权利要求1至5中任一项所述的方法。The processor is configured to make the computer execute the method according to any one of claims 1 to 5 when the program is executed. 10.一种计算机可读存储介质,其上存储有程序,其特征在于,在所述程序运行时,使得计算机执行如权利要求1至5中任一项所述的方法。10. A computer-readable storage medium having a program stored thereon, wherein when the program is executed, the computer is caused to execute the method according to any one of claims 1 to 5.
CN202410619691.7A 2024-05-20 2024-05-20 Document writing intelligent recall method and device and document generation method and device Pending CN118747293A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410619691.7A CN118747293A (en) 2024-05-20 2024-05-20 Document writing intelligent recall method and device and document generation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410619691.7A CN118747293A (en) 2024-05-20 2024-05-20 Document writing intelligent recall method and device and document generation method and device

Publications (1)

Publication Number Publication Date
CN118747293A true CN118747293A (en) 2024-10-08

Family

ID=92920385

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410619691.7A Pending CN118747293A (en) 2024-05-20 2024-05-20 Document writing intelligent recall method and device and document generation method and device

Country Status (1)

Country Link
CN (1) CN118747293A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119538921A (en) * 2025-01-22 2025-02-28 浙江孚临科技有限公司 A Slicing Method for Non-standard Documents
CN119830872A (en) * 2025-03-18 2025-04-15 中国标准科技集团有限公司 Intelligent writing system and method based on multi-layer feature fusion

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119538921A (en) * 2025-01-22 2025-02-28 浙江孚临科技有限公司 A Slicing Method for Non-standard Documents
CN119538921B (en) * 2025-01-22 2025-04-11 浙江孚临科技有限公司 Non-standard document slicing processing method
CN119830872A (en) * 2025-03-18 2025-04-15 中国标准科技集团有限公司 Intelligent writing system and method based on multi-layer feature fusion
CN119830872B (en) * 2025-03-18 2025-09-09 中国标准科技集团有限公司 An intelligent writing system and method based on multi-layer feature fusion

Similar Documents

Publication Publication Date Title
CN116775847B (en) Question answering method and system based on knowledge graph and large language model
WO2021121198A1 (en) Semantic similarity-based entity relation extraction method and apparatus, device and medium
CN111797214A (en) Question screening method, device, computer equipment and medium based on FAQ database
CN110347790B (en) Text duplicate checking method, device and equipment based on attention mechanism and storage medium
CN107491547A (en) Searching method and device based on artificial intelligence
CN111291177A (en) Information processing method and device and computer storage medium
CN118747293A (en) Document writing intelligent recall method and device and document generation method and device
CN110297880B (en) Corpus product recommendation method, apparatus, device and storage medium
CN113660541B (en) Method and device for generating abstract of news video
CN105956053A (en) Network information-based search method and apparatus
CN113761125B (en) Dynamic summary determination method and device, computing device and computer storage medium
CN117743558A (en) Knowledge processing, knowledge question and answer methods, devices and media based on large models
CN114880447A (en) Information retrieval method, device, equipment and storage medium
CN115794995A (en) Target answer obtaining method and related device, electronic equipment and storage medium
TW202001621A (en) Corpus generating method and apparatus, and human-machine interaction processing method and apparatus
CN112487827A (en) Question answering method, electronic equipment and storage device
CN112988952B (en) Multi-level-length text vector retrieval method and device and electronic equipment
CN113761104A (en) Method, device and electronic device for detecting entity relationship in knowledge graph
CN112307190A (en) Medical literature sorting method and device, electronic equipment and storage medium
CN117390169A (en) Form data question-answering method, device, equipment and storage medium
CN114328894A (en) Document processing method, document processing device, electronic equipment and medium
CN118585618A (en) Intelligent reply method, device, electronic device and storage medium
CN119557500B (en) A method and system for accurate search of Internet massive data based on AI technology
CN119377369A (en) A multimodal RAG, device, equipment and storage medium based on a large model
CN114818727A (en) Key sentence extraction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination